Core concepts

This page covers the design decisions that distinguish Dreambase from pgvector, Pinecone, and Weaviate: the hybrid row model (why vectors belong in the row, not a separate store), HNSW co-location (what it means for query performance), the cost-based hybrid planner (how it chooses between scalar-first and ANN-first execution), and embedding freshness (why the dual-store lag problem disappears by design).

The hybrid row model

In most database systems, a row is a collection of typed scalar values: integers, text, timestamps, booleans. Dreambase extends this with a first-class VECTOR(dims) column type. A hybrid row is one that carries both structured scalar columns and a vector embedding in the same storage unit.

This is different from storing vector data in a blob column or a JSON field. The VECTOR column is indexed by a co-located HNSW structure — it participates in query planning, not just storage.

-- A hybrid row in Dreambase
SELECT id, user_id, content,        -- scalar columns
       embedding,                    -- VECTOR(1536) column
       embedding NEAR $1 AS score   -- ANN distance expression
FROM documents
WHERE user_id = 'u_441'
ORDER BY score
LIMIT 5;

Vector co-location

Co-location means the vector index and the row data live on the same storage pages. When the query planner decides to do an ANN scan, it reads the vector index and the row data in the same I/O operations — there is no second store to contact, no network round-trip to a separate vector service.

The practical implication: hybrid queries (SQL predicates + ANN) do not pay the latency of a distributed join between two separate services. The planner has full visibility into the cost of both access paths.

Query planner internals

The Dreambase hybrid query planner is a cost-based optimizer that considers two primary execution strategies for hybrid queries:

  1. Scalar-first: apply SQL WHERE predicates to reduce the candidate set, then run ANN on the surviving rows. Best when selectivity is high (e.g., WHERE user_id = 'u_441' eliminates 99% of rows).
  2. ANN-first: run ANN to get the top-K vector-similar candidates, then apply SQL predicates to filter. Best when WHERE clause selectivity is low and the ANN scan produces a small candidate set.

The planner uses statistics collected at INSERT time to estimate selectivity. You can inspect the plan with EXPLAIN HYBRID.

Embedding freshness model

In dual-store architectures, the vector store is a derived representation of the primary data store. It becomes stale whenever data in the primary store changes and the embedding pipeline has not yet run. The lag is typically the embedding job interval — often 15 minutes to 48 hours in production.

In Dreambase, the embedding column is part of the row. An UPDATE that changes content and embedding in the same statement is a single atomic transaction. There is no derived representation — the vector column is authoritative, not a copy.

The implication: "stale embedding" is not a failure mode in Dreambase. If a row was updated 10 seconds ago, a hybrid query run 11 seconds later will retrieve the updated vector.

Comparison to the dual-store pattern

The dual-store pattern (e.g., Postgres + Pinecone) requires:

  • Two write operations per INSERT or UPDATE (to both databases)
  • A synchronization mechanism to keep them consistent
  • Application-layer logic to join results from two different query interfaces
  • Monitoring for sync lag and divergence

Dreambase eliminates all of these by making the vector column a first-class part of the SQL row. The trade-off is real and worth stating plainly: Dreambase does not support dimensions above 4096 (so models like large Matryoshka embeddings at 8192 dims are out), and sustained write throughput above approximately 50K rows per second will saturate the HNSW index maintenance path before the SQL path. If your workload is write-heavy at that scale, or if you need sub-millisecond ANN at hundreds of millions of vectors with no SQL requirement, a dedicated vector store remains the better choice.