When SQL meets vector: how a cost-based planner handles both

The hardest engineering problem we faced building Dreambase wasn't the HNSW implementation. It wasn't the storage layout or the wire protocol compatibility. It was the query planner: specifically, deciding in what order to evaluate a scalar filter and an ANN (approximate nearest neighbor) scan when both appear in the same query.

This sounds like a narrow implementation detail. It isn't. For workloads at scale, the wrong execution order can mean a 40ms query that runs in 2,000ms instead — and the threshold at which each strategy wins is not static. It depends on dataset size, filter selectivity, vector dimension, HNSW build parameters, and the specific query. That's the problem a cost-based planner exists to solve.

The planning problem, framed concretely

A typical hybrid query in Dreambase looks something like this:

SELECT id, title, content
FROM documents
WHERE tenant_id = 'acme' AND category = 'legal' AND deleted = false
ORDER BY embedding <-> $query_vec
LIMIT 10;

This query has two components: a scalar filter (tenant_id, category, deleted) and a vector similarity sort (ORDER BY embedding <-> $query_vec). The planner must decide: do I run the ANN scan first, get the top-k candidates, then apply the scalar filter? Or do I apply the scalar filter first, reduce the candidate set, then run ANN search over the filtered rows?

Both strategies are correct in the sense that they produce the same result. They are not equivalent in cost.

Scalar-first execution: when it wins

Scalar-first means: evaluate the WHERE clause against structured indexes (B-tree on tenant_id, category, etc.), produce a reduced candidate set, then run vector similarity search only over that set.

This strategy is efficient when the scalar filter is highly selective — meaning it reduces the candidate set dramatically relative to the full table. If tenant_id = 'acme' matches 2,000 rows out of 10 million, running ANN search over those 2,000 rows is far cheaper than scanning the HNSW graph built on all 10 million and then discarding the non-matching results.

The catch: scalar-first requires that ANN search over an arbitrary subset of rows is possible. Pure HNSW graphs are built over the full dataset; they don't support filtered sub-graph traversal efficiently without modification. Dreambase's vector index maintains per-row bitmaps alongside the graph structure, so filtered traversal is supported natively. When the planner decides scalar-first, it passes the bitmap of matching row IDs to the ANN traversal as an allowlist.

Scalar-first is the right call when selectivity is high. We define "high selectivity" operationally as: the scalar filter reduces the candidate set to less than roughly 5% of the full table. Below that threshold, the filtered ANN traversal has so few valid hops to make that it completes much faster than a full-graph traversal followed by post-filter.

ANN-first execution: when it wins

ANN-first means: traverse the HNSW graph to find the top-K candidates by vector similarity (usually K is larger than the requested LIMIT, to account for post-filter attrition), then apply the scalar filter to that candidate set.

This is faster when the scalar filter is low selectivity — meaning it matches most of the table, and the set reduction isn't worth the overhead of applying it before the ANN scan. If category = 'legal' matches 60% of documents, pre-filtering to 60% and running filtered ANN is not materially cheaper than running full ANN and discarding the 40% that don't match. The ANN graph traversal is efficient to begin with; the scalar filter adds overhead without reducing the scan footprint much.

ANN-first also has an accuracy implication: if K (the oversampled candidate count) is not large enough, some qualifying rows may not appear in the top-K and will be excluded from the final result even if they would have ranked in the top-10 after filtering. The planner must account for expected filter selectivity when setting K. We use a formula of K = limit / estimated_selectivity * 1.5, clamped to a maximum of min(table_size, 500) to avoid O(n) degradation on low-selectivity filters with large LIMIT values.

How the cost model works

At plan time, the planner has access to:

Table statistics: row count, column cardinality estimates from the histogram (updated on ANALYZE)
HNSW index statistics: graph node count, avg out-degree, expected traversal depth at current ef parameter
Query-time estimates: scalar predicate selectivity from column statistics, query vector norm (affects cosine vs dot product distance computation cost)

The planner computes two cost estimates — scalar-first and ANN-first — and picks the lower one. The cost functions are empirically calibrated, not purely theoretical. Early versions derived from textbook cost models were off by 2-3x in practice because they didn't account for cache behavior: the filtered ANN traversal has much better spatial locality than a full graph scan, which matters significantly at table sizes above ~500K rows where working sets no longer fit in L3 cache.

We also maintain a per-table "plan history" table: recent queries, their estimated cost, and their actual wall-clock time. The planner uses this to detect systematic estimation errors — cases where the estimates are consistently off in one direction — and adjusts its weighting coefficients. This is not query-specific plan caching (a traditional database approach) but a lightweight adaptive calibration that improves estimation accuracy over time without the stale-plan problems of aggressive caching.

What this looks like in practice

On a synthetic workload of 5 million document rows with 1536-dimension embeddings (OpenAI text-embedding-3-small dimensions) and a mix of high- and low-selectivity filters:

High-selectivity queries (filter matches <3% of rows): scalar-first averages ~18ms p99. ANN-first for the same queries averages ~140ms p99 — the post-filter discard is expensive at this selectivity because the ANN candidate oversampling requirement becomes very large.
Low-selectivity queries (filter matches >40% of rows): ANN-first averages ~22ms p99. Scalar-first averages ~35ms p99 — the filtered ANN traversal gains little from pre-filtering and pays extra overhead setting up the bitmap allowlist.
The planner correctly picks the winning strategy ~92% of the time on this workload, based on validation against exhaustive execution of both plans on a held-out query set.

The 8% where the planner picks wrong are mostly at the boundary: selectivity around 5-8% where the two strategies are nearly equivalent. The cost difference in those cases is small — typically under 5ms either way. We have not observed a case where the planner was dramatically wrong in a direction that mattered.

We want to be precise here: these numbers are from our internal test corpus, not from a standardized benchmark. We haven't published them as an external reproducible benchmark because the workload characteristics (dimension, selectivity distribution, hardware) affect the numbers substantially. If you want to reproduce them on your data, the SQL + vector guide covers how to use EXPLAIN ANALYZE in Dreambase to see which strategy the planner chose and the actual vs estimated costs for a given query.

The broader point is that a hybrid query optimizer isn't a feature you toggle — it's an ongoing calibration problem. The decision of "scalar first or ANN first" can't be right in a static way because the answer depends on statistics that change as your data grows. Building a planner that gets this right consistently is a significant part of what makes a hybrid database different from a vector store with a metadata filter attached.