Embedding freshness: measuring and eliminating index staleness

Embedding freshness is the measure of how closely your vector index reflects the current state of the data it was built from. A fresh index means: if you updated a row 5 minutes ago, the embedding stored for that row reflects the current content. A stale index means: the embedding stored for that row was computed from an earlier version, and any query that retrieves it is retrieving against old data.

Most teams building RAG systems have a rough sense that their index might lag behind. Few have a precise measurement of how stale it actually is, how often specific rows drift, or what the distribution of lag looks like across their corpus. Without that measurement, freshness SLAs are aspirational rather than operational.

A taxonomy of staleness patterns

Not all staleness looks the same. We've observed four distinct patterns in production systems, each with different detection approaches and remediation strategies.

Batch lag. The most common pattern. The embedding pipeline runs periodically, and any changes between runs are not reflected in the index until the next run. This produces a sawtooth freshness curve: perfect immediately after a batch run, progressively more stale until the next run. The maximum lag is equal to the batch interval. The average lag (for a row that updates uniformly over the interval) is half the batch interval.

Partial updates. The pipeline runs but processes only a subset of rows — perhaps due to rate limits, timeouts, or a bug in the incremental detection logic. Rows that were changed but not processed remain stale even after the "latest" batch. These are harder to detect than batch lag because the batch job reports success. The index appears up to date based on job completion logs. Only a per-row audit reveals the gap.

Missing rows. Rows that were inserted after the last full index build but missed by the incremental logic — perhaps because the incremental query used updated_at > last_run but the row's updated_at wasn't set correctly on insert. These rows exist in the source database but are absent from the vector index entirely. Queries that should retrieve them return nothing instead of returning a stale result. This is a correctness problem, not just a freshness problem.

Model drift. The embedding model used to generate the stored vectors is different from the model used to generate the query vector. This happens when a team upgrades their embedding model without re-embedding the corpus. The similarity scores become unreliable: the stored vectors and query vectors inhabit different geometric spaces, and cosine similarity between them is meaningless. This isn't technically "staleness" in the temporal sense, but it's a freshness violation in the more general sense that the index no longer faithfully represents queryable similarity.

How to measure staleness in practice

For batch-lag and partial-update detection, the most reliable approach is maintaining an embedding audit table in your source database. On each successful embedding write, record:

CREATE TABLE embedding_audit (
  row_id        BIGINT        NOT NULL,
  table_name    TEXT          NOT NULL,
  embedded_at   TIMESTAMPTZ   NOT NULL DEFAULT NOW(),
  model_version TEXT          NOT NULL,
  content_hash  TEXT          NOT NULL  -- SHA256 of the text that was embedded
);

The content_hash column is the key element. It lets you detect partial updates: if the content hash stored in embedding_audit no longer matches a fresh hash of the current row content, the embedding is stale even if embedded_at is recent. This catches the case where a row was re-embedded from an old version of the content due to a bug in the incremental change detection.

A staleness report then becomes:

SELECT
  d.id,
  d.updated_at,
  ea.embedded_at,
  EXTRACT(EPOCH FROM (d.updated_at - ea.embedded_at)) / 3600 AS lag_hours,
  (ea.content_hash != SHA256(d.content::bytea)) AS content_mismatch
FROM documents d
LEFT JOIN embedding_audit ea ON ea.row_id = d.id AND ea.table_name = 'documents'
WHERE d.updated_at > NOW() - INTERVAL '7 days'
ORDER BY lag_hours DESC NULLS FIRST;

The NULLS FIRST sort surfaces missing rows (where ea.row_id IS NULL) at the top. Those are the highest priority: not stale, but absent entirely.

What freshness lag actually costs retrieval quality

The cost depends on the volatility of your corpus and the query workload. We can characterize it along two axes.

First, update rate: how frequently do rows change? A static document corpus that's updated monthly has essentially zero freshness cost. A product catalog where prices update multiple times daily has a high freshness cost even from a 1-hour batch lag.

Second, retrieval sensitivity: are users querying for information that is likely to have changed recently? A customer asking "what are the current return policies" is sensitive to freshness — if the policy changed 3 hours ago, a stale embedding will retrieve the old policy text and produce an incorrect answer. A customer asking "how do I return a product" is less sensitive — the procedure is unlikely to have changed, and the answer will be accurate regardless of whether the embedding is from this morning or last week.

A rough rule: if more than 10% of your corpus changes within any given batch interval, and if more than 20% of user queries are sensitive to recent changes, you have a meaningful freshness problem. The 10% / 20% thresholds are not derived from formal study — they're a practical heuristic for deciding whether to invest in reducing lag.

We want to be clear that "more fresh is always better" is not necessarily true from an engineering economics perspective. Moving from a 4-hour batch to real-time co-located embedding is a significant architectural change. Whether it's worth making depends on those two axes above. A static corpus doesn't benefit from the investment. A highly volatile corpus with freshness-sensitive queries absolutely does.

Architecture patterns ordered by freshness guarantee

From weakest to strongest freshness guarantee, with real trade-offs at each level:

Nightly batch: Maximum lag = 24h. Cost: minimal. Appropriate for slowly-changing corpora. Failure mode: pipeline drift and missing rows.

Hourly or sub-hourly batch: Maximum lag = batch interval (minutes to an hour). Cost: linear increase in embedding API spend. Failure mode: same as above, compressed.

CDC-triggered streaming embedding: Maximum lag = message queue depth + embedding latency (typically 1-60 seconds at normal throughput). Cost: significant infrastructure addition (queue, consumer service). Failure mode: consumer lag under high write load; complex retry semantics for embedding failures.

Co-located write-path embedding: Maximum lag = zero (embedding computed in the same write transaction as the row update). Cost: write latency increases by embedding computation time; requires embedding at the database layer rather than asynchronously. Failure mode: write failures rather than silent staleness — a harder failure mode to ignore, which is actually a feature.

The last pattern is the one Dreambase implements. It eliminates the freshness gap by definition, at the cost of embedding latency on writes. For most applications, that trade-off is acceptable: write latency is less user-visible than query accuracy, and the operational complexity eliminated by removing the async embedding pipeline is substantial. But it's not the right choice for every workload — a write-heavy system with very high write throughput and no retrieval freshness requirement would pay the embedding latency cost unnecessarily. The architecture choice should match the freshness requirement, not default to whichever pattern the framework tutorial demonstrated first.