LLM retrieval drift: why your RAG pipeline returns stale context

There is a class of production bug in RAG systems that doesn't surface as an error. The query runs. The vector search returns results. The LLM generates an answer. Latency looks fine. The only symptom is that the answer is wrong — not because the model failed, but because the context it retrieved was stale.

We call this retrieval drift: the gradual divergence between the live state of your source-of-truth database and the state your vector index was built from. It is silent because no alarm fires. It is insidious because the LLM, given stale but plausible context, will produce plausible but incorrect answers. Users may not notice immediately. Your eval metrics may not catch it if they don't account for freshness. By the time the problem is attributed to retrieval rather than model quality, the stale-index pattern is often deeply embedded in the architecture.

What causes retrieval drift

The most common cause is the batch embedding pipeline. The team runs an embedding job — nightly, every 6 hours, every hour in ambitious setups — that reads new or updated records from the source database, calls an embedding model, and upserts vectors into the vector store. This works well in prototypes and early production. It breaks down under two conditions: when data changes faster than the pipeline runs, and when the pipeline fails silently.

Consider the following scenario: an internal knowledge base application at a growing software company. The engineering team publishes runbooks, architectural decision records, and incident post-mortems regularly — roughly 40-60 document updates per day. The embedding job runs every 4 hours. Between runs, any query about a recently updated runbook will retrieve the old version. If an incident just happened and the on-call engineer queried "what are the current escalation steps for our Payments service," they might get the procedure that was superseded two weeks ago.

The second mechanism — silent pipeline failure — is often worse. If the embedding job throws an exception midway through a batch and doesn't mark failed records for retry, your vector index develops holes. Some documents are never embedded at all. Queries that should retrieve them don't. The application doesn't error. It just returns the best candidates from the fraction of the corpus that was successfully indexed.

Measuring your actual embedding lag

The first step in dealing with retrieval drift is quantifying it. Most teams don't have this instrumented, which means they are operating on the assumption that their pipeline is more or less current. That assumption is usually wrong.

A straightforward approach: add a last_embedded_at timestamp to each record in your source database, updated whenever you push the vector to your store. Then run this query periodically:

SELECT
  COUNT(*) FILTER (WHERE updated_at > last_embedded_at) AS stale_rows,
  COUNT(*) FILTER (WHERE last_embedded_at IS NULL) AS never_embedded,
  MAX(updated_at - last_embedded_at) AS max_lag_interval,
  PERCENTILE_CONT(0.95) WITHIN GROUP (
    ORDER BY EXTRACT(EPOCH FROM (updated_at - last_embedded_at))
  ) AS p95_lag_seconds
FROM documents
WHERE updated_at IS NOT NULL;

Run this for a week. You will likely see that the stale row count is not zero between batch runs, and that the max lag interval is longer than you expected — especially around weekends or holidays when the batch job still runs but data updates happen at irregular rates.

If your answer to this query is "we can't easily run this because our vector store doesn't know about last_embedded_at," that tells you something fundamental about the dual-store architecture: the state of the vector index is opaque relative to your source database. You cannot easily ask "what does my vector index know?" without instrumenting it yourself.

What stale retrieval actually costs

The impact of retrieval drift depends entirely on the content type and update rate. For a knowledge base where documents change infrequently, a 4-hour lag may be inconsequential. For a product search application where item prices and availability update in near real time, a 4-hour lag means the LLM describes a product that is out of stock, at a price that was adjusted three hours ago.

There is a second-order effect that's less obvious: model confidence. A well-calibrated LLM will hedge when it's uncertain, but it has no way to know that the context it received is stale. It treats the retrieved chunks as current fact. This means stale retrieval doesn't just produce wrong answers — it produces confidently wrong answers. That distinction matters a lot for user trust. An error that the system flags or hedges is recoverable. An error presented as a confident assertion damages credibility in ways that are slow to repair.

Architectural approaches to eliminating drift

There are three approaches to the freshness problem, ordered from operational cost to operational simplicity.

Reduce batch interval. Running the embedding job every 15 minutes instead of every 4 hours reduces the maximum lag. But it doesn't eliminate it. It also increases your embedding API costs proportionally, and the failure-mode problem persists: a job that fails at 2:14 AM on a Saturday leaves a gap that won't close until the next run. This approach is additive complexity on an architecture that already has a reliability surface.

CDC-triggered embedding. Use a change-data capture stream (Debezium, Postgres logical replication, etc.) to trigger re-embedding on each row change, rather than polling in batches. This gets you sub-minute freshness for most workloads. It also adds a Kafka or message queue dependency, a consumer service, and a new failure mode: if the CDC consumer falls behind under load, you get lag again. You have now replaced a batch pipeline with a streaming pipeline — more fresh, but more infrastructure.

Co-located storage. Store the vector in the same row as the structured data. Write them atomically in a single transaction. The vector is always as fresh as the row. There is no pipeline to fall behind, no CDC consumer to lag, no last_embedded_at delta to monitor. The architecture eliminates the freshness gap at the storage level rather than trying to close it operationally.

We are not saying CDC is a bad approach — for teams with specific constraints (an existing Postgres investment they can't move, a very large existing corpus), it is often the right intermediate step. What we are saying is that CDC-triggered embedding is treating the symptom. The cause is the separation between your write path and your embedding path. Co-location eliminates the cause.

The practical question is how much freshness drift your application can tolerate before it becomes a user-visible quality problem. For support chatbots and internal knowledge tools, that threshold is lower than most teams initially assume. The architecture decision should be made with the freshness SLA in mind — and that SLA should be explicit, not implicit in the batch interval you happened to configure on launch day.