Blog Production

Six lessons from running RAG pipelines in production

·13 min read·Andrew Keil
Abstract visualization of a production RAG pipeline with multiple processing stages

Prototypes are easy. You pick a chunking library, call an embedding API, push vectors into a store, write a retrieval function, and hook it to an LLM. Two days, and you have something that impresses in a demo. Then you try to run it in production, and the gap between "works in a notebook" and "works reliably at scale" turns out to be a series of hard, unrelated problems.

These lessons aren't theoretical. They come from the conversations and engineering telemetry we've accumulated since 2021 — early-stage teams who shipped a RAG prototype and then spent the next several months unshipping the parts that broke. The failure patterns are consistent enough that we've seen the same bugs across entirely unrelated products.

Lesson 1: freshness matters more than recall, for most applications

Standard RAG evaluation focuses heavily on recall — whether the relevant documents appear in the retrieved set. Freshness — whether the retrieved documents reflect current state — receives far less attention in evals, and far more user complaints in production.

A documentation assistant for an internal tooling platform was passing recall evals with scores above 0.85. Users were still filing complaints that answers were wrong. When the team dug in, the pattern was consistent: the retrieved chunks were from the right document, but from an older version. The embedding job ran nightly. Any document edited after midnight was stale until the next morning's run. The recall metric couldn't catch this because it measures whether the right document was retrieved, not whether the right version was retrieved.

The fix in that case was moving from nightly to hourly batches, then to CDC-triggered re-embedding. Each step improved the complaint rate. The fundamental lesson: your freshness SLA should be defined explicitly, not left implicit in the batch schedule you happened to set up.

Lesson 2: chunk size is workload-specific and rarely what the tutorial said

The most common default in RAG tutorials is 512 tokens with 50-token overlap. That number has no principled basis — it appeared in an early LangChain example and propagated from there. For some workloads it's fine. For others it's actively harmful.

Shorter chunks (128-256 tokens) tend to work better for retrieval precision on documents with dense, localized information — API reference documentation, structured runbooks, code files. The retrieved chunk is likely to contain the exact answer without surrounding noise. Longer chunks (1024-2048 tokens) tend to work better when the answer depends on context that spans a paragraph or section — narrative explanations, research summaries, financial disclosures. Retrieving a 256-token chunk from a financial disclosure that requires understanding the surrounding paragraph for context produces answers that are technically correct but misleadingly incomplete.

The right answer is to instrument retrieval quality across chunk sizes on your actual query distribution, not to pick a number from a tutorial. Offline evals using golden query sets are necessary here — production traffic alone doesn't tell you what you missed because you chunked wrong.

Lesson 3: metadata filtering is not optional in multi-tenant applications

A semantic search system over support tickets for a B2B SaaS product. The team built it as a single flat namespace in their vector store. Search worked well in testing. In production, enterprise customers complained that results sometimes included ticket content from other organizations. The vector search was returning semantically similar tickets regardless of tenant ownership, because no filter was applied at query time.

This is obvious in retrospect. It's easy to miss when you're building a proof of concept on a single-tenant test dataset. The fix requires that tenancy filters are applied at query time — not as a post-retrieval step in application code, but inside the retrieval call, so that the ANN search itself is scoped to the correct tenant's data. Post-retrieval filtering in application code is also technically wrong for ANN: if you ask for top-20 results and then discard 15 of them because they're from the wrong tenant, you are not returning the true top-5 for that tenant. You're returning whatever happened to fall in the top-20 globally.

Row-level security at the vector query layer is not a nice-to-have for multi-tenant applications. It is a correctness and security requirement.

Lesson 4: monitor retrieval quality separately from generation quality

When an LLM application produces a bad answer, the instinct is to blame the model. More often, the failure is in retrieval: the wrong context was passed, or no relevant context was passed, and the model made its best guess from weak signal. You cannot know which is true without instrumenting retrieval independently.

The minimum viable retrieval observability setup: log every query vector (or a hash of it), the retrieved chunk IDs, and the scores. Sample 1-5% of production queries for human or model-based relevance labeling. Track three numbers over time: top-1 precision (was the first result relevant?), any-hit recall (was at least one relevant result in the top-5?), and mean reciprocal rank. If these degrade, it's a retrieval problem. If they hold steady while user satisfaction falls, it's a generation problem.

Without this instrumentation, you are debugging a system with one observable output — the final answer — and two latent failure modes that are indistinguishable from the outside.

Lesson 5: the sync job always drifts, eventually

Every team that has run a dual-store architecture for more than 12 months in production has a story about the sync job. It ran fine for eight months, then a schema migration changed the table that was being polled, and the job started silently processing zero rows. Or the embedding API changed its rate limits and the job started queuing up, creating a lag that grew to 72 hours before anyone noticed. Or someone updated a batch of 50,000 documents at once for a legal review, and the job ran for 14 hours straight, blocking other updates.

The operational surface of an embedding sync pipeline grows over time. It accumulates edge cases. It develops implicit dependencies on things that change. We have not seen a production dual-store system that didn't eventually require substantial engineering time to keep the sync job reliable. This cost is not visible on the architecture diagram. It is visible in on-call paging history.

Lesson 6: two databases mean two failure modes and two operational runbooks

This is the most basic lesson and also the most consistently underweighted. When your vector store goes down, your RAG pipeline degrades or fails. When your SQL database has a performance issue, your vector store is still up — but if the sync job depends on the SQL database, it stops processing. You now have an incident that requires understanding the interaction between two systems with different operational characteristics.

We are not saying this is unmanageable. Production teams manage complex multi-system architectures every day. What we are saying is that the dual-store pattern carries a fixed operational overhead that you pay every month, in on-call time, in runbook maintenance, in the cognitive load of tracing queries across two systems. That cost should be part of the architecture decision, not an afterthought discovered in month seven of production.

The teams that have moved away from dual-store patterns consistently report the same thing: the queries got simpler, the failure modes got simpler, and the debugging got simpler. Not faster retrieval — simpler debugging. That is often more valuable than a marginal latency improvement.