RAG That Actually Works: Beyond the Naive Pipeline
Naive RAG — embed, retrieve top-k, stuff into a prompt — fails the moment it meets a real corpus. Here's what production retrieval requires.
The naive RAG recipe is seductive: embed your documents, retrieve the top-k most similar chunks, and stuff them into the prompt. It works on a tidy demo corpus and falls apart on a real one. Production retrieval is a pipeline of deliberate choices.
Chunking is a modelling decision
How you split documents determines what the system can retrieve. Fixed-size chunks shatter tables and sever context; semantic chunking respects structure. The right strategy depends on your corpus, and the only way to know is to measure retrieval quality across candidates.
Hybrid retrieval beats pure vectors
Dense vector search captures meaning but misses exact terms — names, codes, identifiers. Sparse keyword search captures those but misses paraphrase. Combining them with a fusion step gives you the best of both, and it is consistently the single biggest quality lever in real systems.
- Use dense retrieval for semantic similarity
- Use sparse retrieval (BM25) for exact-term matches
- Fuse the results, then re-rank
- Tune the fusion weights against your evaluation set
Re-ranking is non-negotiable
Retrieval is recall-oriented; it casts a wide net. A cross-encoder re-ranker then scores each candidate against the query with far more precision than the initial retrieval can afford. Retrieving twenty candidates and re-ranking to the best four dramatically improves grounding.
Retrieval decides what's possible to answer. Re-ranking decides what actually gets answered.
Measure faithfulness, not vibes
The most dangerous failure mode in RAG is a confident, fluent, wrong answer. You need automated evaluation of faithfulness (is the answer supported by the retrieved context?) and relevance (did retrieval surface the right context?). Without these numbers, every change is a guess.
Close the loop
Real corpora drift, queries shift, and edge cases accumulate. The systems that stay good are the ones where failed answers feed back into the evaluation set and drive the next round of tuning. RAG is not a build-once artifact; it is a system you operate.
Keep reading
Architecting Production AI Agents That Don't Break
The gap between an agent demo and a production agent is enormous. Here's the architecture that closes it: planning, typed tools, memory, and guardrails.
Coordinating Multi-Agent Systems Without Chaos
More agents doesn't mean more capability — it usually means more ways to fail. Coordination patterns that keep multi-agent systems coherent.
Where AI Belongs in Your Automation Strategy
AI is not a replacement for deterministic automation — it's a complement. Knowing which is which is the difference between reliable and brittle.