RAG isn't a system, it's a 12-stage failure surface

Every RAG demo works. The chunking is clean, the embeddings are fresh, the LLM says coherent things. Then you put it in front of real users and the retrieval collapses on 30% of queries within the first week.

Where it actually breaks

The failure surface isn't the retrieval step. It's the twelve decisions that happen before and after it: chunk size, overlap, embedding model, index type, similarity threshold, reranking, context window packing, prompt format, output parsing, citation extraction, groundedness verification, and the feedback loop back to retrieval.

rag_index.py

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(collection=docs, index_name="my-kb", max_document_length=180)
results = RAG.search(query="what is contextual retrieval?", k=5)

The fix isn't a better vector store. It's treating each of those twelve stages as a testable unit and building an eval harness that catches regressions before they reach users.

RAG isn't a system, it's a 12-stage failure surface

Where it actually breaks

Daniel Kim

Keep reading

An eval harness that survives contact with a real user base

Why your agent demo doesn't survive a real workflow

When fine-tuning is worth it (and the cheaper alternatives that usually win)