Every RAG demo works. The chunking is clean, the embeddings are fresh, the LLM says coherent things. Then you put it in front of real users and the retrieval collapses on 30% of queries within the first week.
Where it actually breaks
The failure surface isn't the retrieval step. It's the twelve decisions that happen before and after it: chunk size, overlap, embedding model, index type, similarity threshold, reranking, context window packing, prompt format, output parsing, citation extraction, groundedness verification, and the feedback loop back to retrieval.
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(collection=docs, index_name="my-kb", max_document_length=180)
results = RAG.search(query="what is contextual retrieval?", k=5)The fix isn't a better vector store. It's treating each of those twelve stages as a testable unit and building an eval harness that catches regressions before they reach users.