K
KnowMBAAdvisory
AI StrategyAdvanced8 min read

RAG Architecture Design

RAG (Retrieval-Augmented Generation) is the architecture that grounds an LLM in your private documents by retrieving relevant chunks at query time and injecting them into the prompt. The pipeline has five components: ingestion (parsing + chunking), embedding (turning chunks into vectors), storage (a vector DB), retrieval (similarity search + reranking), and generation (the LLM call with retrieved context). RAG is how you get an LLM to answer 'What's our refund policy?' from your own help center without retraining the model. It is the single highest-ROI AI architecture pattern in enterprise โ€” and the one most consistently botched.

Also known asRetrieval-Augmented GenerationRAG PipelineKnowledge-Grounded LLMDocument Q&A Architecture

The Trap

The trap is treating RAG as 'embed your docs and ship.' The naive pipeline retrieves the wrong chunks 30-50% of the time on real enterprise data because: (1) your documents are messy PDFs and Confluence exports, not clean Markdown, (2) chunking by 512 tokens cuts policies in half, (3) embeddings retrieve based on semantic similarity, not relevance โ€” 'What's our refund policy?' often pulls the marketing page about refunds, not the actual policy, (4) one shot retrieval misses multi-hop questions. The fix isn't a better embedding model; it's better chunking, hybrid search (keyword + semantic), and reranking.

What to Do

Build RAG in five layers and measure each one independently. (1) Chunking: chunk by semantic boundaries (sections, not token counts) and store metadata (doc title, section, date). (2) Hybrid retrieval: combine BM25 keyword search with vector search; either alone misses ~30%. (3) Reranking: use a cross-encoder to re-score the top 50 candidates down to the top 5. (4) Citation: force the LLM to cite the chunk ID it used; reject answers without citations. (5) Eval set: 100+ real questions with hand-labeled correct chunks. Measure retrieval recall@5 separately from answer quality.

Formula

RAG Quality = Retrieval Recall@K ร— Generation Faithfulness โ€” Citation Failures

In Practice

Notion AI's 'Q&A' feature uses RAG over your workspace. Anthropic's documentation cites multiple production deployments where customers tuned chunking and reranking to lift answer accuracy from ~60% to >90% on internal-knowledge tasks. The pattern is consistent: the lift came from retrieval-layer fixes (better chunking, hybrid search, reranking), not from upgrading the LLM.

Pro Tips

  • 01

    Always log the retrieved chunks alongside the answer. When users complain about a wrong answer, 80% of the time the LLM was right given what was retrieved โ€” the retrieval was wrong. You can't debug what you can't see.

  • 02

    Reranking is the cheapest lift in RAG. A small cross-encoder reranker on top 50 โ†’ top 5 typically adds 10-25 points to recall@5 for a few extra cents per query. Skip it and you're leaving accuracy on the table.

  • 03

    If your documents change frequently, build incremental re-embedding into the pipeline from day one. Backfilling 6 months of stale embeddings is the most expensive technical debt in RAG systems.

Myth vs Reality

Myth

โ€œBigger context windows kill RAGโ€

Reality

False. Long-context models complement RAG; they don't replace it. Even with 1M-token windows, you still need to retrieve relevant docs (you have 10M+ tokens of corporate content), and stuffing everything wastes money and degrades attention quality. The best architectures use RAG to pre-filter, then leverage long context for nuanced reasoning across 10-20 retrieved docs.

Myth

โ€œBetter embedding models solve retrieval problemsโ€

Reality

Embedding upgrades typically add 2-5 points of recall. Hybrid search adds 10-15. Reranking adds 10-25. Better chunking adds 10-30. The embedding model is rarely the bottleneck once you're using a competent one (e.g., text-embedding-3 or Voyage).

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Knowledge Check

Your RAG system has 65% answer accuracy on a 100-question eval set. The LLM almost always gives a correct answer when the right chunk is in the context window. What's the highest-leverage fix?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets โ€” not absolutes.

RAG Retrieval Recall@5

Enterprise document Q&A, after hybrid retrieval + reranking

Excellent

> 90%

Good

80-90%

Average

65-80%

Poor

< 65%

Source: Anthropic & vector DB vendor (Pinecone, Weaviate) public benchmarks

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

๐Ÿ““

Notion AI Q&A

2024

success

Notion shipped a workspace-scoped Q&A feature powered by RAG over user documents. Public engineering posts discuss the iterative path from naive embeddings to production quality: better chunking respecting page hierarchy, hybrid retrieval, and per-workspace permission filters. The result: a feature users actually trust rather than a demo that hallucinates.

Architecture

RAG with permission filters

Key Lift

Hierarchy-aware chunking + hybrid search

RAG quality in enterprise comes from respecting the structure of source documents and combining multiple retrieval signals โ€” not from picking the trendiest LLM.

Source โ†—
๐Ÿฆ

Hypothetical: Internal Knowledge Bot at a Large Bank

Composite scenario

success

A retail bank built an internal RAG bot for branch staff to query policy documents. v1 used naive chunking and dense embeddings only โ€” recall@5 was 54%, branch staff abandoned it. A 6-week rebuild added: (a) PDF parsing that preserved tables, (b) section-aware chunking, (c) BM25 + vector hybrid retrieval, (d) cross-encoder reranking, (e) forced citations. Recall@5 jumped to 87%, daily active users went from 40 to 1,800.

v1 Recall@5

54%

v2 Recall@5

87%

DAU (v1)

40

DAU (v2)

1,800

RAG is a pipeline, not a model. The 33-point recall jump came entirely from non-LLM components: parsing, chunking, retrieval fusion, and reranking.

Decision scenario

The RAG Bake-Off

Your team has a working RAG MVP at 65% recall@5 on your eval set. The CEO wants to ship in 4 weeks. You have $40K of cloud + API budget for the quarter. The product team wants to add features; the platform team wants to fix retrieval.

Current Recall@5

65%

Eval Set Size

120 questions

Time to Ship

4 weeks

Budget Remaining

$40,000

01

Decision 1

You can either ship at 65% accuracy with prominent 'AI may be wrong' disclaimers, OR delay 2 weeks to fix the retrieval layer first.

Ship at 65% with disclaimers โ€” users will tell us what's broken in production, and we'll learn faster from real trafficReveal
Launch goes live. By week 3, support tickets cite the AI as 'unreliable.' Internal champions stop recommending it. Daily usage flatlines at 12% of expected. The team spends Q2 fighting the credibility hole created by shipping too early. Real-traffic learning is real, but trust is a one-shot resource โ€” you spent it on a demo-quality launch.
Launch Recall: 65%Usage After 3 Weeks: 12% of expectedTrust Recovery Time: 1+ quarter
Delay 2 weeks. Spend $5K on a reranker, $3K on better chunking infrastructure, and rebuild the eval set to 250 questions. Then ship.Reveal
Recall@5 climbs from 65% to 84%. End-to-end accuracy from 60% to 78%. The 2-week delay costs you nothing politically (the CEO trusts the eval numbers). Launch lands with strong reviews. Daily usage hits 60% of expected by week 4. You earned the right to add features.
Launch Recall: 65% โ†’ 84%End-to-End Accuracy: 60% โ†’ 78%Usage After 4 Weeks: 60% of expected

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn RAG Architecture Design into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn RAG Architecture Design into a live operating decision.

Use RAG Architecture Design as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.