Interactive documentation

Retrieval-Augmented
Generation

Seven sections. Pain first, mechanism second, trade-offs third, failure modes last. Built for engineers who want to understand — not just use.

~60%
Hallucination reduction
<1m
Knowledge update latency
100×
Cheaper than fine-tuning

The problem

01 / 07 — Pain

Before touching RAG, understand what breaks without it. The pattern: LLM asked about private data → hallucinates, refuses, or simply lies with confidence.

The failure

You ask an LLM about your company's internal policy, last quarter's metrics, or a document it has never seen. It either hallucinates an answer, or says "I don't know." Both are useless in production.

Static knowledge cutoff

LLMs are frozen at training time. The world keeps moving. Your internal data — policies, reports, code, tickets — was never in the model to begin with.

Context window limits

You can't paste your 10,000-page knowledge base into a prompt. Even if you could — every token costs money, adds latency, and dilutes focus.

Hallucination under uncertainty

Without grounding, the model fills gaps with plausible-sounding fiction. It doesn't signal uncertainty — it's confident and wrong, which is the worst kind of wrong.

Fine-tuning is expensive

Retraining costs $10,000–$1M+. It takes weeks. The model still can't cite sources. And the moment your data changes, you repeat everything.

The insight

What if, before generating an answer, the model first retrieved the relevant facts from your knowledge base? That's RAG — grounding generation in retrieval. Not retraining. Not context stuffing. Retrieve, then generate.

Hallucination drop
~60%
vs. baseline LLM on domain questions
Knowledge update latency
<1 min
vs. weeks for fine-tuning
Cost vs. fine-tuning
100×
cheaper to add new knowledge

How it works

02 / 07 — Mechanism

RAG has two distinct phases — offline indexing and online retrieval+generation. Most production failures happen when engineers treat them as one thing.

RAG two-phase architecture: offline indexing and online retrieval Left side: offline phase showing documents → chunking → embedding → vector DB. Right side: online phase showing user query → embed query → similarity search → LLM generation. A dashed arrow connects the vector DB to the similarity search step. OFFLINE — BUILD ONCE Documents Chunking Embed chunks Vector DB indexed embeddings ONLINE — PER QUERY User query Embed query Similarity search LLM generates query + top-k chunks retrieve top-k
Critical constraint

The embedding model is used twice — once offline to encode documents, once online to encode the query. They must be the exact same model. Swap one without re-indexing and the vector spaces become incompatible. Every similarity score becomes noise. The system appears to run but retrieves garbage.

Pipeline deep dive

03 / 07 — Pipeline

Six steps. Click each one to understand what happens inside — and what specifically breaks.

Live simulator

04 / 07 — Demo

Watch every step of the pipeline execute in real time. Edit the query, observe how similarity scores shift.

rag-pipeline-trace.sh
Step 1 — embed query
Step 3 — top-k chunks passed to LLM
Step 4 — LLM generates grounded answer

Trade-offs

05 / 07 — Trade-offs

RAG solves real problems and introduces new ones. The honest account — no cheerleading.

Strengths
  • Real-time knowledge updates — no retraining
  • Citable, source-grounded answers
  • Dramatically lower hallucination rate
  • Works with proprietary / private data
  • Decomposable — swap embedding model, vector DB, or LLM independently
Weaknesses
  • Two failure surfaces: retrieval and generation
  • Garbage chunking = garbage retrieval = garbage answers
  • Chunk boundaries destroy document context
  • Added latency: embed + search + generate
  • Semantic similarity ≠ logical relevance

RAG vs alternatives

Approach Update speed Cost Accuracy Best for
RAG Minutes Low High Dynamic docs, Q&A, support bots
Fine-tuning Weeks Very high Medium Style / tone / domain adaptation
Full context Instant Very high Medium Small doc sets (<50 pages)
Keyword search Real-time Very low Low Exact term lookup, legal clauses
Common mistake

Teams choose RAG because it's trendy, not because it's appropriate. If your knowledge base has fewer than 1,000 chunks, you might be better off with full-context stuffing or a hybrid keyword+vector approach. Know your data before architecting.

Failure modes

06 / 07 — Failures

Each failure below is real, common, and usually discovered in production — not in the demo. Click to reveal root cause and fix.

Predict & verify

07 / 07 — Quiz

Prediction builds real understanding. Reason through each question before selecting an answer. Surface learners can follow tutorials. Deep learners can predict what breaks and when.

Next level

After mastering the basics: study HyDE (hypothetical document embeddings), re-ranking with cross-encoders, multi-vector retrieval (ColBERT), and hybrid BM25+vector fusion. These are the patterns that separate production RAG from demo RAG.

RAG Documentation  ·  Built with first-principles teaching  ·  Pain → Root cause → Mechanism → Trade-off → Failure mode