Retrieval-Augmented Generation for Agents

Giving agents access to your knowledge

Intermediate 16 minBuilder
What you'll be able to do
  • Explain why RAG exists and trace the six-stage pipeline from load to generate
  • Choose sensible chunking, embedding, and indexing strategies for real documents
  • Combine vector and keyword search with reranking to maximize retrieval quality
  • Design an agentic RAG loop where the agent controls retrieval, and know when classic RAG is the better choice
  • Evaluate a RAG system with RAGAS and diagnose its common failure modes
At a glance

An LLM only knows what it was trained on, frozen at a date, with none of your private documents. Retrieval-Augmented Generation (RAG) fixes this by fetching relevant text at query time and feeding it to the model as evidence. This lesson takes you from the classic retrieve-then-generate pipeline to agentic RAG, where the agent itself decides when, what, and how often to retrieve.

  1. 1Why RAG exists
  2. 2The six-stage pipeline
  3. 3Chunking, embeddings, and the vector store
  4. 4Hybrid search and reranking
  5. 5Agentic RAG: the agent controls retrieval
  6. 6Evaluation and failure modes

Why RAG exists

Think of a language model as a brilliant expert who read the entire internet once, years ago, then walked into a sealed room and shut the door. They reason beautifully, but they can't look anything up, they haven't seen today's news, and they've never read a single page of your company's documents. That sealed-room expert is your LLM by default.

Three gaps follow directly from how it was built. It hallucinates — confidently inventing facts it never learned. It has a knowledge cutoff — nothing after its training date exists to it. And it has no access to your private data — your wiki, contracts, tickets, and code were never in its training set.

RAG closes all three gaps with one move: instead of asking the model to recall an answer, you retrieve relevant passages from an external source at query time and put them directly into the prompt as evidence. It's the difference between a closed-book exam and an open-book one. The model's job shifts from remembering to reading and synthesizing — something it is far better and more reliable at.

Retrieval-Augmented Generation = fetch relevant text for a query, then generate an answer grounded in that text.

The payoff is concrete: answers stay current (re-index, don't retrain), they cite your private corpus, and they are grounded — you can trace each claim back to a source. For agents this is essential. An agent that can't look things up is limited to what's baked into its weights; an agent with RAG can reason over your entire knowledge base.

The six-stage pipeline

At heart, RAG is a library with a smart librarian. First you build the library: you bring in the books, cut them into readable passages, and file each one where it can be found again. Then, when a question arrives, the librarian fetches the most relevant passages and hands them to the model to read. Every RAG system, simple or agentic, is built from the same six stages — the first four happen offline when you ingest documents, the last two happen at query time.

  1. Load — pull raw documents from their source (PDFs, web pages, databases, tickets).
  2. Chunk — split each document into passages small enough to retrieve precisely.
  3. Embed — turn each chunk into a vector that captures its meaning.
  4. Index — store those vectors in a vector database for fast similarity search.
  5. Retrieve — embed the user's query and find the closest chunks.
  6. Generate — feed the retrieved chunks plus the question to the LLM, which answers from them.
python
# Indexing (offline) — once per document set
chunks = chunk(load("docs/"))
index.add(embed(chunks))

# Query time — per request
def answer(query):
    hits = index.search(embed(query), k=5)
    context = "\n\n".join(h.text for h in hits)
    return llm(f"Answer using only this context:\n{context}\n\nQ: {query}")

Each stage has its own failure modes: a bad chunk boundary, a weak embedding model, or a retrieval miss can sink the whole answer. The rest of the lesson is about getting each stage right.

Chunking, embeddings, and the vector store

Before the librarian can file anything, you have to decide how to cut the books and how to file them. Those two decisions — how you chunk, and which embedding model you use — quietly determine most of your retrieval quality.

Chunking is splitting documents into passages, and it is the most underrated lever. The intuition: a chunk is the unit you retrieve, so it should hold one coherent idea. Chunks too large dilute relevance and waste context; too small, and you bisect a key sentence or split a table. For English prose the sweet spot is 256–512 tokens with 20–30% overlap (the overlap keeps an idea from being severed at a boundary). Better still is semantic chunking — cutting on topic shifts rather than fixed sizes — and parent-child (small-to-big) chunking: index small chunks for precise matching but return the larger parent passage to the LLM, getting precision and context.

Embeddings map text to vectors — lists of numbers — positioned so that similar meanings land near each other in space. "Cancel my subscription" and "how do I end my plan" point to nearby coordinates even though they share few words; that nearness is what retrieval exploits. Model choice matters: by 2026 MTEB scores, Google's Gemini Embedding leads the general leaderboard (~68.3), while open-source models such as NV-Embed-v2 score even higher (~72) on raw benchmarks. Among commercial API options, Cohere embed-v4 (~65.2) and OpenAI text-embedding-3-large (~64.6) are competitive for general English retrieval, and Voyage AI's voyage-3-large beats them by 4–6 MTEB points on domain-specific corpora (legal, medical, code). Don't assume any embedder works equally well for your vertical — benchmark on your own data.

Vector databases index those vectors for fast nearest-neighbor search — finding the closest points to your query without scanning every one:

StoreBest for
Chromalocal prototyping, <10M vectors
pgvectoralready on Postgres, no new infra
Weaviatenative hybrid search, self-host or cloud
Pineconezero-ops managed scale, >10M vectors

Hybrid search and reranking

Vector search has one blind spot, and it's a costly one. It captures meaning but misses exact tokens. Ask for error code E1042 or product Acme-Pro-X and embeddings, which blur such strings into nearby concepts, can sail right past the chunk that names them — the way a search for "big cat" might surface lions when you needed the literal word "jaguar." The 2025–2026 production standard fixes this with hybrid search: run a dense vector query and a sparse BM25 keyword query in parallel, then merge results with reciprocal rank fusion (RRF). BM25 (a classic keyword-ranking algorithm) catches the exact terminology embeddings miss; vectors catch the paraphrases BM25 misses. You get both.

The second upgrade is reranking, and it rests on a simple trade-off. Initial retrieval uses a fast bi-encoder that embeds the query and the chunks separately — efficient, so it can scan millions, but coarse. A cross-encoder reranker (Cohere Rerank, or open-source BGE-reranker) instead reads the query and each candidate together, scoring relevance far more accurately — but it's too slow to run over everything. So you combine them: retrieve the top ~20 candidates cheaply with the bi-encoder, rerank those 20 with the cross-encoder, and pass only the top 3–5 to the LLM.

python
cands = hybrid_search(query, k=20)          # vector + BM25, fused via RRF
top = reranker.rerank(query, cands)[:5]      # cross-encoder, keep best 5
answer = llm(query, context=top)

This matters more than it looks: stuffing 20 weakly-relevant chunks degrades answers (the lost-in-the-middle problem, where models attend least to the middle of a long context). Five precise chunks beat twenty noisy ones.

Watch out

More context is not better context

It is tempting to raise k and dump everything into the prompt. Resist it. Irrelevant chunks actively distract the model and bury the right answer in the middle of a long context, where models attend to it least. Retrieve broadly, then rerank down to a handful of high-signal passages.

Agentic RAG: the agent controls retrieval

Classic RAG is a reflex: it retrieves once, blindly, on every query — even when no lookup is needed, and never twice when one pass isn't enough. Agentic RAG turns that reflex into a deliberate choice by handing the retrieval decision to the agent itself. Instead of a hard-wired step, the retriever becomes a tool the agent calls when it decides it needs to, inside its reasoning loop — the same way a person decides whether, what, and how many times to look something up.

That shift unlocks five patterns now standard in production:

  1. Query rewriting / decomposition — reshape a vague question into precise search queries, or split a multi-part question into sub-queries.
  2. Multi-hop retrieval — retrieve, read, then retrieve again using what was just learned (typically capped at 4–6 hops to bound latency).
  3. Tool routing — choose among retrievers: vector store, BM25, web search, or SQL, depending on the question.
  4. Self-check / faithfulness gate — before answering, judge whether the retrieved context actually supports the claim.
  5. Re-retrieve on failure — if context is weak, rewrite the query or fall back to web search rather than guessing.

Two influential variants build on this. Corrective RAG (CRAG) adds a lightweight evaluator that labels retrieved docs Correct / Incorrect / Ambiguous and triggers a web-search fallback when local retrieval is poor — a quality check before the model ever sees the context. GraphRAG (Microsoft Research, 2024) uses an LLM to extract an entity-relationship graph plus community summaries from your corpus, enabling global, cross-document questions ("what are the main themes across all these reports?") that chunk-based vector search cannot answer at all — reporting up to a 3.4x accuracy gain on complex multi-hop queries.

Key insight

Agentic RAG is a trade-off, not a free upgrade

Letting the agent loop adds latency, tokens, and failure points, and is harder to debug. For high-volume FAQ-style or single-document lookups, classic retrieve-then-generate is faster, cheaper, and more predictable. Reach for agentic RAG when questions are genuinely multi-step or span many sources.

Evaluation and failure modes

RAG fails quietly: unlike a crash, a wrong answer looks exactly like a right one — fluent, confident, and resting on the wrong source. So you cannot trust your eyes; you have to measure. The key move is to evaluate retrieval and generation separately, because the question "did we fetch the right text?" is different from "did the model use it correctly?" The RAGAS framework scores four metrics with an LLM-as-judge (a second model grading the output):

MetricWhat it asksTarget
FaithfulnessIs the answer grounded in the context?≥ 0.75
Answer relevancyDoes it actually address the question?≥ 0.80
Context precisionAre retrieved chunks relevant?≥ 0.70
Context recallWas all needed info retrieved?≥ 0.80

These numbers are a diagnostic map. Low context recall means a retrieval miss (the answer wasn't fetched); low context precision means retrieval noise (you fetched junk alongside the answer); low faithfulness with good context means the model ignored or hallucinated despite having the answer in front of it.

python
from ragas import evaluate
from ragas.metrics import Faithfulness, ContextPrecision, ContextRecall

scores = evaluate(
    dataset,
    metrics=[Faithfulness(), ContextPrecision(), ContextRecall()]
)

Watch for the classic traps: chunk-boundary artifacts losing key context, a stale index (docs changed but weren't re-indexed), and hallucinated citations that blend retrieved and parametric knowledge. And don't over-trust the numbers — overfitting prompts to RAGAS yields high scores that fail on real traffic. Calibrate against human-judged, production-sampled queries too.

Try it: Build and evaluate a hybrid RAG pipeline

Take a small set of your own documents (a handful of PDFs, a wiki export, or markdown notes) and build a RAG pipeline end to end.

  1. Chunk the docs at ~400 tokens with 25% overlap, then try parent-child chunking and compare.
  2. Embed and index with Chroma or pgvector using a current embedding model.
  3. Retrieve with pure vector search, then add BM25 + RRF hybrid search, then add a cross-encoder reranker (Cohere Rerank or BGE-reranker), passing only the top 5 chunks to the model.
  4. Evaluate with RAGAS on ~15 hand-written question/answer pairs, recording faithfulness and context precision at each upgrade.

Write down how each change moved the scores. Then add one agentic step — query rewriting before retrieval — and measure whether it helps. The goal is to feel how chunking, hybrid search, and reranking each shift quality, and to build the instinct to measure rather than guess.

Key takeaways

  1. 1RAG grounds an LLM in external text to fix hallucination, knowledge cutoff, and lack of private-data access — shifting the model from remembering to reading.
  2. 2The pipeline is load → chunk → embed → index → retrieve → generate; chunking strategy and embedding-model choice are high-leverage decisions.
  3. 3Hybrid (vector + BM25) search plus cross-encoder reranking to 3–5 precise chunks beats pure vector search and beats dumping in many noisy chunks.
  4. 4Agentic RAG lets the agent decide when and what to retrieve, rewrite queries, and self-check — powerful for multi-hop questions but slower and harder to debug than classic RAG.
  5. 5Evaluate retrieval and generation separately with metrics like RAGAS faithfulness and context precision, and validate against human-judged production traffic.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.What is the core mechanism by which RAG reduces hallucination and staleness?

2.Why is hybrid (vector + BM25) search preferred over pure vector search in production?

3.Which statement about agentic RAG is accurate?

4.A RAG system returns a fluent answer, but RAGAS shows high context precision and recall yet low faithfulness. What is the most likely problem?

Go deeper

Hand-picked sources to keep learning