Context Engineering
The successor discipline to prompt engineering
- Explain how context engineering differs from prompt engineering and why it became the successor discipline for production agents
- Name and diagnose the four context failure modes — poisoning, distraction, rot, and clash — and what causes each
- Apply the four core strategies (write, select, compress, isolate) to keep an agent's context lean across a long run
- Use compaction, just-in-time retrieval, and tool-result management to fight context rot in real code
- Decide when to reach for sub-agent isolation, and understand its limits
Context engineering is the discipline of deliberately curating everything that fills a model's context window across an entire agent run — not just the one prompt you wrote. As agents loop for dozens or hundreds of steps, the context window becomes a scarce, contested resource where quality, not capacity, decides whether the agent stays coherent or quietly falls apart. This lesson gives you the strategies (write, select, compress, isolate), the failure modes (poisoning, distraction, rot, clash), and the production tooling to keep an agent's context lean and high-signal.
- 1From writing a prompt to engineering a context
- 2The context window is a scarce, contested resource
- 3How contexts fail: poisoning, distraction, confusion, clash
- 4The four strategies: write, select, compress, isolate
- 5Just-in-time retrieval and compaction in practice
- 6Sub-agent isolation — the architectural mitigation
From writing a prompt to engineering a context
Prompt engineering is about the one message you craft for a single interaction. Context engineering is about everything the model sees at every step of a long agent run — and in 2026, that distinction is the whole ballgame.
The term was popularized by Shopify CEO Tobi Lütke on June 19, 2025, who framed it as "the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy endorsed it as "the delicate art and science of filling the context window with just the right information for the next step."
The scope is the key difference. A prompt is one string. A context is the entire state assembled before each model call: the system instructions, the running conversation, retrieved documents, tool definitions, prior tool results, and long-term memory. In an agent that loops 80 times, you are not writing one prompt — you are managing how that whole pile evolves, grows, and decays across every iteration.
This is why the practitioner consensus in 2025–2026 has broadly shifted: prompt engineering alone is no longer sufficient for production AI. The prompt is now just one ingredient in a much larger, dynamic context that you must engineer deliberately.
Key insight
The one-line definition
Context engineering = curating the full information state across an entire agent run, not crafting a single prompt. The guiding principle: find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome — true no matter how large your context window is.
The context window is a scarce, contested resource
It is tempting to think a 200K-token window means you have 200K tokens of usable working memory. You don't. Performance degrades well before you hit the limit, and that degradation is universal.
This is context rot: measurable accuracy loss as input length grows. Chroma's 2025 study of 18 frontier models — including GPT-4.1, Claude Opus 4, and Gemini 2.5 Pro — found degradation at every length increment. It is architectural, rooted in the quadratic cost of attention, not a gap that more training closes. A bigger window does not fix it.
The most famous symptom is the lost-in-the-middle effect (Liu et al., Stanford/TACL 2024): information buried in the middle of a long context is often ignored, with accuracy drops exceeding 30% versus the same fact placed at the start or end.
The practical consequence reframes everything: the context window is a budget shared by competing claimants — instructions, history, tool outputs, retrieved chunks. Every token you spend on low-signal content is a token taxed from the model's attention. High-signal minimalism beats information-dumping. The job is not to fill the window; it is to fill it well.
Watch out
Bigger windows are not a cure
A common and costly misconception: "we'll just use the 1M-token model." Context rot affects every model tested at every length. Larger windows buy you headroom, not immunity. You still have to curate.
How contexts fail: poisoning, distraction, confusion, clash
Drew Breunig's widely-cited June 2025 taxonomy names four distinct ways a context degrades an agent — each with a different mechanism and fix.
- Context poisoning. A hallucination enters the context, gets written back, and is then treated as ground truth in every subsequent step — compounding the error. Note this is not the same as a one-off hallucination; the danger is the feedback loop that amplifies it.
- Context distraction. As history accumulates, the model over-relies on its past steps instead of reasoning freshly, repeating prior actions rather than adapting.
- Context confusion. Irrelevant content — too many tool definitions, stale chunks — influences the response, e.g. the model calls the wrong tool because a tempting-but-wrong one is in scope.
- Context clash. Two pieces of contradictory information sit in the same window (an old plan and a revised one), producing inconsistent, conflicting guidance.
The lesson: failures are not random. Each maps to a specific corrective strategy — pruning poisoned state, summarizing to reduce distraction, scoping tools to reduce confusion, and reconciling or removing contradictions to resolve clash.
Example
Poisoning in the wild
A research agent hallucinates that an API endpoint is /v2/users. That false fact gets written into its working notes. For the next 20 steps it keeps calling /v2/users, ignoring 404s because its own context "confirms" the endpoint exists. The fix isn't a smarter model — it's not letting unverified claims persist in context.
The four strategies: write, select, compress, isolate
If the window is a budget, these are the four moves you make to stay under it. Think of them as the levers on a mixing board: you push some content out of the window, pull only what you need back in, shrink what remains, and wall off the messy parts. LangChain formalized this production toolkit into four named strategies — treat them as a checklist for any long-running agent.
| Strategy | What it does | Concrete mechanism |
|---|---|---|
| Write | Persist info outside the context window | Scratchpads, files, long-term memory stores |
| Select | Pull only relevant stored info back in | RAG, embeddings, knowledge graphs |
| Compress | Shrink what's in-context | Summarization, compaction, pruning |
| Isolate | Separate contexts to prevent pollution | Sub-agents, sandboxing |
The common mistake is equating context engineering with RAG alone. RAG is only the select strategy. A robust agent uses all four: it writes intermediate results to a scratchpad so they don't bloat the window, selects the few documents it needs right now, compresses old turns into summaries, and isolates risky sub-tasks into clean sub-agent contexts.
A second high-leverage technique cuts across all four: static/dynamic separation. Put rarely-changing system instructions first so they can be prompt-cached for cost and latency, and put dynamic content (new tool results, fresh retrievals) last to exploit the model's recency bias. Caching complements context engineering; it does not replace the curation work.
Tip
MCP standardizes the inflow
The Model Context Protocol (Anthropic, late 2024; 10,000+ public servers by late 2025) standardizes how tools, resources, and prompts enter the context window. It doesn't decide what belongs there — that's still your job — but it makes the select and tool-definition plumbing uniform across systems.
Just-in-time retrieval and compaction in practice
Two patterns do most of the heavy lifting in production. Both follow the same instinct a careful researcher uses: keep your desk clear, and pull a reference only when you actually open it.
Just-in-time retrieval. Instead of pre-loading every possibly-relevant document, the agent holds lightweight identifiers — file paths, stored queries, URLs — and loads the actual content into context only when a step needs it. Think of it as keeping a table of contents in mind rather than the whole library on the desk. This progressive disclosure keeps the active context small and high-signal.
# Don't pre-stuff everything. Hold references; load on demand.
file_index = ["docs/auth.md", "docs/billing.md", "docs/api.md"]
def step(state):
# Agent decides which reference it needs *now*
needed = model.pick_relevant(state.goal, file_index)
content = read_file(needed) # load into context just-in-time
return model.act(state.goal, content) # small, focused contextCompaction. When the conversation grows past a threshold, summarize it — preserving high-signal artifacts (architectural decisions, unresolved bugs, key identifiers) and discarding low-signal content (redundant tool dumps, already-resolved chatter). Anthropic's Claude API ships a native compaction API (beta compact-2026-01-12, type compact_20260112) that auto-summarizes when input tokens exceed a configurable threshold (default 100,000; minimum 50,000). Claude Code triggers auto-compaction at roughly 95% of context capacity. Crucially, compaction is not lossless — the art is choosing what to keep. The Claude Cookbook documents the three core levers: compaction, tool-result clearing, and memory injection.
Tip
Steer your summaries
Both the compaction API and Claude Code's /compact accept custom summarization instructions. For a coding agent, explicitly tell it to preserve file paths, function signatures, and open bugs — otherwise generic summarization will discard exactly the details the next step needs.
Sub-agent isolation — the architectural mitigation
The strongest lever against context failure is structural: don't let one context window absorb everything.
In sub-agent isolation, a main orchestrator spawns ephemeral sub-agents, each with a clean, task-scoped context window. A sub-agent does its messy work — dozens of tool calls, dead ends, verbose outputs — in its own context, then returns a condensed summary (typically 1,000–2,000 tokens) to the parent. The parent never sees the noise; the sub-agent always starts clean. Anthropic measured a 90.2% performance improvement for this architecture over single long-context agents.
def orchestrator(goal):
subtasks = plan(goal)
summaries = []
for task in subtasks:
# Each sub-agent gets a fresh, isolated context
result = run_subagent(task) # noisy work stays inside
summaries.append(result.summary) # only ~1–2k tokens returns
return synthesize(goal, summaries)But isolation redistributes context management rather than eliminating it: the orchestrator still accumulates summary traffic and must manage its own growing context. Active research continues to target the residual problem of cross-agent context contamination when many sub-agents share a single orchestrator's window. Isolation is powerful, not free.
Key insight
Why isolation works
Each failure mode feeds on accumulated state. A fresh sub-agent context has no poisoned facts, no distracting history, no rot, no clashing instructions to inherit. You're not making the model smarter — you're denying the failure modes the accumulated context they need to take hold.
Try it: Audit and shrink an agent's context
Take any agent transcript you have (or run a multi-step task in Claude Code or a LangGraph agent and capture the full message history). Then:
- Measure. Count the tokens at the longest point of the run. What fraction is system instructions, conversation history, tool results, and retrieved content?
- Diagnose. Find one instance of each failure mode you can spot: a persisted unverified claim (poisoning), a repeated action (distraction), an irrelevant-but-tempting tool definition (confusion), or two contradictory instructions (clash).
- Apply a strategy. Pick the single biggest token consumer and apply the matching strategy — move it to an external scratchpad (write), replace it with a just-in-time reference (select), summarize it with a steered prompt that preserves key identifiers (compress), or push the noisy sub-task into an isolated sub-agent that returns a 1–2k-token summary (isolate).
- Re-measure. Report the before/after token count and, if you can, whether task success changed.
Write up your single highest-leverage change in three sentences. The goal is to internalize the core instinct: find the smallest set of high-signal tokens that gets the outcome.
Key takeaways
- 1Context engineering curates the full information state across an entire agent run; prompt engineering crafts a single message — conflating them is the central mistake.
- 2Context rot is universal and architectural: every frontier model degrades as input grows, and bigger windows buy headroom, not immunity.
- 3The four failure modes — poisoning, distraction, confusion, and clash — each have a distinct cause and a distinct fix, so diagnose before you patch.
- 4Use all four strategies — write, select, compress, isolate — not just RAG; high-signal minimalism beats information-dumping every time.
- 5Sub-agent isolation gives large, measured gains by keeping each context clean, but it redistributes context management to the orchestrator rather than removing it.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.What most precisely distinguishes context engineering from prompt engineering?
2.A 200K-token model still loses accuracy on long inputs well before hitting its limit. Why?
3.Which strategy mapping is correct?
4.What is the key caveat about sub-agent isolation?
Go deeper
Hand-picked sources to keep learning
Primary authoritative source (Sep 29, 2025): system prompt design, just-in-time retrieval, compaction, note-taking, and multi-agent architectures.
The write / select / compress / isolate framework with LangGraph implementation patterns.
Official reference for the native compaction API (beta compact-2026-01-12): thresholds, custom summarization, streaming, billing, and caching integration.
Code-level recipe for the three core levers: compaction, tool-result clearing, and memory injection.
The definitive taxonomy of the four failure modes (poisoning, distraction, confusion, clash) with mechanisms and mitigations.
Technical treatment of context rot, the Chroma 18-model study, and the sub-agent architecture solution with measured results.