Context Windows & Prompting
The model's working memory — and how to fill it well
- Explain the context window as the model's sole working memory and reason about token budgets
- Use the system, user, assistant, and tool roles correctly to structure what the model sees
- Apply zero-shot, few-shot, chain-of-thought, role assignment, delimiters, and output-format prompting
- Diagnose and fix common failure modes like ambiguity and the lost-in-the-middle effect
- Distinguish an agent's system prompt from a chatbot prompt and write one accordingly
The context window is the only thing a model can see — its entire working memory for a single inference call. This lesson shows you what goes into that window, the message roles that structure it, and the prompting techniques that turn raw model capability into reliable behavior. It ends by showing why prompting an agent is a different discipline from prompting a chatbot.
- 1The context window is everything the model knows
- 2Sizes, token budgets, and the limits of "bigger"
- 3Message roles: who is speaking and with what authority
- 4The prompting toolkit: instructions, examples, roles, delimiters, format
- 5Zero-shot, few-shot, and chain-of-thought
- 6Failure modes: ambiguity and lost-in-the-middle
- 7Prompting an agent vs. prompting a chatbot
The context window is everything the model knows
A language model has no hidden memory between calls. For one inference, the only thing it can reason about is the text you place in its context window — and nothing outside that window exists to it. The window is the model's entire working memory.
What lives in there is broader than the question you typed. On a single agent turn the context typically holds:
- the system prompt (policy, persona, rules)
- the conversation history (every prior user and assistant turn)
- tool schemas (the definitions of functions the model can call)
- tool results returned from those calls
- retrieved documents (RAG passages, files, search hits)
- the model's own prior output in this turn
Everything is concatenated into one sequence of tokens, and the model attends over all of it at once. This single fact explains most of an agent's behavior. If a detail isn't in the window, the model cannot use it, no matter how obvious it seems to you. If contradictory instructions are both in the window, the model has to reconcile them. Mastering agents starts with controlling exactly what enters this window — and what you keep out.
Sizes, token budgets, and the limits of "bigger"
Think of the context window like a desk: you can spread out only so much paper before things fall off the edge. That edge is measured in tokens — chunks of text, roughly 3–4 characters of English each — and the window is a hard cap, because input plus output together must fit on the desk. As of mid-2026, those desks have grown enormous:
| Model | Context window |
|---|---|
| GPT-4o | 128K tokens |
| Claude Opus 4.6 / Sonnet 4.6 | 1M tokens |
| GPT-5 series | 400K–1M tokens |
| Gemini 2.5 / 3.1 Pro | 1M tokens |
| Llama 4 Scout | 10M tokens |
But a big window is a budget, not a free lunch. Tokens are priced, so a bloated context costs real money on every call. More importantly, models often degrade in quality before the advertised limit — effective, reliable capacity is frequently only 60–70% of the marketed number for demanding tasks. In a long-running agent the window fills relentlessly with tool results and prior reasoning, so you must treat it as a scarce, contested resource. Practical levers — compaction (summarizing history can cut 60–80% of tokens), just-in-time retrieval (pull a document only when needed), and sub-agents that return distilled summaries — exist precisely because a larger window does not eliminate the need to manage what's in it.
Watch out
Big window ≠ RAG is obsolete
Even a 1M- or 10M-token window does not retire retrieval and memory. You still pay per token, you still hit quality degradation, and you still face the lost-in-the-middle effect. Fitting everything in is rarely the same as the model using everything well.
Message roles: who is speaking and with what authority
Imagine a transcript where every line is labeled with who said it — the boss, a customer, the assistant, or a machine reporting back. That labeling is what roles do: APIs don't hand the model one undifferentiated blob of text, they tag each message with a role so the model knows its source and how much authority it carries. The common roles:
system— high-priority instructions: policy, persona, rules. Set by the developer, not the end user.user— input from the end user.assistant— the model's own prior turns.tool— results returned from a function/tool call.
Providers differ in the details. Anthropic keeps system as a top-level field separate from the messages array. OpenAI puts roles in the array and adds a distinct higher-trust developer role; their model follows a chain of command — developer > user — analogous to function definitions outranking arguments. Google uses system_instruction plus user/model roles.
A crucial subtlety: roles set priority, not isolation. The model still processes every role's tokens together in one window. The system prompt is weighted more heavily, but it is not a sealed channel the model obeys unconditionally — which is exactly what prompt injection exploits by smuggling adversarial instructions into user or tool content.
Key insight
Priority, not a firewall
Treat the system prompt as the highest-priority voice in a shared room, not a locked door. Never assume content arriving via tool or user is safe just because your real instructions sit in system.
The prompting toolkit: instructions, examples, roles, delimiters, format
Prompting is how you shape behavior without retraining. Five durable levers do most of the work:
- Clear instructions — be specific, prefer positive phrasing ("do X") over negative ("don't do Y"), and explain the why so the model generalizes.
- Role assignment — give the model a role in the system prompt ("You are a meticulous financial analyst") to focus tone and behavior.
- Delimiters — separate sections unambiguously. Anthropic recommends XML tags; OpenAI recommends XML tags and Markdown headers. Both help any model parse multi-part prompts.
- Examples (few-shot) — show 3–5 worked examples to lock in format and style.
- Output format — state the exact shape you want, ideally with a schema.
Here's the toolkit assembled:
<role>You are a precise data-extraction assistant.</role>
<instructions>
Extract the company, amount, and date from each invoice line.
Return only valid JSON. If a field is missing, use null.
</instructions>
<example>
Input: "Acme Corp billed $4,200 on 2026-03-01"
Output: {"company": "Acme Corp", "amount": 4200, "date": "2026-03-01"}
</example>
<document>
{{ the text to process }}
</document>Notice the document sits last, right before the model answers — that placement is deliberate, as the next section explains.
Zero-shot, few-shot, and chain-of-thought
These three strategies are really one question: how much do you show the model before asking it to work? They sit on a spectrum of scaffolding, from "just tell it what to do" to "walk it through worked examples and reasoning."
- Zero-shot — instructions only, no examples. Fastest, and often enough for strong modern models.
- Few-shot (multishot) — 3–5 in-context examples. It dramatically improves format consistency on structured tasks. Wrap examples in
<example>tags (Anthropic) or as user/assistant turn pairs (OpenAI). - Chain-of-thought (CoT) — ask the model to reason step by step before answering. The original work was few-shot (Wei et al., 2022); the zero-shot trigger "Let's think step by step" (Kojima et al., 2022) is now widely preferred.
Here is the counterintuitive recent finding: for highly capable current models, zero-shot CoT often matches or beats few-shot CoT (arXiv:2506.14641, EMNLP 2025). When the model already reasons well, hand-picked exemplars can constrain it rather than help; few-shot's main remaining value is aligning output format, not teaching reasoning.
One more distinction that trips people up: CoT prompting is not the same as a reasoning model. CoT instructs a standard model to externalize reasoning in its visible output. Reasoning models (the o-series, Claude extended thinking) have a separate internal thinking budget they spend before the final answer — a different mechanism, covered later in the course.
Tip
Reach for few-shot for shape, CoT for hard reasoning
Use few-shot when you need a consistent format the model keeps drifting from. Reach for chain-of-thought (or a reasoning model) when the task needs multi-step logic. They solve different problems and stack well together.
Failure modes: ambiguity and lost-in-the-middle
Two failures account for a surprising share of bad outputs.
Ambiguity. Vague prompts force the model to guess your intent, and it often guesses wrong or inconsistently. The fix is unglamorous but reliable: state the audience, the constraints, the format, and what to do in edge cases. Spelling out "if the field is missing, return null" removes a whole class of silent errors.
Lost in the middle. Models have a U-shaped attention bias: they reliably use information at the start and end of a long context but show 10–25% accuracy degradation for material buried in the middle (Liu et al., TACL 2024). This is structural, not a bug you can prompt away.
The practical fix is positional:
- Put long documents at the top of the prompt.
- Put the question or task instruction at the very end, right before the model generates.
Anthropic reports this ordering can improve response quality by up to 30%. It's also why the toolkit example earlier placed <document> last. When you can't avoid a long middle, lean on retrieval to surface only the relevant passages rather than dumping everything and hoping the model finds it.
Prompting an agent vs. prompting a chatbot
A chatbot prompt mostly sets tone, scope, and response format — it shapes one good reply. An agent system prompt is categorically different because it governs an autonomous loop, not a single turn. It must encode policy and procedure, not just personality.
An effective agent system prompt specifies:
- Role and goal — what the agent is for.
- Available tools and when to use each — the heuristics for choosing actions.
- Multi-step strategy — how to plan, sequence, and recover from errors.
- Stopping conditions — when the task is done and it should hand back a final answer.
- Output format for downstream consumers — other code or agents may parse the result.
You are a research agent. Goal: answer the user's question with cited sources.
Tools:
- web_search(query): use FIRST when facts may be recent or you're unsure.
- fetch_page(url): use to read a promising result in full.
Strategy: search, read the 2 most relevant pages, cross-check claims across
sources, then answer. If a page fails to load, try the next result.
Stop when: you can answer with at least two agreeing sources, OR after 5 tool
calls — then answer with the caveat that evidence was limited.
Output: a 3-sentence answer followed by a `Sources:` list of URLs.This is the on-ramp to context engineering — deliberately curating the whole window across a long run — which the course returns to in depth.
Try it: Engineer one prompt three ways
Pick a small structured task (e.g., extract {name, email, company} from a messy paragraph). (1) Write a zero-shot prompt and run it. (2) Add 3 few-shot examples wrapped in <example> tags and rerun. (3) Add an explicit output schema and a rule for missing fields (use null). Compare format consistency across the three. Then take a ~2,000-word document, ask one question about a fact buried in its middle, and run it twice: once with the question before the document, once with the document first and the question last. Note any difference in accuracy. Write four sentences: which prompt variant was most reliable, and whether position changed the answer.
Key takeaways
- 1The context window is the model's entire working memory; if something isn't in it, the model cannot use it.
- 2A bigger window is a priced budget, not a free pass — effective capacity is often 60–70% of the advertised limit.
- 3Roles (system, user, assistant, tool) set priority, not isolation — all tokens are processed together, which is why prompt injection works.
- 4Modern models make zero-shot CoT competitive with few-shot; place long documents first and the task last to beat lost-in-the-middle.
- 5An agent system prompt must encode tools, strategy, stopping conditions, and output format — far more than a chatbot's tone and scope.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.Why does a 1M-token context window NOT eliminate the need for retrieval and memory systems?
2.What does the lost-in-the-middle effect imply for prompt construction?
3.Why can prompt injection succeed even when your real instructions are in the system prompt?
4.Which is the clearest difference between an agent system prompt and a chatbot prompt?
Go deeper
Hand-picked sources to keep learning
Canonical reference: clarity, XML tags, few-shot, long-context placement, roles, and output control.
Current guidance on developer/system/user/assistant roles, few-shot, CoT, and structured output.
Defines and quantifies the U-shaped attention bias and the 10–25% middle-position degradation.
Comprehensive 2024 taxonomy of 58+ text prompting techniques and standardized vocabulary.
EMNLP 2025 finding that zero-shot CoT can outperform few-shot CoT on highly capable models; few-shot's main role is format alignment.
The bridge from prompting to context engineering: compaction, retrieval, sub-agents.