Anatomy of an Agent
The four parts under every framework
- Name and explain the five components of an agent — model, instructions, tools, memory, and orchestration — and what each one is responsible for
- Trace a single user request as it flows through all five components in order
- Explain why separating these components cleanly makes an agent debuggable and testable
- Map the anatomy onto real frameworks (LangGraph, CrewAI, OpenAI Agents SDK, Google ADK)
- Diagnose which component is at fault when an agent misbehaves
An agent is not a model — it is a system with five interlocking parts: a model, instructions, tools, memory, and an orchestration loop that ties them together. This lesson dissects each part, traces how a single request flows through them, and shows that the same anatomy lives under every framework you'll meet, from LangGraph to the OpenAI Agents SDK. Once you can name the parts, you can debug them, because each one fails in its own distinct way.
- 1An agent is a system, not a model
- 2The five components
- 3How a request flows through the parts
- 4Separation of concerns: why clean parts are debuggable
- 5The same anatomy, every framework
- 6Where each component goes wrong
An agent is a system, not a model
Start with the trap almost everyone falls into: believing the model is the agent. It isn't. The model is one part — the reasoning core — and on its own it can only produce text. What turns a model into an agent is the machinery wrapped around it that lets it act, remember, and keep going until a goal is met.
The cleanest way to picture this is a body. The model is the brain: it decides. But a brain in a jar can't do anything. It needs instructions (its values and goals), tools (hands to act on the world), memory (to carry context from one moment to the next), and an orchestration loop (the nervous system that connects perception to action and back). Take any part away and the agent stops being an agent — no tools and it can only talk; no loop and it acts once and halts.
This framing isn't academic. As of 2026 it maps cleanly onto every major framework and even onto the formal research literature, where an agent is written as a tuple of policy, memory, tools, verifiers, and environment. The practical payoff is large: when you see an agent as five separable parts rather than one inscrutable blob, you can reason about it, test each piece, and find bugs fast.
The five components
In plain terms: each component answers one question. Who is doing the thinking? The model. What is it trying to do and how should it behave? The instructions. What can it actually touch in the world? The tools. What does it remember? The memory. What keeps it running and stops it? The orchestration. Hold those five questions in your head and you have the whole anatomy.
Here it is in one table. Internalize it — the rest of the course is detail on each row.
| Component | What it is | Its one job |
|---|---|---|
| Model | The LLM at the core | Map the current context to the next decision (think or act) |
| Instructions | System prompt + policy + tool descriptions | Tell the model who it is, what to pursue, and how to behave |
| Tools | Typed, schema-validated functions | Let the agent sense and change the outside world |
| Memory | Context window → conversation store → long-term knowledge | Carry information across steps and across runs |
| Orchestration | The runtime loop | Run the perceive→reason→act→observe cycle, enforce budgets, handle errors |
Three subtleties to notice. Instructions are broader than "the system prompt" — in production they include tool descriptions, injected examples, and any retrieved context, because all of it shapes the model's behavior. Memory is layered, not a single thing: the immediate context window, a conversation store, and long-term knowledge are three different mechanisms. And orchestration is far more than a while loop — it owns stopping conditions, token and cost budgets, retries, and loop detection. Each row is a place where an agent can succeed or fail independently of the others.
Key insight
Why "instructions" deserve their own box
Anthropic's research found that the system-prompt/instructions layer has a disproportionate effect on reliability: vague instructions are a leading cause of goal drift and hallucinated actions. Instructions aren't decoration on top of the model — they are the agent's policy.
How a request flows through the parts
So far the five parts look like a static diagram. They aren't — they're a pipeline a single request runs through, often many times in a single turn. Picture passing a baton: each component does its job, then hands the growing message history to the next. Here is the exact path:
- Instructions are injected into the context window: the system prompt, the available tool schemas, and any relevant memory.
- The model reasons over that context and emits one of two things: a final text answer, or a tool call (a structured request like
search(query="...")). - If it's a tool call, the orchestrator executes the tool and captures the result.
- The result is appended to memory (the message history) so the model can see what happened.
- The loop repeats from step 2, now with the new observation in context.
It stops when the model produces a text-only answer (it's done) or when a budget guard fires — max iterations, max tokens, max cost, or a timeout. That stopping logic lives in orchestration, not in the model, which is exactly why it's reliable: you never trust a non-deterministic model to decide when to quit.
Notice how each component touches the request in turn, hands off cleanly, and reads from a shared, growing message history. That history is the working memory, and keeping it clean is half the battle in real agents.
Example
One turn, concretely
Goal: "What's the weather in Tokyo, and should I pack an umbrella?"
- Instructions + the
get_weathertool schema enter the context. - Model decides it needs data → emits
get_weather(city="Tokyo"). - Orchestrator runs the tool →
{"temp": 14, "rain_chance": 0.8}. - Result appended to history.
- Model loops, now sees 80% rain, and produces a text answer: "14°C and likely rain — pack the umbrella." Text-only response → loop stops.
Separation of concerns: why clean parts are debuggable
Why bother keeping these five parts distinct? Because when something breaks, you want to know which thing broke — not stare at a wall of logs guessing. That's the whole architectural payoff: clean separation makes failures isolatable. A tool failure is not a model failure is not a memory failure. When each component has defined inputs and outputs, you can point at the broken one.
The practical test is mockability: if your parts are cleanly separated, you can swap any one for a fake and test the rest. A "mock" here just means a stand-in that returns a fixed, predictable result instead of doing the real work.
# Test the orchestration loop without spending tokens or hitting APIs
def fake_model(messages):
# Always asks for the calculator once, then answers
if not any(m["role"] == "tool" for m in messages):
return ToolCall("calculator", {"expr": "2+2"})
return FinalAnswer("The result is 4.")
def fake_calculator(expr):
return "4"
# Now exercise the loop, budget guards, and history handling in isolation
run_agent(model=fake_model, tools={"calculator": fake_calculator}, goal="2+2?")Because run_agent only depends on interfaces (a callable model, a dict of tools), you can verify stopping conditions and history management with zero network calls. The same logic lets you mock a flaky tool to test error handling, or replay a fixed message history to test the model's behavior deterministically. Agents are non-deterministic; clean seams are how you make them testable anyway.
The same anatomy, every framework
Here's the news that makes every framework suddenly easy: they are all the same five parts in different clothes. Each framework is just an opinionated way of assembling model, instructions, tools, memory, and orchestration. The vocabulary changes; the anatomy doesn't.
| Framework | Model | Instructions | Tools | Memory | Orchestration |
|---|---|---|---|---|---|
| LangGraph | any LLM | node prompts | tool nodes | shared state object + checkpoints | directed graph (nodes + conditional edges) |
| CrewAI | any LLM | role / goal / backstory | agent tools | crew memory (pluggable vector store) | sequential / hierarchical process |
| OpenAI Agents SDK | OpenAI-first | agent instructions + guardrails | function tools | sessions | the runner + handoffs between agents |
| Google ADK | Gemini-first | agent config | tool registry | built-in state | workflow agents (Sequential / Parallel / Loop) |
| AG2 / AutoGen | any LLM | conversable-agent system message | registered functions | conversation history | structured agent conversations |
The lesson is liberating: learn the anatomy once and every framework becomes legible. Open the OpenAI Agents SDK and see Agents, Handoffs, Guardrails, Sessions, Tracing, and you can immediately place each one — Guardrails and Sessions are instructions and memory; Handoffs are an orchestration pattern. Frameworks earn their keep by providing the hard parts — state management, checkpointing, tracing, human-in-the-loop — that are tedious to build correctly. Anthropic's caution isn't "avoid frameworks"; it's "don't use them as black boxes you can't reason about."
Watch out
Two names that trip people up in 2026
OpenAI's production agent framework is the Agents SDK (released March 2025) — Swarm was an experimental prototype and is deprecated; the Swarm README explicitly directs users to migrate. And AutoGen entered maintenance mode at Microsoft in late 2025 as Microsoft pivoted to the Microsoft Agent Framework; the active community continuation of AutoGen 0.2 is AG2, maintained by the original creators outside Microsoft. Citing "Microsoft AutoGen" (active) or "OpenAI Swarm" (current) is outdated.
Where each component goes wrong
Now the real reason to learn the anatomy: debugging. When an agent misbehaves, the symptom usually points straight at one component — because each part fails in its own characteristic way. Learn the signatures and you stop guessing.
- Model — hallucination, goal drift, and compounding errors that grow over many turns. Research on long-horizon tasks shows context reliability degrades significantly as the number of steps increases, with earlier instructions and observations becoming progressively less influential by the final turn.
- Instructions — ambiguous or conflicting goals that cause wrong actions or, worse, infinite loops where the agent can never decide it's done.
- Tools — malformed arguments, schema mismatches, API timeouts, and dangerous side effects. Anthropic's “Building Effective Agents” guide found that teams spent more time on tool schemas than on prompts; typed (Pydantic/JSON Schema) arguments sharply cut malformed calls.
- Memory — context overflow, stale embeddings, low-quality retrieval, and prompt injection hidden inside retrieved content.
- Orchestration — missing budget guards causing runaway loops, routing errors in multi-agent setups, and race conditions.
So when an agent misbehaves, resist debugging "the agent." Ask: which component? A wrong-but-confident answer points at the model or instructions; a crash on a tool result points at tools or schemas; an agent that forgets what it did points at memory; an agent that never stops points at orchestration. The symptom is the map.
Tip
Observability is the emerging sixth concern
You can't fix a component you can't see. Production runtimes now emit OpenTelemetry-style spans for every LLM call, tool call, retrieval, and handoff. Tools like LangSmith, Langfuse, and Arize Phoenix turn the five-part anatomy into a trace you can actually read.
Try it: Label the anatomy
Take one agent you can inspect — your own toy loop, a LangGraph tutorial, or the OpenAI Agents SDK quickstart — and produce a one-page dissection.
- Identify all five components in the code or config: where is the model called, where do instructions live, how are tools defined, what stores memory, and where is the loop?
- Draw the request flow as five numbered steps for a single example task, naming which component owns each step.
- Predict one failure per component: write the specific bug you'd expect from the model, the instructions, a tool, memory, and orchestration — then say how you'd detect it (which span or log line).
- Bonus — prove separation: replace the real model call with a hard-coded fake (return a fixed tool call, then a fixed answer) and confirm the loop still runs end-to-end with no API calls. If it can't, your components aren't cleanly separated yet.
Key takeaways
- 1An agent is a system of five parts — model, instructions, tools, memory, orchestration — not just an LLM.
- 2A request flows in order: instructions enter context → model reasons → tool runs → result appends to memory → loop, until a final answer or a budget guard stops it.
- 3Clean separation of components makes failures isolatable and lets you mock any one part to test the others.
- 4Every framework — LangGraph, CrewAI, OpenAI Agents SDK, Google ADK, AG2 — is just these five parts under different names.
- 5Each component fails in its own way, so naming the broken part is the fastest path to a fix.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.Which statement best captures the core idea of agent anatomy?
2.In a single agent turn, what is the correct order of events?
3.Why does cleanly separating the components make an agent easier to debug?
4.An agent runs forever and never returns a final answer. Which component is the most likely culprit?
Go deeper
Hand-picked sources to keep learning
The canonical source on workflows vs agents, component design, and why tool (ACI) design matters as much as prompts.
A concise, code-forward walkthrough of the minimal loop and how model, tools, history, and orchestration fit together.
Component breakdown with request-flow diagrams and a per-component failure-modes table.
Current overview of model options, the three-layer memory taxonomy, the tool/MCP layer, and framework-to-anatomy mapping.
Side-by-side mapping of how LangGraph, CrewAI, OpenAI Agents SDK, AG2, and Google ADK implement the five components.
Academic survey formalizing the agent tuple (policy, memory, tools, verifiers, environment) and cataloging failure modes per component.