Building with a Graph Framework
Stateful, controllable agents with LangGraph
- Model an agent as a LangGraph StateGraph using nodes, edges, conditional edges, and typed state
- Explain why an explicit state machine is more reliable and debuggable than a hidden agent loop
- Add persistence with a checkpointer so an agent can resume across crashes, sessions, and users
- Implement a human-in-the-loop approval gate using interrupt() and Command(resume=...)
- Use current v1.0 APIs and avoid the deprecated v0.x patterns you'll see in older tutorials
A from-scratch agent loop works in a notebook, but it is a black box: when it loops forever, crashes mid-task, or needs a human to approve a payment, you have nowhere to stand. LangGraph fixes this by modeling your agent as an explicit state machine — a graph of nodes and edges with typed, checkpointed state — so execution becomes inspectable, resumable, and safe to pause. This lesson builds a real LangGraph agent and shows why explicit control beats an opaque loop in production.
- 1Why model an agent as a graph?
- 2Nodes, edges, and typed state
- 3A worked example: a tool-using agent
- 4Persistence: checkpoints and resuming
- 5Human-in-the-loop with interrupt()
- 6Why explicit control wins in production
Why model an agent as a graph?
Think of the agent you built earlier as a recipe you run in your head: call the model, run any tools, feed the results back, repeat. That while loop is easy to write and works fine — right up until something goes wrong. When the loop spins forever, fails on step 14 of 20, or needs a human to approve a refund, where do you stand? There is no place to look inside, no place to pause, no place to pick up where it left off. The control flow lives invisibly inside Python, and you can't hold an invisible thing in your hands.
LangGraph's one move is to make that control flow visible. Instead of a hidden loop, you describe your agent as a directed graph — a state machine you can literally draw on a whiteboard. Each step of work is a node. The arrows wiring steps together are edges. And a shared, typed state object travels through the whole thing like a clipboard passed hand to hand.
Once the structure is an object rather than buried logic, everything hard becomes easy: you can visualize the agent, save its state at every step, rewind it, and slot a human decision into the middle — none of which is practical with a raw loop. LangGraph reached a stable v1.0 on October 22, 2025 (Python 3.10+), with a commitment to no breaking changes until v2.0. It is the framework Uber, LinkedIn, Klarna, and JP Morgan reach for when an agent has to survive contact with production.
Key insight
The core reframe
A from-scratch agent hides its control flow inside a loop. LangGraph turns that control flow into data — a graph plus a checkpointed state object — so you can inspect, persist, pause, and replay it. That is the entire reason to adopt it.
Nodes, edges, and typed state
Picture a flowchart you can actually execute: boxes are steps, arrows are the order, and a clipboard of shared facts gets handed from box to box. That is a LangGraph StateGraph, built from exactly three primitives:
- State — the clipboard. A
TypedDictdescribing everything that flows through the graph. Each node receives the current state and returns a partial update (just the fields it changed). - Nodes — the boxes. Plain Python functions that read state and return a dict of fields to merge back in. Nodes do the actual work: call a model, run a tool, transform data.
- Edges — the arrows. A normal edge always goes A→B. A conditional edge runs a small router function that inspects state and returns the name of the next node, which is how the graph branches.
The one subtle part is how a returned update gets merged into the clipboard. That is controlled by reducers. Without a reducer, a returned value simply overwrites the existing field. The built-in add_messages reducer instead appends to a running message list — which is exactly what a chat agent needs, since it must accumulate history rather than replace it. The shortcut MessagesState bundles a messages field with that reducer pre-wired, and is the recommended starting point.
from typing import TypedDict
from typing_extensions import Annotated
from langgraph.graph import MessagesState # messages + add_messages
# Or define your own state with a reducer:
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list, add_messages] # appends, not overwrites
step_count: int # plain field: overwritesSo the whole mental model fits in one line: nodes are boxes, edges are arrows, state is the clipboard, and reducers decide whether a new note replaces the clipboard or gets stapled to it.
Watch out
v0.x APIs you'll still see in old tutorials
In v1.0, set_entry_point()/set_finish_point() are gone — use add_edge(START, "node") and add_edge("node", END). ToolExecutor is replaced by ToolNode. And create_react_agent (from langgraph.prebuilt) is deprecated in favor of create_agent from the langchain package. add_conditional_edges() is unchanged.
A worked example: a tool-using agent
Let's build the canonical agent: a model node, a tool node, and a conditional edge that bounces between them until the model stops asking for tools. If that sounds familiar, it should — it is the ReAct loop you wrote by hand earlier, except now it's an explicit, drawable graph instead of a hidden while.
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool
@tool
def get_weather(city: str) -> str:
"""Return the current weather for a city."""
return f"It's 18°C and clear in {city}."
tools = [get_weather]
model = init_chat_model("anthropic:claude-sonnet-4-5").bind_tools(tools)
def call_model(state: MessagesState):
return {"messages": [model.invoke(state["messages"])]}
def should_continue(state: MessagesState) -> str:
last = state["messages"][-1]
return "tools" if last.tool_calls else END
builder = StateGraph(MessagesState)
builder.add_node("model", call_model)
builder.add_node("tools", ToolNode(tools))
builder.add_edge(START, "model")
builder.add_conditional_edges("model", should_continue, ["tools", END])
builder.add_edge("tools", "model") # loop back
graph = builder.compile()Read should_continue carefully: it asks one question — did the model just request a tool? — and routes accordingly. That tiny router is the agent loop, now written as data instead of buried in Python control flow. Run graph.invoke({"messages": [("user", "Weather in Oslo?")]}) and execution flows START → model → (tools → model)* → END, looping through the tool node as many times as the model needs. Best of all, you can render the exact topology with graph.get_graph().draw_mermaid() and see the agent you built.
Tip
Prebuilt vs. manual
For a standard tool-calling agent, the prebuilt create_agent gives you this graph in one line. Build a manual StateGraph like the above when you need custom routing, parallel branches, a supervisor over sub-agents, or approval gates — i.e. exactly the cases where a graph earns its keep.
Persistence: checkpoints and resuming
Here is the practical payoff of an explicit graph that a raw loop can't easily match: it can save its entire state after every single step. You enable this by handing the graph a checkpointer when you compile it. With one configured, LangGraph serializes the full state at each node transition — like hitting save in a video game after every move — and keys those saves by a thread_id.
from langgraph.checkpoint.memory import MemorySaver
graph = builder.compile(checkpointer=MemorySaver())
config = {"configurable": {"thread_id": "user-42"}}
graph.invoke({"messages": [("user", "Hi")]}, config)
# Later, same thread_id — prior messages are restored automatically:
graph.invoke({"messages": [("user", "What did I just ask?")]}, config)Because every save is tagged with a thread_id, each user or session gets its own isolated, durable memory for free. Three concrete wins fall out of this one mechanism:
- Conversation memory — resume a thread days later with full history intact.
- Fault tolerance — a crash mid-run resumes from the last checkpoint, not from the start.
- Multi-user isolation — one graph serves many threads with no state bleeding between them.
You pick the checkpointer to match where the agent runs: MemorySaver for dev (state lives in RAM and vanishes on restart), SqliteSaver for a single long-running process, and PostgresSaver / AsyncPostgresSaver for real production, where saves must survive restarts and scale across machines.
Watch out
MemorySaver is NOT for production
MemorySaver keeps state in process RAM and loses everything on restart. It is documented as development/testing only. Production agents must use a durable checkpointer such as PostgresSaver or AsyncPostgresSaver.
Human-in-the-loop with interrupt()
Some actions should never happen on the agent's own say-so: issuing a refund, sending an email, deleting a row. You want a human to look first. Checkpointing makes this clean, because pausing is just saving and waiting. Call interrupt() inside any node and the graph freezes on the spot: it serializes state to the checkpointer and hands a payload back to your caller, who can show it to a human and collect a decision.
from langgraph.types import interrupt, Command
def approve_refund(state):
decision = interrupt({"action": "refund", "amount": state["amount"]})
if decision != "approve":
return {"messages": [("assistant", "Refund cancelled.")]}
return {"messages": [("assistant", "Refund issued.")]}That first invoke stops dead at the interrupt and returns the payload. To wake the graph back up, you call it again with Command(resume=value) and the same thread_id so it finds the right saved state:
graph.invoke(Command(resume="approve"), config)This is a genuine pause that can last seconds or days — the human is fully outside the request, not blocking a thread somewhere. The exact same mechanism powers reviewing a draft before it sends, editing a plan before it runs, or correcting a tool call before it fires.
Watch out
interrupt() has sharp edges
Four rules: (1) a checkpointer is mandatory — no persistence, no resume. (2) Never wrap interrupt() in try/except. (3) On resume the node re-executes from the top, so any code before interrupt() must be idempotent (don't charge the card before the gate). (4) Payloads must be JSON-serializable.
Why explicit control wins in production
A black-box agent loop is charming in a demo and miserable in production, for one reason: when it misbehaves, you can't see why. LangGraph's explicit structure pays off exactly where reliability matters, because every part of the agent is now a thing you can point at.
- Debuggability. Every node maps cleanly to a trace span. With LangSmith, you see the graph, which edges were taken, and the precise state delta at each node — not just an undifferentiated wall of model calls. This deep observability is one of LangGraph's main advantages over opaque frameworks.
- Time-travel debugging. Because state is checkpointed at every transition, LangGraph Studio lets you rewind to any earlier checkpoint, edit the state, and fork a new execution path from there. A nondeterministic agent becomes reproducible: you can replay the exact moment it went wrong.
- Bounded behavior. Explicit edges plus a recursion limit make a runaway loop a configuration setting, not a 3 a.m. surprise.
One clarification worth internalizing: LangGraph is a pure orchestration runtime. It depends on LangChain packages, but it does not force you to use LangChain chains or prompt templates — you can call any LLM SDK that supports tool calling directly inside a node. It is MIT-licensed and model-agnostic at the framework layer.
The rule of thumb to leave with: reach for a graph when control flow is complex enough that you need to see and steer it. For a plain tool-caller, a one-line prebuilt agent is still enough — don't pay for machinery you won't use.
Key insight
What you can point at, you can fix
A hidden loop gives you a wall of logs and a shrug. An explicit graph gives you a span per node, a saved state per step, and a rewind button. Production reliability is mostly the ability to locate a failure — and that is precisely what making control flow visible buys you.
Try it: Build a refund agent with an approval gate
Build a StateGraph agent that handles customer refund requests, using current v1.0 APIs.
Steps:
- Start from
MessagesStateand a model bound to one tool,lookup_order(order_id). - Wire the standard loop:
add_edge(START, "model"), ashould_continueconditional edge to aToolNodeorEND, andadd_edge("tools", "model"). - Add an
approve_refundnode that callsinterrupt({"amount": ...})before issuing any refund. Route the model to it when a refund is requested. - Compile with a checkpointer (
MemorySaveris fine for the lab) and run a refund request with a fixedthread_id. Confirm the graph pauses at the interrupt. - Resume with
graph.invoke(Command(resume="approve"), config)and verify the refund completes. Then run a fresh thread and resume with"deny".
Stretch: render the graph with graph.get_graph().draw_mermaid(), then swap MemorySaver for SqliteSaver and confirm the conversation survives a process restart.
Reflection (2–3 sentences): Where would interrupt() re-running the node from the top cause a bug if your pre-interrupt code weren't idempotent?
Key takeaways
- 1LangGraph turns an agent's hidden loop into an explicit, inspectable state machine: nodes (functions), edges (routing), and a typed shared state.
- 2State reducers control merging — plain fields overwrite, the add_messages reducer appends; MessagesState is the recommended starting point.
- 3A checkpointer persists state at every step keyed by thread_id, giving you conversation memory, crash recovery, and multi-user isolation — but MemorySaver is dev-only.
- 4interrupt() plus Command(resume=...) implements human-in-the-loop approval, but requires a checkpointer and an idempotent node because the node re-runs from the top on resume.
- 5Explicit graphs unlock LangSmith tracing and LangGraph Studio time-travel debugging; use v1.0 APIs (START/END, ToolNode, create_agent), not deprecated v0.x patterns.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.What does the add_messages reducer do when a node returns a value for the messages field?
2.In LangGraph v1.0, how do you mark a node as the entry point of the graph?
3.Which statement about interrupt() is correct?
4.Why is MemorySaver unsuitable for production?
Go deeper
Hand-picked sources to keep learning
Authoritative v1.0 reference for StateGraph, nodes, edges, state, and checkpointing.
Complete conceptual docs for interrupt(), Command(resume=...), and human-in-the-loop patterns with examples.
Every breaking change and v0.x → v1.0 substitution (START/END, ToolNode, create_agent).
Official October 22, 2025 announcement covering new features, design philosophy, and production adoption.
MIT-licensed source, releases, and changelog. Check the releases tab for the latest patch version.
Practical walkthrough of current patterns: StateGraph, MessagesState, ToolNode, and checkpointing.