Long-Horizon & Stateful Tasks

Staying coherent over hundreds of steps

Advanced 14 minBuilder
What you'll be able to do
  • Explain why long-horizon tasks fail non-linearly, distinguishing error compounding, goal drift, context rot, and session discontinuity
  • Decompose a long task into checkpointed subtasks that survive a context reset
  • Design error recovery that resumes from the last known-good state instead of restarting
  • Manage context across many steps using the sawtooth pattern, external state, and the four memory types
  • Build progress-tracking and termination artifacts that stop agents from drifting or quitting early
At a glance

A coding agent that fixes one bug looks magical. Ask it to ship a whole feature across hundreds of steps and several context windows, and you watch it slowly lose the plot — forgetting the goal, retrying dead ends, declaring victory on broken code. This lesson explains why long-horizon agents drift and break, and the production engineering — decomposition, checkpoints, external state, and active context management — that keeps them coherent over hours of work.

  1. 1Why long horizons break agents
  2. 2The four ways agents lose the plot
  3. 3Decomposition and checkpoints
  4. 4Recovering from failed steps
  5. 5Managing context over a long run
  6. 6Progress tracking and termination

Why long horizons break agents

Start with the intuition: a single agent step is a coin flip you almost always win, but a long task is hundreds of those flips in a row — and the chance of getting every one right shrinks fast. That is the whole problem in a sentence. Reliability multiplies, so a model that is excellent per step can still be useless end to end.

The math is brutal. If each step succeeds 99% of the time, a 300-step task succeeds only about 5% of the time (0.99 to the 300th power) — this is error compounding, where strong per-step accuracy collapses into a weak overall system.

The empirical picture is even worse than this simple multiplication suggests. METR tracks the task-completion time horizon — the length of human task a frontier agent finishes at 50% success — and finds it doubling roughly every 4–7 months (the pace accelerated to roughly every 4 months in 2024–2025). That horizon has grown from minutes in 2019 to multi-hour tasks by mid-2026, though measurements beyond a few hours still carry high uncertainty. But within a single run, doubling a task's duration roughly quadruples its failure rate, and every measured agent shows noticeable degradation after about 35 minutes of human-equivalent work.

Why the cliff? A 2025 study of 3,100+ agent trajectories found failures are abrupt, not gradual — agents jump from partial robustness to near-systematic failure. And the causes are mostly process-level (72.5%): bad environment interactions, misread instructions, planning errors, and accumulated history errors. Only 27.5% are design-level limits like memory caps. The lesson: you can't buy your way out with a bigger model alone. You have to engineer the run.

Key insight

The cliff, not the slope

Long-horizon agents don't degrade gracefully. They hold together, then collapse. That means your job is not to nudge average quality up a few points — it's to detect the approaching cliff (drift, rising retries, context pressure) and intervene before the run tips over.

The four ways agents lose the plot

"Drift" is a catch-all word, and catch-all words hide the fix. If you can name which way an agent is losing the plot, you know which lever to pull. Four distinct failure modes dominate long runs:

  1. Error compounding — small mistakes feed the next step as fact. A wrong tool result or a hallucinated assumption becomes baked-in 'knowledge' the agent reasons from for the rest of the run.
  2. Goal drift — the agent slowly stops pursuing the original objective. Triggers include early instructions fading as context fills, recent cues overriding the goal (pattern-matching override), session interruptions, the model's defaults overriding your system prompt, and subgoal displacement, where over-optimizing a sub-task quietly erases the parent objective.
  3. Context rot — accuracy degrades well before the window is full. Enterprise data attributes 65% of agent failures to context drift and memory loss, with accuracy falling around 60–70% of advertised window capacity, not at 100%.
  4. Session discontinuity — the task simply outlives the context window, and state doesn't survive the reset.

The big misconception is that long-horizon failure means 'ran out of tokens.' It rarely does. Frontier models can hold near-perfect goal adherence for tens of thousands of tokens before drift sets in — yet every evaluated model eventually drifts. The token limit is the cliff's edge; drift pushes you off long before you reach it.

Watch out

Bigger context is not a fix

A 1M-token window doesn't eliminate drift — it delays it. Bigger windows and stronger models push the failure threshold further out but don't change the shape of the failure curve. Engineering the run is non-optional, at every model size.

Decomposition and checkpoints

The fix follows directly from the diagnosis: if a 300-step task is fragile because it's 300 steps, stop running it as 300 steps. Break it into a handful of independently verifiable units, complete one at a time, and save your progress after each. That saved progress is a checkpoint — a durable, external record of 'this much is genuinely done and verified,' not a feeling, but a committed artifact you can return to.

Anthropic's production pattern for multi-session coding agents uses a two-agent harness. An Initializer runs once and sets up the scaffolding: a feature-list JSON with pass/fail status, a progress.md file, an init.sh, and an initial git commit. Then each Coding agent session reads the progress file and git log, picks one feature, implements it, runs tests, commits with a descriptive message, and updates the progress file.

python
def pick_next_task(state):
    """One feature at a time; clean handoff between sessions."""
    for task in state["features"]:
        if task["status"] == "pending":
            return task
    return None  # nothing pending -> candidate to terminate

Git commits are your checkpoints: each is a known-good state you can resume from or roll back to. Working one feature at a time avoids half-finished partial states and hands the next session a clean boundary — the opposite of the tempting but fragile 'do everything in parallel for speed' approach.

Key insight

A checkpoint is an artifact, not a memory

If 'progress so far' lives only in the context window, a reset erases it. A real checkpoint is written outside the agent — a git commit, a row in a status file — so it survives the model forgetting, the session ending, and the process crashing.

Recovering from failed steps

Over hundreds of steps, some will fail — that's a certainty, not a risk. The naive reaction is to restart the whole task from scratch, which throws away good work and re-burns the budget. The production reaction is checkpoint-based resumption: pick back up from the last known-good state, not the beginning. Think of it like a video game with save points — death sends you to the last save, not to the title screen.

Three principles make this work:

  • Idempotency — design steps so that re-running them is safe (e.g., 'ensure file X exists' beats 'create file X', which would error or duplicate on a retry). Idempotent steps make retries free of side-effect damage.
  • Per-step retries with backoff — retry the failed unit a bounded number of times, with growing waits between attempts, before escalating — rather than failing the whole run.
  • Record the failure — and this is the one teams skip. The progress file must log failed approaches, not just completed work. Otherwise the next session happily retries the same dead end.

Durable execution frameworks like Temporal or Azure Durable Functions give you this at the infrastructure layer almost for free: native checkpointing, retries, and replay. LangGraph's Time Travel lets you replay from any checkpointed state. The mindset shift: a long-horizon agent is less like a script you run once and more like a resumable workflow that expects to be interrupted, crash, and pick back up.

Tip

Write down the dead ends

A progress file that lists only successes is a trap. Record failed approaches explicitly ("tried X, broke Y, abandoned"). Successive sessions without this re-explore the same failures and burn the iteration budget on known dead ends.

Managing context over a long run

Working memory — the context window — only ever grows if you let it: every tool result, every message piles on. And accuracy starts sliding once you cross roughly 60–70% of capacity, well before 'full.' So the goal is to keep context small and information-dense instead of large and noisy. The tool for that is the sawtooth pattern, three simple moves repeated at every checkpoint:

  1. Produce a structured summary of what was learned and done.
  2. Prepend it to a persistent Knowledge block at the top of context.
  3. Delete the raw messages it summarized.

Context sawtooths up as you work, then drops sharply at each compaction — staying bounded indefinitely (hence the name: the usage graph looks like saw teeth). The Focus Agent implements an autonomous version, cutting tokens 22.7% on SWE-bench Lite (up to 57% on individual instances) with no accuracy loss in its evaluation — a promising early result on a small sample. LangChain's Deep Agents SDK uses threshold triggers: offload tool results over 20K tokens to the filesystem with a path pointer, truncate old inputs at 85% capacity, and summarize when offloading isn't enough.

Underneath, lean on the four memory types: working (the live context, session-scoped), episodic (a timeline of past decisions and tool calls), semantic (a knowledge base, often vector-stored), and procedural (reusable skills and tool definitions). The one thing you must never summarize away is the goal itself — store it externally and re-read it verbatim, because repeated summarization is lossy and can hallucinate 'facts' that then poison the rest of the run.

Example

Sawtooth in one pass

Run 1 fills context to 80% across 40 tool calls. At the checkpoint, the agent writes a 600-token Knowledge block ("auth module done, DB schema is X, the migration approach failed — use Y"), deletes the 40 raw messages, and continues at ~10% usage. The fact density went up; the token count went down.

Progress tracking and termination

A long run can fail in two opposite ways, and you have to guard against both. The agent can quit early — declaring victory on incomplete or broken work — or it can never quit, looping forever and over-polishing. Knowing when a task is truly done is still an open problem in 2026, so you don't leave it to the model's judgment; you engineer it.

The core artifact is an external task list with pass/fail status — the same feature-list JSON the Initializer created. It is the single source of truth that survives every context reset and prevents a premature 'done.' Pair it with three guards:

  • Mandatory end-to-end verification before any task flips to done. No test pass, no completion. This is the antidote to confident-but-wrong self-assessment, where a drifting agent believes it finished.
  • Max-iteration / time caps with a graceful partial-completion state, so the agent stops cleanly instead of spinning forever.
  • A planner/worker split — a large frontier model plans once, smaller models execute. Isolating planning from execution prevents mid-run goal drift and can cut cost up to ~90%.
python
def should_terminate(state, iteration):
    if iteration >= state["max_iterations"]:
        return "halt_partial"          # graceful, with progress saved
    if all(t["status"] == "done" and t["verified"]
           for t in state["features"]):
        return "complete"
    return "continue"

Termination is a verified-checklist question, not a vibe — and 'is the goal achieved?' is exactly the judgment a drifting agent gets wrong.

Try it: Make an agent survive a context reset

Take a multi-step task you can run with any agent framework (e.g., 'scaffold a small CLI app with three features and tests'). Build the harness, not just the agent:

  1. Initializer step: write a features.json (each item: name, status: pending|done, verified: false) and an empty progress.md.
  2. Worker step: implement exactly one pending feature, run its test, and only on a pass set status: done, verified: true. Commit to git with a descriptive message. Append both the success and any failed approach to progress.md.
  3. Reset: kill the process and start a fresh session that reads only features.json, progress.md, and git log — no prior chat history. Confirm it resumes the next pending feature without redoing finished work.
  4. Terminate: stop when every feature is done and verified, or when you hit a max-iteration cap (save a partial-completion state).

Write 3–4 sentences on what broke when you forced the reset, and which artifact (features list, progress file, or git log) saved the run.

Key takeaways

  1. 1Long-horizon tasks fail non-linearly: errors compound, and doubling task duration roughly quadruples failure — agents collapse at a cliff rather than degrading gently.
  2. 2Most long-run failure is process-level drift (goal drift, error compounding, context rot), not running out of tokens — accuracy slips at 60–70% of window capacity.
  3. 3Decompose into one-at-a-time, independently verifiable subtasks and checkpoint each completion to durable external state like git commits.
  4. 4Recover by resuming from the last known-good checkpoint with idempotent, retryable steps — and record failed approaches so sessions don't retry dead ends.
  5. 5Keep context bounded with the sawtooth pattern, store the goal externally and re-read it, and gate termination on a verified pass/fail task list with iteration caps.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.An agent step succeeds 99% of the time. Why is a 300-step task still likely to fail?

2.What does the evidence say is the dominant cause of long-horizon agent failure?

3.Why must a long-horizon agent's progress file record failed approaches, not just completed work?

4.What is the 'sawtooth' context pattern?

Go deeper

Hand-picked sources to keep learning