Coding Agents

The killer app of agentic AI

Advanced 15 minBuilder

What you'll be able to do

Explain why software development is uniquely suited to agentic AI compared to other domains
Trace the edit-run-test loop and the role of repo-level context files in a coding agent's workflow
Interpret SWE-bench scores and the trajectory of measured progress without over-reading the numbers
Compare the leading coding agents — Claude Code, OpenAI Codex, Cursor, Aider, and Devin — and pick one for a task
Apply practical habits for working with coding agents and recognize where they still fail and humans remain essential

At a glance

Coding is the killer app of agentic AI — the one place where agents already merge real pull requests and rank as a top commercial use case. This lesson explains why software is uniquely agent-friendly (tests pass or fail, sandboxes are reversible), walks the edit-run-test loop and repo context that make it work, traces SWE-bench's climb from near-zero to ~79%, and tours the tools that matter — Claude Code, OpenAI Codex, Cursor, Aider, and Devin — alongside an honest map of where humans stay essential.

1Why coding is the killer app
2The edit-run-test loop
3Repo context and AGENTS.md
4SWE-bench and measured progress
5The tool landscape
6Working well with coding agents
7Where humans stay essential

Why coding is the killer app

If you want to know where agentic AI works today, follow the money: coding agents are the most commercially successful category, and it is not an accident. Software has a property almost no other domain offers — a cheap, automatic, ground-truth verifier. Tests pass or fail. Code compiles or it doesn't. A linter is either clean or it isn't. The agent never has to guess whether it succeeded; it runs the checks and reads the result.

Four properties stack up to make programming agent-friendly:

Verifiable outcomes — tests, compilation, and linting give an objective success signal.
Sandboxable, reversible environments — code runs in a container, and git makes every change undoable.
Clean programmatic interfaces — shells, editors, debuggers, and package managers are built to be driven by text commands.
A well-defined problem space — "make the failing test pass" is a crisp goal in a way "write a good marketing plan" never is.

Contrast this with a research or support agent, which often has no runnable check on whether its answer is correct. That missing verifier is exactly why coding raced ahead while other categories are still maturing.

The edit-run-test loop

Strip a coding agent to its core and you find one loop running over and over:

text

read relevant files  →  plan a change  →  edit the code
        ↑                                      ↓
   read output  ←  run build / tests / linter ─┘   (repeat until checks pass
                                                    or an iteration cap is hit)

The magic is in the last arrow: the agent runs something and reads the result, then decides its next move. A test failure isn't a dead end — it's an observation that drives the next edit. This is the ReAct loop you met earlier, specialized for code, and it typically runs for a bounded number of turns (often around 10) before stopping.

The load-bearing requirement is a runnable verification signal. Give the agent a failing test and a command to run it, and it iterates toward green with real feedback. Take that away — a codebase with no tests — and the loop has no closure condition. The agent edits until the code looks done, not until it is done, and confidently hands you something that may not work.

Key insight

No verifier, no agency

A coding agent is only as reliable as the signal it can run. The single highest-leverage thing you can do to make an agent succeed on your repo is give it a fast, trustworthy command — a test, a type-check, a build — that tells it whether it's right.

Repo context and AGENTS.md

A real codebase is far too big to fit in the model's context window, so the agent works the way a new engineer does — it can't memorize the whole repo, it needs a map and a few cheat-sheets. Two techniques provide exactly that.

Repo maps. Tools like Aider build a compact map of the repository — files, key symbols, and signatures — so the model knows what exists and where without reading every file. Think of it as a table of contents rather than the full book. The agent then pulls full file contents only when it needs them, conserving context.

Context files. AGENTS.md (the emerging cross-tool standard) and Claude Code's CLAUDE.md are repository-level files an agent loads at session start. They hold build/test commands, code-style rules, architectural decisions, and tooling quirks — the tacit knowledge a new teammate would otherwise ask about.

But don't overstate their value. A 2026 study (arXiv 2602.11988) found that developer-written context files improved task success by only ~4% while raising inference cost up to 19%, and LLM-generated context files actually hurt performance by ~3%. Use them for concrete, hard-to-guess facts — "run tests with pytest -q", "never touch legacy/" — not as a dumping ground.

Watch out

Context files aren't a magic boost

Keep AGENTS.md short and factual. Bloated, auto-generated context files cost tokens, degrade the model as the window fills, and can lower success rates. Curate, don't accumulate.

SWE-bench and measured progress

How do you measure whether a coding agent is actually any good? You give it real bugs and check whether its fix works. That is exactly what SWE-bench does, and it's the benchmark that defined the field. Each task is a real GitHub issue; the agent must edit a real codebase so the project's hidden tests pass — no partial credit for code that merely looks plausible. It comes in two tiers: the full set (2,294 issues) and SWE-bench Verified (500 human-validated issues, filtered so every task is genuinely solvable). Verified scores run higher and are the headline number you'll see quoted.

The trajectory is the story. Scores went from effectively 0% in 2023, to ~13% in early 2024, to roughly ~79% on Verified by early 2026 and continuing higher through mid-2026 as top agents crossed the 85–90% range. That is one of the steepest capability climbs in modern AI.

Two caveats keep you honest:

The harness matters. A harness (or scaffold) is the agent loop wrapped around the model — the code that decides how it reads files, runs tests, and retries. A score belongs to a model plus its harness, not the model alone, so the same model can score very differently under a different loop. Always cite the harness with the number.
It's a trajectory, not a ceiling. Numbers that were state-of-the-art eighteen months ago are now mid-pack. Treat any specific figure as a snapshot.

Complementary benchmarks like METR's Time Horizon measure how long an agent can work on one coherent task autonomously — a frontier metric now measured in hours, not minutes.

The tool landscape

These tools all share the same edit-run-test engine; what differs is where they live and how much they do on their own — a terminal command you steer turn by turn, an IDE that suggests as you type, or a cloud worker you hand a ticket and walk away from. Five systems dominate as of 2026, each occupying a different niche:

Tool	Form factor	Distinctive trait
Claude Code	CLI + cloud	`CLAUDE.md`, plan mode, subagents, hooks, MCP; ~87.6% SWE-bench Verified on Opus 4.7
OpenAI Codex	Cloud agent	Relaunched May 2025; runs each task in an isolated sandbox and proposes PRs
Cursor	AI-native IDE	VS Code fork; Agent Mode, the Composer model, parallel background cloud agents
Aider	Terminal	Open-source, model-agnostic (100+ LLMs), repo-map, auto-commits with git
Devin	Autonomous agent	Own shell, browser, and editor; best on scoped, junior-level tasks

A few clarifications that trip people up. The 2025 OpenAI Codex is not the 2021 Codex — the old one was a code-completion model behind early GitHub Copilot; the new one is a full autonomous agent. Aider is a harness, not a model — its quality depends entirely on the LLM you point it at (GPT-5 with high reasoning hits 88% on Aider's polyglot benchmark). And Cursor is a full fork, not "VS Code with a plugin" — proprietary models and cloud VMs are what make its agent mode work. GitHub Copilot, the dominant IDE completion tool, has grown agent modes too but remains primarily an in-editor assistant.

Working well with coding agents

Getting value from these tools is a skill. A handful of habits separate frustration from a force multiplier.

Give it a verifier. Point the agent at a test command or build before you ask for the change. The loop needs a signal.
Manage context deliberately. A debugging session can burn tens of thousands of tokens, and a full window degrades the model. Use /clear between unrelated tasks, lean on compaction/summarization, and push investigation work into subagents so it doesn't pollute the main thread.
Plan before editing. Plan mode (explore and propose first) catches misunderstandings before any code changes.
Scope tightly. Agents shine on clearly bounded tasks with verifiable outcomes — the kind a junior engineer would finish in a few hours.
Review every diff. You are the senior reviewer. Read the change, run it, and own it.
Keep humans on risky actions. Use permission gates for destructive or irreversible operations.

bash

# A tight, reviewable Claude Code task in non-interactive mode
claude -p "Fix the failing test in tests/test_auth.py. \
Run: pytest tests/test_auth.py -q. Iterate until it passes. \
Do not modify files outside src/auth/."

Tip

Treat the agent like a fast junior engineer

Hand it a scoped task with clear acceptance criteria and a way to check its work — then review the result. The bottleneck isn't the agent's speed; it's the quality of the spec and the verifier you give it.

Where humans stay essential

The progress is real, and so are the limits. Coding agents still fail predictably in a few places.

Ambiguous specs. Without a precise goal and a verifier, the agent fills gaps with assumptions — often wrong ones.
Complex inter-file reasoning. Changes that ripple across many modules, or depend on subtle cross-cutting invariants, exceed what current agents reliably track.
Tacit knowledge. "We don't do it that way here" — undocumented conventions, business context, and team history — lives in people's heads, not the repo.
Mid-task scope changes. Autonomous agents like Devin do best on stable, scoped work; shifting requirements throw them off.

The honest framing: Devin's PR merge rate reached 67% in 2025 (up from 34% the year before) — impressive, and still a third of PRs needing rework. Agents are not general-purpose software engineers. They are powerful executors of well-specified work. Architecture, judgment on edge cases, and oversight remain human jobs — which is precisely why your skill at scoping, verifying, and reviewing is what determines whether these tools help or hurt.

Try it: Run a coding agent on one scoped bug

Pick a small repo of yours (or clone a tiny open-source project) that has at least one test command. 1) Establish the verifier: find or write a single failing test and confirm the command to run it (e.g. pytest -q). 2) Add minimal context: create an AGENTS.md (or CLAUDE.md) with just the build/test command and one rule (e.g. "don't edit legacy/"). 3) Run an agent — Claude Code, Aider, or another — and give it exactly one scoped task: make that test pass, with the test command in the prompt. 4) Observe the loop: watch it edit, run, read the failure, and iterate. 5) Review the diff before accepting, then run the full test suite yourself. Write three sentences: what the agent got right, where it needed steering, and whether the verifier was the thing that made it converge. This builds the core instinct — scope tightly, give a verifier, review every diff.

Key takeaways

1Coding leads agentic AI because software offers automatic verifiers (tests, compilation, linting), reversible sandboxes, clean interfaces, and well-defined goals.
2The edit-run-test loop is the engine, and it only works with a runnable verification signal — no verifier, no real closure condition.
3SWE-bench Verified rose from ~0% in 2023 to ~79% by early 2026 and continued climbing through mid-2026 (top agents exceeding 85–90%), but scores depend on the harness and represent a moving trajectory, not a ceiling.
4Claude Code, OpenAI Codex, Cursor, Aider, and Devin each fit a different niche — and Aider's quality depends on whichever model you pair with it.
5Agents excel on tightly scoped, verifiable tasks; ambiguous specs, deep inter-file reasoning, and tacit knowledge keep humans essential for architecture and review.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.Why is software development uniquely suited to agentic AI compared to domains like research or customer support?

2.What is the single most important requirement for the edit-run-test loop to actually work?

3.Which statement about SWE-bench is correct?

4.Which is an accurate characterization of the leading coding tools?

Go deeper

Hand-picked sources to keep learning

Claude Code Best Practices (Official Docs)

Authoritative guide to CLAUDE.md, plan mode, subagents, and context management.

SWE-bench Official Leaderboard

Live coding-agent scores on real GitHub issues; mind the Verified vs full distinction.

Evaluating AGENTS.md (arXiv 2602.11988)

Empirical study finding repository context files add only ~4% — and LLM-generated ones can hurt.

Devin's 2025 Performance Review — Cognition AI

First-party data on PR merge rates, speed, and an honest take on limitations.

Introducing Codex — OpenAI

Official launch of the 2025 autonomous Codex agent; explains the sandbox-per-task model.

Aider GitHub Repository

Open-source terminal coding agent; documents the repo-map approach and polyglot benchmark.