Coding Agents
The killer app of agentic AI
- Explain why software development is uniquely suited to agentic AI compared to other domains
- Trace the edit-run-test loop and the role of repo-level context files in a coding agent's workflow
- Interpret SWE-bench scores and the trajectory of measured progress without over-reading the numbers
- Compare the leading coding agents — Claude Code, OpenAI Codex, Cursor, Aider, and Devin — and pick one for a task
- Apply practical habits for working with coding agents and recognize where they still fail and humans remain essential
Coding is the killer app of agentic AI — the one place where agents already merge real pull requests and rank as a top commercial use case. This lesson explains why software is uniquely agent-friendly (tests pass or fail, sandboxes are reversible), walks the edit-run-test loop and repo context that make it work, traces SWE-bench's climb from near-zero to ~79%, and tours the tools that matter — Claude Code, OpenAI Codex, Cursor, Aider, and Devin — alongside an honest map of where humans stay essential.
- 1Why coding is the killer app
- 2The edit-run-test loop
- 3Repo context and AGENTS.md
- 4SWE-bench and measured progress
- 5The tool landscape
- 6Working well with coding agents
- 7Where humans stay essential
Why coding is the killer app
If you want to know where agentic AI works today, follow the money: coding agents are the most commercially successful category, and it is not an accident. Software has a property almost no other domain offers — a cheap, automatic, ground-truth verifier. Tests pass or fail. Code compiles or it doesn't. A linter is either clean or it isn't. The agent never has to guess whether it succeeded; it runs the checks and reads the result.
Four properties stack up to make programming agent-friendly:
- Verifiable outcomes — tests, compilation, and linting give an objective success signal.
- Sandboxable, reversible environments — code runs in a container, and
gitmakes every change undoable. - Clean programmatic interfaces — shells, editors, debuggers, and package managers are built to be driven by text commands.
- A well-defined problem space — "make the failing test pass" is a crisp goal in a way "write a good marketing plan" never is.
Contrast this with a research or support agent, which often has no runnable check on whether its answer is correct. That missing verifier is exactly why coding raced ahead while other categories are still maturing.
The edit-run-test loop
Strip a coding agent to its core and you find one loop running over and over:
read relevant files → plan a change → edit the code
↑ ↓
read output ← run build / tests / linter ─┘ (repeat until checks pass
or an iteration cap is hit)The magic is in the last arrow: the agent runs something and reads the result, then decides its next move. A test failure isn't a dead end — it's an observation that drives the next edit. This is the ReAct loop you met earlier, specialized for code, and it typically runs for a bounded number of turns (often around 10) before stopping.
The load-bearing requirement is a runnable verification signal. Give the agent a failing test and a command to run it, and it iterates toward green with real feedback. Take that away — a codebase with no tests — and the loop has no closure condition. The agent edits until the code looks done, not until it is done, and confidently hands you something that may not work.
Key insight
No verifier, no agency
A coding agent is only as reliable as the signal it can run. The single highest-leverage thing you can do to make an agent succeed on your repo is give it a fast, trustworthy command — a test, a type-check, a build — that tells it whether it's right.
Repo context and AGENTS.md
A real codebase is far too big to fit in the model's context window, so the agent works the way a new engineer does — it can't memorize the whole repo, it needs a map and a few cheat-sheets. Two techniques provide exactly that.
Repo maps. Tools like Aider build a compact map of the repository — files, key symbols, and signatures — so the model knows what exists and where without reading every file. Think of it as a table of contents rather than the full book. The agent then pulls full file contents only when it needs them, conserving context.
Context files. AGENTS.md (the emerging cross-tool standard) and Claude Code's CLAUDE.md are repository-level files an agent loads at session start. They hold build/test commands, code-style rules, architectural decisions, and tooling quirks — the tacit knowledge a new teammate would otherwise ask about.
But don't overstate their value. A 2026 study (arXiv 2602.11988) found that developer-written context files improved task success by only ~4% while raising inference cost up to 19%, and LLM-generated context files actually hurt performance by ~3%. Use them for concrete, hard-to-guess facts — "run tests with pytest -q", "never touch legacy/" — not as a dumping ground.
Watch out
Context files aren't a magic boost
Keep AGENTS.md short and factual. Bloated, auto-generated context files cost tokens, degrade the model as the window fills, and can lower success rates. Curate, don't accumulate.
SWE-bench and measured progress
How do you measure whether a coding agent is actually any good? You give it real bugs and check whether its fix works. That is exactly what SWE-bench does, and it's the benchmark that defined the field. Each task is a real GitHub issue; the agent must edit a real codebase so the project's hidden tests pass — no partial credit for code that merely looks plausible. It comes in two tiers: the full set (2,294 issues) and SWE-bench Verified (500 human-validated issues, filtered so every task is genuinely solvable). Verified scores run higher and are the headline number you'll see quoted.
The trajectory is the story. Scores went from effectively 0% in 2023, to ~13% in early 2024, to roughly ~79% on Verified by early 2026 and continuing higher through mid-2026 as top agents crossed the 85–90% range. That is one of the steepest capability climbs in modern AI.
Two caveats keep you honest:
- The harness matters. A harness (or scaffold) is the agent loop wrapped around the model — the code that decides how it reads files, runs tests, and retries. A score belongs to a model plus its harness, not the model alone, so the same model can score very differently under a different loop. Always cite the harness with the number.
- It's a trajectory, not a ceiling. Numbers that were state-of-the-art eighteen months ago are now mid-pack. Treat any specific figure as a snapshot.
Complementary benchmarks like METR's Time Horizon measure how long an agent can work on one coherent task autonomously — a frontier metric now measured in hours, not minutes.
The tool landscape
These tools all share the same edit-run-test engine; what differs is where they live and how much they do on their own — a terminal command you steer turn by turn, an IDE that suggests as you type, or a cloud worker you hand a ticket and walk away from. Five systems dominate as of 2026, each occupying a different niche:
| Tool | Form factor | Distinctive trait |
|---|---|---|
| Claude Code | CLI + cloud | CLAUDE.md, plan mode, subagents, hooks, MCP; ~87.6% SWE-bench Verified on Opus 4.7 |
| OpenAI Codex | Cloud agent | Relaunched May 2025; runs each task in an isolated sandbox and proposes PRs |
| Cursor | AI-native IDE | VS Code fork; Agent Mode, the Composer model, parallel background cloud agents |
| Aider | Terminal | Open-source, model-agnostic (100+ LLMs), repo-map, auto-commits with git |
| Devin | Autonomous agent | Own shell, browser, and editor; best on scoped, junior-level tasks |
A few clarifications that trip people up. The 2025 OpenAI Codex is not the 2021 Codex — the old one was a code-completion model behind early GitHub Copilot; the new one is a full autonomous agent. Aider is a harness, not a model — its quality depends entirely on the LLM you point it at (GPT-5 with high reasoning hits 88% on Aider's polyglot benchmark). And Cursor is a full fork, not "VS Code with a plugin" — proprietary models and cloud VMs are what make its agent mode work. GitHub Copilot, the dominant IDE completion tool, has grown agent modes too but remains primarily an in-editor assistant.
Working well with coding agents
Getting value from these tools is a skill. A handful of habits separate frustration from a force multiplier.
- Give it a verifier. Point the agent at a test command or build before you ask for the change. The loop needs a signal.
- Manage context deliberately. A debugging session can burn tens of thousands of tokens, and a full window degrades the model. Use
/clearbetween unrelated tasks, lean on compaction/summarization, and push investigation work into subagents so it doesn't pollute the main thread. - Plan before editing. Plan mode (explore and propose first) catches misunderstandings before any code changes.
- Scope tightly. Agents shine on clearly bounded tasks with verifiable outcomes — the kind a junior engineer would finish in a few hours.
- Review every diff. You are the senior reviewer. Read the change, run it, and own it.
- Keep humans on risky actions. Use permission gates for destructive or irreversible operations.
# A tight, reviewable Claude Code task in non-interactive mode
claude -p "Fix the failing test in tests/test_auth.py. \
Run: pytest tests/test_auth.py -q. Iterate until it passes. \
Do not modify files outside src/auth/."Tip
Treat the agent like a fast junior engineer
Hand it a scoped task with clear acceptance criteria and a way to check its work — then review the result. The bottleneck isn't the agent's speed; it's the quality of the spec and the verifier you give it.
Where humans stay essential
The progress is real, and so are the limits. Coding agents still fail predictably in a few places.
- Ambiguous specs. Without a precise goal and a verifier, the agent fills gaps with assumptions — often wrong ones.
- Complex inter-file reasoning. Changes that ripple across many modules, or depend on subtle cross-cutting invariants, exceed what current agents reliably track.
- Tacit knowledge. "We don't do it that way here" — undocumented conventions, business context, and team history — lives in people's heads, not the repo.
- Mid-task scope changes. Autonomous agents like Devin do best on stable, scoped work; shifting requirements throw them off.
The honest framing: Devin's PR merge rate reached 67% in 2025 (up from 34% the year before) — impressive, and still a third of PRs needing rework. Agents are not general-purpose software engineers. They are powerful executors of well-specified work. Architecture, judgment on edge cases, and oversight remain human jobs — which is precisely why your skill at scoping, verifying, and reviewing is what determines whether these tools help or hurt.
Try it: Run a coding agent on one scoped bug
Pick a small repo of yours (or clone a tiny open-source project) that has at least one test command. 1) Establish the verifier: find or write a single failing test and confirm the command to run it (e.g. pytest -q). 2) Add minimal context: create an AGENTS.md (or CLAUDE.md) with just the build/test command and one rule (e.g. "don't edit legacy/"). 3) Run an agent — Claude Code, Aider, or another — and give it exactly one scoped task: make that test pass, with the test command in the prompt. 4) Observe the loop: watch it edit, run, read the failure, and iterate. 5) Review the diff before accepting, then run the full test suite yourself. Write three sentences: what the agent got right, where it needed steering, and whether the verifier was the thing that made it converge. This builds the core instinct — scope tightly, give a verifier, review every diff.
Key takeaways
- 1Coding leads agentic AI because software offers automatic verifiers (tests, compilation, linting), reversible sandboxes, clean interfaces, and well-defined goals.
- 2The edit-run-test loop is the engine, and it only works with a runnable verification signal — no verifier, no real closure condition.
- 3SWE-bench Verified rose from ~0% in 2023 to ~79% by early 2026 and continued climbing through mid-2026 (top agents exceeding 85–90%), but scores depend on the harness and represent a moving trajectory, not a ceiling.
- 4Claude Code, OpenAI Codex, Cursor, Aider, and Devin each fit a different niche — and Aider's quality depends on whichever model you pair with it.
- 5Agents excel on tightly scoped, verifiable tasks; ambiguous specs, deep inter-file reasoning, and tacit knowledge keep humans essential for architecture and review.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.Why is software development uniquely suited to agentic AI compared to domains like research or customer support?
2.What is the single most important requirement for the edit-run-test loop to actually work?
3.Which statement about SWE-bench is correct?
4.Which is an accurate characterization of the leading coding tools?
Go deeper
Hand-picked sources to keep learning
Authoritative guide to CLAUDE.md, plan mode, subagents, and context management.
Live coding-agent scores on real GitHub issues; mind the Verified vs full distinction.
Empirical study finding repository context files add only ~4% — and LLM-generated ones can hurt.
First-party data on PR merge rates, speed, and an honest take on limitations.
Official launch of the 2025 autonomous Codex agent; explains the sandbox-per-task model.
Open-source terminal coding agent; documents the repo-map approach and polyglot benchmark.