Evaluating Agents

If you can't measure it, you can't ship it

Advanced 16 minBuilderDecision-maker
What you'll be able to do
  • Explain why evaluation is the real bottleneck for shipping agents and adopt eval-driven development
  • Distinguish outcome, trajectory, and efficiency evaluation and know when each matters
  • Choose between code-based, model-based (LLM-as-judge), and human graders — and mitigate judge bias
  • Read agent benchmarks (SWE-bench, GAIA, τ-bench, WebArena) critically, including contamination and reliability caveats
  • Build a small representative eval set and run it offline before deploying and online after
At a glance

Evaluating an agent is harder than evaluating a model: the agent is non-deterministic, takes many steps, and its value lives in both the final outcome and the path it took to get there. This lesson gives you a three-layer eval stack — outcome, trajectory, efficiency — plus graders, benchmarks, and an offline-to-online workflow you can actually ship against.

  1. 1Why evaluation is the bottleneck
  2. 2Three things worth measuring
  3. 3Three kinds of graders
  4. 4LLM-as-judge: power and pitfalls
  5. 5Building a representative eval set
  6. 6Reading public benchmarks critically
  7. 7Offline suites and online monitoring

Why evaluation is the bottleneck

Here is the trap almost every team falls into. You tweak a prompt, run the agent once on a demo, watch it work, ship it — and then learn about the real failures from an angry user. The thing standing between a flashy demo and a dependable product is not a smarter model. It is evaluation: a repeatable way to tell whether a change made the agent better or worse. Without it, every change you make is a coin flip you cannot see the result of.

The 2025-2026 best practice has a name: eval-driven development. Before you build, you write down what the agent must do and turn that into a test suite. Then you iterate against the suite instead of against gut feel. It is test-driven development, adapted for systems that are non-deterministic — systems that can produce a different answer each time you run them.

Why is this so much harder than grading a plain chatbot? A chatbot gives you one output you grade once. An agent runs a multi-step trajectory — it calls a tool, looks at the result, re-plans, calls another tool — and the very same input can take a different path on a different run. So you have to grade both the destination and the journey, across repeated trials, often with no single official 'right answer' to check against.

Key insight

The reframe

An agent eval set is not a final exam you run once at the end. It is the instrument panel you build first and watch the whole time. Teams without one fly blind and find out about failures from customers.

Three things worth measuring

When you grade an agent, there are really only three questions to ask — and skipping any one will blindside you. Think of them as a stack: did it work, did it work well, and what did it cost?

  1. Outcome — did the task actually succeed? Check the final state of the world, not the agent's own words. An agent will cheerfully announce "Done, I booked your flight" while no reservation exists in the database. Grade the world, not the transcript.
  2. Trajectory — was the path reasonable? Did it pick sensible tools, avoid redundant calls, recover from errors, and reason soundly? A right answer stumbled into through ten wasted steps is a fragile answer that will break the moment the task shifts.
  3. Efficiency — what did it cost? Token spend, latency, and number of steps. An agent that succeeds but takes 90 seconds and $0.40 per task may simply be too expensive to ship.
DimensionQuestionTypical signal
OutcomeDid it succeed?Final state / DB check, pass/fail
TrajectoryWas the path good?Tool-call quality, step count, reasoning
EfficiencyWhat did it cost?Tokens, latency, steps

Measuring outcome alone is the single most common mistake. Trajectory eval is what catches agents that succeed for the wrong reasons — the ones that look great on your demo and collapse on the next variation.

Watch out

Grade the state, not the story

An agent can report success in fluent prose while the actual database, file system, or API state shows failure. Always verify outcome against ground-truth state, never against the agent's self-report.

Three kinds of graders

Once you know what to measure, you need to decide who does the grading. There are three options, and a mature setup uses all three — each covers the others' blind spots.

  • Code-based graders — plain functions that check the result. Fast, cheap, and deterministic: a string match, an exact-value check, a binary pass/fail, or comparing the final database state to an expected one. Reach for these wherever the outcome is objectively verifiable.
  • Model-based graders (LLM-as-judge) — an LLM reads the output and scores it against a rubric. This is the only practical option for subjective or open-ended tasks ("is this summary faithful and genuinely useful?") where no exact answer exists. The catch: judges are non-deterministic and carry predictable biases (next section).
  • Human graders — the gold standard for nuance, and the source of truth you calibrate the other two against. Accurate but slow, expensive, and impossible to run on every commit.
python
# Code-based outcome grader: check real state, not the transcript.
def grade_booking(db, expected):
    booking = db.get_latest_booking(user_id=expected["user_id"])
    return {
        "passed": booking is not None
        and booking["flight"] == expected["flight"]
        and booking["status"] == "confirmed",
        "detail": booking,
    }

Rule of thumb: use code graders wherever you can, LLM judges where you must, and humans to keep the other two honest.

LLM-as-judge: power and pitfalls

An LLM judge lets you apply human-like judgment to thousands of cases for cents apiece — which is exactly why it is so tempting, and so dangerous, to treat it as a stand-in for a human. It is not equivalent. On code tasks, a single LLM judge disagrees with the human majority vote about 31% of the time. And the disagreements are systematic, not random noise you can average away:

  • Length bias — favors longer answers regardless of quality.
  • Position bias — favors whichever option is shown first in a pairwise comparison.
  • Self-model bias — judges prefer outputs that resemble their own style.
  • Deception vulnerability — a confident, persuasive tone can win the verdict even when the content is nonsense.

The good news: some of these are cheap to detect. To catch position bias, present the same pair twice with the order swapped — if the verdict flips, the judge was biased on that case. Standard mitigations include randomizing order, scoring against an explicit written rubric (not a vague "which is better?"), and polling multiple judges.

A key advance is Agent-as-a-Judge (Zhuge et al., 2024): instead of grading only the final answer, an agent evaluates another agent by inspecting its entire trajectory. On code-generation evaluation, this agent judge differed from the human majority only 0.3% of the time — versus 31% for a single LLM judge — because seeing how the work was done is far more informative than seeing only what was produced.

Tip

Swap test for position bias

For any pairwise LLM-judge comparison, run it twice with the order reversed. Only trust verdicts that survive the swap; treat flips as ties and investigate. This one trick removes a huge source of phantom signal.

Building a representative eval set

Your eval set is the most valuable asset you will build — and the good news is it does not start big. Begin with 20-50 hand-reviewed tasks pulled from real failures, support tickets, and your own dogfooding. Avoid synthetic data here: machine-generated tasks tend to be too clean and too easy, and they lull you into a false sense of safety.

Follow these rules:

  1. Include negatives, not just positives. Cover cases where the agent should act and cases where it should decline or ask for clarification. An agent that never says no is a liability.
  2. Prove tasks are solvable. Give each task a reference solution, so a 0% score means "the agent failed," not "the task was impossible."
  3. Grade the outcome, not the exact path. Penalizing a valid, creative solution just because it differs from your expected steps produces misleading failures — a classic grader bug.
  4. Close the loop. Every production failure becomes a new eval case, so the set grows alongside your understanding of where the agent breaks.

Finally, measure reliability with pass^k, not just pass@k. pass@k asks "did at least one of k attempts succeed?" — that is a capability metric (can it ever do this?). pass^k asks "did all k attempts succeed?" — the reliability metric that actually matters in production (can it do this every time?). τ-bench showed top models with respectable single-pass rates falling below 25% at pass^8: they fail most of the time precisely when you need consistency.

Watch out

A 100% pass rate is a red flag

Saturating your eval set does not mean the agent is perfect — it means the set stopped measuring the hard cases. Progress will look like it stalled when really only the difficult tasks remain. Keep adding harder cases from production.

Reading public benchmarks critically

Public benchmarks are useful for one thing: comparing models against each other on a shared task. They become misleading the moment you read a headline number as a promise of what an agent will do for you. A few you should know, and why each comes with an asterisk:

  • SWE-bench Verified — 500 human-validated GitHub issues; the flagship coding-agent benchmark. Frontier models post 80-94%, but OpenAI stopped reporting Verified scores after finding training-data contamination across frontier models (the test answers had effectively leaked into training). The contamination-controlled SWE-bench Pro tells a soberer story: Claude Opus 4.5 scores ~46% on Pro versus ~81% on Verified.
  • GAIA — 450 general-assistant questions (reasoning, web, tool use) across 3 difficulty levels. Top score as of Sept 2025 is 74.55% (Claude Sonnet 4.5 via HAL's Generalist Agent); Level 3 stays brutal at ~65%.
  • τ-bench (tau-bench) — tool-agent-user interaction in retail and airline domains, with simulated users and policy rules to obey. It introduced pass^k. Leading models hit pass@1 of ~60-75% on retail tasks, but consistency is the open problem — pass^8 collapses below 25% for many top models.
  • WebArena — browser agents in real web environments. Top scores as of 2026 sit around ~65-68% versus a ~78% human baseline (up from ~14% at launch), driven largely by computer-use agents operating at the pixel level.

Two cautions before you trust any leaderboard. First, scores are often not reproducible without the exact scaffolding, and there is growing evidence that agents can exploit a benchmark's structure to inflate scores without genuinely solving the tasks (a form of reward hacking). Second, agent evals are expensive: running a capable scaffold on a full benchmark can cost thousands of dollars per run, and a statistically reliable multi-run protocol can reach the hundreds of thousands. Use public benchmarks to compare models — but make your private, domain-specific eval set the thing you actually ship against.

Offline suites and online monitoring

Evaluation lives in two places, and you need both. Think of offline as the dress rehearsal and online as opening night with a real audience.

Offline — your eval set runs in CI before anything ships, exactly like a unit-test suite. Wire it so that a drop in score blocks the merge (Braintrust does this natively via GitHub Actions; LangSmith and DeepEval cover the same loop). This catches regressions before users do and turns the vague question "did this change help?" into a number you can gate on.

Online — once deployed, real traffic becomes the ultimate test set, full of inputs you never imagined. Run A/B tests between agent versions, sample live traces, and apply lightweight LLM-judge checks to production runs to catch drift the offline set never anticipated. Then feed every interesting live failure back into the offline set — that is how the loop closes.

python
# Offline harness sketch: score every case, gate on a threshold.
def run_suite(agent, cases, graders):
    results = []
    for case in cases:
        trace = agent.run(case["input"])          # full trajectory
        scores = {name: g(trace, case) for name, g in graders.items()}
        results.append({"id": case["id"], "scores": scores})
    pass_rate = sum(r["scores"]["outcome"]["passed"] for r in results) / len(results)
    assert pass_rate >= 0.85, f"Regression: {pass_rate:.0%} below 85% gate"
    return results

The offline suite gives you a fast, repeatable signal; online eval keeps you honest against the messy reality the suite can never fully capture. Offline tells you what you already know to test for; online tells you what you forgot.

Try it: Build a 10-case eval set and gate on it

Take any small agent you've built (or a toy tool-calling loop). (1) Write 10 eval cases as JSON: 7 where the agent should act and succeed, 3 negatives where it should decline or ask for clarification. Pull at least 3 from a real failure you've actually seen. (2) Implement two graders: a code-based outcome grader that checks final state (return value, file, or mock DB) and a one-line LLM-as-judge rubric grader for one open-ended case. (3) Run each case 5 times and report both pass@5 and pass^5 — notice how much lower pass^5 is. (4) For your LLM-judge case, run the comparison twice with swapped order and record whether the verdict flips. (5) Wrap it in a run_suite() that asserts an 85% outcome pass rate so a regression would fail CI. Deliverable: the cases file, the two graders, and a short note on what the pass@5-vs-pass^5 gap told you about your agent's reliability.

Key takeaways

  1. 1Evaluation, not model choice, is the bottleneck for shipping agents — adopt eval-driven development and build the suite before you build the agent.
  2. 2Measure all three dimensions: outcome (verify final state, not the transcript), trajectory (was the path sound?), and efficiency (tokens, latency, steps).
  3. 3Combine code-based, LLM-as-judge, and human graders; LLM judges carry systematic length, position, and self-model bias that you must actively detect and mitigate.
  4. 4Use pass^k for reliability, not just pass@k for capability — strong single-pass agents often fall below 25% when every trial must succeed.
  5. 5Treat public benchmark scores skeptically (contamination, reward hacking, cost) and ship against a private, growing eval set seeded from real failures.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.An agent's transcript says "I've successfully booked your flight," and your outcome grader marks the task passed. What is the most likely flaw?

2.Which metric best captures an agent's RELIABILITY for production?

3.You run a pairwise LLM-judge comparison, then run it again with the two answers swapped, and the verdict flips. What does this indicate?

4.Why should you be skeptical of an 81% SWE-bench Verified score when deciding what a coding agent can really do?

Go deeper

Hand-picked sources to keep learning