Verification-Driven Development

Give Claude something to check itself against

Intermediate 12 minBuilder
What you'll be able to do
  • Explain why Claude stops at 'looks done' and how a missing verifier turns you into the verification loop
  • Provide a concrete verifier in your prompt — test command, build, screenshot, or diff — so Claude iterates to green on its own
  • Use /goal <condition> to keep Claude working across turns until a condition holds, and clear it when done
  • Apply /run and /verify to confirm a change works in the real app, and /code-review, /simplify, and /security-review on the diff
  • Configure a Stop hook for guaranteed end-of-turn verification in unattended and CI runs
At a glance

The single highest-leverage habit with Claude Code is giving it a way to check its own work. Without a verifier — a test command, a build, a screenshot, a diff — Claude stops when the work merely 'looks done' and you become the verification loop. This lesson shows how to wire verification signals into prompts, and how /goal and Stop hooks close the loop for unattended runs.

  1. 1Why 'looks done' is the trap
  2. 2Put the verifier in the prompt
  3. 3/goal — keep working until the condition holds
  4. 4/run and /verify — confirm it works, not just that tests pass
  5. 5Review the diff: /code-review, /simplify, /security-review
  6. 6Stop hooks — guaranteed verification at end of turn

Why 'looks done' is the trap

Claude Code runs an agentic loop with three blended phases: gather context, take action, then verify results. The catch is the last phase: Claude can only verify against something checkable. If you give it no check, it falls back to its own judgment of whether the work looks done — and 'looks done' is a vibe, not a fact.

This is the most important sentence in the lesson: without a verifier, you become the verification loop. Claude writes the code, declares victory, and hands you something that compiles in its head. You run it, find the bug, paste the error back, and Claude fixes it. Round and round. You are doing the job a failing test could do automatically.

The fix is to flip the relationship. Instead of you checking Claude's work, give Claude something it can check its own work against — a command that exits non-zero, a build that fails, a screenshot that looks wrong. Then Claude's verify phase has teeth: it runs the check, sees red, and iterates until green, all in one turn, before it ever returns to you.

Key insight

The one-line mental model

A prompt without a verifier asks Claude to guess when it's done. A prompt with a verifier tells Claude when it's done. The second one is the only one that scales to unattended work.

Put the verifier in the prompt

The habit is simple: every non-trivial task should name the signal Claude uses to know it succeeded. You don't need new tooling — you need to say the check out loud in the prompt. Four kinds of verifier cover almost everything:

VerifierWhat it provesHow to hand it to Claude
Test commandBehavior is correct"Make pytest tests/test_auth.py pass"
Build / typecheckCode compiles and types check"Run npm run build and tsc --noEmit; fix until both succeed"
ScreenshotThe UI actually renders rightPaste a target mockup, then "match this; use /verify to compare"
DiffThe change is scoped and clean"Keep the diff under 40 lines; run git diff and remove anything unrelated"

The difference in outcome is stark. Compare these two prompts:

text
# Weak — no verifier, Claude guesses when done
Add input validation to the signup form.

# Strong — verifier named, Claude iterates to green
Add input validation to the signup form.
The tests in tests/signup.test.ts encode the rules.
Run `npm test -- signup` and keep going until it passes.

With the strong prompt, Claude doesn't stop at 'looks done' — it runs the suite, reads the failures, and fixes them in a loop without your involvement. A passing test suite is something it cannot rationalize away.

Tip

No test yet? Ask for the test first

If a behavior has no check, have Claude write the test (or a tiny repro script) before the fix, confirm it fails, then make it pass. Now the rest of the work is verification-driven for free.

Watch out

Shell-mode output is not a verifier

Running a command with the ! prefix adds its output to context but does not trigger Claude's verify phase — there's no interpretation. If you want Claude to react to a result, ask it explicitly ("run the tests and fix any failures"), don't just paste output.

/goal — keep working until the condition holds

A verifier in the prompt closes the loop within a turn. But some conditions span many turns — "the whole suite is green," "the build is clean across all packages." That's what /goal is for: you state a condition, and Claude keeps working across turns until it's met, instead of returning to you after each step.

text
/goal all tests in `npm test` pass and `npm run build` succeeds

Claude now treats that condition as the bar for done. It edits, runs the suite, sees failures, fixes them, re-runs — turn after turn — and only truly settles when the condition holds. When you're finished, remove the goal:

text
/goal clear
FormEffect
/goal <condition>Set the goal; Claude works across turns until it's met
/goal (no argument)Show the current or most recently achieved goal
/goal clearRemove an active goal early (also stop, off, reset, none, cancel)

The condition should be checkable, not aspirational. /goal the code is high quality gives Claude nothing to test against — you're back to 'looks done.' /goal pytest exits 0 and ruff reports no errors gives it a target it can verify on every turn.

Example

A good goal vs. a useless one

Useless: /goal make the app better. Good: /goal every page in the e2e suite (npm run e2e) passes and there are no TypeScript errors. The second one Claude can grind on autonomously because it can measure it.

/run and /verify — confirm it works, not just that tests pass

Tests prove logic; they don't prove the app runs. A change can pass every unit test and still crash on launch, render a blank screen, or wire up the wrong handler. That gap is what /run and /verify close (Claude Code v2.1.145+).

  • /run launches and drives your project's app so you (and Claude) see the change working in the running app, not just in tests.
  • /verify goes further: it builds the app, runs it, and observes the result to confirm a change does what it should — explicitly not relying on tests or type checks.

Both need a small per-project setup so Claude knows how to build and launch your app. You generate that once with /run-skill-generator, which writes a per-project skill teaching /run and /verify how to start and drive the app from a clean environment.

text
# After making a UI change:
/verify   # builds, launches, drives the app, reports whether the change actually works

This turns a screenshot or a running app into a first-class verifier — the same loop-closing idea as a test command, but for behavior that only shows up at runtime.

Note

Why runtime verification matters

A 'looks done' UI fix that passes tests but renders nothing is exactly the failure tests miss. /verify catches it because it judges the running app, which is the only ground truth that matters for user-facing behavior.

Review the diff: /code-review, /simplify, /security-review

Once the work is green, a second independent pass over the diff catches what a passing suite won't — subtle bugs, needless complexity, and security holes. These bundled skills all operate on git diff, so they read only what changed.

CommandWhat it checksKey options
/code-review [level] [--fix] [--comment] [target]Correctness bugs and reuse/simplification/efficiency cleanupslevel = low/medium/high/xhigh/max/ultra; --fix applies findings to your working tree; --comment posts inline GitHub PR comments; ultra runs a deep cloud review
/simplify [target] (v2.1.154+)Cleanup only — reuse, simplification, efficiency, abstraction (4 parallel agents); does not hunt for bugsPass a path or PR reference to target it
/security-reviewInjection, auth issues, exposed credentials, data exposure on the branch diff

Higher effort levels trade speed for coverage: low/medium surface fewer, high-confidence findings; highmax go broader and may include uncertain ones; ultra runs a multi-agent review in the cloud.

text
/code-review high --fix        # broad review, then apply the fixes to your tree
/code-review medium --comment  # post findings as inline comments on the PR
/simplify                      # cleanup-only pass; use /code-review for bugs
/security-review               # scan the diff for vulnerabilities before you ship

Think of these as verifiers for quality, complementing the verifiers for correctness you wired into the prompt. On versions before v2.1.154, /simplify is equivalent to /code-review --fix.

Tip

Separate the writer from the reviewer

Running /code-review gives you an independent pass over the change — a different lens than the session that wrote it. For higher stakes, review in a fresh session so the reviewer carries no bias from how the code was built.

Stop hooks — guaranteed verification at end of turn

Everything so far depends on you asking for the check, or on Claude choosing to run it. For unattended runs — background agents, CI, claude -p scripts — you want verification that fires no matter what. That's a Stop hook.

Hooks are shell commands wired to lifecycle points (PreToolUse, PostToolUse, Stop, SessionStart, and more) in .claude/settings.json. Unlike CLAUDE.md instructions, which are advisory, hooks are deterministic and guaranteed — they run on the harness, not at Claude's discretion. A Stop hook runs when Claude is about to end its turn. If your check fails, the non-zero exit blocks the stop and Claude keeps working.

json
{
  "hooks": {
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "npm test && npm run build" }
        ]
      }
    ]
  }
}

Now Claude cannot end the turn while tests fail or the build breaks — the loop is closed by the harness, not by trust. This is the bedrock under autonomous workflows: pair a Stop hook with /goal (or a headless claude -p run) and the agent grinds to a verified state without supervision.

Watch out

Hooks guarantee, CLAUDE.md only suggests

Putting "always run the tests before finishing" in CLAUDE.md is advisory — Claude may skip it, and compaction can drop it. If end-of-turn verification must happen every time, it belongs in a Stop hook, not in prose.

Tip

Scope unattended runs defensively

When running headless, combine the Stop hook with guardrails: claude -p --allowedTools "Edit,Bash(npm test*)" --max-turns 10 --max-budget-usd 5.00. The hook enforces the finish line; the flags bound the blast radius.

Try it: Wire a verifier into a real change

Pick a small project with a test command (or scaffold one). Step 1 — feel the trap: ask Claude to add a feature with a vague prompt (e.g. "add email validation to the signup form") and note how it stops at 'looks done.' Step 2 — add a verifier: write or have Claude write one failing test that encodes the rule, confirm it fails, then re-prompt: "Make npm test -- signup pass; keep going until green." Watch Claude iterate to green inside a single turn. Step 3 — set a goal: run /goal npm testexits 0 andnpm run build succeeds, make a slightly bigger change, and observe Claude working across turns until the condition holds; then /goal clear. Step 4 — guarantee it: add a Stop hook to .claude/settings.json that runs npm test && npm run build, then start a headless run with claude -p --max-turns 8 "refactor the validation helper" and confirm the turn cannot end while the check is red. Write three sentences comparing how much you had to verify in Step 1 versus Step 4.

Key takeaways

  1. 1Claude stops when work 'looks done'; without a check you become the verification loop. Always name a verifier in the prompt.
  2. 2The four everyday verifiers are a test command, a successful build/typecheck, a screenshot, and a clean diff — state the check explicitly.
  3. 3/goal <condition> makes Claude work across turns until a checkable condition holds; remove it with /goal clear.
  4. 4/run and /verify (v2.1.145+) launch and drive the real app to confirm a change works at runtime, not just that tests pass.
  5. 5/code-review (--fix, --comment), /simplify, and /security-review run independent verification passes over the git diff.
  6. 6Stop hooks in .claude/settings.json give guaranteed end-of-turn verification for unattended and CI runs — deterministic, unlike advisory CLAUDE.md.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.You ask Claude to 'add caching to the API endpoint' with no further detail, and it reports the work is done — but the endpoint is still slow. What is the root cause in verification terms?

2.Which /goal is well-formed for autonomous, multi-turn work?

3.A UI change passes every unit test but renders a blank screen on launch. Which tool is designed to catch this, and why?

4.For an unattended background run, why is a Stop hook more reliable than putting 'always run the tests before finishing' in CLAUDE.md?

Go deeper

Hand-picked sources to keep learning