Advanced Reasoning & Planning/Lesson 2 of 4

Reflection & Self-Correction

Agents that check their own work

Advanced 13 minBuilderResearcher

What you'll be able to do

Implement a generator–critic–revise loop and reason about when to stop iterating
Explain the Reflexion architecture and why verbal self-reflection learns without weight updates
Distinguish intrinsic self-correction from grounded (extrinsic) feedback and predict which will work
Use tests, execution results, and verifier models as reliable correction signals
Recognize the failure modes — blind spots, sycophancy, over-reflection — and design around them

At a glance

Reflection lets an agent look at its own output, decide it's wrong, and try again — sometimes a genuine superpower, sometimes a confident march deeper into error. This lesson separates the two: when a model criticizing itself actually helps, and when it just rationalizes its own mistakes. The punchline is that reflection works when it's anchored to ground truth — tests, execution, a real verifier — and is unreliable when the model is only judging itself.

1What reflection actually is
2Reflexion: verbal reinforcement
3The crucial split: intrinsic vs. grounded
4Verifiers, ORMs, and PRMs
5How reflection backfires
6Building a grounded reflection loop

What reflection actually is

Humans rarely ship the first draft. We write, reread, wince, and revise. Reflection gives an agent the same move: after producing an output, it generates a critique of that output, then revises — optionally looping several times. A useful mental model is a small inner committee: a generator that does the work, an evaluator that judges it, and a step that feeds the judgment back in.

Concretely, the simplest reflection loop is three prompts:

Generate — produce a first attempt.
Critique — ask: "What is wrong with this? Be specific."
Revise — produce a new attempt that fixes the named problems.

This is appealing because it costs nothing but inference — no retraining, no extra data — and it sometimes turns a mediocre answer into a good one. But the appeal hides a trap that the rest of this lesson is about: a model critiquing itself is only as good as its ability to see its own mistakes. If the error sits in the model's blind spot, the critique will be confident, fluent, and wrong. Reflection is a powerful tool, but it is not free reliability.

Reflexion: verbal reinforcement

Here is the core question: how does an agent learn from a failure without retraining? Reflexion (Shinn et al., NeurIPS 2023) gives the canonical answer — instead of updating weights, store the lesson in words and feed it back on the next attempt. Think of it as keeping a diary of mistakes and rereading it before you try again. Three components make it work:

Actor — generates actions or text (often a ReAct agent).
Evaluator — scores the trajectory, as a scalar or free-form feedback.
Self-Reflection memory — an episodic buffer of verbal critiques the agent reads before retrying.

After a failed attempt, the agent writes a short reflection — "I assumed the file was JSON; it was CSV, so parsing failed. Next time, check the format first" — and carries it into the next episode's context. This is verbal reinforcement learning: learning across attempts with zero gradient updates.

The published gains are real on structured tasks: 91% pass@1 on HumanEval (vs. GPT-4's 80%), 130/134 AlfWorld tasks solved, +20% on HotpotQA. The catch — central to everything below — is that Reflexion requires a reliable evaluator. With a clear pass/fail signal it shines; without one, the reflection step can produce plausible but misleading critiques that steer the agent wrong.

Watch out

Reflexion does NOT fine-tune

A common misread: "Reflexion improves the model by training it." It does not touch weights. All learning lives in an in-context episodic memory of natural-language reflections. Remove that memory between attempts and the benefit vanishes.

The crucial split: intrinsic vs. grounded

Here is the single most important distinction in this lesson. The plain-language version: does the critique come from the model guessing whether it was wrong, or from the world telling it so?

Intrinsic self-correction — the model revises using only its own judgment, no external signal. The same network that made the error is now grading the error.
Grounded (extrinsic) self-correction — the critique is anchored to something the model cannot rationalize away: a unit test that fails, code that throws, a tool that returns an error, a separately trained verifier.

The research verdict (TACL 2024, replicated across multiple 2025 papers) is blunt: pure intrinsic self-correction does not reliably improve reasoning, and can flip correct answers to wrong ones. The model shares the same blind spots as its own evaluator, so for errors it couldn't catch the first time, a second look just produces confident, wrong feedback.

Grounded feedback is a different animal. When a code agent runs the test suite and sees AssertionError: expected 5, got 4, that is external ground truth. The model can't argue with it. RLEF (ICML 2025) builds on exactly this, using execution errors and unit-test results as reward signals for multi-turn code generation. The design rule follows directly: don't trust the model to grade itself on reasoning; give it a real grader.

Verifiers, ORMs, and PRMs

Sometimes there is no test to run — a logic puzzle, an essay, a plan. The next best thing is a verifier model: a separate model trained to judge correctness, acting as the grader the task itself can't provide. Two flavors matter:

Outcome Reward Model (ORM) — scores only the final answer. Sparse signal: right or wrong, with no idea where things went off.
Process Reward Model (PRM) — scores each intermediate step. Dense, step-level feedback that catches an error the moment it appears, like a teacher marking each line of working rather than just the final number.

PRMs cost more compute (you evaluate every step) but pay off: in reasoning benchmarks they run >8% more accurate than ORMs and can be 1.5–5x more compute-efficient when used as process verifiers to guide search. OpenAI's Let's Verify Step by Step (2023) introduced PRMs for math; by 2025 they're applied to SQL synthesis, tool use, logical deduction, and software engineering — anywhere steps are identifiable.

The deeper point is decorrelation of error modes. A verifier helps most when its mistakes don't overlap with the generator's. Using the same model to generate and verify gives you correlated blind spots and "rubber-stamping," where the critic just validates the generator. Different models, different temperatures, or an execution environment break that correlation — which is the whole game.

Key insight

Why separation beats self-critique

If generator and critic are the same model trained the same way, their errors are correlated — the critic is blind exactly where the generator is. Truly independent correction comes from a different model, a different environment, or a formal check. Independence, not cleverness, is what makes the critic useful.

How reflection backfires

Reflection isn't free — done carelessly it actively makes answers worse. There are three signature failure modes; know them or you'll ship them.

Self-consistency / coherence trap. Across rounds the model builds a plausible but wrong justification, getting more confident as it elaborates. It's not converging on truth; it's converging on a story.
Sycophantic reflection. Asked to critique its own work, the model agrees with itself — "On reflection, my answer looks correct" — instead of genuinely attacking it.
Quality degradation (over-reflection). Too many iterations introduce new errors or hedge a correct answer into mush. More loops is not more quality.

There's a striking diagnosis here. The Self-Correction Blind Spot study (2025) tested 14 open-source non-reasoning models and found a 64.5% average blind-spot rate: models fix an error when it's shown to them externally but miss the identical error in their own output. The fix was almost comically simple — prepending the word Wait to the generation cut blind spots by 89.3%. The capability to self-correct exists; it just needs activating. One likely reason: supervised fine-tuned (SFT) models are trained mostly on error-free demonstrations, so correcting a mistake mid-stream is out of distribution. RL-trained reasoning models (DeepSeek-R1, the o-series) learn correction through outcome feedback and self-correct far more reliably.

Building a grounded reflection loop

Now put the principles together into something you'd actually ship: separate the critic, prefer grounded signals, and bound the retries (2–3 attempts is standard; then escalate to a human). Here's a code-revision loop grounded in test execution — the most reliable form of reflection.

python

def reflect_and_fix(task, model, run_tests, max_iters=3):
    code = model.generate(f"Write code for:\n{task}")
    for attempt in range(max_iters):
        result = run_tests(code)          # GROUND TRUTH, not self-opinion
        if result.passed:
            return code, attempt
        # The critique is anchored to a real failure signal
        critique = model.generate(
            f"This code failed:\n{code}\n\n"
            f"Test output:\n{result.error}\n\n"
            "Explain the specific bug, then rewrite the code."
        )
        code = extract_code(critique)
    return code, max_iters            # bounded: escalate to a human now

The load-bearing line is run_tests(code): the model is reacting to a real failure, not its own guess. Language Agent Tree Search (LATS, ICML 2024) generalizes this — it combines Monte Carlo Tree Search with Reflexion-style critiques, so failed branches become verbal lessons for the rest of the search (94.4% pass@1 on HumanEval with GPT-4; EM score of 0.61 on HotpotQA, substantially above ReAct's baseline). For high-stakes tasks with no executable check, decorrelate errors instead: use multi-agent debate or a PRM as the external grader.

Try it: Grounded vs. ungrounded reflection

Pick a small coding task with a known answer (e.g., "parse a date string in three formats and return ISO-8601"). Build two reflection loops over the same model. Loop A (intrinsic): generate, then ask the model to critique and revise using only its own judgment, for up to 3 rounds. Loop B (grounded): generate, run a tiny unit-test suite, and feed the actual pass/fail and error message back into the critique, for up to 3 rounds. Run both on 5 task variants and record: final correctness, how often Loop A increased confidence while staying wrong, and how often Loop B fixed a real bug. Write three sentences on what you observed. You should see the grounded loop convert failures into fixes far more reliably than the self-judging one — the core lesson, demonstrated on your own machine.

Key takeaways

1Reflection = generate, critique, revise — but a model is only as good a critic as its ability to see its own blind spots.
2Reflexion stores verbal critiques in episodic memory and learns across attempts with no weight updates; it needs a reliable evaluator to work.
3Pure intrinsic self-correction is unreliable for reasoning and can make correct answers wrong; grounded feedback (tests, execution, verifiers) is what pays off.
4Separate the generator and critic — different model, environment, or a formal check — to decorrelate error modes and avoid rubber-stamping.
5Bound retries to 2–3 iterations and escalate; watch for sycophancy, the coherence trap, and over-reflection that degrades good answers.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.Why is pure intrinsic self-correction (a model critiquing itself with no external signal) unreliable for reasoning tasks?

2.How does Reflexion learn from a failed attempt?

3.What makes grounded (extrinsic) feedback more reliable than self-critique?

4.Which statement about Process Reward Models (PRMs) and verifier design is correct?

Go deeper

Hand-picked sources to keep learning

Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023)

The foundational paper. Actor–Evaluator–Self-Reflection loop and the benchmark results cited here.

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs (TACL 2024)

The critical survey showing intrinsic self-correction does not reliably help reasoning. Essential counterweight to the hype.

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

NeurIPS 2025 workshop. 64.5% blind-spot rate across 14 non-reasoning models, and the surprising 'Wait' fix that cuts it by 89.3%.

RLEF: Grounding Code LLMs in Execution Feedback with RL

ICML 2025 spotlight. Current SOTA direction for grounded self-correction in coding.

Language Agent Tree Search (LATS)

ICML 2024. Combines MCTS with Reflexion-style critiques for multi-path exploration.

Prompt Engineering Guide: Reflexion

Clear, accessible walkthrough of the Reflexion architecture and results.