Roles, Debate & Collaboration

Getting better answers from disagreement

Advanced 13 minBuilderResearcher

What you'll be able to do

Assign clear specialized roles (planner, researcher, critic, executor) and explain why separation of concerns beats one overloaded agent
Run a multi-agent debate correctly — and cite the evidence on when it helps versus when it collapses into sycophantic consensus
Distinguish voting/self-consistency from true debate and pick the cheaper option when it suffices
Implement a generator–critic loop, the most cost-efficient collaboration pattern
Reason about diminishing returns: optimal agent counts, round limits, and the quadratic context cost of collaboration

At a glance

A team of agents only beats a single strong agent when you give them something a solo model lacks: different jobs, structured disagreement, or independent votes. This lesson covers the four collaboration patterns that actually work — role specialization, multi-agent debate, voting/self-consistency, and generator–critic loops — and is honest about where each one earns its cost and where it quietly makes things worse.

1Why disagreement produces better answers
2Role specialization: planner, researcher, critic, executor
3Multi-agent debate: structured disagreement
4Voting, self-consistency, and ensembling
5Generator–critic and reviewer loops
6Costs and diminishing returns

Why disagreement produces better answers

A single agent doing everything is like one person who is simultaneously the author, the fact-checker, and the editor of a report. They will miss their own mistakes, because the same reasoning that produced the error also evaluates it. The fix is structural, not just smarter prompting: split the work across agents that have different jobs, different information, or different vantage points, and let them check each other.

This is the unifying idea behind every pattern in this lesson. Collaboration helps for two distinct reasons:

Specialization — a focused agent with a tailored prompt, its own tools, and a single objective carries less context and makes fewer errors than a generalist juggling everything at once.
Independent perspectives — when several agents reason separately and then reconcile, idiosyncratic errors tend to cancel while correct reasoning reinforces.

Anthropic's production multi-agent research system (June 2025) is the headline proof: a Claude Opus 4 lead researcher coordinating Claude Sonnet 4 subagents beat single-agent Opus 4 by over 90% on their internal research eval. But it consumed roughly 15× more tokens. Collaboration is a lever, not a free upgrade — the rest of the lesson is about pulling it deliberately.

Key insight

The core principle

Separating concerns across specialized roles is the most reliably effective multi-agent pattern. Debate, voting, and critique are powerful but conditional — they help only with the right configuration. Specialization helps almost always.

Role specialization: planner, researcher, critic, executor

The most dependable way to use multiple agents is to give each one a role: a tailored system prompt, a scoped tool set, and one clear objective. A common four-role team:

Planner — decomposes the goal into ordered subtasks; holds the strategy, not the details.
Researcher — gathers facts via search/retrieval tools; returns evidence, not opinions.
Critic — stress-tests proposals, finds gaps and errors; has no incentive to defend the work.
Executor — carries out concrete steps (write code, call APIs, draft the output).

Crucially, role specialization is a prompting technique, not a model-selection one. The same base model is instantiated several times with different system prompts and tools — you do not need different underlying models. Each role gets less context and a narrower job, which reduces cognitive load and per-call token cost.

In frameworks this is first-class: CrewAI gives every agent a role, goal, and backstory; LangGraph models roles as nodes; the OpenAI Agents SDK uses handoffs between role-defined agents.

python

from crewai import Agent

planner = Agent(
    role="Planner",
    goal="Break the objective into an ordered, verifiable task list",
    backstory="A meticulous strategist who never writes code, only plans.",
    tools=[],  # planning needs no external tools
)

critic = Agent(
    role="Critic",
    goal="Find flaws, missing cases, and unsupported claims in a draft",
    backstory="A skeptical reviewer rewarded only for catching real problems.",
    tools=[search_tool],
)

Watch out

The expert-persona benefit is mixed

Assigning a vivid role ("Senior Security Researcher") increases useful behavioral divergence and can help on alignment-style tasks. But studies show near-zero average benefit on many specialized tasks, and personas that include social or demographic details can degrade zero-shot reasoning. Use roles to shape behavior and tools, not as a magic accuracy boost.

Multi-agent debate: structured disagreement

Think of a panel of experts arguing toward a verdict: each states a position, hears the others, and refines. Multi-agent debate works the same way — N agents independently propose an answer, read each other's responses, and iteratively revise across several rounds until they converge. The foundational result — Du et al. (2023), Improving Factuality and Reasoning through Multiagent Debate — showed this improves mathematical reasoning (on GSM8K, a grade-school math word-problem benchmark) and reduces hallucination versus a single pass. ChatEval (ICLR 2024) extended it to evaluation: diverse critic roles judging a response agreed with human ratings far better than a single GPT-4 judge — roughly 10% higher Kendall-τ (a rank-correlation score, where higher means closer to human judgment) on the TopicalChat benchmark.

The catch, established by Smit et al. (Should we be going MAD?, ICML 2024), is that debate does not reliably beat simpler methods like self-consistency unless it is carefully tuned. The two failure modes to know:

Sycophancy / consensus collapse — LLMs are agreeable by default, so a debate can converge on a wrong answer before the correct one surfaces, scoring below a single agent. A confident majority can bully a correct minority into conforming ("tyranny of the majority").
Degeneration of thought — agents become overconfident and stop genuinely considering new perspectives, while "problem drift" decays the discussion over long runs.

Diversity is the antidote. ChatEval's gains came from heterogeneous roles (one judging factual accuracy, one style, one relevance); homogeneous agents lose most of the benefit.

Example

Debate that works vs. debate that collapses

Works: 3 agents with distinct personas (optimist, skeptic, domain expert) debate a tricky claim for 3 rounds; the skeptic surfaces a flaw the others missed, and the group revises toward the correct answer.

Collapses: 3 identical agents with the same prompt agree in round 1 to be polite, lock in a plausible-but-wrong answer, and spend two more rounds reassuring each other — costing 3× the tokens for worse accuracy than a single call.

Voting, self-consistency, and ensembling

Sometimes you don't need agents to argue — you just need them to vote. Ask a hard question to five people who can't see each other's answers, then go with the majority: the lone mistakes cancel out and the common answer wins. That is the intuition behind self-consistency, and it is debate's cheaper cousin. Self-consistency (Wang et al., 2022) runs one model multiple times independently, sampling different reasoning chains, then takes the majority answer. There is no inter-agent communication at all — and that is the whole point.

This is the most common confusion in the field, so be precise:

	Communication?	Cost	Risk
Self-consistency	None — independent samples, then vote	Lower	Ties; no cross-checking
Multi-agent debate	Agents read & respond across rounds	Higher	Sycophancy, drift

Because samples are independent, self-consistency is immune to sycophancy — no agent can talk another out of a correct answer. Smit et al. found properly tuned self-consistency is often competitive with debate, and it is markedly simpler and cheaper. Ranked-voting variants (2025) further improve over plain majority voting for reasoning tasks.

The practical rule: debate only earns its extra cost when agents bring genuinely different information or tools to the table. If they would all reason over the same context, you usually want self-consistency — sample several times and vote, rather than letting agents negotiate.

Tip

Default to the cheaper pattern

If your agents share the same context and tools, reach for self-consistency (independent sampling + majority vote) before multi-agent debate. You keep most of the accuracy benefit, avoid sycophancy entirely, and pay a fraction of the tokens.

Generator–critic and reviewer loops

The generator–critic loop is the most cost-efficient collaboration pattern, because it adds just one extra agent call per iteration while providing structured external feedback. It is directed and asymmetric: agent A produces a draft, agent B critiques it, A revises — repeat until the critic is satisfied or a cap is hit.

This is not the same as debate. Debate is an N-agent symmetric pattern where everyone sees everyone and converges on consensus; generator–critic is a two-agent directed pipeline with distinct producer and reviewer roles. The asymmetry is a feature — the critic has no draft to defend, so it is far less prone to sycophancy.

python

def generator_critic(task, generate, critique, max_rounds=3):
    draft = generate(task)
    for _ in range(max_rounds):
        review = critique(task, draft)
        if review["approved"]:
            return draft
        draft = generate(task, feedback=review["feedback"])
    return draft  # return best effort after the cap

Research has extended multi-role loops in various directions — for example, recent self-training frameworks (such as SAGE, 2026) chain four roles like Challenger, Planner, Solver, Critic to evolve a model's weights from self-generated data. Note the distinction: that is a training-time technique, not a runtime inference pattern like the loop above. For runtime use, the two-agent generator–critic version captures most of the collaboration value at the lowest cost and is the pattern to reach for first.

Watch out

Always cap the loop

A generator and critic can ping-pong forever. Always set max_rounds (2–4 is typical) and an explicit approval condition, or the loop becomes an expensive way to never finish.

Costs and diminishing returns

Collaboration scales cost faster than benefit, and there is real research on where the curve flattens.

Agent count and rounds. A 2025 literature review found the sweet spot is 3–4 diverse agents over 2–4 rounds. Beyond that, accuracy plateaus or degrades from problem drift and degeneration of thought. More is not better.
Context explodes quadratically. In a fully-connected debate, every agent reads every other agent's output, so tokens scale ~O(N²) per round. A 6-agent debate needs roughly 6× the tokens of a single pass — and debate systems often incur ~6× the compute of single-agent inference.
Token overhead is pattern-specific. The headline "15× cost" applies to deep-research orchestration like Anthropic's, not to every multi-agent system. A simple generator–critic loop adds only ~1–2× per iteration.

The industry has converged accordingly: in 2026, major vendors (Anthropic, OpenAI, AutoGen/AG2, Cognition, LangChain) default to orchestrator + isolated subagents, with supervisor/orchestrator-worker as the dominant topology. Peer-mesh debate is reserved for specific high-value accuracy scenarios where its cost is justified.

The discipline mirrors the whole course: use the least amount of collaboration that solves the problem. Add a critic before you add a debate; add a vote before you add a mesh.

Tip

A practical escalation ladder

Single agent → 2) generator–critic loop → 3) self-consistency vote → 4) role-specialized team with a supervisor → 5) multi-agent debate. Climb only when the previous rung demonstrably fails on your eval set.

Try it: Build a generator–critic loop, then add a vote

Pick a task with a checkable answer (e.g. "write a Python function that returns the nth prime" or "summarize this article in 3 bullets, each grounded in the text").

Part 1 — Generator–critic. Implement two roles with the same base model but different system prompts: a generator that produces a draft and a critic that returns {approved: bool, feedback: str}. Loop with max_rounds=3, feeding the critic's feedback back to the generator. Log each round.

Part 2 — Self-consistency. Now run the generator 5 times independently (no critic, higher temperature) and take the majority/best answer by a simple rule. Compare quality and token cost against Part 1.

Reflect (2–3 sentences): Which pattern gave the better answer? Which was cheaper? Did the critic ever just agree to be agreeable — and if so, how would you make it more skeptical? This builds the instinct for choosing the least collaboration that solves the problem.

Key takeaways

1Role specialization (planner/researcher/critic/executor) is the most reliably effective collaboration pattern — and it is a prompting technique, achievable with one base model instantiated multiple times.
2Multi-agent debate can improve factuality and reasoning, but only when agents are diverse and tuned; without that it collapses into sycophantic consensus and can score below a single agent.
3Self-consistency — independent sampling plus a majority vote, with no inter-agent communication — is cheaper, immune to sycophancy, and often competitive with full debate.
4The generator–critic loop is the most cost-efficient pattern: one extra call per round, an asymmetric reviewer that has no draft to defend, and it must always be capped.
5Collaboration scales cost faster than benefit — 3–4 agents and 2–4 rounds is the sweet spot, context grows quadratically, so use the least collaboration that solves the problem.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.Which collaboration pattern is described as the most *reliably* effective across tasks?

2.What is the key failure mode that can make multi-agent debate score *below* a single agent?

3.How does self-consistency differ from multi-agent debate?

4.According to 2025 research, what is the approximate sweet spot for a multi-agent debate?

Go deeper

Hand-picked sources to keep learning

Du et al. — Improving Factuality and Reasoning through Multiagent Debate

The foundational multi-agent debate paper; project page with code and results.

Should we be going MAD? Multi-Agent Debate Strategies for LLMs (ICML 2024)

Critical empirical assessment: debate often doesn't beat self-consistency without tuning. Essential counter-evidence.

ChatEval: Multi-Agent Debate for LLM Evaluation (arXiv 2308.07201)

ICLR 2024; diverse critic roles beat a single judge, with concrete Kendall-τ numbers on TopicalChat.

Literature Review of Multi-Agent Debate for Problem-Solving (arXiv 2506.00066)

2025 meta-analysis: optimal agent count (3–4), round limits (2–4), context explosion, and failure modes.

Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate (arXiv 2509.23055)

2025 paper formally defining sycophancy in debate and measuring its impact on accuracy.

How we built our multi-agent research system (Anthropic Engineering Blog)

Primary source — production case study: Lead Researcher + Subagents; 90.2% improvement over single-agent Opus 4, ~15× token cost on BrowseComp eval.