Multi-Agent Orchestration at Scale/Lesson 6 of 6

Cost, Observability & Orchestration Gotchas

Track the spend, attribute the agents, dodge the traps

Advanced 13 minBuilderDecision-maker

What you'll be able to do

Reason about cost per paradigm — subagents summarize back, teammates each hold a full window (most expensive), workflows keep state in script variables but can spawn hundreds — and profile on a small slice first
Attribute spend with /usage (by skill/subagent/plugin/MCP), /insights (history and trends), live /workflows token panels, and OpenTelemetry agent_id / parent_agent_id spans
Apply the permission model across layers: parent mode wins, workflow subagents always run acceptEdits with the parent allowlist, routines run with no prompts, and protected paths are never auto-approved except under bypass
Diagnose the real failures of workflows (caps, runaway loops, same-session resume, literal meta block), agent teams (experimental, no resume/rewind restore, lagging status), and background/cloud sessions
Treat these features as research preview / experimental and defer to /help and the live docs for the installed version

At a glance

Coordinating MANY agents — workflows, agent teams, background sessions — adds two costs the single-subagent lesson never had to worry about: real money and real operational traps. This lesson gives you a per-paradigm cost model (why teammates are the most expensive thing you can run), the observability stack that attributes that spend (/usage, /insights, /workflows, OpenTelemetry agent_id spans), the permission model across orchestration layers (parent mode wins; you can only relax within it), and a troubleshooting playbook for the experimental-feature failures you will actually hit. The through-line: always profile on a small slice before a large run.

1The cost mental model: count the active context windows
2Observability & attribution: where did the money go?
3Permissions across orchestration layers: parent mode wins
4Workflow gotchas (research preview, v2.1.154+)
5Agent-team gotchas (experimental, OFF by default, v2.1.32+)
6Background & cloud gotchas (supervisor, /goal, provider limits)
7Currency: most of this is preview or experimental

The cost mental model: count the active context windows

A single subagent is essentially free to reason about: it does heavy work in an isolated window and hands back a summary, so your main context — and your bill's growth — stays bounded. The moment you coordinate many agents, that intuition breaks, and the single number that predicts your spend is how many full context windows are alive at once.

Each paradigm sits at a different point on that curve:

Paradigm	What each agent holds	Where intermediate state lives	Scale	Cost shape
Subagents	Isolated window, summarizes back	Returned summary in main context	a few	Cheapest — you pay for the work, keep the summary
Dynamic workflows	Fresh window per spawned agent	Script variables, OUTSIDE your context window	up to 1,000/run, 16 concurrent	Cheaper per-agent context (no chatter in your window) but can spawn hundreds
Agent teams	A full, fresh context window each	Shared mailbox + task list	3–5 recommended	Most expensive per agent — every teammate is a whole session
Background agents	Full window per session, often a worktree	Each session's own context	a handful	Per active session; idle ones cost nothing once stopped

The key contrast is workflows vs. teams. A workflow can spawn hundreds of agents, but their intermediate results live in script variables outside your context window — only the final verified answer lands in your session, so you don't pay context for every agent's chatter. An agent team is the opposite: each teammate is a full Claude Code session with its own complete window, the single most expensive unit of work in the toolkit. Cost scales roughly linearly with the number of active agents, so a 5-teammate team is on the order of 5× a single session.

Because of that linearity, the discipline is non-negotiable: profile on a small slice first. Run the audit on one directory, ask the narrow version of the research question, spin up one teammate before five. Read the real token numbers, then multiply by your intended scale before you commit.

Key insight

Teammates are the expensive thing; workflows hide their state

If you remember one cost fact: an agent-team teammate is a full context window (most expensive per agent), while a workflow keeps intermediate results in script variables that never enter your window. Same goal, very different bill. Anthropic's own C-compiler experiment ran 2,000+ sessions over ~2 weeks for ~$20,000 of API — order-of-magnitude proof that coordinated multi-session work is a real budget line, not a rounding error.

Watch out

Always test on a small slice

Cost scales ~linearly with active agents, so a misjudged large run multiplies the mistake. Before any wide fan-out: one directory, one narrow question, one teammate. Watch the per-phase tokens in /workflows, confirm the answer quality, THEN scale. This single habit is the cheapest insurance in the whole module.

Observability & attribution: where did the money go?

Once many agents are spending in parallel, you need to attribute that spend — which skill, which subagent, which phase. Claude Code gives you four lenses, from live to historical to programmatic.

/usage — the breakdown. Shows cost broken down by skill, subagent, plugin, and MCP server. This is your first stop after a big run: it tells you what spent, not just how much.

/insights — the history and trends. Where /usage is a snapshot, /insights gives you history, per-skill cost, and trends over time — so you can see that a particular skill or workflow has quietly become your biggest line item across many sessions.

/workflows — live per-phase, per-agent tokens. While a workflow runs, drill into a phase (Enter / →) to see agent count, token totals, elapsed time, and per-agent prompts / tool calls / results. This is the panel you watch during your small-slice profiling run.

OpenTelemetry spans — business attribution. For org-scale accounting, Claude Code emits OpenTelemetry spans carrying agent_id and parent_agent_id. That parent/child linkage lets a cost platform (e.g. CloudZero) roll spend up the orchestration tree — attributing a child agent's tokens back to the workflow or team lead that spawned it — so you can answer "which initiative cost what," not just "which process."

text

Live, this run:      /workflows   → per-phase / per-agent tokens, elapsed time
This session:        /usage       → cost by skill / subagent / plugin / MCP
Across sessions:     /insights    → history, per-skill cost, trends
Org / business unit: OTel spans   → agent_id + parent_agent_id → cost platform

Tip

Match the lens to the question

Debugging a single run's spend right now → /workflows. "What's expensive this session?" → /usage. "What's trending up over weeks?" → /insights. "Charge this back to a team / product?" → OpenTelemetry agent_id / parent_agent_id spans into your observability platform. They compose: the OTel parent_agent_id is what makes child-agent cost roll up to the orchestration that caused it.

Permissions across orchestration layers: parent mode wins

Coordinating agents multiplies permission decisions, so the model is built on one rule: the parent's mode takes precedence, and a child can only RELAX within the constraints it inherits — never escalate beyond them. If the parent is in bypassPermissions or auto, that wins; a child cannot tighten its way past a parent that's locked down, and it cannot grant itself powers the parent didn't have.

Each layer then has its own behavior within that frame:

Layer	How it runs	What still prompts
Workflow subagents	Always `acceptEdits` with the parent's tool allowlist	File edits auto-approved; non-allowlisted shell / web / MCP can still prompt mid-run
Agent teammates	Inherit from the lead's mode	Per the inherited mode
Routines (cloud)	No prompts at all	Nothing prompts — scope is set up-front instead
Any layer	Subject to protected-path rule	Protected paths never auto-approve (except bypass)

Workflow subagents are the surprising one: regardless of the parent's permission mode, they always run in acceptEdits so file edits are auto-approved — but they still carry the parent's allowlist, so a shell command, web fetch, or MCP call that isn't on that allowlist can still prompt you mid-run. (That's why a workflow can pause even though it's supposed to run unattended.)

Routines run on the other extreme: no prompts whatsoever. You can't approve things in a scheduled cloud run, so you bound their reach up front instead — by repos, branch-push settings (branches default to a claude/ prefix), environment network access, and connectors.

And cutting across all of it: protected paths are never auto-approved — .claude, .git, .npmrc, .mcp.json (and the like) — except under bypassPermissions. No permission mode short of full bypass will silently let an agent rewrite your .mcp.json or git internals.

Watch out

An 'unattended' workflow can still stop and ask

Workflow subagents auto-approve file edits (always acceptEdits), but a non-allowlisted shell / web / MCP call can still pop a prompt mid-run. If you walk away from a large workflow that needs those tools, it can stall waiting on you. Pre-approve the tools it needs in the parent's allowlist, or expect to babysit the run.

Note

You can relax, not escalate

Inheritance is one-directional: a child can drop to a looser mode only if the parent already permits it. It can never grant itself more than the parent had. So lock down at the top — the lead session / parent mode — and trust that children can't quietly exceed it. Protected paths (.claude, .git, .npmrc, .mcp.json) stay off-limits to auto-approval unless you're in full bypassPermissions.

Workflow gotchas (research preview, v2.1.154+)

Dynamic Workflows is a research preview and needs v2.1.154+ (claude --version). The traps below come straight from how the runtime is built.

The caps are hard: 16 concurrent / 1,000 total. At most 16 agents run at once (bounded by CPU cores) and 1,000 agent calls per run total. The danger is a silent runaway loop: a generation that recurses or fans out unbounded can burn straight through the 1,000-call cap with no error — just a surprising bill and a stalled run. Read the script Claude generated and sanity-check its loops before approving a large run.

Resume is same-session only. Pause/resume works within one Claude Code session (completed agents return cached results; the rest run live). But there's no checkpoint persistence across sessions — exit Claude Code and the workflow restarts fresh next time. Don't rely on closing your laptop and picking up where you left off.

The meta block must be a literal. The mandatory export const meta = {...} must be a literal object — no function calls, no variables, no dynamic construction. Any deviation breaks parsing:

// CORRECT — a literal
export const meta = {
  name: 'audit',
  description: 'Audit a directory for X',
  phases: [{ title: 'Scan' }, { title: 'Verify' }]
};

// BREAKS PARSING — built dynamically
export const meta = buildMeta(process.env.NAME);

No mid-run input. A workflow can't take user input mid-run — only permission prompts can pause it. You can't "answer a question" partway through; design the prompt so the run is self-contained.

Budget overruns — test small. Workflows cost significantly more than single-session work (parallel agents, full context per agent). The mitigation is the same discipline as everywhere else: run it on one directory / a narrow question first, watch per-phase tokens in /workflows, then scale.

API note: different sources name the scripting primitives differently — the official docs describe spawn(definition, prompt, args); third-party writeups describe agent(), parallel(), pipeline(), phase(). Don't memorize an API. Read the script Claude actually generates and lean on /help for your version.

Watch out

Silent runaway loops hit the cap with no error

The 1,000-call cap is a guardrail, not a warning system — an unbounded loop in the generated script can blow through it silently and expensively. Before approving a large run, open the script (the approval dialog's 'View raw script', or /workflows) and confirm the fan-out is bounded by your data, not by an open-ended loop.

Agent-team gotchas (experimental, OFF by default, v2.1.32+)

Agent teams are experimental and disabled by default — you must opt in:

bash

export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
# or settings.json: { "env": { "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1" } }

Because they're experimental, the rough edges are real and worth knowing before you trust a team with real work:

/resume and /rewind do NOT restore in-process teammates. Session restore brings back your lead, not the live teammates — you'll re-create them. Don't treat a team as durable across a restart.
Task status lags. Teammates sometimes finish work but fail to mark the task complete. When the shared task list looks stuck, nudge them manually ("mark your task complete") rather than assuming they're still working.
Shutdown is slow. The lead requests exit and the teammate approves it — this can take a while. Budget time; don't yank the session.
One team at a time per lead, no nested teams, and no lead transfer (the lead is your main session, fixed).
Cleanup MUST run from the lead. Say "clean up the team" to the lead session — that removes the shared task list (~/.claude/tasks/{team-name}/), mailbox, and config consistently. Running cleanup from a teammate won't tidy shared resources.

Cost reminder: every teammate is a full context window, so size to 3–5 teammates with ~5–6 tasks each and don't leave teams running unattended — idle teammates are still expensive context windows.

Watch out

Experimental means 'verify, don't trust'

Off by default for a reason: state restore is incomplete (no teammate restore on /resume//rewind), task status can silently lag, and shutdown drags. Treat a team as a powerful but fragile tool — keep runs short, nudge stuck tasks, and always tear down from the lead. Defer to /help and the live docs; the behavior shifts between versions.

Background & cloud gotchas (supervisor, /goal, provider limits)

Unattended and cloud features have their own failure surface — these are the ones learners hit first.

Unpinned background sessions stop after ~1 hour. Background sessions live in a per-user supervisor (daemon), and unpinned idle sessions stop after ~1 hour. If you want one alive while you're away, pin it (Ctrl+T in agent view) — pinned sessions also get restarted by the supervisor on a binary update.

Supervisor stalled? Restart it without losing work.

bash

claude daemon status                    # diagnose
claude daemon stop --any --keep-workers # restart supervisor, PRESERVE sessions

The --keep-workers flag is the important part: it restarts the supervisor while preserving the running sessions.

/goal can't read files — it judges from the transcript only. /goal <condition> pursues a condition across turns, but a small fast model (typically Haiku) evaluates after each turn from the transcript ONLY — it can't run shell or read files independently. So the condition must be provable from Claude's own output. "All tests in test/auth pass" only works if Claude actually runs them and the result is in the transcript; a condition that requires inspecting a file the goal-checker can't see will never resolve.

Provider limits — Monitor, PushNotification, and the ultra features. Several capabilities are unavailable on Bedrock / Vertex / Foundry / ZDR: the Monitor and PushNotification tools, plus ultraplan / ultrareview. (Dynamic workflows are available on Bedrock/Vertex/Foundry; /loop, agent teams, and agent view are local.) Two more sharp edges: Monitor is also disabled with DISABLE_TELEMETRY / CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC; and ultraplan disconnects Remote Control (both use claude.ai/code), so you can't run cloud planning and a phone-connected Remote Control session at once.

Tip

Pin it, or it's gone in an hour

The most common 'where did my background agent go?' is an unpinned idle session the supervisor reaped after ~1 hour. Pin with Ctrl+T for anything you expect to outlive a coffee break. And if the whole supervisor seems wedged, claude daemon stop --any --keep-workers restarts it without throwing away the sessions it hosts.

Note

If your goal never resolves, it's probably unprovable from the transcript

/goal's evaluator (a small fast model) sees the transcript, not your filesystem. Phrase conditions so they're satisfied by something Claude actually emits — run the tests, print lint output — rather than a state only a file read could confirm. Add a bound like '…or stop after 20 turns' so a never-provable goal can't loop forever.

Currency: most of this is preview or experimental

These features are new and move fast. State this plainly to yourself before you build on them:

Research preview: Dynamic Workflows (v2.1.154+), agent view / background agents (v2.1.139+), ultraplan / ultrareview, cloud routines.
Experimental, OFF by default: Agent teams (v2.1.32+) — opt in with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
Current keywords (NOT deprecated): ultracode (triggers dynamic workflows) and ultrathink (deeper reasoning for one prompt) are current.

Because flags, caps, and even the scripting-API surface shift between versions, two habits keep you safe: check claude --version before assuming a feature exists, and defer to /help and the live docs for your installed version rather than to anything memorized — including this lesson. When in doubt about a workflow's behavior, read the script Claude generated; it's the ground truth for that run.

Key insight

The durable skill is the discipline, not the syntax

Command names and caps will drift. What won't: count your active context windows to predict cost, profile on a small slice before scaling, attribute spend with /usage + /insights + OTel, lock permissions at the parent, and read the generated script. Learn the model and the habits; look up the exact syntax each time with /help.

Try it: Profile a small slice, then attribute the spend

Practice the cost-and-observability discipline before you ever run a large fan-out. Treat every feature here as preview/experimental — check claude --version and /help first.

Baseline the lenses: in any repo, run /usage and /insights and note what each shows — /usage breaks the current session down by skill/subagent/plugin/MCP; /insights shows history and per-skill trends. Write one sentence on which question each answers.
Profile a workflow on a small slice: with ultracode (needs v2.1.154+), ask for a workflow that audits one directory for a narrow issue. While it runs, open /workflows, drill into a phase (Enter/→), and record agent count, per-agent tokens, and elapsed time. Multiply by your intended full-repo scale to estimate the real cost — do NOT run the full version yet.
Read the generated script: from the approval dialog ('View raw script') or /workflows, open the script and confirm the fan-out is bounded by your data, not an open-ended loop. Note where the meta block is a literal.
Permission probe: give the workflow a task needing a shell command that is NOT on your allowlist and confirm it prompts mid-run even though file edits auto-approve. Then add the tool to the parent allowlist and confirm the prompt disappears.
(Optional, experimental) Tiny team: enable CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, spawn ONE teammate, give it a small task, then /resume-test that it does NOT restore the teammate. Tear down with 'clean up the team' from the lead.
Goal sanity check: set /goal <a condition provable from output, e.g. the test command's result appears in the transcript> or stop after 10 turns and confirm it resolves — then try a condition that requires reading a file and observe it never resolving.

Deliverable: a short note with your small-slice token numbers, the extrapolated full-run cost, the one tool that prompted mid-run despite acceptEdits, and one sentence on why the file-reading /goal condition never resolved.

Key takeaways

1Cost scales ~linearly with active agents: subagents summarize back (cheap), workflows keep state in script variables but spawn up to 1,000 (16 concurrent), and agent teammates each hold a FULL context window — the most expensive per agent.
2Always profile on a small slice — one directory, a narrow question, one teammate — and read the real token numbers before scaling a multi-agent run.
3Attribute spend with /usage (by skill/subagent/plugin/MCP), /insights (history + trends), live /workflows token panels, and OpenTelemetry agent_id / parent_agent_id spans for business chargeback.
4Parent permission mode wins; children can only relax within inherited constraints. Workflow subagents always run acceptEdits with the parent allowlist (edits auto-approved; non-allowlisted shell/web/MCP still prompt); routines run with no prompts; protected paths (.claude/.git/.npmrc/.mcp.json) never auto-approve except under bypass.
5Workflow traps: 16-concurrent/1,000-total caps, silent runaway loops, same-session-only resume, a literal meta block, and no mid-run input. Team traps: experimental + off by default, no teammate restore on /resume or /rewind, lagging task status, slow shutdown, one team at a time, cleanup from the lead.
6Background/cloud traps: pin background sessions or they stop ~1h; recover a stalled supervisor with `claude daemon stop --any --keep-workers`; /goal judges from the transcript only; Monitor/PushNotification/ultra features are unavailable on Bedrock/Vertex/Foundry/ZDR; ultraplan disconnects Remote Control.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.You need to run the same large code audit two ways: (A) a dynamic workflow that fans out across files, and (B) a 5-teammate agent team. Why is the team materially more expensive per agent?

2.After a heavy multi-agent session, you want to know which SKILL has been quietly growing into your biggest cost line over the past few weeks. Which tool is the right one?

3.A workflow subagent is running while you're in default permission mode. Which statement about its permissions is correct?

4.You set `/goal all tests in test/auth pass and lint is clean` and the goal never resolves even though the code looks fine. What is the most likely cause?

Go deeper

Hand-picked sources to keep learning

Dynamic workflows (official docs)

The 16-concurrent / 1,000-total caps, the literal meta block, same-session resume, /workflows monitoring, and the acceptEdits + parent-allowlist permission behavior.

Agent teams (official docs)

Experimental + off-by-default enablement, full-window-per-teammate cost, the /resume & /rewind restore limitation, lagging task status, and cleanup-from-the-lead.

Agent view / background agents (official docs)

The supervisor daemon, the ~1h stop for unpinned sessions, pinning with Ctrl+T, and `claude daemon stop --any --keep-workers` recovery.

Monitoring usage & cost (official docs)

/usage and /insights breakdowns and the OpenTelemetry agent_id / parent_agent_id spans for business attribution.

Permission modes (official docs)

Parent-mode precedence, relaxing within inherited constraints, protected paths, and how each orchestration layer prompts.

Building a C compiler with Claude (Anthropic engineering)

Order-of-magnitude reality check on coordinated multi-session cost: 2,000+ sessions over ~2 weeks for ~$20,000 of API.