Cost & Latency Optimization
Fast, cheap, good — engineering the trade-offs
- Reason about where an agent's cost and latency actually come from, using input/output token asymmetry and loop multiplication
- Apply prompt caching correctly on Anthropic and OpenAI, including when it loses money
- Design a model router or cascade that sends most traffic to cheaper models while protecting quality
- Use streaming and parallel tool calls to attack perceived and real latency separately
- Put guardrails — token budgets and iteration caps — on agent loops to stop runaway spend
An agent that calls a model fifty times to finish one task is fifty times the bill and fifty times the wait of a single completion. This lesson gives you the token-economics intuition behind that multiplier and the production levers — routing, prompt caching, streaming, parallelism, and context pruning — that cut spend 45-85% and latency 3-5x without sacrificing the quality your users feel.
- 1Where cost and latency actually come from
- 2Prompt caching: stop paying for the same prefix
- 3Model routing and cascades: cheap first, escalate when needed
- 4Streaming and perceived latency
- 5Parallelism, batching, and speculative decoding
- 6Cutting context and loop steps
Where cost and latency actually come from
Every optimization in this lesson follows from one asymmetry: output tokens are expensive, input tokens are cheap — but input tokens are where the volume hides.
Output tokens cost roughly 4-6x more than input tokens and dominate latency, because the model generates them one at a time, sequentially. Input tokens are processed in parallel, so they are cheap and fast per token. But agents load enormous prefixes — system prompts, tool schemas, retrieved documents, prior turns — on every iteration. In software-engineering agent traces, input-to-output ratios exceed 150:1. Loading context, not generating answers, is where the resources go.
Now multiply. A single LLM call is one prompt and one completion. An agent loop re-sends the growing context every turn, so a task that takes 50 steps can burn 50x the tokens of a one-shot call. This is why an unconstrained coding agent can cost $5-8 per task even though per-token prices have fallen ~10x per year since 2022.
The takeaway for builders: don't micro-optimize the answer. Attack the prefix you re-send every turn, and attack the number of turns. Those two levers dominate everything else.
Key insight
Two numbers to instrument first
Before optimizing, log tokens-per-task and iterations-per-task in your traces. Cost scales with both. A task that 'works' in 6 iterations and one that thrashes to 30 can cost 5x more for the same outcome — and you will never see it without per-task telemetry.
Prompt caching: stop paying for the same prefix
Here is the intuition: your agent sends the provider the same long opening — system prompt, tool definitions, reference docs — on every single turn, and every turn the provider does the expensive work of reading it from scratch. Prompt caching tells the provider to keep that work around and reuse it. Technically, it reuses the KV-cache (the model's pre-computed internal representation of the prompt) instead of recomputing the prefix. That is prompt caching, and since agents re-send a large, stable prefix every turn, it is the biggest single cost win.
The two major providers differ in ways that matter:
| Anthropic | OpenAI | |
|---|---|---|
| Activation | Explicit cache_control markers | Automatic, no code change |
| Cache read price | 0.1x base input (90% off) | 0.5x base input (50% off) |
| Cache write price | 1.25x (5-min TTL) / 2.0x (1-hr TTL) | No write premium |
| Default TTL | 5 minutes (opt-in 1-hour via "ttl": "1h") | ~1 hour (varies) |
| Min prompt | 1,024–4,096 tokens (model-dependent) | 1,024 tokens |
Place the cache breakpoint at the end of your stable content (system prompt, tool definitions, long reference docs) and put volatile content — the latest user turn — after it. On Anthropic, mark the last stable block with cache_control:
response = client.messages.create(
model="claude-sonnet-4-5",
system=[
{
"type": "text",
"text": SYSTEM_AND_TOOLS,
"cache_control": {"type": "ephemeral"}, # cache the stable prefix
}
],
messages=[
{"role": "user", "content": latest_user_message} # volatile, NOT cached
],
max_tokens=1024,
)By default the cache TTL is 5 minutes; pass "ttl": "1h" in cache_control for the longer TTL (at 2.0x write cost). A chatty agent reusing a 20k-token stable prefix can drop input cost by ~85% on cache hits.
Watch out
Caching can lose money on low reuse
An Anthropic cache write costs 1.25x (5-min) or 2.0x (1-hour) of normal input. You only break even after roughly 1.4+ cache hits on that prefix. Caching a prefix that is read once is strictly worse than not caching. Cache the stable, frequently-reused prefix — never the part that changes every call.
Model routing and cascades: cheap first, escalate when needed
Most agent sub-tasks — classifying intent, extracting a field, summarizing a tool result — do not need a frontier model. Routing sends each request to the cheapest model that can handle it; a cascade tries a cheap model first and escalates only when its output is low-confidence.
The numbers are compelling. Production systems route 60-80% of queries to cheaper models and cut cost 45-85%. The RouteLLM paper (LMSYS, 2024) demonstrated cost reductions of up to 85% on MT Bench while retaining 95% of GPT-4 quality, with smaller but still significant savings on other benchmarks (MMLU: ~45%, GSM8K: ~35%). A small fine-tuned classifier (~0.5B params) adds only milliseconds of routing latency.
def route(query: str) -> str:
difficulty = classifier.predict(query) # cheap, ~ms
if difficulty < 0.3:
return "haiku" # simple extraction / classification
if difficulty < 0.7:
return "sonnet" # most real work
return "opus" # hard reasoning, planningUse LiteLLM or RouteLLM rather than rolling your own — they give you fallback chains, cost tracking, and load balancing across 100+ providers for free.
Tip
Route by capability, not just price
A cheaper model can be more expensive overall if it needs extra loop iterations or retries to succeed. If your tiny model fails a sub-task three times before escalating, you paid for four calls instead of one. Treat routing decisions as first-class telemetry: watch escalation rates and per-route success, because routing quality drifts on new query types.
Streaming and perceived latency
There are two latencies. Real latency is how long until the full response is done. Perceived latency is how long until the user sees something. Streaming attacks the second, not the first.
The metric that captures the wait is Time to First Token (TTFT). With streaming, you deliver tokens as they generate, so a 4-second response feels fast because text starts flowing in a few hundred milliseconds. MLPerf Inference v6.0 (April 2026) defines strict interactive TTFT P99 thresholds (e.g., ≤ 1.5–2.0 s for large reasoning models) and per-output-token latency targets to reflect real-world responsiveness requirements. Techniques like staircase streaming in multi-agent inference can cut TTFT by up to 93%.
In practice: stream the final, user-facing response. For internal agent steps (a tool call the user never reads), streaming buys nothing — collect the full structured output and act on it. Most SDKs make this a one-line change:
with client.messages.stream(model="sonnet", messages=msgs) as stream:
for text in stream.text_stream:
emit(text) # render incrementally to the userStreaming changes nothing about your bill or total compute — it changes whether your product feels responsive.
Watch out
Streaming does not make the model faster
A common myth: 'we added streaming and cut latency.' The full response takes exactly as long; you have only moved when the user sees output. That is genuinely valuable for UX, but do not count it as a throughput or cost win — and never let it substitute for real latency work like routing to a faster model or parallelizing tools.
Parallelism, batching, and speculative decoding
Imagine an agent that needs to check the weather, look up a flight, and search a database — three independent lookups. If you run them one after another, the user waits for all three in a row. If the calls don't depend on each other, there is no reason to: fire them at the same time and you only wait for the slowest one.
That is the core idea. When an agent makes several independent tool calls, running them sequentially makes total latency the sum of all calls. Run them concurrently and latency collapses to the slowest single call — a reported 3-5x reduction. Modern function-calling APIs emit parallel tool calls natively; your job is to actually execute them in parallel.
import asyncio
results = await asyncio.gather(*[run_tool(c) for c in tool_calls])Three more throughput levers:
- Batching trades latency for throughput. Async/offline batch inference costs 5-10x less per token than synchronous serving; continuous batching (standard in vLLM, TensorRT-LLM, TGI) doubles or triples requests-per-second on the same GPU. Use it for background jobs — not user-facing real-time calls, where it adds queueing delay.
- Speculative decoding uses a small draft model to propose tokens a large model verifies in parallel: a lossless 2-3x latency cut. QuantSpec (SqueezeAILab/UC Berkeley, ICML 2025) reaches ~2.5x speedup with >90% acceptance rate. Crucially, the output distribution is provably identical — no quality trade-off.
- Semantic caching stores embeddings of past queries and returns cached answers for similar ones — up to 73% cost reduction in high-repetition workloads like support and FAQs.
Watch out
Parallel tool calls can raise token use slightly
Running tools concurrently doesn't cut tokens — the model still processes every tool result, sometimes simultaneously, which can nudge token count up. The latency win (sum → slowest) almost always justifies it, but if you're optimizing for cost alone, measure rather than assume.
Cutting context and loop steps
The two biggest multipliers — the re-sent prefix and the number of turns — are also the two you control directly.
Prune the context. Conversation summarization, memory compaction, and pruning stale tool results cut token usage 50-80%. The ACON framework (Oct 2025) showed 26-54% peak-token reduction with no fine-tuning. Tools like LLMLingua-2 compress long prompts and RAG contexts with minimal quality loss. But beware the deeper failure: nearly 65% of enterprise AI failures in 2025 were context drift and memory loss, not raw exhaustion. Managing what is in context — relevance, staleness, redundancy — matters more than window size.
Cap the loop. Uncontrolled retries and recursion are the number-one driver of runaway agent spend. Every production agent needs hard guardrails:
MAX_ITERS, TOKEN_BUDGET = 12, 100_000
spent = 0
for i in range(MAX_ITERS):
resp = step(history)
spent += resp.usage.total_tokens
if resp.done or spent > TOKEN_BUDGET:
break
else:
escalate_to_human(history) # don't retry foreverWhen a budget is hit, escalate to a human — do not retry indefinitely. A capped, observable loop is the difference between a $0.30 task and a $30 one.
Tip
Bigger context window is not the fix
It's tempting to solve long-horizon agents by buying a larger context window. But if 65% of failures are drift and staleness, a bigger window just lets you carry more irrelevant junk. Curate aggressively; summarize old turns; drop tool results you no longer need. Context engineering beats context volume.
Try it: Profile and optimize an agent loop
Take any small tool-using agent (or the one from the 'first agent from scratch' lesson) and instrument it. Step 1: Log per-task total_tokens, iterations, and wall-clock latency for five representative tasks. Step 2: Add prompt caching to the stable prefix (system prompt + tool schemas) and re-measure input cost across a multi-turn session — confirm you see cache hits after the first call. Step 3: Add a hard MAX_ITERS cap and a TOKEN_BUDGET, with an escalation branch instead of infinite retries. Step 4: Convert any independent sequential tool calls to asyncio.gather and measure the latency change. Report a before/after table of cost-per-task and latency-per-task. Target: at least a 40% cost reduction without changing task success on your five examples.
Key takeaways
- 1Cost and latency in agents are dominated by the large prefix re-sent every turn and the number of loop iterations — not by the final answer.
- 2Prompt caching cuts input cost 50-90%, but cache writes cost more than normal input, so it only pays off above ~1.4 hits per prefix.
- 3Routing 60-85% of traffic to cheaper models cuts cost 45-85% while retaining ~95% of frontier quality — but route by capability, not just price.
- 4Streaming improves perceived latency only; parallel tool calls and speculative decoding cut real latency 2-5x without changing outputs.
- 5Every production agent needs a token budget and iteration cap that escalate to a human, because uncontrolled loops are the top driver of runaway spend.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.Why does an agent loop cost so much more than a single LLM call for the same final answer?
2.When does enabling Anthropic prompt caching actually LOSE money?
3.Which statement about streaming is correct?
4.What is the single most important guardrail against runaway agent spend?
Go deeper
Hand-picked sources to keep learning
Current reference for the cache_control API, pricing, TTL options, and model support.
Academic paper formalizing token cost structure and input-to-output ratios for single and multi-agent systems.
Foundational routing paper: route 85% of queries to cheaper models while keeping 95% of frontier quality.
Practical engineering post with pricing ratios, cascade patterns, and failure modes to avoid.
Engineering guide on semantic caching, TTFT metrics, and a token-waste taxonomy with quantified impact.
Production inference engine implementing continuous batching, PagedAttention, and speculative decoding.