Cost & Latency Optimization

Fast, cheap, good — engineering the trade-offs

Advanced 14 minBuilderDecision-maker
What you'll be able to do
  • Reason about where an agent's cost and latency actually come from, using input/output token asymmetry and loop multiplication
  • Apply prompt caching correctly on Anthropic and OpenAI, including when it loses money
  • Design a model router or cascade that sends most traffic to cheaper models while protecting quality
  • Use streaming and parallel tool calls to attack perceived and real latency separately
  • Put guardrails — token budgets and iteration caps — on agent loops to stop runaway spend
At a glance

An agent that calls a model fifty times to finish one task is fifty times the bill and fifty times the wait of a single completion. This lesson gives you the token-economics intuition behind that multiplier and the production levers — routing, prompt caching, streaming, parallelism, and context pruning — that cut spend 45-85% and latency 3-5x without sacrificing the quality your users feel.

  1. 1Where cost and latency actually come from
  2. 2Prompt caching: stop paying for the same prefix
  3. 3Model routing and cascades: cheap first, escalate when needed
  4. 4Streaming and perceived latency
  5. 5Parallelism, batching, and speculative decoding
  6. 6Cutting context and loop steps

Where cost and latency actually come from

Every optimization in this lesson follows from one asymmetry: output tokens are expensive, input tokens are cheap — but input tokens are where the volume hides.

Output tokens cost roughly 4-6x more than input tokens and dominate latency, because the model generates them one at a time, sequentially. Input tokens are processed in parallel, so they are cheap and fast per token. But agents load enormous prefixes — system prompts, tool schemas, retrieved documents, prior turns — on every iteration. In software-engineering agent traces, input-to-output ratios exceed 150:1. Loading context, not generating answers, is where the resources go.

Now multiply. A single LLM call is one prompt and one completion. An agent loop re-sends the growing context every turn, so a task that takes 50 steps can burn 50x the tokens of a one-shot call. This is why an unconstrained coding agent can cost $5-8 per task even though per-token prices have fallen ~10x per year since 2022.

The takeaway for builders: don't micro-optimize the answer. Attack the prefix you re-send every turn, and attack the number of turns. Those two levers dominate everything else.

Key insight

Two numbers to instrument first

Before optimizing, log tokens-per-task and iterations-per-task in your traces. Cost scales with both. A task that 'works' in 6 iterations and one that thrashes to 30 can cost 5x more for the same outcome — and you will never see it without per-task telemetry.

Prompt caching: stop paying for the same prefix

Here is the intuition: your agent sends the provider the same long opening — system prompt, tool definitions, reference docs — on every single turn, and every turn the provider does the expensive work of reading it from scratch. Prompt caching tells the provider to keep that work around and reuse it. Technically, it reuses the KV-cache (the model's pre-computed internal representation of the prompt) instead of recomputing the prefix. That is prompt caching, and since agents re-send a large, stable prefix every turn, it is the biggest single cost win.

The two major providers differ in ways that matter:

AnthropicOpenAI
ActivationExplicit cache_control markersAutomatic, no code change
Cache read price0.1x base input (90% off)0.5x base input (50% off)
Cache write price1.25x (5-min TTL) / 2.0x (1-hr TTL)No write premium
Default TTL5 minutes (opt-in 1-hour via "ttl": "1h")~1 hour (varies)
Min prompt1,024–4,096 tokens (model-dependent)1,024 tokens

Place the cache breakpoint at the end of your stable content (system prompt, tool definitions, long reference docs) and put volatile content — the latest user turn — after it. On Anthropic, mark the last stable block with cache_control:

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    system=[
        {
            "type": "text",
            "text": SYSTEM_AND_TOOLS,
            "cache_control": {"type": "ephemeral"},  # cache the stable prefix
        }
    ],
    messages=[
        {"role": "user", "content": latest_user_message}  # volatile, NOT cached
    ],
    max_tokens=1024,
)

By default the cache TTL is 5 minutes; pass "ttl": "1h" in cache_control for the longer TTL (at 2.0x write cost). A chatty agent reusing a 20k-token stable prefix can drop input cost by ~85% on cache hits.

Watch out

Caching can lose money on low reuse

An Anthropic cache write costs 1.25x (5-min) or 2.0x (1-hour) of normal input. You only break even after roughly 1.4+ cache hits on that prefix. Caching a prefix that is read once is strictly worse than not caching. Cache the stable, frequently-reused prefix — never the part that changes every call.

Model routing and cascades: cheap first, escalate when needed

Most agent sub-tasks — classifying intent, extracting a field, summarizing a tool result — do not need a frontier model. Routing sends each request to the cheapest model that can handle it; a cascade tries a cheap model first and escalates only when its output is low-confidence.

The numbers are compelling. Production systems route 60-80% of queries to cheaper models and cut cost 45-85%. The RouteLLM paper (LMSYS, 2024) demonstrated cost reductions of up to 85% on MT Bench while retaining 95% of GPT-4 quality, with smaller but still significant savings on other benchmarks (MMLU: ~45%, GSM8K: ~35%). A small fine-tuned classifier (~0.5B params) adds only milliseconds of routing latency.

python
def route(query: str) -> str:
    difficulty = classifier.predict(query)  # cheap, ~ms
    if difficulty < 0.3:
        return "haiku"      # simple extraction / classification
    if difficulty < 0.7:
        return "sonnet"     # most real work
    return "opus"           # hard reasoning, planning

Use LiteLLM or RouteLLM rather than rolling your own — they give you fallback chains, cost tracking, and load balancing across 100+ providers for free.

Tip

Route by capability, not just price

A cheaper model can be more expensive overall if it needs extra loop iterations or retries to succeed. If your tiny model fails a sub-task three times before escalating, you paid for four calls instead of one. Treat routing decisions as first-class telemetry: watch escalation rates and per-route success, because routing quality drifts on new query types.

Streaming and perceived latency

There are two latencies. Real latency is how long until the full response is done. Perceived latency is how long until the user sees something. Streaming attacks the second, not the first.

The metric that captures the wait is Time to First Token (TTFT). With streaming, you deliver tokens as they generate, so a 4-second response feels fast because text starts flowing in a few hundred milliseconds. MLPerf Inference v6.0 (April 2026) defines strict interactive TTFT P99 thresholds (e.g., ≤ 1.5–2.0 s for large reasoning models) and per-output-token latency targets to reflect real-world responsiveness requirements. Techniques like staircase streaming in multi-agent inference can cut TTFT by up to 93%.

In practice: stream the final, user-facing response. For internal agent steps (a tool call the user never reads), streaming buys nothing — collect the full structured output and act on it. Most SDKs make this a one-line change:

python
with client.messages.stream(model="sonnet", messages=msgs) as stream:
    for text in stream.text_stream:
        emit(text)  # render incrementally to the user

Streaming changes nothing about your bill or total compute — it changes whether your product feels responsive.

Watch out

Streaming does not make the model faster

A common myth: 'we added streaming and cut latency.' The full response takes exactly as long; you have only moved when the user sees output. That is genuinely valuable for UX, but do not count it as a throughput or cost win — and never let it substitute for real latency work like routing to a faster model or parallelizing tools.

Parallelism, batching, and speculative decoding

Imagine an agent that needs to check the weather, look up a flight, and search a database — three independent lookups. If you run them one after another, the user waits for all three in a row. If the calls don't depend on each other, there is no reason to: fire them at the same time and you only wait for the slowest one.

That is the core idea. When an agent makes several independent tool calls, running them sequentially makes total latency the sum of all calls. Run them concurrently and latency collapses to the slowest single call — a reported 3-5x reduction. Modern function-calling APIs emit parallel tool calls natively; your job is to actually execute them in parallel.

python
import asyncio
results = await asyncio.gather(*[run_tool(c) for c in tool_calls])

Three more throughput levers:

  • Batching trades latency for throughput. Async/offline batch inference costs 5-10x less per token than synchronous serving; continuous batching (standard in vLLM, TensorRT-LLM, TGI) doubles or triples requests-per-second on the same GPU. Use it for background jobs — not user-facing real-time calls, where it adds queueing delay.
  • Speculative decoding uses a small draft model to propose tokens a large model verifies in parallel: a lossless 2-3x latency cut. QuantSpec (SqueezeAILab/UC Berkeley, ICML 2025) reaches ~2.5x speedup with >90% acceptance rate. Crucially, the output distribution is provably identical — no quality trade-off.
  • Semantic caching stores embeddings of past queries and returns cached answers for similar ones — up to 73% cost reduction in high-repetition workloads like support and FAQs.

Watch out

Parallel tool calls can raise token use slightly

Running tools concurrently doesn't cut tokens — the model still processes every tool result, sometimes simultaneously, which can nudge token count up. The latency win (sum → slowest) almost always justifies it, but if you're optimizing for cost alone, measure rather than assume.

Cutting context and loop steps

The two biggest multipliers — the re-sent prefix and the number of turns — are also the two you control directly.

Prune the context. Conversation summarization, memory compaction, and pruning stale tool results cut token usage 50-80%. The ACON framework (Oct 2025) showed 26-54% peak-token reduction with no fine-tuning. Tools like LLMLingua-2 compress long prompts and RAG contexts with minimal quality loss. But beware the deeper failure: nearly 65% of enterprise AI failures in 2025 were context drift and memory loss, not raw exhaustion. Managing what is in context — relevance, staleness, redundancy — matters more than window size.

Cap the loop. Uncontrolled retries and recursion are the number-one driver of runaway agent spend. Every production agent needs hard guardrails:

python
MAX_ITERS, TOKEN_BUDGET = 12, 100_000
spent = 0
for i in range(MAX_ITERS):
    resp = step(history)
    spent += resp.usage.total_tokens
    if resp.done or spent > TOKEN_BUDGET:
        break
else:
    escalate_to_human(history)  # don't retry forever

When a budget is hit, escalate to a human — do not retry indefinitely. A capped, observable loop is the difference between a $0.30 task and a $30 one.

Tip

Bigger context window is not the fix

It's tempting to solve long-horizon agents by buying a larger context window. But if 65% of failures are drift and staleness, a bigger window just lets you carry more irrelevant junk. Curate aggressively; summarize old turns; drop tool results you no longer need. Context engineering beats context volume.

Try it: Profile and optimize an agent loop

Take any small tool-using agent (or the one from the 'first agent from scratch' lesson) and instrument it. Step 1: Log per-task total_tokens, iterations, and wall-clock latency for five representative tasks. Step 2: Add prompt caching to the stable prefix (system prompt + tool schemas) and re-measure input cost across a multi-turn session — confirm you see cache hits after the first call. Step 3: Add a hard MAX_ITERS cap and a TOKEN_BUDGET, with an escalation branch instead of infinite retries. Step 4: Convert any independent sequential tool calls to asyncio.gather and measure the latency change. Report a before/after table of cost-per-task and latency-per-task. Target: at least a 40% cost reduction without changing task success on your five examples.

Key takeaways

  1. 1Cost and latency in agents are dominated by the large prefix re-sent every turn and the number of loop iterations — not by the final answer.
  2. 2Prompt caching cuts input cost 50-90%, but cache writes cost more than normal input, so it only pays off above ~1.4 hits per prefix.
  3. 3Routing 60-85% of traffic to cheaper models cuts cost 45-85% while retaining ~95% of frontier quality — but route by capability, not just price.
  4. 4Streaming improves perceived latency only; parallel tool calls and speculative decoding cut real latency 2-5x without changing outputs.
  5. 5Every production agent needs a token budget and iteration cap that escalate to a human, because uncontrolled loops are the top driver of runaway spend.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.Why does an agent loop cost so much more than a single LLM call for the same final answer?

2.When does enabling Anthropic prompt caching actually LOSE money?

3.Which statement about streaming is correct?

4.What is the single most important guardrail against runaway agent spend?

Go deeper

Hand-picked sources to keep learning