Production Engineering for Agents/Lesson 2 of 6

Observability & Tracing

Seeing inside an agent's decisions

Advanced 13 minBuilder

What you'll be able to do

Explain why agents are uniquely hard to debug and why standard APM tools fall short
Model an agent run as a trace of nested spans and name the data every span must capture
Apply the OpenTelemetry GenAI semantic conventions and know their current maturity and trade-offs
Choose between LangSmith, Langfuse, Arize Phoenix, and MLflow for a given team and constraint
Run the trace-to-fix loop: instrument, score, harvest failures into evals, and gate releases

At a glance

An agent can return an HTTP 200 with a confidently wrong answer — no exception, no stack trace, just quietly broken. This lesson shows you how to make nondeterministic, multi-step agents debuggable by capturing their execution as traces and spans, what to log at each model and tool call, which platforms to use (LangSmith, Langfuse, Arize Phoenix, OpenTelemetry GenAI), and how to turn a failed trace into a fix that ships.

1Why agents break in the dark
2The mental model: traces and spans
3What every span must capture
4The standard: OpenTelemetry GenAI conventions
5Choosing a platform
6From trace to fix

Why agents break in the dark

A traditional bug announces itself: an exception, a 500, a red line in the logs. Agents are worse than that. The defining failure mode is the silent failure — the agent returns a clean HTTP 200 carrying an answer that is confidently, completely wrong. Nothing crashed. The model just chose the wrong tool, retrieved the wrong document, or hallucinated a fact, and then wrote a fluent paragraph around it.

Three properties compound to make this hard:

Nondeterminism. The same prompt can produce different tool calls on different runs. You cannot reliably reproduce the failure by re-running it.
Long, multi-step traces. A deep agent can run hundreds of steps over several minutes — model call, tool call, retrieval, reflection, repeat. The mistake might be in step 37.
Silent failures. Because there is no error, you do not even know which run went wrong unless you are scoring outputs.

The only way to debug a system like this is to record what it actually did, step by step, in a form you can inspect after the fact. That recording is a trace, and producing it is the entire discipline of agent observability.

Watch out

Standard APM is not enough

Datadog, Jaeger, and Zipkin will show you HTTP latency and service maps, but they have no concept of tokens, prompt/completion content, model parameters, cost, or the meaning of a tool call. You need either an LLM-specific platform or the OpenTelemetry GenAI extensions layered on top of your existing APM — not raw distributed tracing alone.

The mental model: traces and spans

Agent observability borrows its core abstraction from microservices distributed tracing. There are exactly two nouns you need.

A trace is one complete agent execution — the whole run, end to end, for a single request.
A span is one discrete operation inside that trace: an LLM call, a tool invocation, a retrieval step, a state transition, or a memory read/write.

Spans nest. The top-level span is the agent invocation; inside it sits a chat span (the planning model call), which spawns an execute_tool span, which contains an HTTP span, and so on. The parent/child links let you reconstruct the exact tree of what called what, in order, with timing. This is why the unit of debugging is the trace tree, not a flat log file.

Think of it like a flame graph for reasoning. When the answer is wrong, you do not re-run the agent — you open the trace, walk down to the span where the decision went sideways (the model picked the wrong tool, the retriever returned junk), and read the exact input and output at that node. The bug stops being a mystery and becomes a coordinate.

Key insight

The one-line mental model

Trace = one run. Span = one step. Spans nest into a tree. Every platform you will meet — LangSmith, Langfuse, Phoenix — is an opinionated UI over this same trace-of-spans structure. Learn the structure once and the tools become interchangeable.

What every span must capture

A span is only as useful as the data attached to it. Logging the final answer is not enough — the most valuable debug signal lives in intermediate steps: the tool arguments the model chose, which documents it retrieved (and which it didn't), and why it picked one action over another.

The minimum viable schema per span:

Field	Why it matters
Span type	LLM / tool / retrieval / agent / memory
Start & end timestamp	Yields latency for that step
Input	Prompt or tool arguments
Output	Completion or tool result
Token counts	Input + output tokens
Cost	Computed from tokens × model price
Error state & retry count	Distinguishes flaky from broken
Model name	Which model actually ran
Linking IDs	trace ID, parent span ID, session ID, user ID

The linking IDs are what stitch isolated spans into a navigable tree and let you group a multi-turn conversation (a session) or trace a complaint back to a specific user.

For high-volume production you rarely keep full detail on everything: log baseline metrics (tokens, cost, latency) for 100% of requests, but capture full-detail traces for only 10–20% — using tail-based sampling so you always keep every error trace.

Tip

Tracing won't slow you down

The old fear that tracing adds latency is outdated. Langfuse and LangSmith both export spans via async background batching, so the application path is barely touched. Only naive synchronous export blocks the request — avoid that one pattern and the overhead is negligible.

The standard: OpenTelemetry GenAI conventions

Imagine two tools both record "which model ran," but one calls the field model_name and the other calls it llm.model. Now your dashboards, your queries, and your switch-to-a-new-vendor plan all break on a naming mismatch. A semantic convention is just an agreed-upon dictionary of field names so this never happens — everyone writes gen_ai.request.model and everything interoperates.

To avoid every vendor inventing its own names, the industry is converging on the OpenTelemetry GenAI Semantic Conventions (OTel GenAI SemConv) — a CNCF-backed, vendor-neutral standard for LLM and agent telemetry. (CNCF, the Cloud Native Computing Foundation, is the open-source body that also stewards Kubernetes, so this is an industry standard, not one company's API.) Standardizing the names means you can switch backends without re-instrumenting.

Key attribute and operation names:

text

# Attributes
gen_ai.request.model            -> "gpt-4o" / "claude-sonnet-..."
gen_ai.usage.input_tokens       -> prompt tokens
gen_ai.usage.output_tokens      -> completion tokens
gen_ai.response.finish_reasons  -> ["stop"], ["tool_calls"]
gen_ai.provider.name            -> "openai" / "anthropic"

# Span operation names
invoke_agent | chat | execute_tool | invoke_workflow

Provider-specific extras are gated by gen_ai.provider.name — e.g. OpenAI adds gen_ai.usage.cache_read.input_tokens and gen_ai.usage.reasoning.output_tokens for o-series models. Since v1.39, the conventions also cover MCP (mcp.method.name, mcp.session.id, mcp.protocol.version) and use W3C Trace Context propagation so an agent trace and the MCP server's trace are linked rather than two disconnected islands.

Watch out

These conventions are still experimental

As of v1.41 (mid-2026) the GenAI conventions are still in Development status — not stable. Attribute names can change between minor versions. Opt in explicitly with OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental and be ready to update instrumentation when you bump versions.

Choosing a platform

The conventions decide what names to record; a platform is the product that collects, stores, and shows you those traces in a UI built for LLMs — searchable trace trees, token and cost rollups, side-by-side run comparison. You could write traces to a generic database, but these platforms save you from building the viewer yourself. Four matter:

LangSmith (LangChain, commercial + on-prem). A 3-tier model — runs (steps) → traces (one run) → threads (full conversation). 2025 added Polly, an AI assistant that answers "why did the agent loop?" from trace data, and langsmith-fetch, a CLI for debugging traces from inside coding agents. Despite the name, it works with any framework via its SDK — not just LangChain.
Langfuse (open-source, YC W23, self-hostable). Built on OpenTelemetry. Hierarchy: Sessions → Traces → Observations (Generations, Spans, Events). Strong combo of prompt management + eval + tracing; added full LLM-as-judge execution tracing in Oct 2025 so you can audit the evaluator too.
Arize Phoenix (fully open-source, self-hostable, no feature gates). The only major platform built entirely on OpenTelemetry with no proprietary tracing layer, using OpenInference conventions and 10 span kinds including GUARDRAIL and EVALUATOR. Best for eval rigor and data sovereignty. (Managed tier is Arize AX.)
MLflow Tracing. As of 3.6.0 it has full OTel support and GenAI SemConv compatibility — a natural fit for teams already using MLflow for model lifecycle.

A second convention layer, OpenInference (Arize-originated), extends OTel with richer RAG metadata and explicit retriever/reranker/embedding span types; Phoenix uses it, and many tools emit both.

Key insight

Lock-in is a function of how you instrument

Instrument with vendor-neutral OTel GenAI SemConv and you can swap backends freely. Instrument with proprietary decorators (LangSmith-only @traceable, Langfuse-only annotations) and switching means re-instrumenting every span. Choose the convention deliberately — it is a more durable decision than choosing the UI.

From trace to fix

Tracing is not an end in itself — its payoff is a closed loop that turns reactive debugging into continuous improvement. The trace-to-fix workflow has four stages:

Instrument spans at every LLM, tool, and retrieval step (see the schema above).
Score traces online with LLM-as-judge or rule-based scorers — correctness, groundedness, did-it-finish.
Harvest failures into evals. Every failed or low-scoring trace becomes a new entry in your eval dataset. Real production failures are the highest-signal test cases you will ever get.
Gate releases in CI. Run the eval set on every change; block any release that degrades a quality metric.

This is the bridge from the Evaluating Agents lesson: traces are how you find the failures, evals are how you prevent their return.

A minimal instrumented step looks like this:

python

from openai import OpenAI
from langfuse import observe, get_client

client = OpenAI()

@observe(as_type="generation")
def plan_step(user_goal: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_goal}],
    )
    # Attach model and token counts to the current generation span
    langfuse = get_client()
    langfuse.update_current_observation(
        model="gpt-4o",
        usage_details={
            "input": resp.usage.prompt_tokens,
            "output": resp.usage.completion_tokens,
        },
    )
    return resp.choices[0].message.content

The decorator opens a span, records input and output automatically, and links it to its parent — turning an opaque step into a coordinate you can inspect when the answer is wrong.

Example

Multi-agent: don't let sub-agents go dark

When an orchestrator hands off to a sub-agent, you must propagate parent-child span context across the handoff boundary (W3C Trace Context headers carry it over HTTP). Skip this and a failure inside the sub-agent appears to the orchestrator's trace as an unexplained black box — exactly the silent failure you built tracing to eliminate.

Try it: Instrument and break your own agent

Take any small tool-using agent you've built (or a 30-line ReAct loop with a calculator and a web-search tool). 1) Add tracing with Langfuse or Arize Phoenix — both are free and self-hostable; wrap each model call, tool call, and retrieval in a span. 2) Run five varied requests and open the trace tree for each. Confirm every span shows input, output, token counts, latency, and is linked to its parent. 3) Now induce a silent failure: give the agent a tool that returns plausible-but-wrong data (e.g. a calculator that's off by one), ask a question that depends on it, and watch it return a confident, wrong 200 OK answer. 4) Open the trace and find the exact span where the wrong value entered the reasoning. 5) Write a one-line LLM-as-judge or rule-based scorer that would have flagged that run, and turn the failing input into a single eval case. You've just run the full trace-to-fix loop end to end.

Key takeaways

1Agents fail silently — a 200 OK can carry a wrong answer — so nondeterminism, long traces, and the absence of errors make recorded traces the only viable debugging tool.
2Model every run as a trace of nested spans, and capture per span: type, timing, input, output, token counts, cost, errors, model name, and linking IDs.
3OpenTelemetry GenAI Semantic Conventions are the vendor-neutral standard (covering MCP since v1.39) but remain experimental — set OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental and expect attribute names to change between minor versions.
4LangSmith, Langfuse, Arize Phoenix, and MLflow are the leading platforms; pick by ecosystem, self-hosting needs, and eval rigor — and instrument with neutral conventions to avoid lock-in.
5The trace-to-fix loop — instrument, score, harvest failures into evals, gate CI — converts reactive debugging into continuous improvement.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.What makes agent failures uniquely hard to detect compared with traditional software bugs?

2.In the trace/span model, what is a 'span'?

3.What is the current status of the OpenTelemetry GenAI Semantic Conventions as of v1.41 (mid-2026)?

4.Which sequence correctly describes the trace-to-fix workflow?

Go deeper

Hand-picked sources to keep learning

AI Agent Observability — Evolving Standards and Best Practices (OpenTelemetry)

Official OTel perspective on agent observability and the GenAI conventions.

OpenTelemetry — Semantic Conventions for GenAI Spans

Official specification for span attributes, operation names, provider extensions, and the experimental stability status of the GenAI conventions.

Langfuse — Observability Data Model

The Sessions → Traces → Observations model and OTel integration, from the open-source platform's docs.

Braintrust — Agent Observability: The Complete Guide for 2026

Span types, the minimum viable schema, a tool comparison, and the trace-to-fix pipeline.

Arthur AI — Best Practices for Building Agents, Part 1: Observability and Tracing

Production engineering view: the five areas to instrument and how to build eval sets from trace data.

What is Arize Phoenix?

Docs for the fully open-source, OTel + OpenInference observability platform with its 10 span kinds.