Structured Outputs & Reliability
Making model output safe to program against
- Explain why free-text output breaks programmatic pipelines and what 'reliable' output really means
- Distinguish JSON mode (valid syntax) from structured outputs / constrained decoding (schema-guaranteed) and pick the right one
- Define an output schema with JSON Schema, Pydantic, or Zod and bind it to a provider
- Implement a validate-and-retry loop that feeds the validator's error back to the model for self-correction
- Reason about provider support, schema limitations, and the trade-offs of each approach
An agent's output is useless if your code can't reliably parse it. This lesson shows how to turn an LLM from something that emits prose into something that emits data your program can trust — moving from hopeful prompting to JSON mode, to constrained decoding that mathematically guarantees a schema, and finally to validation-and-retry as a safety net. By the end you'll know which guarantee each technique gives you and how to wire it into a real pipeline.
- 1Why free text breaks pipelines
- 2JSON mode vs. structured outputs
- 3How constrained decoding works
- 4Defining schemas: JSON Schema, Pydantic, Zod
- 5Validation and self-correcting retries
- 6Trade-offs and provider support
Why free text breaks pipelines
An agent is a program, and programs pass data between steps. The moment you need the model's output to drive the next step — populate a database row, choose a branch, call another tool — free text becomes a liability. You asked for a category and a confidence score; the model replied: "Sure! This ticket looks like a billing issue, I'd say ~85% confident." Now you're writing a regex to scrape billing and 85 out of a sentence that could be phrased a hundred ways tomorrow.
The core problem is a mismatch of contracts. Your code expects a fixed shape — keys, types, allowed values. A language model, left to its own devices, produces plausible text, not parseable data. It may add a friendly preamble, wrap JSON in markdown fences, rename a field, return a number as a word, or omit a key entirely.
In an agent loop these small failures compound. A 3% parse-failure rate per step sounds tolerable until a ten-step task fails roughly a quarter of the time. Structured outputs exist to replace hope with a contract — so the rest of your system can treat the model like any other typed API.
Key insight
Schema conformance ≠ correctness
Structured outputs guarantee the shape of the answer, not its truth. A model can return perfectly valid JSON whose category is wrong or whose number is hallucinated. Schema enforcement and factual accuracy are orthogonal — you still need evals and validation logic for the values themselves.
JSON mode vs. structured outputs
Both features promise "JSON output," but they make two very different promises — and conflating them is the most common mistake in this area. Think of it as the difference between asking someone to write in English versus asking them to fill out this exact form: the first constrains the language, the second constrains the shape.
JSON mode (e.g. response_format={"type": "json_object"}) is soft guidance. It guarantees the output is syntactically valid JSON — it will parse — but says nothing about its shape. The model can return {"foo": "bar"} when you expected {"category": str, "confidence": float}. Your json.loads() succeeds; your schema check still fails. OpenAI introduced JSON mode in November 2023.
Structured outputs (a.k.a. constrained or grammar-based decoding) is a hard constraint. The provider compiles your JSON Schema into a token-level grammar and, at every generation step, masks out any token that would violate the schema. The model literally cannot emit an invalid field name or wrong-typed value. This is a mathematical guarantee, not a best-effort instruction. OpenAI shipped this in August 2024 (response_format={"type":"json_schema", "strict": true}).
| Valid JSON? | Matches your schema? | Mechanism | |
|---|---|---|---|
| JSON mode | Yes | No guarantee | Instruction / fine-tuning |
| Structured outputs | Yes | Guaranteed | Token masking at decode time |
Use JSON mode only when any JSON will do. For anything your code branches on, use structured outputs.
Watch out
JSON mode does not enforce your schema
Empirically, JSON-mode parse-against-schema failures run ~2–5% on strong models and 5–12% on weaker ones (e.g. DeepSeek). Full structured outputs push that below ~0.1–0.3%. If you only need valid JSON, JSON mode is fine; if you need your JSON, you need constrained decoding.
How constrained decoding works
It helps to picture what happens during generation. At each step an LLM produces a probability distribution over its entire vocabulary (~100k+ tokens) and samples one. Constrained decoding inserts a gate: given the schema and the tokens generated so far, it computes which next tokens are grammatically legal and sets the probability of every illegal token to zero before sampling.
Concretely, if the schema says the next key must be "confidence" and a number must follow, the decoder will not allow the model to open a stray field or emit a letter where a digit belongs. The schema is compiled once into a finite-state machine / grammar; the per-token check is a fast mask.
For self-hosted inference this is what libraries like XGrammar (the default backend in vLLM, SGLang, and TensorRT-LLM) and Outlines do. XGrammar uses vocabulary partitioning and cached token masks to add under ~40 microseconds per token.
A delightful consequence: constrained decoding can be faster than free generation. Because masking shrinks the effective sampling space, benchmarks have shown up to ~50% speedups — and up to 100x over naive grammar approaches. The intuition that "adding a constraint must slow things down" is simply wrong here.
Tip
Faster, not slower
Don't avoid structured outputs for fear of latency. Constraining the vocabulary at each step reduces work the sampler does. Treat the speedup as a free bonus on top of the reliability win.
Defining schemas: JSON Schema, Pydantic, Zod
Under the hood every provider speaks JSON Schema — the standard vocabulary for describing object shapes (type, properties, required, enum). You can hand-write it, but in practice two libraries have become the default authoring layer.
In Python, use Pydantic (v2). You declare a model as a class; the SDK derives the JSON Schema for you and parses the response back into a typed object.
from pydantic import BaseModel, Field
from openai import OpenAI
class Ticket(BaseModel):
category: str = Field(description="billing | technical | account")
confidence: float = Field(ge=0, le=1)
needs_human: bool
client = OpenAI()
resp = client.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": "Card declined twice today."}],
response_format=Ticket, # Pydantic model → JSON Schema, strict
)
ticket: Ticket = resp.choices[0].message.parsed # fully typedIn TypeScript, the equivalent is Zod, used by the Vercel AI SDK (generateObject), the OpenAI SDK, and Anthropic's zodOutputFormat(). One schema definition gives you runtime validation and a static type. Define the schema once, in your language's idiom, and let the SDK translate to JSON Schema and back.
Example
Anthropic structured outputs (2025+)
Claude gained native structured outputs, announced in public beta on November 14, 2025 and now generally available. Use client.messages.parse() with a Pydantic model (Python) or zodOutputFormat() (TS), via output_config.format. Before this, the standard pattern was Tool Use + tool_choice forced to coerce schema-shaped output — a workaround, not the same feature. Don't conflate the two.
Validation and self-correcting retries
Even with native structured outputs you should validate — and when you're on older JSON mode, smaller models, or a provider without hard constraints, validation-and-retry is your primary defense. The pattern is simple and effective: parse, and on failure send the exact validator error back to the model so it can fix its own mistake.
The key insight: a bare retry ("try again") barely helps; a retry with the precise error message lets the model self-correct.
from pydantic import ValidationError
def extract(messages, max_retries=2):
for attempt in range(max_retries + 1):
raw = call_model(messages) # returns a JSON string
try:
return Ticket.model_validate_json(raw)
except ValidationError as e:
messages += [
{"role": "assistant", "content": raw},
{"role": "user",
"content": f"That failed validation:\n{e}\nReturn corrected JSON only."},
]
raise RuntimeError("schema conformance failed after retries")This is exactly what the Instructor library automates (Pydantic/Zod + auto-retry, on top of any provider). Cap retries at 2–3, then escalate or fall back to a simpler schema — an unbounded loop just burns tokens. Instructor sits on top of native structured outputs; it leverages them when available and falls back to prompt-based coercion when not.
Tip
Feed back the error, not a vague nudge
Retry loops that don't pass the validator's exact message back are largely ineffective. The error text ("confidence: input should be ≤ 1") is the signal the model needs to repair its output.
Trade-offs and provider support
The good news: you rarely have to build constrained decoding yourself anymore. As of late 2025 all three major API providers offer native, schema-guaranteed structured outputs — you pass a schema and they handle the token masking. The catch is that each supports a slightly different slice of JSON Schema, so the same schema can pass on one provider and be rejected by another.
- OpenAI —
response_formatwithjson_schema+strict: true;gpt-4o-2024-08-06+, o-series, GPT-4.1. The most mature implementation; SDK.parse()takes Pydantic directly. - Anthropic (Claude) —
output_config.format(generally available);client.messages.parse()/zodOutputFormat(); announced November 2025, now GA across Claude 3.5/4-series models. - Google Gemini —
response_schemawithresponse_mime_type='application/json', across Gemini 1.5/2.0/2.5.
The catch: not every JSON Schema feature is supported, and support differs by provider. Claude, for example, does not allow recursive schemas, numeric range constraints (minimum/maximum/multipleOf), string-length constraints (minLength/maxLength), or external $ref URLs, and requires additionalProperties: false (limits: 20 strict tools, 24 optional params, 16 union-typed params). OpenAI has its own restrictions. A schema that works on one provider may be rejected by another — validate against each target.
Decision guide: any JSON → JSON mode; typed data you branch on → provider structured outputs; self-hosted models → XGrammar/Outlines; multi-provider or need auto-retry → Instructor.
Try it: From prose to a typed contract
Build a ticket classifier two ways and compare. (1) Write a Ticket schema with three fields — category (an enum of billing|technical|account), confidence (a float 0–1), and needs_human (bool) — using Pydantic if you're in Python or Zod in TypeScript. (2) Run 20 short support messages through plain prompting ("reply with JSON") and log how many outputs fail to parse or fail schema validation. (3) Re-run the same 20 through your provider's structured outputs (.parse() / response_format with strict: true). (4) Add a validate-and-retry wrapper (max 2 retries) that feeds the validation error back on failure, and confirm it rescues the rare miss. Report the three failure rates side by side and note any schema features your provider rejected. You'll feel, concretely, the jump from hope to contract.
Key takeaways
- 1Free text breaks pipelines because programs need a fixed contract; small per-step parse failures compound across an agent loop.
- 2JSON mode guarantees only valid JSON syntax; structured outputs / constrained decoding guarantee your schema by masking illegal tokens at decode time.
- 3Constrained decoding is a hard mathematical guarantee — and is often faster than free generation, not slower.
- 4Author schemas in Pydantic (Python) or Zod (TypeScript); SDKs derive JSON Schema and parse responses back into typed objects.
- 5Always keep a validate-and-retry net that feeds the exact validator error back to the model, capped at 2–3 attempts — and remember schema conformance is not correctness.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.What does JSON mode (response_format: json_object) actually guarantee?
2.How do structured outputs / constrained decoding enforce a schema?
3.Which statement about latency and reliability is correct?
4.What makes a validation-retry loop actually effective at self-correction?
Go deeper
Hand-picked sources to keep learning
The output_config.format API, supported JSON Schema features, the limits table, and SDK helpers (.parse() / zodOutputFormat()).
Covers JSON mode vs. Structured Outputs, refusals, streaming, and which schema features are supported.
Pydantic/Zod-based library that adds validation and self-correcting retries on top of 15+ providers.
2025 benchmark of six constrained-decoding frameworks against 10,000 real-world JSON schemas.
The token-masking engine behind vLLM, SGLang, and TensorRT-LLM; explains the 100x speedup over naive grammar methods.
Detailed practitioner walkthrough of how constrained decoding differs from JSON mode, with empirical context and commentary on failure modes.