Production Engineering for Agents/Lesson 4 of 6

Reliability & Guardrails

Designing for the 20% that goes wrong

Advanced 14 minBuilder

What you'll be able to do

Design input and output guardrails that validate, screen, and schema-check every model turn
Choose correctly between retries, fallbacks, and circuit breakers for a given failure
Cap an agent's iterations and wall-clock time so it can never loop forever
Place human approval gates on irreversible, high-blast-radius actions
Assemble these layers into a defense-in-depth architecture and reason about its limits

At a glance

An agent that works 80% of the time is a demo; the engineering is all in the 20% that goes wrong. This lesson shows you how to build reliability the way production teams do — as layered defense-in-depth: input and output guardrails, retries with backoff, fallbacks, timeouts, circuit breakers, hard iteration caps, and human approval gates on dangerous actions. No single layer is trustworthy on its own, so you stack them.

1Reliability is a stack, not a switch
2Input and output guardrails
3Retries, fallbacks, and circuit breakers
4Iteration caps and timeouts
5Human approval gates for high-risk actions
6Guardrail frameworks and their limits

Reliability is a stack, not a switch

Here is the trap every new agent builder falls into: when the agent misbehaves, you reach for a better prompt — "You must never do X." That feels like a fix, but it is one thin layer, and a thin layer always leaks. The model can be talked out of any instruction by a clever input, a confusing tool result, or just bad luck. A 2025 Robust Intelligence study found 78% of production LLM apps were vulnerable to at least one class of prompt injection — and most of them had exactly that kind of system prompt telling the model to behave.

The production answer is defense-in-depth: a strategy borrowed from security and aviation where you assume every layer is fallible and stack independent checks, so no single failure can collapse the system. For an agent, the minimum stack is four layers:

Pre-LLM input guardrails — screen and validate what goes into the model (PII, injection patterns, schema).
Post-LLM output guardrails — validate what comes out (format, toxicity, hallucination, policy).
Action/tool validation — check tool arguments and permissions before execution.
Monitoring and telemetry — observe failure rates so you can react.

Think of it like aviation: redundant systems, checklists, and circuit breakers, because the cost of a single uncaught failure is high. The rest of this lesson walks each layer and the resilience patterns that connect them.

Key insight

The mental model

No guardrail catches everything. The best frameworks catch only 60–85% of serious problems and false-positive on 5–15% of turns. You are not building a wall; you are building Swiss cheese where the holes don't line up.

Input and output guardrails

A guardrail is just a check that runs around a model call — and where you put it changes what it can afford to do. There are two positions, with very different time and money budgets.

Pre-LLM (input) guardrails run in the hot path before every model call, so they must be fast and deterministic: regex, rule sets, lightweight classifiers, schema checks. Typical jobs: PII detection, prompt-injection screening, and rejecting malformed input. Because they run constantly, an expensive check here taxes every single request.

Post-LLM (output) guardrails can afford slightly more latency: format/schema compliance, toxicity filtering, and hallucination or policy checks — sometimes using an auxiliary LLM as a judge (a second model whose only job is to grade the first model's answer). That judge adds 200–800ms per turn and 15–40% to provider cost depending on prompt size, so use it deliberately.

Here is a minimal pattern that screens input, validates structured output against a schema, and refuses on failure:

python

from pydantic import BaseModel, ValidationError
import re

INJECTION = re.compile(r"ignore (all|previous) instructions", re.I)

class Reply(BaseModel):
    answer: str
    confidence: float

def input_guard(text: str) -> None:
    if INJECTION.search(text):
        raise ValueError("input rejected: injection pattern")

def output_guard(raw: str) -> Reply:
    return Reply.model_validate_json(raw)  # raises if schema/format is wrong

def guarded_turn(user_text: str, call_model) -> Reply:
    input_guard(user_text)
    raw = call_model(user_text)
    try:
        return output_guard(raw)
    except ValidationError as e:
        raise ValueError(f"output failed schema: {e}")

The non-obvious rule: treat all external data as untrusted input, not just the end user. A poisoned web page, a malicious tool result, or a hostile document can carry instructions just as easily as a typed message. OWASP's ASI06 Memory & Context Poisoning warns that retrieved documents, tool outputs, and MCP server responses are injection vectors too.

Watch out

Injection isn't only from users

In 2025 VirusTotal found 314 malicious agent skills from a single publisher that used plain natural-language instructions — not code — to hijack agents. Guardrails must treat tool schemas and tool outputs as an attack surface, not just the user message.

Retries, fallbacks, and circuit breakers

When a dependency fails, you have three different tools — and people constantly grab the wrong one. The trick is to first ask what kind of failure this is: a one-off hiccup, a dead provider, or a dependency that is sick and staying sick. Each gets its own pattern, and a serious system wants all three.

Retries with exponential backoff + jitter handle transient failures: a network blip, a TLS error, a brief rate limit. Backoff spaces attempts out (wait longer each time); jitter randomizes the spacing so a fleet of clients doesn't retry in lockstep and create a self-inflicted spike (the "thundering herd").
Fallbacks handle continuity: when the primary model or provider is down, route to a secondary so the user still gets an answer instead of an error.
Circuit breakers handle persistent failures: when a dependency's failure rate crosses a threshold, the breaker opens and stops sending requests for a cooldown — like a household fuse tripping. This prevents cascading failure and gives the dependency room to recover.

python

import random, time

def with_backoff(fn, attempts=5, base=0.5, cap=8.0):
    for i in range(attempts):
        try:
            return fn()
        except TransientError:
            if i == attempts - 1:
                raise
            sleep = min(cap, base * 2 ** i) * (0.5 + random.random())
            time.sleep(sleep)  # exponential backoff with jitter

Think of it as a triage rule: retry for transient, fall back for continuity, trip the breaker for sustained outages. Gateways like Portkey bundle all three across providers, but the patterns matter even if you build them yourself.

Watch out

Per-execution retries betray you in multi-agent systems

with_retry(stop_after_attempt=10) on a tool looks safe — until 10 parallel agents each retry 10 times and 100 simultaneous requests hammer a dead service. The fix is a shared circuit breaker in graph state that every worker checks before calling the dependency.

Iteration caps and timeouts

An agent's loop is its superpower — and its scariest liability. The same loop that lets it keep trying until a job is done will, if the goal is unachievable or a tool always errors, happily run forever: burning tokens and money with nothing to show for it. This is a reliability problem before it is a cost problem, and the fix is boring and non-negotiable: hard caps that make "stop" a guaranteed outcome, not a hope.

Two caps, applied at both the single-agent and orchestrator level:

Max iterations — e.g. max_steps = 50. After N reasoning/tool cycles, stop and return the best partial result (or escalate to a human).
Wall-clock timeout — e.g. a 60-second budget for the whole run, independent of step count, to catch a single tool that hangs without ever returning.

python

import time

def run_agent(goal, step, max_steps=50, time_budget=60):
    deadline = time.monotonic() + time_budget
    state = init(goal)
    for i in range(max_steps):
        if time.monotonic() > deadline:
            return finalize(state, reason="timeout")
        state = step(state)
        if state.done:
            return finalize(state, reason="done")
    return finalize(state, reason="max_steps")  # never loops forever

Maps directly to OWASP ASI08 Cascading Failures: an uncapped loop in one agent can saturate shared resources and take down a whole multi-agent system. Always make "give up gracefully" an explicit, reachable state — not an accident.

Tip

Self-correction beats hard failure

When an output guardrail catches a problem, don't always reject. Feed the specific issue back as a targeted correction prompt and let the agent retry — it's far more user-friendly. Just keep the retry counted against your max-iteration cap so a self-correction loop can't run away.

Human approval gates for high-risk actions

Some actions can't be undone: moving money, deleting data, modifying production, sending an external email. For these, no guardrail is smart enough to fully trust on its own — the right pattern is to simply stop and ask a person. That is a human in the loop (HITL): the agent pauses before the dangerous step and waits for explicit approval. Reserve these gates for irreversible, high-blast-radius actions only; gating everything trains users to rubber-stamp without reading, which is worse than no gate at all.

In LangGraph, interrupt() pauses graph execution with a state snapshot and surfaces a payload to a human; resuming injects the decision back in. (interrupt_before / interrupt_after is a static, node-level alternative.) Both require a checkpointer — LangGraph raises a compile-time error otherwise, because pausing means persisting and restoring state.

python

from langgraph.types import interrupt, Command

def execute_transfer(state):
    decision = interrupt({  # pauses; waits for a human
        "action": "wire_transfer",
        "amount": state["amount"],
        "to": state["recipient"],
    })
    if decision != "approve":
        return Command(goto="cancelled")
    return Command(goto="do_transfer")

This is increasingly not optional. The EU AI Act's main provisions apply from August 2, 2026, requiring high-risk AI systems to allow human oversight during operation. (Note: a Digital Omnibus political agreement reached on May 7, 2026 shifts rules for certain high-risk areas to December 2, 2027 and for systems integrated into regulated products to August 2, 2028; final legislative adoption is still pending, so treat August 2026 as the current planning baseline and verify the live Commission page before quoting.) The US Treasury's Financial Services AI Risk Management Framework (February 2026) requires documented human review at defined decision points. In regulated domains, the gate is a compliance control, not a nicety.

Watch out

No persistence, no pause

interrupt() does not work without a checkpointer. A human gate is only as durable as the state store behind it — if your process can die while waiting for approval, you need real persistence, not an in-memory snapshot.

Guardrail frameworks and their limits

You don't have to hand-build every check — a mature ecosystem of libraries gives you injection screening, content classifiers, and dialogue rails out of the box. The skill is knowing what each one is good at, and where its ceiling is. The current landscape:

Framework	What it does	Notes (2026)
Guardrails AI	Python input/output validators, RAIL spec, Guardrails Hub, standalone REST server	v0.10.0; ~7k stars; OpenAI-SDK-compatible server
NVIDIA NeMo Guardrails	Colang-scripted dialogue rails (topics, flows)	v0.20.0 (Jan 2026); production-ready microservice container available; +100–300ms
Meta LlamaFirewall	3 layers: PromptGuard 2 (jailbreaks), Agent Alignment Checks (CoT auditor for injection), CodeShield (insecure code)	Built for agents, not chatbots
Meta LlamaGuard 3	Fine-tuned content-safety classifier	Strong turnkey, but no tunable threshold
Garak	LLM vulnerability scanner, 120+ adversarial probes	Red-team before you ship

Use the OWASP Top 10 for Agentic Applications (released December 2025, ASI01–ASI10) as your threat checklist — it covers agent-specific risks like goal hijack, tool misuse, and rogue agents that the chatbot-era OWASP LLM Top 10 misses.

The honest caveat: frameworks reduce risk, they don't eliminate it. They miss 15–40% of serious problems and false-alarm on a non-trivial share of legitimate turns. That is exactly why you stack them with the resilience patterns above instead of trusting any one library to be the whole defense.

Note

Check maturity notes before shipping

NeMo Guardrails is excellent for multi-turn dialogue rails. v0.17.0 (Oct 2025) carried a 'not recommended for production as-is' warning; v0.20.0 (Jan 2026) ships a production-ready Kubernetes microservice. Always read the maturity notes for the specific version you deploy.

Try it: Harden a fragile agent

Start from a minimal tool-using agent that calls one flaky external API and has an uncapped loop. Add four layers in order and verify each: (1) an input guardrail that rejects an obvious injection string and a schema-invalid request; (2) a retry-with-backoff-and-jitter wrapper plus a circuit breaker that opens after 5 consecutive failures and cools down for 30s; (3) a hard max_steps=20 cap and a 45-second wall-clock timeout, returning a graceful partial result instead of hanging; and (4) a human approval gate on one simulated irreversible action (e.g. delete_records) using a checkpointer-backed interrupt() (or a simple input() stand-in). Finally, run Garak or hand-craft 5 adversarial inputs and record which layer caught each. Write one paragraph: which failures got through, and what layer you'd add next.

Key takeaways

1Reliability is defense-in-depth: stack input guards, output guards, action validation, and monitoring so no single failure collapses the agent.
2Retries handle transient failures, fallbacks handle continuity, and circuit breakers handle sustained outages — they are distinct and you want all three.
3Hard max-iteration caps plus a wall-clock timeout are non-negotiable so an agent can never loop forever or hang.
4Put human approval gates on irreversible, high-blast-radius actions; in regulated domains they are now a compliance requirement, not a nicety.
5Guardrail frameworks catch only 60–85% of serious problems, so use them as one layer among many — never as the whole defense.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.In a multi-agent system, why is per-execution retry (e.g. stop_after_attempt=10) on a shared tool dangerous?

2.Which pattern is the right tool for a dependency that is persistently failing (not just a transient blip)?

3.What is required for LangGraph's interrupt() to pause an agent for human approval?

4.Which statement best reflects the real-world effectiveness of guardrail frameworks in 2026?

Go deeper

Hand-picked sources to keep learning

OWASP Top 10 for Agentic Applications (Dec 2025)

Authoritative taxonomy of the 10 agentic-specific risks (ASI01–ASI10) with mitigations. Use it as your threat checklist.

Portkey — Retries, fallbacks, and circuit breakers in LLM apps

Practical guide to when and how to use each resilience pattern and how they complement each other.

LangGraph Interrupts — Human-in-the-Loop (official docs)

interrupt() and interrupt_before/after for approval gates, including the persistence requirement and resumption.

Guardrails AI — GitHub (v0.10.0)

Leading Python framework for input/output validation: validators, RAIL spec, Guardrails Hub, and a standalone REST server.

LlamaFirewall: An open-source guardrail system for secure AI agents

Meta's three-layer defense (PromptGuard 2, Agent Alignment Checks, CodeShield) designed for autonomous agents, not chatbots.

Red Hat — Every layer counts: defense-in-depth for AI agents

A concrete multi-layer defense framework spanning sandboxing, guardrails, red-teaming, identity, and tracing.