Frontiers & Capstone/Lesson 5 of 5

Capstone: Build a Production Agent

Put it all together into a portfolio project

Advanced 18 minBuilder
What you'll be able to do
  • Scope a capstone agent that is narrow, verifiable, and genuinely useful — not a toy demo
  • Design the architecture: model, typed tool contracts, deterministic state, memory layers, and orchestration
  • Add retrieval and structured outputs so the agent is grounded and safe to program against
  • Build a golden eval dataset and gate it in CI so quality is measured, not assumed
  • Add guardrails, step-level tracing, and a deployment path, then showcase the project credibly
At a glance

This is where the whole course converges: you take everything — tools, memory, retrieval, evaluation, guardrails, observability — and assemble it into one production-grade agent you can put on your portfolio. By 2026 the differentiator is no longer model intelligence; it is whether your system chooses the right action, stays within policy, and is measurable, debuggable, and trusted around real users and money. This lesson hands you a concrete project blueprint and an ordered build checklist you can finish in one to two weeks.

  1. 1What actually counts as a capstone in 2026
  2. 2Choosing a project that can actually be finished
  3. 3The architecture: contracts, state, and orchestration
  4. 4Grounding: retrieval and structured outputs
  5. 5Evaluation: the part that makes it credible
  6. 6Guardrails, observability, and shipping

What actually counts as a capstone in 2026

Think of a capstone less like a clever demo and more like a small product you could hand to a stranger. The job is to prove you can build a system that acts reliably around real users and real money — not that you can write one impressive prompt.

That is why a capstone is not a notebook that calls one API. It is a small distributed system in which the LLM is the planner and executor, wrapped in the engineering that makes it dependable: typed tool contracts, deterministic state, trace-level observability, and tested quality.

The instinct to reach for raw model power first is exactly backwards. In 2026, model quality is commoditized — every frontier model can plan and call tools. What separates a portfolio project that gets you hired from one that gets ignored is architecture: the scaffolding around the model, not the model itself.

Concretely, reviewers in 2026 look for five signals:

  1. A deployed MVP (an endpoint or CLI that actually runs), not a notebook.
  2. CI-gated eval tests that fail the build on regression — quality measured, not claimed.
  3. Hybrid retrieval against real data, not a toy API.
  4. Observable traces you can replay and debug step by step.
  5. A README that documents architecture decisions and measured quality.

Hit those five and you have demonstrated production engineering — which is the entire point.

Watch out

A demo is not a capstone

A notebooks-only build against a toy API does not demonstrate production skill. Recruiters and reviewers in 2026 specifically screen for a deployed MVP, CI-gated evals, observable traces, and documented architectural choices. Build for those signals from day one.

Choosing a project that can actually be finished

Most capstones fail before a line of code is written — they die from being too ambitious. The cure is brutal scoping: pick something small enough to finish, concrete enough to test, and real enough to matter.

Those three words have precise meanings. A good capstone is narrow, verifiable, and useful — three constraints that kill almost every grandiose idea, which is exactly the point.

  • Narrow — one user, one workflow, one domain. "Triage incoming support tickets and draft replies" beats "build a company-wide assistant."
  • Verifiable — you can write a test that says this run was correct. If you cannot define success, you cannot evaluate it, and an unevaluated agent is just a vibe.
  • Useful — it connects to real infrastructure you (or a stand-in service) actually have: a real database, a real document set, a real ticketing API.

Strong, scopeable archetypes:

ProjectToolsWhy it works
Support-ticket triage agentsearch docs, label, draft replyclear success criteria, real data
Codebase Q&A agentrepo search, file read, run testsverifiable via tests
Research-and-report agentweb search, fetch, summarizetrajectory is inspectable
Internal-data analystSQL query, chart, summarizestructured, deterministic checks

Pick one. Write a one-paragraph spec: the user, the goal, the tools, and one sentence defining a successful run. That sentence becomes your first eval case — the seed everything else grows from.

Tip

The simplest-thing-first rule

Anthropic's Building Effective Agents guidance still holds in 2026: start with the simplest solution and add agentic complexity only when simpler approaches fail. If a fixed workflow (prompt chain or router) solves your task, ship that — and note in your README why you did or didn't need a full agent loop.

The architecture: contracts, state, and orchestration

The most important mental shift: the LLM is the planner, never the whole system. It decides what to do next; everything that makes those decisions safe, repeatable, and debuggable lives in the engineering around it. Picture the model sitting inside a sturdy harness — the harness is your real work.

Three pieces of that harness carry the most weight.

1. Typed tool contracts. Every tool gets a strict schema and input validation. Bad arguments should be rejected before they touch your infrastructure — and the rejection message should be useful feedback the model can read and recover from, not a stack trace.

2. Deterministic state. The agent's state changes through a reducer — a single function that takes the current state plus an event and returns the next state — instead of mutating a blob in place. Predictable transitions are what make a run reproducible and debuggable: replay the same events, get the same state.

3. Orchestration. This is the loop that feeds each tool result back to the model and runs until done. Build it yourself first to understand it, then optionally adopt a framework like LangGraph (stable v1.0, late 2025) for checkpointing, persistence, and human-in-the-loop.

python
from pydantic import BaseModel, field_validator

class SearchTicketsArgs(BaseModel):
    query: str
    max_results: int = 5

    @field_validator("max_results")
    @classmethod
    def cap(cls, v: int) -> int:
        if not 1 <= v <= 20:
            raise ValueError("max_results must be 1..20")
        return v

def search_tickets(args: SearchTicketsArgs) -> dict:
    rows = db.search(args.query, limit=args.max_results)
    return {"count": len(rows), "results": rows}  # bounded, structured

Validate, execute, return a bounded structured result. That single discipline — applied to every tool — is 80% of reliability.

Key insight

Build it raw before you reach for a framework

Write the loop with direct API calls once. Frameworks abstract the exact mechanics you most need to understand when something breaks — debugging a framework you don't understand is far harder than debugging forty lines of your own code. Adopt LangGraph after you know what it's hiding.

Grounding: retrieval and structured outputs

Two additions move your agent from plausible to trustworthy. Retrieval grounds its claims in real data so it stops making things up, and structured outputs make its results safe to program against so the rest of your system can rely on them.

Retrieval. Don't reach for a vector database first — that is the most common over-engineering mistake. Most production wins come from structured state plus summaries plus task artifacts; vector search is added later, for large document sets. When you do add it, use hybrid retrieval — keyword search (BM25, which matches exact terms) combined with vector search (which matches meaning) — followed by a reranking step that re-sorts the candidates by relevance. That combination is what reviewers expect in 2026, and it materially outperforms naive top-k cosine search. Make retrieval agentic: let the model decide when and what to retrieve, rather than stuffing everything into context.

Structured outputs. Free text breaks pipelines — you cannot reliably parse, store, or test prose. Instead, define a schema for the agent's final answer and validate every result against it, retrying on failure.

python
from pydantic import BaseModel

class TriageResult(BaseModel):
    category: str
    priority: int          # 1..5
    draft_reply: str
    cited_doc_ids: list[str]

# Pass TriageResult as the structured-output schema; validate,
# and on ValidationError, retry once with the error appended
# as feedback before failing the run.

A validated TriageResult with cited_doc_ids is something you can store, test, and audit — and the citations let you check the answer was actually grounded. Free-text prose gives you none of that.

Watch out

Treat retrieved content as untrusted

Anything you pull from documents, web pages, or tool outputs is untrusted input. It can carry indirect prompt injection. Never let retrieved text silently escalate the agent's privileges or trigger irreversible actions — keep policy enforcement outside the model.

Evaluation: the part that makes it credible

Here is the rule that separates engineers from hobbyists: if you can't measure it, you can't ship it. Anyone can demo an agent that works once; proving it works consistently is the hard part, and evaluation is how you prove it.

Good evaluation rests on three pillars:

  1. Success / quality — did the run complete the task? (completion rate, groundedness, coherence)
  2. Process / trajectory — were the right tools chosen, with valid arguments, in a sensible number of steps?
  3. Trust / safety — does it resist prompt injection and stay within policy?

Trajectory evaluation is non-negotiable because agents produce silent failures: a correct-looking answer reached via the wrong data source or the wrong tool. The output looks fine, so checking only the final answer misses it entirely — only inspecting the path catches it.

Build a golden dataset — a curated set of inputs paired with known-good expectations. Start with 20–30 cases covering your most critical use case; 50–100 catches obvious failures; 200–500 is production-ready. Source cases three ways: developer-curated traces from your own logs, anonymized production data, and synthetic multi-turn cases generated by "dueling LLMs" (one model writes hard cases, another critiques them).

Gate it in CI. Run the eval set on every change and fail the build on regression — so quality can only go up.

python
def test_triage_eval():
    results = [run_agent(c.input) for c in GOLDEN_CASES]
    success = mean(judge(r, c.expected) for r, c in zip(results, GOLDEN_CASES))
    assert success >= 0.85   # block the merge if quality drops

Watch out

LLM-as-judge is not reliable out of the box

LLM judges show position, length, and agreeableness biases, exceed 50% error on complex tasks, and agree with domain experts only ~64–68% in specialized fields. Before trusting judge scores, add a rubric, few-shot examples, structured output, and calibration against a set of human labels.

Guardrails, observability, and shipping

Guardrails are the seatbelts of an agent: you install them before the crash, not after. Treating them as a post-launch patch is the single most common — and most expensive — mistake. Industry surveys from 2025 (Deloitte, KPMG) consistently find that roughly 80% of organizations lack mature governance capabilities before deploying agents, and the reactive scramble after an incident is the canonical anti-pattern. Bake them in from the start.

Guardrails checklist:

  • Capability gating — allowlist which tools the agent can use per environment.
  • Human-in-the-loop — an approval gate for any irreversible or money-moving action.
  • Data boundaries — redact sensitive outputs, use short-lived tokens.
  • Injection resilience — treat all retrieved content as untrusted.
  • Budgets — rate limits, token/cost caps, and max-iteration caps so a runaway loop can't bankrupt you.
  • Policy-as-code — enforce policy outside the LLM, deterministically, so it can't be talked out of a rule.

Observability means you can see what the agent did, step by step. Emit a trace for every step. A minimum payload: traceId, stepId, tool name, an arguments hash, duration, a result summary, model name, and token counts. Traces power both live debugging and offline eval replay — they turn a nondeterministic agent into something you can actually inspect.

Use LangSmith (with LangGraph) or Langfuse (framework-agnostic, OpenTelemetry GenAI) to collect them.

Deployment comes in two shapes: agent-as-service (a backend /run endpoint, with long runs pushed onto an async queue like ARQ or Celery + Redis so requests don't time out) or agent-in-repo (local execution). For long-running, failure-prone workflows, durable execution (e.g., Temporal) keeps state alive across crashes, so a process dying mid-run doesn't lose everything. Then write the README that explains your choices and shows your measured numbers — that document is half of what reviewers grade.

Example

A minimal step trace

json
{
  "traceId": "run_8f2",
  "stepId": 3,
  "tool": "search_tickets",
  "argsHash": "a1b2c3",
  "durationMs": 240,
  "resultSummary": "3 tickets matched",
  "model": "claude-opus-4-5",
  "tokensIn": 1840,
  "tokensOut": 96
}

Replaying these for a failed run is how you turn a nondeterministic agent into a debuggable one.

Try it: Ship a thin but complete production agent

Build one end-to-end slice of a real capstone, following the production build order.

1. Spec (15 min). Write a one-paragraph spec for a narrow, verifiable agent (e.g., support-ticket triage). End with one sentence defining a successful run — this is your first eval case.

2. Tool contract (30 min). Implement one tool with a Pydantic schema, input validation, and a bounded structured return. Reject bad arguments with a useful error message.

3. Loop + structured output (45 min). Wire the model–tool loop with direct API calls (no framework yet). Return a validated structured final result with citations.

4. Eval + CI gate (45 min). Hand-write 10–20 golden cases. Write a test that runs them and asserts a success threshold; wire it so it fails the build on regression.

5. Trace + guardrail (30 min). Emit a step-level trace (traceId, stepId, tool, argsHash, duration, tokens). Add one guardrail: an iteration cap and a human-approval gate for any irreversible action.

Deliverable. A repo with the agent, the eval suite passing in CI, sample traces, and a README that states your architecture choices and your measured success rate. That README is what a reviewer reads first.

Key takeaways

  1. 1In 2026 the differentiator is architecture, not model intelligence — typed tool contracts, deterministic state, observability, and evaluation are the project.
  2. 2Scope ruthlessly: a good capstone is narrow, verifiable, and connected to real infrastructure, with a one-sentence definition of a successful run that becomes your first eval case.
  3. 3Build the loop raw before adopting a framework, add structured outputs, and prefer structured state plus summaries over vectors until you actually need large-scale recall.
  4. 4Trajectory evaluation and a CI-gated golden dataset (20–50 cases minimum) are non-negotiable because agents fail silently — correct answers reached the wrong way.
  5. 5Guardrails, step-level traces, and a clear README of architecture and measured quality are what make the project deployable and what reviewers actually grade.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.In 2026, what most distinguishes a portfolio-worthy capstone agent from a toy demo?

2.Why is trajectory (process) evaluation essential, not optional, even when the final answer looks correct?

3.What is the recommended approach to memory and retrieval when starting a capstone?

4.Which statement about guardrails and deployment is accurate for production agents in 2026?

Go deeper

Hand-picked sources to keep learning