Safety, Security & Governance/Lesson 2 of 5

Securing Agentic Systems

Least privilege for software that acts on its own

Advanced 14 minBuilder

What you'll be able to do

Apply the principle of least privilege to agent tools and data, addressing OWASP's three root causes of excessive agency
Choose the right sandbox (container, gVisor, or microVM) for a given trust level and isolate untrusted code
Design allowlists and risk-tiered confirmation gates that stop irreversible actions without causing confirmation fatigue
Inject secrets through a credential broker instead of environment variables, and treat all tool outputs as untrusted input
Compose these controls into a defense-in-depth posture with logging and monitoring that shrinks the blast radius of a compromise

At a glance

An agent amplifies every security mistake you give it: a single over-privileged agent hit by prompt injection can exfiltrate data or delete records at machine speed. This lesson is the defensive playbook — least privilege scoped to the task, OS-level sandboxing, confirmation gates for irreversible actions, secrets that never touch the agent's hands, and treating every byte of external data as hostile. You'll leave with a concrete, layered architecture you can apply on Monday.

1Think in blast radius
2Least privilege, scoped to the task
3Sandboxing: match isolation to trust
4Allowlists and risk-tiered confirmation gates
5Secrets the agent never holds
6All external data is untrusted
7Defense in depth and monitoring

Think in blast radius

Security for chatbots is about what the model says. Security for agents is about what the model does — and an agent does it at machine speed, across many tool calls, without a human watching each one. The mental model that organizes everything in this lesson is blast radius: if this agent were fully compromised right now — say, by a malicious instruction hidden in a web page it just fetched — how much damage could it do before anyone noticed?

The answer is exactly the sum of every privilege you handed it. A database tool with DELETE rights, a shell with network access, an email tool with your sender identity, a long-lived API key in its environment — each one widens the radius. Your job as a defender is not to make the agent unhackable (you can't), but to make a successful hack boring: scoped to a tiny, reversible, observable slice of the world.

This reframes design. Instead of asking "what can this agent do?", you ask "what's the worst it can do if it's turned against me?" Every control below — least privilege, sandboxing, gates, secret brokering — is a way to shrink that worst case. No single control is sufficient; together they make compromise survivable.

Least privilege, scoped to the task

Think about how a good office grants building access: a contractor hired to fix one room gets a key to that room for that day — not a master key to the whole campus, forever. That is least privilege, and it is the single highest-leverage control here: give the agent the minimum capability needed for the task in front of it, and nothing more. OWASP's LLM06:2025 — Excessive Agency (renumbered from the old LLM08) names three distinct root causes you must address independently:

Excessive functionality — tools the agent doesn't need are present at all. Remove them.
Excessive permissions — the tools it has carry broader rights than the task requires. Scope them.
Excessive autonomy — high-impact actions proceed without any human check. Gate them.

Scoping is action-level, not tool-level. A database connector for a reporting task should hold SELECT only — not UPDATE or DELETE — unless the task explicitly demands writes. The current best practice is dynamic scoping (permissions granted per task, not standing) backed by ephemeral tokens that expire when the task completes, rather than long-lived credentials sitting around to be stolen.

Three controls are the non-negotiable minimum for any production agent: a unique agent identity, minimal permissions scoped to the current task, and full tool-call logging. If you ship nothing else from this lesson, ship those three.

Watch out

Excessive agency is three problems, not one

It's tempting to treat "too much agency" as a single dial. OWASP is more precise: excessive functionality, excessive permissions, and excessive autonomy are independent failures with independent fixes. Removing an unused tool does nothing about an over-scoped one. Audit all three.

Sandboxing: match isolation to trust

When an agent runs code — its own, or code it generated — it needs an execution sandbox, and the strength of that sandbox should match how much you trust what's running. Picking the wrong tier is a classic mistake: a standard Docker container shares the host kernel, so a kernel bug or misconfiguration can let code escape. Containers are fine for fully trusted code, not for AI-generated or untrusted code.

The practical tiering, with rough numbers:

Tier	Mechanism	Overhead	Use for
Docker container	Process isolation, shared kernel	Minimal	Fully trusted code only
gVisor	User-space kernel intercepts syscalls	10–30% I/O, ms startup	Multi-tenant SaaS agents
Firecracker microVM	Dedicated kernel per workload (KVM)	~125 ms boot, <5 MiB	Untrusted / AI-generated code

Isolation also has two axes, not one. Anthropic's Claude Code sandboxing (2025) enforces both: filesystem isolation (reads/writes confined to the working directory) and network isolation (outbound traffic routed through a proxy with an allowlist of domains), using Linux bubblewrap and macOS Seatbelt at the OS level so the limits cover subprocesses and scripts too. Their testing showed sandboxing cut permission prompts by 84% while keeping the agent safe.

Allowlists and risk-tiered confirmation gates

Two everyday ideas do the heavy lifting here. The first is a guest list at a door: if you're not on it, you don't get in — no exceptions, no arguing about who isn't allowed. The second is the difference between a button you can press freely and one behind a glass cover you must lift first. Together these two controls govern what actions are even possible and which require a human.

Allowlists (default-deny). Don't enumerate what's forbidden — enumerate what's permitted, and deny everything else. Allowlist the domains an agent may reach, the commands it may run, the file paths it may touch. A denylist is a guarantee you missed something; a default-deny allowlist fails closed.

Confirmation gates, tiered by risk. The naive design — ask a human to approve every action — backfires. It produces confirmation fatigue: people rubber-stamp prompts without reading them, and the gate becomes theater. The fix is to gate on risk level:

Low / read-only (search, read a file): autonomous, no gate.
High / irreversible / high-blast-radius (delete records, send email, deploy, spend money): require explicit human approval.

Design around irreversibility where you can. GitHub's Copilot coding agent (2025) never commits to a branch directly — it only opens pull requests, and those PRs don't auto-trigger CI without human review. The dangerous action is structurally impossible, not merely discouraged.

Watch out

Never cache an approval

An approval is for one action, one time. NVIDIA's guidance is explicit: approvals must never be cached or persisted across sessions. The agent's context — and any injected instructions hiding in it — changes between runs, so a legitimate "yes" yesterday is an open door for an adversary today.

Secrets the agent never holds

Don't hand the agent the key — hand it a valet ticket. A valet can move your car without ever holding the title or knowing where you live; the ticket works once, for one purpose, and then it's useless. With that picture in mind: the common instinct to pass an API key as an environment variable into the agent process is wrong, and it's worth understanding why. Environment variables can be read by any process in the same environment, extracted via /proc, leaked into tool outputs, and — most dangerously — trivially exfiltrated once a prompt injection convinces the agent to print or send them. If the secret is in the process, the agent can leak it.

The correct pattern is a credential broker: a service (HashiCorp Vault, AWS Secrets Manager) that mints short-lived, task-scoped tokens on demand and injects them at the point of use, the way OIDC works for humans. The agent's reasoning loop never sees the raw credential; it asks the broker, the broker checks policy, and a token that expires minutes later does the work.

python

# WRONG: secret lives in the agent's environment, leakable forever
import os
stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # one injection away from exfiltration

# RIGHT: broker mints a short-lived, scoped token at call time
def charge(amount_cents: int, customer: str) -> dict:
    token = broker.issue(
        scope="payments:charge",      # this action only
        ttl_seconds=120,              # expires fast
        actor=AGENT_IDENTITY,         # attributable
    )
    return payments_client(token).charge(amount_cents, customer)
    # token is never placed in the model's context window

Rotate automatically, scope narrowly, and keep an audit trail of every issuance.

All external data is untrusted

Here is the rule that surprises people: every piece of data the agent ingests from outside — retrieved documents, tool outputs, API responses, web pages, emails, calendar invites — is untrusted input, because any of it can carry an indirect prompt injection: instructions hidden in content the agent fetches, designed to hijack it. This isn't rare. Injection-style findings appeared in over 73% of assessed production deployments in 2025, and it's classified as a system-level architectural vulnerability — not something you can patch by fine-tuning the model or writing a better system prompt.

The reason it can't be prompted away: the model has no reliable channel separation between your trusted instructions and the untrusted content sitting next to them in the same context window. So defense is structural. Meta's "Rule of Two" captures it cleanly: a component should satisfy no more than two of (A) processes untrusted input, (B) accesses sensitive data, (C) can change external state. A component doing all three has an unacceptable blast radius — split it.

Don't forget the supply chain. A 2025 scan of 1,808 MCP servers found 66% had security findings; path-traversal bugs were among the top critical findings, and CVE-2025-53109 (symlink bypass to RCE) and CVE-2025-53110 (directory-containment bypass via prefix matching) both hit Anthropic's own Filesystem MCP server. Auditing the MCP servers and tools you connect is part of your security posture, not an afterthought.

Key insight

The Rule of Two

Untrusted input + sensitive data + ability to change external state = the lethal combination. Allowing any agent component all three at once is the architectural root of most catastrophic agent incidents. Cap every component at two of the three, and a compromise stays contained.

Defense in depth and monitoring

A castle doesn't rely on one wall. It has a moat, then an outer wall, then an inner keep — because the designers assumed any single barrier would eventually be breached. That assumption is the whole idea behind defense in depth: no single control on this list is sufficient, so you stack independent layers, and when one fails — and one will — the others still contain the damage. A complete posture layers:

Input sanitization — treat fetched/retrieved content as untrusted; strip or quarantine instructions.
Least-privilege tools — minimal functionality, minimal permissions, ephemeral tokens.
OS-level sandbox — filesystem and network isolation matched to trust level.
Network egress filtering — default-deny allowlist of reachable domains.
Output validation — schema-check and bound what tools return before it re-enters context.
Confirmation gates — risk-tiered human approval for irreversible actions.
Audit logging & monitoring — full tool-call logs tied to a unique agent identity, alerting on anomalies.

That last layer turns the others from hope into operations. Log every tool call with the agent identity, arguments, and result; monitor for spikes in volume (a sign of a runaway or hijacked agent) and for actions that escape the allowlist. You can't prevent every compromise, but with attributable logs and anomaly alerts you can detect it fast and prove what happened afterward — which is the difference between an incident and a catastrophe.

Try it: Shrink the blast radius

Take a small agent you've built (or the from-scratch agent from Module 3) that has at least one tool which writes or sends — a file writer, a shell, or an email tool. Do four things and write one paragraph on each:

Scope it down. List every permission its tools currently hold, then cut each to the minimum the task needs (e.g., SELECT-only DB access, a single allowlisted output directory). Note what you removed.
Add a risk-tiered gate. Classify each tool action as low (read-only) or high (irreversible/external write). Add a human-approval prompt to the high-risk ones only — and confirm you are NOT caching the approval.
Move the secret. If any credential is read from an environment variable, replace it with a stub credential-broker function that returns a short-lived token at call time and keeps the raw value out of the model's context.
Apply the Rule of Two. Identify any single component that processes untrusted input AND accesses sensitive data AND can change external state. Sketch how you'd split it so no component holds all three.

Deliverable: a before/after diff plus your four paragraphs. This is the exact audit you'll run on every production agent you ship.

Key takeaways

1Design for blast radius: assume the agent will be compromised and make the worst case small, reversible, and observable.
2Least privilege has three independent fronts — excessive functionality, permissions, and autonomy (OWASP LLM06:2025) — and all three need fixing.
3Match the sandbox to trust: containers for trusted code, gVisor for multi-tenant, Firecracker microVMs for untrusted or AI-generated code.
4Never put secrets in the agent's environment; broker short-lived, task-scoped tokens, and gate irreversible actions on risk without caching approvals.
5Treat all external data as untrusted, cap every component at two of {untrusted input, sensitive data, external writes}, and layer controls with audit logging.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.OWASP's LLM06:2025 (Excessive Agency) identifies three root causes. Which set is correct?

2.You need to run untrusted, AI-generated code in production with the strongest practical isolation. Which option fits best?

3.Why is passing an API key as an environment variable into an agent process considered unsafe?

4.According to Meta's 'Rule of Two,' which agent component has an unacceptable blast radius?

Go deeper

Hand-picked sources to keep learning

OWASP AI Agent Security Cheat Sheet

The most comprehensive practical reference: least-privilege scoping, risk tiers, memory isolation, multi-agent signing, and confirmation gate design. Living document.

OWASP LLM06:2025 — Excessive Agency

The canonical 2025 entry for agent over-privilege and autonomy, defining the three root causes and their mitigations (allowlists, HITL gates, downstream enforcement).

Making Claude Code More Secure: Sandboxing (Anthropic)

Production example of dual filesystem + network isolation via bubblewrap/Seatbelt; reduced permission prompts by 84%.

How GitHub's Agentic Security Principles Work (GitHub Blog)

Six concrete principles including 'prevent irreversible changes' — the agent opens PRs only, never commits directly.

Sandboxing AI Agents: MicroVMs, gVisor & Isolation (Northflank)

Technical comparison of Docker vs gVisor vs Firecracker vs Kata with boot times, overhead, and use-case guidance.

Practical Security Guidance for Sandboxing Agentic Workflows (NVIDIA)

Tiered security model, OS-level vs application-level controls, secret-injection patterns, and the rule that approvals must never be cached.