Deploying & Scaling Agents

From notebook to always-on service

Advanced 14 minBuilder
What you'll be able to do
  • Decide between synchronous and async/queue-based deployment for a given agent task
  • Design a queue-and-worker architecture for long-running agent runs
  • Persist agent state and sessions so a run can survive a crash and resume
  • Manage provider rate limits, quotas, and concurrency before they cause outages
  • Choose a compute target (serverless, containers, micro-VMs, durable execution) and control cost at scale
At a glance

An agent that runs beautifully in your notebook is not a product. Production agents are long-running, stateful, non-deterministic, and burst dozens of external API calls per task — which breaks the request-response model most web services assume. This lesson shows you the deployment patterns that actually hold up: async queues, durable execution, state persistence, rate-limit management, and cost control.

  1. 1Why agents break the web-service playbook
  2. 2Synchronous vs. async execution
  3. 3Queues and workers for long-running agents
  4. 4State and session persistence
  5. 5Concurrency, rate limits, and provider quotas
  6. 6Durable execution for resilience
  7. 7Choosing compute and controlling cost

Why agents break the web-service playbook

A normal web service follows a tidy contract: a request comes in, you compute, you return a response in milliseconds, and you forget everything. Agents violate every clause of that contract.

An agent run is long (a research or coding task can take minutes to hours), stateful (it carries a growing message history and plan across steps), non-deterministic (the same input can take a different path), and bursty (a single task may fire dozens of sequential LLM and tool calls). The standard HTTP request lifecycle assumes the opposite of all four.

The practical consequence: you cannot just wrap your agent in a Flask route and call it done. The request will time out, a crash will lose all progress, and concurrent users will collide with provider rate limits you never hit in development.

Deploying an agent is therefore less about "hosting a model" and more about building a small distributed system around it — one that can run work in the background, remember where it was, recover from failure, and stay inside provider limits. The rest of this lesson builds that system piece by piece.

Synchronous vs. async execution

The first decision is the most consequential: does the caller wait, or not?

Synchronous (request-response) fits short, stateless tasks that finish inside an HTTP timeout — document classification, a one-shot Q&A, a single tool call. The client sends a request and blocks for the answer. Simple, and perfectly fine when the work takes a few seconds.

Async / background is required the moment a run can outlast a request. HTTP timeouts are typically 30–300 seconds; an agent run that takes 20 minutes cannot live inside one. The pattern:

  1. The web service enqueues a job and returns immediately with a job_id.
  2. A worker pulls the job and executes the agent run in the background.
  3. The client polls GET /jobs/{id} or receives a push (webhook, Slack message, email) on completion.

The cardinal rule: never design a synchronous UI for a 20-minute agent task. A spinner that blocks for twenty minutes is a guaranteed timeout and a terrible experience. Hand back a job ID, show progress, and notify on completion.

Watch out

The timeout trap

If your agent occasionally finishes in 8 seconds, a synchronous endpoint seems to work in testing — until a harder task takes 90 seconds and the load balancer kills the connection. "Usually fast" is not a deployment model. If worst-case duration can exceed your timeout, go async.

Queues and workers for long-running agents

Async execution is implemented with a queue (a durable list of pending jobs) and workers (processes that pull and run them). Common stacks: Celery + Redis or RabbitMQ in Python, or AWS SQS feeding container workers. The queue decouples accepting work from doing work, which lets you scale the two independently and absorb bursts.

A minimal Celery setup looks like this:

python
# tasks.py
from celery import Celery

app = Celery("agents", broker="redis://localhost:6379/0",
             backend="redis://localhost:6379/1")

# acks_late: if a worker crashes mid-run, the job is requeued,
# not silently lost.
@app.task(bind=True, acks_late=True, max_retries=3)
def run_agent(self, job_id: str, prompt: str):
    try:
        result = agent.run(prompt)          # the long-running work
        save_result(job_id, result)
    except RateLimitError as exc:
        # exponential backoff, then retry
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)
python
# api.py — enqueue and return immediately
@router.post("/agent")
def start(req: Req):
    job_id = new_id()
    run_agent.delay(job_id, req.prompt)     # non-blocking
    return {"job_id": job_id, "status": "queued"}

Set acks_late=True so a job is acknowledged only after it completes — if a worker dies, the broker requeues it. Route heavy and light tasks to separate queues so a flood of slow jobs can't starve fast ones.

State and session persistence

Agents are stateful, and that state must live outside the worker process — otherwise a restart erases the run. There are five patterns you'll combine in practice:

  1. Windowed memory — keep the last N messages in Redis.
  2. Summarized memory — periodically compress history with an LLM call.
  3. Semantic / RAG memory — retrieve relevant facts from a vector DB.
  4. Checkpoint / serialization — snapshot the full agent state to durable storage (used by LangGraph and AutoGen).
  5. Hybrid storage — match the medium to the data type.

For production checkpointing with LangGraph, the rule is simple: use MemorySaver for development, and PostgresSaver in production. Skip SqliteSaver — it buckles under write contention at real concurrency.

python
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(DB_URL)
checkpointer.setup()
graph = builder.compile(checkpointer=checkpointer)

# thread_id ties a run to its persisted state; resume by reusing it.
cfg = {"configurable": {"thread_id": job_id}}
graph.invoke({"messages": [user_msg]}, cfg)

On AWS, DynamoDBSaver stores small checkpoints (<350KB) directly and offloads large ones to S3 with a reference pointer. Expect checkpointing to add roughly 10–20% overhead versus a stateless chain — cheap insurance for resumability.

Key insight

Statelessness is not a virtue here

The instinct to make services stateless is right for CRUD APIs and wrong for agents. Forcing statelessness means re-sending the entire run history on every call (expensive) or losing recovery entirely. A stateful, persisted session — keyed by a thread_id — is the correct model for almost every non-trivial agent.

Concurrency, rate limits, and provider quotas

This is the failure mode that catches teams by surprise. Unmanaged rate limiting is the number-one cause of agent failures in production.

Providers enforce multiple simultaneous limits: requests per minute (RPM), tokens per minute (TPM), tokens per day (TPD), and concurrent requests — at the organization and project level. A single agent task can fire dozens of sequential calls, so it looks like a small DDoS. Scaling workers makes this worse: more workers means more requests hitting the same shared quota at once.

Two principles fix this:

  • Check quota before you start, not during. Before kicking off a 30-step workflow, confirm you have headroom. Failing on step 27 wastes everything before it.
  • Put an AI gateway in front of every provider call. Tools like Portkey, TrueFoundry, or AgentGateway enforce token/request budgets uniformly across all traffic, attribute cost per feature, cache, and fail over to a backup model when a provider throttles.
python
# A gateway centralizes limits, retries, and failover —
# instead of scattering ad-hoc backoff across your codebase.
client = Portkey(
    api_key=GATEWAY_KEY,
    config={"strategy": {"mode": "fallback"},
            "targets": [{"provider": "anthropic"},
                        {"provider": "openai"}]},
)

Don't try to scale your way out of a quota problem — adding workers without managing quota just produces synchronized rate-limit errors.

Durable execution for resilience

Retries handle a single flaky call. But what happens when a worker crashes 18 steps into a 25-step run? Without help, you restart from zero — re-paying for and re-doing all the prior work, and possibly repeating side effects.

Durable execution solves this. Frameworks like Temporal and Restate checkpoint every step of a workflow automatically, so if a worker dies, another resumes exactly where it left off. They also pause-and-resume on rate limits and retry transient failures transparently.

  • Temporal uses a pull model: dedicated workers pull tasks from a queue. Its OpenAI Agents SDK integration reached GA on March 23, 2026 and wraps agent calls as Activities automatically — you don't hand-write activity code. Workflows must be deterministic; the non-deterministic LLM and tool calls live in Activities.
  • Restate uses a push model: the server pushes work to serverless functions, so you get durable agents on Vercel, Cloudflare Workers, Lambda, or Modal with no dedicated worker fleet. It records each step in a durable journal.

Choose Temporal when you already run worker infrastructure and want a battle-tested platform; choose Restate when you want durability on plain serverless without managing workers.

Tip

When durable execution earns its keep

If a run is short and idempotent, queue retries are enough. Reach for Temporal or Restate when runs are long, multi-step, and expensive to repeat — exactly the workflows where a mid-run crash would otherwise cost real money and re-trigger side effects like sent emails or filed tickets.

Choosing compute and controlling cost

Where you run the agent — the compute target — is the box your code executes inside: a serverless function, a container, or a lightweight virtual machine. The choice is constrained by one brutal fact: each box has a maximum time it will let a single run stay alive — its execution time limit. If your agent runs longer, the platform kills it mid-task.

TargetMax execution timeNotes
AWS Lambda15 min (hard)Fine for short tasks only
Azure Functions (Consumption)10 min (hard)HTTP responses also capped at 230 s by load balancer
Google Cloud Functions60 min (HTTP)Workable for medium runs
Containers (ECS/Fargate/K8s)No platform limitStandard for long agents
Micro-VMs (Firecracker)No platform limitStrong isolation, fast resume

The misconception to kill: serverless is fine for long-running agents. It isn't — most non-trivial agents exceed Lambda's 15 minutes and hit Azure Consumption's 10-minute ceiling. Use containers or micro-VMs for anything long. Micro-VMs (Fly.io, Modal, Kata) add per-workload kernel isolation and sub-25ms snapshot resume — the right choice when running untrusted or LLM-generated code, since containers share the host kernel and have had critical escape CVEs.

On cost: API prices dropped 60–80% over 2024–2026, yet total token consumption grew 13x — because agents multiply tokens geometrically. Control it with model routing (cheap model for simple steps, premium for hard ones), which cuts spend 40–60% with no quality loss, plus per-feature cost attribution and real-time usage dashboards so an overrun is visible before the invoice.

Try it: Make a synchronous agent survive a crash

Take any agent you've built and put it behind a queue. (1) Add a POST /agent endpoint that enqueues the run with Celery+Redis (or SQS) and returns a job_id immediately, plus a GET /jobs/{id} to poll status. (2) Set acks_late=True and add exponential-backoff retries on rate-limit errors. (3) Persist run state with LangGraph's PostgresSaver, keyed by thread_id = job_id. (4) Start the run, then kill -9 the worker partway through and restart it — confirm the run resumes from its last checkpoint rather than restarting. (5) Write three sentences on what would need to change to handle 100 concurrent users without hitting provider rate limits (hint: an AI gateway and a pre-flight quota check). This exercise turns a fragile demo into something with the bones of a production deployment.

Key takeaways

  1. 1Agents are long-running, stateful, non-deterministic, and bursty — so they need an async, distributed-system architecture, not a plain request-response endpoint.
  2. 2Use a queue and background workers for any run that can exceed an HTTP timeout: enqueue, return a job ID, execute in a worker, then poll or push the result.
  3. 3Persist agent state outside the worker (LangGraph PostgresSaver in production, DynamoDB+S3 on AWS) so a run can survive a crash and resume by thread_id.
  4. 4Manage provider rate limits proactively with an AI gateway and pre-flight quota checks — unmanaged rate limiting is the top cause of production agent failures, and adding workers makes it worse.
  5. 5Durable execution (Temporal pull-based, Restate push-based) resumes long workflows from their last checkpoint; pick container or micro-VM compute and use model routing to control cost.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.Why is a synchronous HTTP endpoint usually the wrong way to deploy a long-running agent?

2.In a Celery-based agent deployment, what does setting acks_late=True accomplish?

3.Which statement about provider rate limits for agents is correct?

4.What problem does durable execution (Temporal, Restate) specifically solve?

Go deeper

Hand-picked sources to keep learning