Designing Tools Agents Can Use
Tool design is prompt engineering for actions
- Write tool names, descriptions, and parameter schemas that an LLM selects and fills in correctly
- Pick the right granularity — consolidating operations that go together and splitting ones that don't
- Design return values and error messages as high-signal feedback the model can act on
- Make side-effecting tools idempotent and gate destructive actions behind explicit confirmation
- Evaluate tools with realistic multi-step tasks and iterate on the descriptions that fail
An agent is only as capable as the tools you hand it, and a tool the model can't understand is worse than no tool at all. This lesson treats tool design as what it really is — prompt engineering for actions — and walks through naming, descriptions, typed parameters, bounded return values, model-readable errors, and the safety invariants that keep nondeterministic agents from doing damage.
- 1Your tool is a prompt the model reads every turn
- 2Names and descriptions: writing for the model
- 3Granularity: not too many, not too few
- 4Typed parameters: make invalid states unrepresentable
- 5Return values and errors are feedback, not logs
- 6Idempotency, confirmation, and safety in the tool layer
- 7Evaluate tools with real tasks, then iterate
Your tool is a prompt the model reads every turn
Picture how the agent actually sees your tool. It never runs your code to find out what it does; it reads the words you wrote — the name, the description, the parameter list — sitting right there in its context alongside the system prompt. From the model's point of view, a tool is a paragraph it has to interpret, not a function it can inspect. That single fact reframes the whole job.
Here is the mental model that changes everything: a tool definition is not API documentation for humans — it is a prompt for the model. Every tool's name, description, and parameter schema is loaded into the agent's context on every call. The model reads them to decide which tool to use and how to fill it in, exactly the way it reads your system prompt.
That reframing has teeth. In 2025, Anthropic published Writing effective tools for AI agents making the case directly; the same year, a paper on rewriting tool descriptions (arXiv:2602.20426) showed that on catalogs of 100+ tools, improving descriptions alone produced consistent, significant gains in tool-selection accuracy and end-to-end task success. The tool's implementation never changed — only the words the model reads.
So the craft of tool design is the craft of writing for a model: clear names, an unambiguous description, a schema that makes wrong inputs impossible to express, and outputs sized for a context window. Get the wording right and a mediocre model uses your tools reliably. Get it wrong and your best model fumbles. Everything in this lesson follows from treating the tool surface as prose the model must understand under pressure.
Names and descriptions: writing for the model
Think of the name and description as the only briefing the model gets before it has to act. If a smart new hire couldn't pick the right tool from those words alone, neither can the model — so write them like instructions, not labels.
Start with the name. Use snake_case, consistently, and name the action the model is taking: search_invoices, send_email, cancel_subscription. For larger tool sets, namespace with a prefix or suffix — asana_search, jira_search — so related tools cluster. Anthropic notes the prefix-vs-suffix choice has a measurable effect on tool-selection scores, so pick one convention and hold it.
The description is a mini-prompt. A reliable template:
"Tool to
<do X>. Use when<Y>. Do not use for<Z>."
State the critical constraint first, keep it under ~1024 characters, and remember every character costs tokens on every call. Spell out what the model can't infer — units, formats, side effects, and when not to reach for this tool. OpenAI suggests the intern test: could a competent new hire, with no other context, pick the right tool from your description alone?
@tool
def refund_payment(payment_id: str, amount_cents: int) -> str:
"""Issue a refund for a completed payment.
Use when a customer is owed money back on a settled charge.
Do NOT use for disputes or chargebacks (use open_dispute).
amount_cents must be <= the original charge. Idempotent per
(payment_id, amount_cents).
"""Key insight
Descriptions are prompt engineering, not docs
Tool descriptions are read by the model on every request and directly steer selection. Small wording changes produce large, measurable accuracy differences — so version them, A/B them, and treat a vague description as a bug, not a cosmetic flaw.
Granularity: not too many, not too few
How many tools should an agent have, and how big should each one be? This is a Goldilocks problem, and both extremes fail in opposite ways — flood the model with choices, or hand it one blunt instrument it has to wrestle.
The first mistake is too many tools: wrapping every API endpoint 1:1 gives you fifty overlapping functions whose descriptions blur together. Every one is read on every turn — inflating cost and latency — and selection accuracy collapses. OpenAI's guidance is to keep the active set under ~20 tools per turn and defer the rest with dynamic loading or a tool-search step. As catalogs pass 50–150 tools, the model leans on embedding similarity, and a poorly described tool becomes indistinguishable from the right one.
The second mistake is too few, too coarse: a single manage_files(action, ...) god-tool forces the model to encode intent into stringly-typed arguments it gets wrong.
The principle: each tool does one atomic action — copy_file, move_file, delete_file — but consolidate operations the agent always uses together. If the model invariably calls get_customer_by_id, then list_transactions, then list_notes, fuse them into one get_customer_context that returns all three. You've cut three round-trips to one and removed two chances to pick the wrong next tool.
Watch out
More tools is not more capable
It is tempting to expose your whole API "just in case." Don't. Beyond ~20 active tools, every added tool taxes cost, latency, and selection accuracy. Curate aggressively and load rarely-used tools on demand.
Typed parameters: make invalid states unrepresentable
The cheapest bug to fix is the one the model can't even write down. If your schema only allows valid inputs, the model physically cannot emit a malformed call — no validation, no retry, no apology needed. That is the whole goal of parameter design: shrink the space of expressible mistakes to zero.
Use JSON Schema to its fullest:
- Enums for finite value sets (
status: "open" | "closed" | "refunded") — the model can't hallucinate"refunded!". - Strong types over free strings: integers, ISO-8601 dates, booleans. Encode units in the name (
amount_cents,timeout_ms). - Explicit required/optional, with concrete examples inside each field's description.
- Strict mode on. OpenAI's strict mode (
additionalProperties: false, every field marked required,nullas a type for genuinely optional fields) guarantees schema adherence — no missing keys, no invented enum values.
from pydantic import BaseModel, Field
from typing import Literal
class BookFlight(BaseModel):
origin: str = Field(description="IATA airport code, e.g. 'SFO'")
destination: str = Field(description="IATA airport code, e.g. 'JFK'")
date: str = Field(description="Departure date, ISO-8601 'YYYY-MM-DD'")
cabin: Literal["economy", "business", "first"] = "economy"LangChain, the OpenAI SDK, and Anthropic all derive their JSON Schema from models like this. Validation at the tool boundary is your last line of defense — but a tight schema means there's far less left to catch.
Return values and errors are feedback, not logs
Whatever a tool returns isn't a log file you skim later — it's the next thing the model reads, and it shapes the very next decision. Treat every return value as a message to the agent: say the useful thing, leave out the noise, and when something breaks, tell it what to do instead of dumping the wreckage.
So return high-signal, not high-volume. Anthropic's guidance: prioritize relevance over completeness. Strip UUIDs and mime_type noise in favor of human-readable identifiers, and offer a response_format parameter (concise vs detailed) — their concise mode used roughly one-third the tokens. For large result sets, paginate, filter, or truncate (Claude Code caps tool output near 25,000 tokens by default).
Errors are the subtler art. An error message is feedback to the model, not a log line for you. A raw stack trace burns context on something the model can't act on. Structure errors so the model knows what to do next:
{
"error_code": "AMOUNT_EXCEEDS_BALANCE",
"context": {"requested_cents": 5000, "available_cents": 3200},
"action": "Retry with amount_cents <= 3200, or call check_balance first."
}That triple — a machine-readable error_code, the relevant context, and an explicit action — turns a failure into a recovery the agent can execute on its own, instead of a dead end it loops on or hallucinates around.
Example
Two errors, same bug
Bad: Traceback (most recent call last): ... ValueError: insufficient funds. The model can't parse the cause or the fix.
Good: {"error_code":"INSUFFICIENT_FUNDS","context":{"short_by_cents":1800},"action":"Reduce amount by 1800 cents or top up the account."}. The agent now knows exactly how to recover.
Idempotency, confirmation, and safety in the tool layer
Here is an uncomfortable truth about agents: they retry, and they don't always know they're repeating themselves. A response times out, the loop fires the same call again, and now you've charged a customer twice. Because the model is nondeterministic, you cannot trust it to never do the dangerous thing — you have to make the dangerous thing impossible to do twice (or, sometimes, impossible to do without a human nod).
So design side-effecting tools to be idempotent: accept a large, unguessable idempotency key so a repeated call is a no-op, not a second charge.
Some actions can't be made idempotent — deleting a production table, wiring money, posting publicly. For those, separate the decision from the execution and require an explicit confirmation step or a policy-service approval gate before the irreversible action fires.
The deeper rule: safety invariants live in the tool layer, not the prompt. A system-prompt instruction like "never delete more than 10 rows" can be forgotten, overridden, or injected away over a long conversation. Enforce limits where they can't be bypassed:
- RBAC scope checks and allowlists inside the tool, on every call.
- Rate limits independent of model output.
- Sandboxing for code execution and file access.
- Least privilege — give each tool only the permissions it strictly needs.
The model proposes; the tool layer disposes — and the tool layer is the only place you can guarantee the rules hold.
Evaluate tools with real tasks, then iterate
You can't tell whether a tool is well-designed by rereading its description — it looks fine to you because you already know what it means. The only honest test is to watch a real agent try to use it on a real task and see where it stumbles. The good news: most stumbles point at a specific line of prose you can fix.
And not with a trivial one-call test — use realistic multi-step tasks that resemble production work, because the failures that matter show up only when tools compose.
Instrument every run and track:
| Signal | What a problem looks like | Likely fix |
|---|---|---|
| Tool-selection accuracy | Wrong tool chosen | Sharper description / namespacing |
| Parameter errors | Malformed or missing args | Enums, examples, strict mode |
| Redundant calls | Same data fetched repeatedly | Tune pagination / consolidate tools |
| Token consumption | Bloated context, high cost | Add concise mode, truncate results |
| Error / retry rate | Loops, give-ups | Make errors pedagogical, add idempotency |
Treat tool descriptions as living prompts: when an eval shows the agent picking the wrong tool, the fix is usually three better words in a description, not a new tool. The 2025 description-rewriting research formalized exactly this — curriculum-style refinement of descriptions was the single biggest lever on large-catalog reliability. Build a small eval harness early, run it on every tool change, and let the failures tell you which words to rewrite.
Try it: Redesign a leaky tool
Take a deliberately bad tool: do_stuff(action: str, data: str) -> str that wraps create/read/update/delete on a notes store and returns raw exceptions on failure.
- Split & name. Replace it with atomic,
snake_casetools (create_note,get_note,update_note,delete_note). Add namespacing if you imagine 30+ tools. - Schema. Define typed parameters with Pydantic — use a
Literalenum somewhere, mark optional fields explicitly, and embed an example in each field description. Enable strict mode. - Returns & errors. Make
get_notereturn a concise, human-readable result (no UUIDs/mime types). Replace raw exceptions with{error_code, context, action}JSON. - Safety. Make
delete_noterequire a confirmation flag and enforce a per-call rate limit inside the tool body — not in the prompt. Add an idempotency key tocreate_note. - Evaluate. Write one realistic multi-step task ('find the note about Q3 budget, append a line, then delete the outdated draft') and run your agent against both the old and new tools. Record tool-selection accuracy, parameter errors, and token use. Note which single description change fixed the most failures.
Key takeaways
- 1A tool definition is a prompt the model reads every turn — names, descriptions, and schemas drive selection, and small wording changes have large, measurable effects.
- 2Right-size granularity: one atomic action per tool, but consolidate operations the agent always uses together, and keep the active set under ~20.
- 3Use typed schemas, enums, examples, and strict mode to make invalid tool calls impossible to express.
- 4Return high-signal results and structure errors as machine-readable feedback (error_code + context + action), never raw stack traces.
- 5Enforce idempotency, confirmation gates, and safety invariants (RBAC, rate limits, least privilege) in the tool layer, not the prompt.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.Why is a tool's description considered prompt engineering rather than documentation?
2.An agent always calls get_customer_by_id, then list_transactions, then list_notes in sequence. What is the best tool-design move?
3.Which error format is best for an LLM agent to recover from?
4.Where should the invariant 'never delete more than 10 rows' be enforced?
Go deeper
Hand-picked sources to keep learning
First-party, most authoritative guide: namespacing, descriptions, response-format control, consolidation, error design, and eval-driven iteration.
Strict mode, JSON Schema, enums, required/optional fields, the intern test, and keeping tool sets small.
Empirical evidence that description quality is the primary bottleneck at scale: consistent, significant gains in selection accuracy and task success on catalogs of 100+ tools.
Practitioner guide on naming, description templates, parameter design (enums, examples), and continuous improvement via production monitoring.
Why error messages must be structured feedback for the model, not stack traces for developers.