Tool Use & Function Calling

How an agent reaches outside its own head

Intermediate 15 minBuilder

What you'll be able to do

Write a correct tool/function schema (name, description, JSON Schema parameters) that the model can call reliably
Trace the full 5-step request/response cycle from tool definition to final answer
Use parallel tool calls and tool_choice to control when and how the model invokes tools
Explain how the model decides which tool to call, and handle tool errors so it can self-correct
Identify the key API differences between OpenAI, Anthropic, and Google when porting tool-calling code

At a glance

An LLM on its own can only produce text — it cannot fetch a stock price, query your database, or send an email. Tool use (OpenAI calls it function calling) is the mechanism that fixes this: the model emits a structured request to run a function, your code runs it, and you feed the result back. This lesson shows you exactly how that loop works, how to write tool schemas the model can use reliably, and where OpenAI, Anthropic, and Google differ in practice.

1The model asks; your code acts
2The tool schema: name, description, parameters
3The request/response cycle
4Parallel calls and tool_choice
5How the model decides — and how to make it reliable
6Returning observations and handling errors
7Provider differences that bite you

The model asks; your code acts

Start with a plain picture. A language model is a text-in, text-out system: it reads words and predicts more words. That is all it does. So when you want it to fetch live data or take a real action, there has to be a hand-off — the model says what it wants, and something outside the model actually does it. That hand-off is tool use.

Here is the single most important idea, and the one beginners most often get wrong: the model never runs any code. When you give a model a calculator tool, it does not compute anything. It looks at the conversation, decides a calculation would help, and emits a structured message that says, in effect, "please call calculate with the argument expression='42*17'." Your application receives that request, runs the real function, and sends the answer back. The model then continues reasoning with the result in hand.

Think of the model as a brilliant analyst locked in a room with a phone. It cannot leave or touch anything itself — but it can pick up the phone and dictate precise requests to an assistant outside, then react to whatever the assistant reports back. Function calling is that phone.

This design is what lets an LLM reach beyond its training data: it can read live data, run code, and take real-world actions, all while staying a pure text-in, text-out system. "Function calling" (OpenAI's 2023 term) and "tool use" (Anthropic's term) describe the same mechanism — we'll use them interchangeably.

Watch out

The model does not execute anything

The model only describes a call as structured JSON. Your application is always responsible for execution — and therefore for sandboxing, permissions, and safety. If a tool can delete files or spend money, that risk lives in your code, not the model's.

The tool schema: name, description, parameters

If the model is going to request a function, it first needs to know which functions exist and how to call them. You provide that as a list of schemas — think of each schema as the "menu entry" for one tool: its name, what it's for, and the inputs it takes. Every schema has three parts:

name — a short identifier (alphanumeric, 1–64 chars), e.g. get_weather.
description — natural language explaining when and how to use the tool. This is the single biggest factor in whether the model picks the right tool. Treat it as prompt engineering, not a code comment.
parameters (OpenAI) / input_schema (Anthropic) — a JSON Schema object defining the expected inputs, their types, and which are required. (JSON Schema is just a standard JSON format for describing the shape of data — types, fields, and constraints.)

json

{
  "name": "get_weather",
  "description": "Get the current weather for a city. Use when the user asks about temperature, rain, or conditions in a specific location.",
  "input_schema": {
    "type": "object",
    "properties": {
      "city": { "type": "string", "description": "City name, e.g. 'Paris'" },
      "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }
    },
    "required": ["city"]
  }
}

The schema is sent as part of every request and counts as input tokens — the tools array, the model's tool calls, and the results you return all add to your token bill.

Tip

Spend your effort on the description

A vague description ("weather tool") leads to wrong or missed calls. A precise one that states when to use it, what it returns, and any gotchas dramatically improves selection accuracy. The model reads descriptions the way a new engineer reads function docs.

The request/response cycle

Now put the model and your code in conversation. A single tool call is a small back-and-forth: the model asks, your code answers, and you hand the answer back so the model can keep going. Concretely it's a five-step loop:

Send the conversation plus the tools array to the model.
Receive a response. If the model wants a tool, it stops with stop_reason: "tool_use" (Anthropic) or finish_reason: "tool_calls" (OpenAI), and the response carries a structured call with the tool name and arguments.
Execute the real function in your code with those arguments.
Return the result back in a new message — a tool_result block (Anthropic) or a role: "tool" message (OpenAI).
Continue: the model reads the result and either calls another tool or writes its final answer.

Steps 1–5 repeat until the model returns plain text. This is exactly the ReAct loop — reason, act, observe — implemented at the API level.

python

response = client.messages.create(model="claude-opus-4-8", messages=messages, tools=tools)

while response.stop_reason == "tool_use":
    tool = next(b for b in response.content if b.type == "tool_use")
    result = run_tool(tool.name, tool.input)        # your code executes
    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": [{
        "type": "tool_result", "tool_use_id": tool.id, "content": str(result)
    }]})
    response = client.messages.create(model="claude-opus-4-8", messages=messages, tools=tools)

Ordering is strict: the tool result must immediately follow the assistant message that requested it — interleaving other messages causes a 400 error.

Example

One turn, concretely

User: "What's the weather in Tokyo?" → model emits get_weather(city="Tokyo") and stops → your code calls the real weather API, gets 18°C, clear → you return that as a tool result → model replies "It's currently 18°C and clear in Tokyo." Two model calls, one tool execution in between.

Parallel calls and tool_choice

Two controls let you tune how many tools run per round-trip and whether the model is allowed to skip tools at all.

Parallel tool calls

Some tasks need several tools that don't depend on each other. Rather than make one round-trip per tool, the model can emit them all in one response. Ask "what's the weather in Tokyo, London, and Cairo?" and a modern model returns three get_weather calls at once. Your code runs them concurrently (e.g. asyncio.gather) and returns all three results together. This collapses latency from the sum of the calls to the duration of the slowest one.

Note the division of labor: the model decides what can run in parallel and emits the calls; your code is responsible for actually running them concurrently. The model is stateless and runs nothing itself.

tool_choice

By default the model decides whether to use a tool at all ("auto"). You can override that:

Intent	OpenAI	Anthropic
Model decides (default)	`auto`	`auto`
Must call some tool	`required`	`any`
Force a specific tool	`{type:"function", ...}`	`{type:"tool", name:"..."}`
Never call a tool	`none`	`none`

The concepts line up, but the names differ — OpenAI's required is Anthropic's any, a classic migration bug. Forcing a tool (any/tool/required) also adds ~100 system-prompt tokens on Anthropic and, importantly, is incompatible with Claude's extended thinking — only auto/none work with reasoning mode.

How the model decides — and how to make it reliable

How does the model know which tool to reach for? Not by running anything — it matches meaning. It reads the user's intent, compares it against each tool's description, and picks the closest fit, the same way you'd skim function docs to find the right call. Three things steer that choice:

Tool descriptions — the primary signal. Clear, specific descriptions win.
System prompt policy — explicit instructions like "always verify prices with the lookup_price tool before answering."
What's already in context — models skip tools for stable knowledge they already have, and reach for tools for fresh or external facts.

Two reliability levers matter in production:

Strict mode (strict: true) guarantees the model's arguments conform to your schema. It is an OpenAI-specific option that requires additionalProperties: false, every field listed in required, and optional fields typed as e.g. ["string", "null"]. It is off by default — you must opt in. Anthropic does not expose an equivalent flag; Claude generally follows the schema but you should validate inputs in your code regardless. Without strict mode (or your own validation), arguments can partially violate the schema.
Tool-set size — more tools is not better. Past ~30–50 tools, selection accuracy degrades. Keep the active set scoped to the task, or for large catalogs use dynamic tool pre-filtering (Anthropic ships this as Tool Search, using regex and BM25 modes) to surface relevant tools on demand before the model context fills with irrelevant schemas.

Under the hood, models are fine-tuned and RL-trained on curated tool-call datasets to recognize when a structured call beats a text answer — quality is strongly dataset-dependent.

Returning observations and handling errors

Whatever your tool returns is the only thing the model sees of the outside world — it becomes the model's observation, the next thing it reasons about. So return results that are useful, bounded, and machine-readable. A 50,000-row dump wastes context; a summarized, structured result keeps the agent on track.

Errors deserve the same care. It's tempting to treat a failed tool call as something to hide or crash on, but to the model an error is just more information — feedback it can act on. When a tool fails, return a descriptive error rather than crashing the loop:

python

try:
    data = lookup_order(order_id)
    result = {"type": "tool_result", "tool_use_id": tid, "content": json.dumps(data)}
except OrderNotFound:
    result = {
        "type": "tool_result", "tool_use_id": tid, "is_error": True,
        "content": "No order found with that ID. Ask the user to re-check the number."
    }

Anthropic uses the is_error flag; OpenAI returns the error string in the role: "tool" message. Either way, a good error message ("Rate limit exceeded. Retry after 60 seconds.") lets the model self-correct. Claude will typically retry an invalid call 2–3 times with corrections before giving up gracefully. Vague errors ("Error 500") leave it stuck. Write tool errors the way you'd write them for a junior engineer who can only see that one line.

Provider differences that bite you

Good news first: the mechanism is identical everywhere — model asks, your code acts, you return the result. What changes between providers is only the wire format: the field names and the exact shape of the JSON. That sounds trivial, but a handful of these differences silently break code during a migration. Here are the ones that actually bite:

	OpenAI	Anthropic	Google Gemini
Schema key	`parameters`	`input_schema`	`parameters` (in `functionDeclarations`)
Stop signal	`finish_reason: "tool_calls"`	`stop_reason: "tool_use"`	function-call part in response
Where calls live	`tool_calls` array	`tool_use` content blocks	`functionCall` parts
Arguments format	*JSON string* → needs `JSON.parse()`**	already-parsed object (`input`)	parsed object
Result message	`role: "tool"`	`tool_result` block	`functionResponse` part
"Must use a tool"	`tool_choice: required`	`tool_choice: any`	`mode: ANY`

The arguments format is the sharpest edge: OpenAI hands you a JSON string you must parse; Anthropic and Google give you a ready object. Two more notes: OpenAI's old functions parameter is deprecated (as of late 2023) — use tools with type: "function". And above the raw APIs sits the Model Context Protocol (MCP) — an open standard released by Anthropic in November 2024, adopted by OpenAI (March 2025) and Google, and since donated to the Linux Foundation's Agentic AI Foundation (December 2025) — that standardizes tool discovery and invocation across providers, so you describe a tool once and any MCP-aware model can use it. Framework abstractions (Vercel AI SDK, LangChain, LlamaIndex) normalize these differences for you.

Try it: Build a one-tool calling loop

Using the SDK of your choice (Anthropic or OpenAI), define a single get_weather(city, unit?) tool with a clear description and a JSON Schema. Hard-code the function to return a fake but realistic result (e.g. 'Tokyo: 18°C, clear'). Then implement the full loop: (1) send a user message "Compare the weather in Tokyo and London" with your tool; (2) detect the tool_use / tool_calls stop signal; (3) execute the tool(s) — note whether the model issued two parallel calls; (4) return the result(s) in the correct result-message format; (5) loop until the model returns plain text. Stretch: add an error path that returns a descriptive is_error result when an unknown city is requested, and confirm the model recovers by asking the user to clarify. Write 3–4 sentences on what surprised you about how the model sequenced or parallelized the calls.

Key takeaways

1The model never executes code — it emits a structured JSON call, and your application runs the function and returns the result.
2A tool schema is name + description + JSON Schema parameters, and the description is the biggest lever on whether the model picks the right tool.
3Tool use is a strict 5-step loop (send → tool_use stop → execute → return result → continue) where the result must immediately follow the request.
4Parallel tool calls cut latency to the slowest single call, and tool_choice controls whether the model must, may, or must not call a tool.
5OpenAI ('function calling'), Anthropic ('tool use'), and Google Gemini share the mechanism but differ in argument format, parameter names, and stop signals — the common source of migration bugs.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.When a model performs 'function calling,' what does the model itself actually do?

2.Which part of a tool schema most strongly determines whether the model picks the right tool?

3.You want the model to be forced to call one of the available tools. What do you set?

4.A tool fails because an order ID doesn't exist. What is the best thing to return to the model?

Go deeper

Hand-picked sources to keep learning

Anthropic — Tool Use with Claude (Overview)

Authoritative reference: defining tools, handling tool calls, parallel use, tool choice, and server tools. Current for 2025–2026.

OpenAI — Function Calling Guide

Canonical OpenAI reference covering schemas, tool_choice, parallel calls, and strict mode.

Martin Fowler — Function calling using LLMs

Clear engineering explainer of how the model decides to call a tool and how the request/response loop works.

Prompt Engineering Guide — Function Calling with LLMs

Provider-neutral walkthrough across OpenAI, Anthropic, and open-source models.

An LLM Compiler for Parallel Function Calling

Research on scheduling parallel and sequential tool calls — the orchestration view of parallel execution.

Anthropic — Tool Search (Dynamic Tool Pre-filtering)

How to use Anthropic's built-in Tool Search feature (regex and BM25 modes) to keep active tool sets small, reducing token usage and improving selection accuracy at scale.