AI StrategyAdvanced8 min read

AI Tool Use Patterns

Tool use is when an LLM decides to call an external function — search, calculator, database query, API — instead of (or in addition to) generating text. Modern frontier models (Claude, GPT, Gemini) emit structured tool-call JSON the application executes; the result is fed back into the model. Tool use is the single biggest unlock turning LLMs from chatbots into systems: search beats hallucinated facts, calculators beat hallucinated math, code execution beats hallucinated outputs. The architectural patterns matter as much as the model: which tools you expose, how you scope them, how you handle errors, and how you prevent the model from looping on a broken tool define the system's reliability.

Also known asFunction CallingTool-Calling LLMsLLM ToolsFunction-Calling PatternsAction Models

Challenge a friend Browse library

The Trap

The trap is exposing too many tools at once. Models degrade as the tool list grows — past ~20-30 tools, selection accuracy drops sharply because the model has to reason over too many options. The other trap is vague tool descriptions ('search_thing — searches things'); the LLM cannot guess intent it isn't told. And the worst trap: tools without idempotency or error contracts — when a tool fails, the model retries; if 'send_email' isn't idempotent, you ship 4 emails to the same customer. Tool use shifts most of the engineering work from prompts to tool design.

What to Do

Design tools like a public API: (1) Narrow scope — one tool, one action; resist 'multipurpose' tools that branch on parameters. (2) Self-documenting — names, parameter descriptions, examples, and explicit return schemas the model can introspect. (3) Idempotent or marked non-idempotent — if 'create_order' isn't idempotent, name it 'create_order_unsafe' so the model treats it carefully. (4) Structured errors — return machine-readable error codes and recovery hints; the model uses them to retry sensibly. (5) Tool selection scaffolding — if you have 50+ tools, use a router pattern: a first call selects the tool category, a second call uses the narrowed toolset. (6) Eval per tool — every tool has its own test suite for selection accuracy and parameter fidelity.

Formula

Tool Selection Accuracy ≈ Description Quality × Tool Count^(-0.5) — Schema Ambiguity

In Practice

Anthropic's tool-use documentation describes patterns proven across customer deployments: scope tools narrowly, write rich descriptions with examples, and handle errors as data the model can reason over. OpenAI's function calling and Google's Gemini tool use follow analogous patterns. Cursor and Claude Code expose tools (read_file, edit_file, run_command, search_codebase) with tight scopes and structured outputs — each tool is small enough that the model picks correctly, and errors come back as structured information the model can act on. The tools are the product more than the prompt.

Pro Tips

01
Treat tool descriptions as first-class prompts. Models pick tools based on the description; ambiguous descriptions cause silent misrouting that's painful to debug. Run an eval where you give the model 100 user queries and check tool-selection accuracy before you optimize anything else.
02
Structured errors are how you turn flaky tools into reliable systems. Returning {error: 'ratelimit', retry_after_ms: 800} lets the model wait and retry. Returning 'something went wrong' makes the model improvise — usually badly.
03
If a tool is dangerous, make the model perform a confirmation step in the application layer, not the prompt. 'You must say MAYISEND first' is a prompt-injection liability; a real confirmation gate in code is not.

Myth vs Reality

Myth

“More tools make the model more capable”

Reality

Tool count and selection accuracy are inversely related past ~20 tools. A focused 8-tool agent often outperforms a 40-tool agent because the model can correctly choose what to call. Curate aggressively and use router patterns when the surface genuinely must be wider.

Myth

“The LLM handles tool errors gracefully if you just tell it to”

Reality

Models retry on opaque errors but rarely diagnose. They will call the same broken tool with the same broken arguments three times then either give up or invent an answer. Structured errors with recovery hints fix this; vague errors do not.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

You're shipping an agent with 8 tools and per-call selection accuracy of 92%. A teammate proposes adding 25 more tools 'to make it more capable.' Selection accuracy in tests with the new tool set drops to 71%. What should you do?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Tool Selection Accuracy

Production agent with curated toolset of 5-15 tools

Excellent

> 97%

Good

92-97%

Marginal

85-92%

Don't Ship

< 85%

Source: Anthropic and OpenAI tool-use evaluation patterns

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

⌨️

Claude Code (Anthropic)

2024-2026

success

Claude Code exposes a deliberately small, sharp toolset (read_file, edit_file, run_bash, search files, web fetch). Each tool has narrow scope, structured outputs, and clear error messages. The product's reliability comes substantially from this restraint — the agent picks correctly because there are few choices, and error handling is consistent because every tool returns the same structured shape. The lesson is the inverse of 'more tools = more capability.'

Tool Count

Small, focused set

Tool Description Style

Explicit examples, structured I/O

Error Handling

Structured, recoverable

Restraint in tool design is a competitive advantage. The temptation to expose every internal API is the road to a 40-tool agent that can't reliably pick any of them.

Source ↗

🖱️

Cursor

2024-2026

success

Cursor's agentic mode (Composer / Agent) exposes file-editing and codebase-search tools. The product invested heavily in tool-call reliability — argument fidelity for multi-file edits, structured diffs, and error recovery when a tool call fails partway through. Public posts emphasize that the engineering work is mostly in the tools and harness, not in the prompts.

Engineering Focus

Tools and harness over prompts

Reliability Lever

Structured tool I/O

When AI products feel reliable, it is almost always because someone invested in tool design and error contracts, not because the prompts are clever.

Source ↗

🧰

Hypothetical: The 60-Tool Agent

Composite scenario

success

A B2B SaaS company exposed all 60 of their internal API endpoints as tools to one agent. Selection accuracy dropped to 64% in evals — almost 4 of 10 calls hit the wrong endpoint. They refactored to a router pattern: a first call categorizes the request (CRM / billing / reporting / admin), a second call selects within that 8-12 tool subset. End-to-end accuracy jumped to 91%. The product became reliable; the API surface didn't change at all.

Tools (initial)

60 in one set

Selection Accuracy (initial)

64%

Tools per Subset (router)

8-12

Selection Accuracy (router)

91%

When you can't reduce the tool count, partition it. The model handles 'pick a category, then pick a tool within it' far better than 'pick one of 60.'

Related concepts