AI Tool Use Patterns
Tool use is when an LLM decides to call an external function — search, calculator, database query, API — instead of (or in addition to) generating text. Modern frontier models (Claude, GPT, Gemini) emit structured tool-call JSON the application executes; the result is fed back into the model. Tool use is the single biggest unlock turning LLMs from chatbots into systems: search beats hallucinated facts, calculators beat hallucinated math, code execution beats hallucinated outputs. The architectural patterns matter as much as the model: which tools you expose, how you scope them, how you handle errors, and how you prevent the model from looping on a broken tool define the system's reliability.
The Trap
The trap is exposing too many tools at once. Models degrade as the tool list grows — past ~20-30 tools, selection accuracy drops sharply because the model has to reason over too many options. The other trap is vague tool descriptions ('search_thing — searches things'); the LLM cannot guess intent it isn't told. And the worst trap: tools without idempotency or error contracts — when a tool fails, the model retries; if 'send_email' isn't idempotent, you ship 4 emails to the same customer. Tool use shifts most of the engineering work from prompts to tool design.
What to Do
Design tools like a public API: (1) Narrow scope — one tool, one action; resist 'multipurpose' tools that branch on parameters. (2) Self-documenting — names, parameter descriptions, examples, and explicit return schemas the model can introspect. (3) Idempotent or marked non-idempotent — if 'create_order' isn't idempotent, name it 'create_order_unsafe' so the model treats it carefully. (4) Structured errors — return machine-readable error codes and recovery hints; the model uses them to retry sensibly. (5) Tool selection scaffolding — if you have 50+ tools, use a router pattern: a first call selects the tool category, a second call uses the narrowed toolset. (6) Eval per tool — every tool has its own test suite for selection accuracy and parameter fidelity.
Formula
In Practice
Anthropic's tool-use documentation describes patterns proven across customer deployments: scope tools narrowly, write rich descriptions with examples, and handle errors as data the model can reason over. OpenAI's function calling and Google's Gemini tool use follow analogous patterns. Cursor and Claude Code expose tools (read_file, edit_file, run_command, search_codebase) with tight scopes and structured outputs — each tool is small enough that the model picks correctly, and errors come back as structured information the model can act on. The tools are the product more than the prompt.
Pro Tips
- 01
Treat tool descriptions as first-class prompts. Models pick tools based on the description; ambiguous descriptions cause silent misrouting that's painful to debug. Run an eval where you give the model 100 user queries and check tool-selection accuracy before you optimize anything else.
- 02
Structured errors are how you turn flaky tools into reliable systems. Returning {error: 'ratelimit', retry_after_ms: 800} lets the model wait and retry. Returning 'something went wrong' makes the model improvise — usually badly.
- 03
If a tool is dangerous, make the model perform a confirmation step in the application layer, not the prompt. 'You must say MAYISEND first' is a prompt-injection liability; a real confirmation gate in code is not.
Myth vs Reality
Myth
“More tools make the model more capable”
Reality
Tool count and selection accuracy are inversely related past ~20 tools. A focused 8-tool agent often outperforms a 40-tool agent because the model can correctly choose what to call. Curate aggressively and use router patterns when the surface genuinely must be wider.
Myth
“The LLM handles tool errors gracefully if you just tell it to”
Reality
Models retry on opaque errors but rarely diagnose. They will call the same broken tool with the same broken arguments three times then either give up or invent an answer. Structured errors with recovery hints fix this; vague errors do not.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
You're shipping an agent with 8 tools and per-call selection accuracy of 92%. A teammate proposes adding 25 more tools 'to make it more capable.' Selection accuracy in tests with the new tool set drops to 71%. What should you do?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
Tool Selection Accuracy
Production agent with curated toolset of 5-15 toolsExcellent
> 97%
Good
92-97%
Marginal
85-92%
Don't Ship
< 85%
Source: Anthropic and OpenAI tool-use evaluation patterns
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Claude Code (Anthropic)
2024-2026
Claude Code exposes a deliberately small, sharp toolset (read_file, edit_file, run_bash, search files, web fetch). Each tool has narrow scope, structured outputs, and clear error messages. The product's reliability comes substantially from this restraint — the agent picks correctly because there are few choices, and error handling is consistent because every tool returns the same structured shape. The lesson is the inverse of 'more tools = more capability.'
Tool Count
Small, focused set
Tool Description Style
Explicit examples, structured I/O
Error Handling
Structured, recoverable
Restraint in tool design is a competitive advantage. The temptation to expose every internal API is the road to a 40-tool agent that can't reliably pick any of them.
Cursor
2024-2026
Cursor's agentic mode (Composer / Agent) exposes file-editing and codebase-search tools. The product invested heavily in tool-call reliability — argument fidelity for multi-file edits, structured diffs, and error recovery when a tool call fails partway through. Public posts emphasize that the engineering work is mostly in the tools and harness, not in the prompts.
Engineering Focus
Tools and harness over prompts
Reliability Lever
Structured tool I/O
When AI products feel reliable, it is almost always because someone invested in tool design and error contracts, not because the prompts are clever.
Hypothetical: The 60-Tool Agent
Composite scenario
A B2B SaaS company exposed all 60 of their internal API endpoints as tools to one agent. Selection accuracy dropped to 64% in evals — almost 4 of 10 calls hit the wrong endpoint. They refactored to a router pattern: a first call categorizes the request (CRM / billing / reporting / admin), a second call selects within that 8-12 tool subset. End-to-end accuracy jumped to 91%. The product became reliable; the API surface didn't change at all.
Tools (initial)
60 in one set
Selection Accuracy (initial)
64%
Tools per Subset (router)
8-12
Selection Accuracy (router)
91%
When you can't reduce the tool count, partition it. The model handles 'pick a category, then pick a tool within it' far better than 'pick one of 60.'
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn AI Tool Use Patterns into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn AI Tool Use Patterns into a live operating decision.
Use AI Tool Use Patterns as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.