Multi-Agent System Design
Multi-agent systems decompose a task across specialized LLM agents that coordinate via messages, shared state, or an orchestrator. Common patterns: (1) Orchestrator-worker — a planner agent dispatches subtasks to specialist agents (researcher, writer, critic, executor). (2) Pipeline — agents hand off sequential stages. (3) Debate/critic loops — two or more agents adversarially refine an answer. (4) Swarm — many short-lived agents work on shards of the same problem in parallel. The promise is scaling intelligence beyond a single context window; the cost is communication overhead, error compounding, and a debugging nightmare.
The Trap
The trap is using multi-agent because it sounds sophisticated when a single well-prompted call would do. Each additional agent in the chain multiplies error rates. If each agent is 90% reliable on its sub-task, a 5-agent pipeline is 0.9^5 = 59% reliable end-to-end. Costs balloon — message-passing means agents repeatedly re-read context, often 3-10x the tokens of a monolithic call. And debugging is brutal: when the system produces a bad answer, you must trace which of N agents introduced the error, often through hundreds of inter-agent messages. KnowMBA POV: multi-agent systems sound clever in demos and break in production. Default to single-agent + tools; reach for multi-agent only when the problem genuinely cannot be expressed in one prompt.
What to Do
Apply a four-question gate before going multi-agent: (1) Does the task have genuinely independent sub-problems that can run in parallel? (If sequential, you probably want a single agent with tools, not multiple agents.) (2) Do the sub-tasks need different system prompts, models, or guardrails? (If not, one agent suffices.) (3) Can you accept a 2-5x cost increase and 30-60% lower end-to-end reliability without remediation? (4) Have you instrumented per-agent traces, message-bus logging, and per-step eval harnesses? If any answer is no, simplify. When you do build multi-agent, enforce: (a) typed message contracts between agents, (b) max-hop budgets to kill runaway loops, (c) a single source of truth for shared state (avoid agents re-deriving the same facts), (d) per-agent eval suites — a system is only as reliable as its weakest agent.
Formula
In Practice
Anthropic's published research on a multi-agent research system describes an orchestrator-worker pattern where a lead agent decomposes research queries into parallel subagent tasks. The post is candid about the trade-offs: the multi-agent system uses roughly 15x more tokens than a single Claude call, and is reserved for breadth-first research questions where the parallel exploration is worth the cost. Most queries do not justify it. The headline lesson from one of the most credible multi-agent deployments in production: even when it works, it is expensive and hard, and you should not reach for it by default.
Pro Tips
- 01
Single agent with many tools beats many agents with few tools, in most cases. The model has full context; debugging is one trace; cost is bounded by one loop. Multi-agent is the right call only when the parallelism is the entire point (e.g., search the web 12 ways simultaneously).
- 02
If you must orchestrate, treat the orchestrator like a critical infra service: typed schemas for handoffs, retry/backoff per worker, dead-letter queue for failed sub-tasks, and a hard max-hop budget. Without these, your 'agent system' is a distributed system without distributed-system discipline.
- 03
Run an A/B before going multi-agent: same task, single-agent baseline vs multi-agent variant, on a 200-item eval. If the multi-agent version isn't materially better on quality (not just 'feels smarter'), keep the single agent. The token bill alone usually decides it.
Myth vs Reality
Myth
“More agents = more intelligence”
Reality
More agents = more places for the system to fail. Each handoff is a lossy compression of context. Adding a 'critic' agent that re-reads outputs catches some errors and introduces others. Empirically, beyond 3-4 specialized agents, marginal quality gains turn negative as coordination overhead dominates.
Myth
“Multi-agent systems are how AGI will work, so we should build that way now”
Reality
Future architectures are speculative; today's production systems suffer from the same engineering realities as any distributed system: latency, partial failure, debugging cost. Build for what works today; rearchitect when the model capabilities or tools materially change.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
Your team is designing a customer-support automation. Current proposal: 6 specialized agents (intent-classifier, KB-retriever, policy-checker, response-drafter, tone-reviewer, sender). Each is 92% reliable in isolation. What's the most important critique?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
When to Use Multi-Agent
Engineering decision frameworkStrong Fit
Parallel breadth-first research, swarm simulation, independent sub-tasks
Conditional Fit
Sequential stages with materially different prompts and guardrails
Probably Wrong
Sequential stages that could be one prompt with tools
Anti-Pattern
Multi-agent because 'agents are the future'
Source: Anthropic engineering blog + production deployment patterns
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Anthropic Research System
2024-2025
Anthropic published an engineering account of a multi-agent research system used internally and in product surfaces. A lead orchestrator agent decomposes a query into parallel research subagent tasks; subagents explore different facets and return findings; the orchestrator synthesizes. The post explicitly states the system uses ~15x the tokens of a single Claude call, and is reserved for queries where breadth-first parallel research justifies the cost. Most queries are still single-agent.
Token Cost vs Single Call
~15x
Pattern
Orchestrator-worker, parallel
Default Posture
Single-agent unless parallelism is the point
Even at the frontier, multi-agent is reserved for problems where parallel exploration is worth a 15x token premium. Not the default; not even close.
Hypothetical: The 7-Agent Customer Email Pipeline
Composite scenario
A SaaS company built a 7-agent pipeline for customer email replies: classifier → retriever → policy-checker → drafter → tone-reviewer → personalizer → sender. Each agent was ~91% reliable. End-to-end reliability landed at 0.91^7 ≈ 51%. The team rebuilt as a single agent with retrieval and policy tools; reliability rose to 84%, latency dropped from 14s to 3s, and token cost fell 70%. The 'sophisticated' pipeline was worse on every dimension.
Multi-Agent Reliability
~51%
Single-Agent Reliability
~84%
Latency
14s → 3s
Token Cost Reduction
70%
Multiplicative error decay is the silent killer of multi-agent systems. If you can't justify why splitting helps more than it hurts, don't split.
Decision scenario
The Architect's Multi-Agent Pitch
Your principal engineer proposes a 6-agent system to handle internal IT tickets: classifier, knowledge-retriever, policy-validator, action-planner, executor, summarizer. Each agent is estimated at 90% reliable. Your current single-agent + tools prototype is 78% reliable. Throughput needs are 10K tickets/day. The principal argues 'specialized agents are more accurate.'
Tickets per Day
10,000
Single-Agent Prototype Reliability
78%
Proposed Multi-Agent Per-Stage Reliability
90%
Single-Agent Token Cost per Ticket
~6,000 tokens
Estimated Multi-Agent Tokens per Ticket
~28,000 tokens (5x rebroadcast)
Decision 1
You need to decide architecture before the next sprint. The principal is influential and the design 'feels' more rigorous.
Approve the 6-agent architecture — specialized agents will produce better outputs, and the team can iterate to improve each stage independentlyReveal
Stay single-agent + tools. Invest the next two sprints in prompt engineering, eval harnesses, and tool reliability to raise the 78% baseline. Revisit multi-agent only if you hit a ceiling that single-agent provably cannot cross.✓ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn Multi-Agent System Design into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn Multi-Agent System Design into a live operating decision.
Use Multi-Agent System Design as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.