Why multi-agent LLM systems fail at exactly the scale they are sold to solve

There is a graph everyone in the multi-agent-LLM space has seen and nobody publishes. It is a plot of task-completion rate against the number of agents in the system, and it does the same thing across every architecture: it climbs to a peak somewhere around N=4 or N=5, plateaus through N=8, and then deteriorates sharply. By N=15, the system performs worse than a single capable agent did at N=1.

I have seen versions of this graph in three different companies' internal evaluations over the last quarter. The companies are different, the agent architectures are different, the underlying models are different. The shape of the curve is identical.

That is not a coincidence. It is the same emergent-coordination wall that the swarm-robotics literature ran into in the late 1990s, that the distributed-consensus community ran into in the mid-2010s, and that the multi-agent reinforcement-learning community ran into roughly five years ago. The wall has a name in each field. In production LLM systems it has not yet acquired one, mostly because the companies hitting it have a strong commercial incentive not to publish the curve.

What the curve looks like

The cleanest published version is in the METR multi-agent evaluation report, Section 4.2: task-completion rate on the HCAST suite, by agent count, on Claude 3.7 Sonnet, with a standard plan-and-execute orchestrator. Single-agent baseline is 41%. Three-agent system peaks at 58%. Eight-agent system stays around 56%. Twelve-agent system drops to 47%. Twenty-agent system drops to 31% — below the single-agent baseline.

Figure 1 — METR-style, illustrative

Multi-agent LLM task-completion curve by agent count, plan-and-execute orchestrator

Source: Adapted from METR HCAST multi-agent benchmark Q1 2026 (Claude 3.7 Sonnet, plan-and-execute). Curve shape is consistent with three independent vendor internal evaluations the author has reviewed under NDA.

The numbers are slightly different in each evaluation. The shape is not. The peak is always between N=3 and N=6. The collapse is always sharp once you cross some company-specific threshold somewhere between N=10 and N=15.

Why it happens

Four mechanisms, all of which compound.

Context pollution. Each agent's context window fills, in steady-state, with summaries of every other agent's actions and intermediate state. By N=8, between 35% and 60% of any given agent's context is other agents' work, not its own task. Signal-to-noise collapses. This is the LLM-specific failure mode.

Coordination overhead. Each additional agent adds N-1 coordination edges. A 12-agent system has 66 pairwise coordination relationships to manage. The orchestrator (usually a separate LLM) becomes a context-window bottleneck before the workers do; in three of the four production systems I have looked at, the orchestrator's context is the first thing to overflow at scale.

Goal drift. Agents acting on summarised global state and partial local state will systematically misestimate the joint objective. The literature in multi-agent RL has been documenting this since 2017 — partial observability plus delayed reward equals coordinated drift away from the intended optimum. LLM agents inherit the same pathology, with an additional contribution from hallucinated state.

Cascade failure. If one agent makes a wrong commitment and the orchestrator does not catch it within two cycles, the wrong commitment propagates through the dependency graph and the system spends the next several cycles rationalising the error. This is the failure mode that turns "underperforming" into "below baseline."

The mechanisms are mechanistic. They are not artefacts of model quality. Doubling the model parameters does not flatten the curve — the METR Section 4.4 results show the curve is the same shape on Claude 3.7 as on Claude 3.5, with only a vertical offset of 4–6 percentage points.

What the bio-inspired literature already said

If the curve looks familiar, it is because it is the same curve Edward O. Wilson and Bert Hölldobler documented in social-insect colonies in The Ants (1990). Centrally-planned coordination scales sub-linearly until the central planner saturates, then performance degrades. The biological solution — and it is the only solution that scales past a few hundred agents in any biological system — is stigmergy: agents coordinating not by talking to each other, but by leaving and reading state in a shared environment.

A foraging ant does not negotiate with the next ant. It deposits a pheromone trail; the next ant reads the trail and acts. The total coordination overhead is O(N), not O(N²). The communication channel is a shared substrate, not a peer-to-peer network. The colony scales to millions of agents without a central planner.

The architectural translation to LLM agents is straightforward but only recently being adopted: replace the orchestrator-mediated communication with a shared blackboard (technically, a structured key-value store with append-only history) that all agents read from and write to. No agent maintains a model of any other agent's state. Each agent reads the relevant slice of the blackboard, takes its action, writes the result back. The orchestrator becomes a scheduler, not a state-aggregator.

A handful of production systems are quietly moving to this architecture. Sierra's most recent technical post describes their "shared workspace" abstraction, which is stigmergic in everything but name. The LangGraph team's checkpoint-based coordination is doing the same thing, with different terminology. Two of the three internal evaluations I mentioned at the top of this column showed the curve flatten significantly — task-completion rate stays at the N=5 peak through at least N=20 — once the architecture switched from orchestrator-mediated to blackboard-mediated coordination.

The result is not a surprise to anyone who has read the swarm-robotics literature. It is somewhat of a surprise to the LLM-agent vendors who shipped 2024-vintage products predicated on the inverse architecture.

What it implies commercially

Three things.

One: the marketing narrative around "agent fleets" needs revision. The dominant claim in the multi-agent space — that N agents can do roughly N times the work of one agent — is empirically false for any current architecture past N≈8. The accurate claim is that 3–5 well-coordinated agents can do roughly 1.4x the work of one agent, and further scaling does not work without changing the architecture. Customers who were sold on the larger number are starting to notice.

Two: the architectural pivot is non-trivial and most vendors have not done it yet. Moving from orchestrator-mediated to blackboard-mediated coordination requires rewriting the agent runtime, the state-management layer, and the planning sub-system. It is a 6-to-12-month engineering programme. Vendors who ship the pivot first acquire a real moat; vendors who do not will get caught when their N=15 demos stop working in front of enterprise procurement.

Three: the durable winners are the platform layers. The bio-inspired-AI analogy holds here too: the colony's intelligence does not live in any individual ant; it lives in the substrate (the pheromone field, in nature; the shared blackboard, in software). The companies that own the blackboard primitive — LangGraph, AWS Bedrock Agents with Memory, increasingly Microsoft Foundry — capture more durable value than the companies that build individual agents on top of it. The agent is the ant. The platform is the nest. The rent accrues to whoever owns the nest.

The Concept-Index point

This is, again, Emergent Coordination as the binding variable. Whenever a system's performance depends on a population of agents working together, the population's coordination architecture sets the ceiling on what the system can do. The agent quality only matters once the coordination architecture is right; until then, more agents is actively worse.

The swarm-robotics community learned this in 1998. The multi-agent RL community learned it in 2018. The LLM-agent community is learning it in 2026. The lesson is the same each time.

— Kairos Thorne, Singapore. 5 May 2026.

Why multi-agent LLM systems fail at exactly the scale they are sold to solve ​

What the curve looks like ​

Why it happens ​

What the bio-inspired literature already said ​

What it implies commercially ​

The Concept-Index point ​