The inference cost curve is moving 4x faster than the training one — and that breaks most AI business plans

The interesting line in the most recent Epoch AI compute tracker is not the one analysts keep quoting. The training-cost curve for frontier models has indeed bent — roughly an 11x reduction in dollars-per-effective-FLOP between GPT-4's 2023 training run and the Q1 2026 generation. That is large. It is also a slow-moving variable that almost no business plan is exposed to directly.

The line that matters for revenue is the inference one. And the inference line has moved faster than almost anyone modelled.

Pull the published API prices for GPT-4-class capability — comparing OpenAI's gpt-4-turbo at launch (April 2024, $30/M output tokens) with Anthropic's Claude 3.7 Sonnet in April 2026 ($3.00/M output tokens) at materially higher benchmark performance — and the like-for-like compression is 90% in 24 months. Strip out the model upgrades and look only at the same model held constant: gpt-4o output fell from $15 to $2.50 over the same period. That is also ~83%. Either way you slice it, the curve is steep.

Figure 1 — Illustrative, indexed

Inference-cost curve vs. training-cost curve, indexed Apr 2024 = 100

Source: Author's compilation from Epoch AI, OpenAI & Anthropic API pricing histories, and published benchmark equivalence work. Curves are stylised; the relative slopes are the point.

A few people have flagged this. Andrej Karpathy noted in February that the per-token cost of frontier-equivalent capability was falling roughly 10x per year. Mosaic's research blog put numbers on it. What none of them has done — and what is the subject of this column — is work through what the curve does to the margin structure of the application layer.

The three forces compounding

The inference cost curve is not bending for one reason. It is bending for at least three, and they compound multiplicatively.

The first is architectural. Mixture-of-Experts routing, speculative decoding, prefix caching, KV-cache compression, and 4-bit quantisation each shave roughly 30–60% off serving cost, and they stack. Claude 3.7's inference stack is conservatively 4x more efficient per parameter than GPT-4's was at launch. None of this is research; all of it is in production.

The second is silicon. The Blackwell B200 platform delivers roughly 2.4x the inference tokens-per-watt of H100 at FP4. The TPU v6e delivers something similar for Google internal workloads. Per Q1 FY27 10-Q filings, hyperscaler capex run-rate is on pace for $610bn calendar 2026, more than double 2024 — and roughly 70% of that build-out is inference-skewed, not training-skewed. Capacity is being added faster than demand, deliberately, because the marginal call option on having spare inference capacity dominates the marginal cost.

The third is competition. Two years ago there were three companies serving frontier-capability inference at scale. Today there are at least seven, plus a long tail of open-weight serving companies (Together, Fireworks, Groq, Cerebras) running heavily-optimised stacks on commodity hyperscaler capacity. Llama 3.3 405B inference on Together AI clears at $0.88/M output tokens — for capability that was state-of-the-art eighteen months ago and priced at $60.

Stack the three forces: efficient architectures (3–5x) × better silicon (~2x) × competitive pricing pressure (~2x). That maths gets you, roughly, the curve in Figure 1.

What it re-prices

Now to the part that breaks business plans.

A 2024-vintage AI application-layer plan — call it a coding assistant, a contact-centre agent, a vertical-SaaS copilot — modelled gross margin against a presumed input cost that fell at roughly 30% per year. That was the consensus. At 30% per year, a SaaS product priced at $40/seat/month with $14 of underlying inference cost in year one becomes a $40/$10 model in year two, $40/$7 in year three. Gross margins climb. Standard story.

What actually happened is closer to 70% per year. The same $14 cost is now $4.20, on its way to $1.25 by Q4 2026. That should be wonderful news for the application layer — and for the application layer that owns its distribution and pricing, it is.

For everyone else, it is a problem. Three reasons:

One: the customer notices. Enterprise procurement teams now openly cite the published API rates in renewal conversations. "Your input cost fell 80%, we want 30% of that back" is a sentence that is being said in 2026 in a way it was not being said in 2025. Net-revenue-retention curves at AI-native SaaS companies are starting to flatten earlier than the Bessemer benchmark suggests they should.

Two: the moat collapses. A vertical agent priced at $400/seat in 2024 because the underlying model was uniquely capable now has three competitors offering equivalent capability at $90, because the model is no longer uniquely capable. The category is being commoditised top-down by the inference curve itself.

Three: the unit economics that justified the 2024-vintage Series B valuations presupposed gross margin expansion and ARR expansion. The expansion is happening at the gross-margin line. It is not happening at the ARR line, because the customer captured most of the surplus through price renegotiation. Net result: revenue multiples compress while gross margins do not.

Where the rents actually accrue

If the application layer is mostly giving the surplus back, where does the surplus end up?

Some accrues to the model labs, briefly. OpenAI and Anthropic both still post negative gross margin on inference at posted prices, by all credible analysis, though the gap is closing. The labs are not the durable beneficiary; they are the conduit.

Some accrues to the silicon, briefly. Nvidia margins, as I argued in the column on Blackwell, are also under structural compression even as units rise. Picks-and-shovels is real, but the shovels are also being commoditised.

Most of it accrues to whoever owns the demand-side relationship. This is the Aggregator Theory result and it appears yet again. When the marginal cost of intelligence falls toward the marginal cost of bandwidth, the only durable rent is over the customer relationship — the trust, the distribution, the workflow integration, the data feedback loop that makes your deployment of a commodity model materially better than a competitor's deployment of the same model.

That is why Microsoft can run Copilot at a presumed loss on raw inference and still have it be the most profitable individual product launch in the company's history: the inference is the loss-leader, the Office 365 entrenchment is the rent. It is why ServiceNow can charge $35/seat for Now Assist while Anthropic charges $20 for the same number of tokens directly: ServiceNow owns the workflow.

The implication for operators: do not invest in being uniquely good at running the inference. Invest in being uniquely good at owning the place the inference happens. The curve in Figure 1 is going to keep falling for at least another 18 months, and a 70% YoY decline does not slow down for moats made of model quality. It does slow down — sometimes it stops — for moats made of distribution, integration, and switching cost.

What to watch in the next four prints

Three numbers:

Hyperscaler inference-revenue mix. AWS Bedrock, Azure OpenAI Service, and Google Vertex are all guiding toward inference being >60% of their AI revenue by end-FY27. If that number disappoints in the next two prints, it means the demand curve is sloping down faster than the cost curve — bad for everyone.
AI-native SaaS gross-margin guides. Watch for any 2024-vintage AI-first SaaS company guiding to "long-term gross margin in the 70–75% range" being revised down in the next 90 days. Three or more in a quarter is a signal.
Open-weight inference share. Together AI and Fireworks both publish quarterly throughput figures. If the open-weight share of total served tokens crosses 25% by Q3 2026 — currently around 14% on best estimate — the proprietary-model premium has effectively ended.

Inference deflation is the most important AI variable nobody is watching closely enough. It will determine, more than capex, more than training scale-out, more than even regulatory action, who keeps the surplus when this is over.

— Kairos Thorne, Singapore. 26 May 2026.

The inference cost curve is moving 4x faster than the training one — and that breaks most AI business plans ​

The three forces compounding ​

What it re-prices ​

Where the rents actually accrue ​

What to watch in the next four prints ​

Read the full archive. Every Monday in your inbox.

The inference cost curve is moving 4x faster than the training one — and that breaks most AI business plans

The three forces compounding

What it re-prices

Where the rents actually accrue

What to watch in the next four prints