01 18 September 2025

The energy cost of inference at scale

Datacenter operators report efficiency gains. The aggregate numbers tell a different story.

When people talk about AI's energy problem, they almost always talk about training. The numbers are impressive in the worst sense. Training GPT-4 consumed an estimated 50 gigawatt-hours of electricity. Enough to power San Francisco for three days. The figure gets cited so often it has become a kind of shorthand for AI excess. But training is a one-off event. You do it, you have a model, you move on.

Inference is the other thing. Inference is what happens every time someone asks ChatGPT a question, every time a recommendation engine ranks a feed, every time an AI coding assistant suggests a line of completion. It happens billions of times a day, and it never stops. Industry estimates of inference's share of total AI compute range from two-thirds to 90 percent, depending on who is counting and what they count. The precise figure matters less than the direction: the expensive part is running it.

The numbers nobody puts on the invoice

Global datacenters consumed roughly 415 terawatt-hours of electricity in 2024. That is about 1.5% of worldwide electricity generation. The IEA projects this will more than double to 945 TWh by 2030, with AI as the primary driver. In the United States alone, the Lawrence Berkeley National Laboratory forecasts datacenter demand growing from 176 TWh in 2023 to somewhere between 325 and 580 TWh by 2028. The range is wide because nobody is quite sure how fast this scales.

Ireland is already there. Around 21% of the country's electricity goes to datacenters, heading toward 32% by 2026 according to the IEA. In Virginia, the figure is 26%. These are operational numbers from grid operators watching the load curves in real time.

The cost is not abstract. In the PJM electricity market (stretching from Illinois to North Carolina), datacenters accounted for an estimated $9.3 billion price increase in the 2025-26 capacity market. The average residential bill in western Maryland is expected to rise by $18 a month as a direct consequence. Carnegie Mellon estimates that datacenters and cryptocurrency mining could push average US electricity bills up 8% by 2030, exceeding 25% in the highest-demand markets of northern Virginia. The people paying that premium are not the people asking ChatGPT to summarise their emails.

Efficiency gains that do not add up

The industry response to these numbers is reliably optimistic. Google reports a 33x efficiency improvement for Gemini between May 2024 and May 2025. Quantisation techniques can reduce memory requirements by 75% and cut energy consumption by 60-80%. Mixture of Experts architectures activate only 5-10% of parameters per token. DeepSeek-V3 activates 37 billion parameters out of 671 billion total.

But efficiency gains tell you how much energy each query costs. They do not tell you how much energy all the queries cost. And the demand curve is not politely waiting for the efficiency curve to catch up. Worldwide, AI datacenter power consumption is expected to reach 90 TWh by 2026, roughly a tenfold increase from 2022 levels.

Per-query efficiency improves and total consumption climbs regardless. Economics has a name for this: the Jevons paradox. Make something cheaper to run and people run more of it.

The GPU problem

It comes down to architecture. GPUs were designed for graphics rendering and repurposed for AI. They are general-purpose machines doing specialised work, which means they carry overhead. An Nvidia H100 draws 350 watts in its PCIe form, and up to 700 watts in the SXM form that fills most datacenters. It is optimised for flexibility: you can run any model on it, switch between training and inference, deploy it for image generation or protein folding or language processing. That generality comes at a thermodynamic cost.

GPT-4's training cluster is estimated to have achieved only 32-36% utilisation of available compute, lost mostly to communication overhead between chips. Inference has its own version of this problem. The dominant bottleneck is memory bandwidth. Every time a model generates a token, it needs to read the model's weights from memory. For a large language model, that means moving gigabytes of data through a memory bus that was not designed for this access pattern. The compute units sit idle, waiting for data. The watts keep flowing.

You cannot engineer around this within the current architecture. You can compress models, batch requests more cleverly, push the memory closer to the compute. But you are still fighting the basic physics of a general-purpose chip running a specific workload. The transistors that make the GPU flexible are the same transistors that make it inefficient for any single task.

The arithmetic of scale

The major labs want AI assistants handling billions of queries per day. They want AI embedded in every search result, every email draft, every customer service interaction. They want models running continuously, reasoning through multi-step tasks, generating not single responses but entire workflows.

How much of this is useful is a question the power grid does not get to ask.

At current efficiency levels, this trajectory requires building power generation capacity equivalent to several small countries. Not datacenters. Power plants. The infrastructure commitments being made by Microsoft, Google, Amazon, and Meta collectively amount to tens of billions of dollars in energy procurement, including reviving nuclear plants (Microsoft signed a twenty-year deal to restart a reactor at Three Mile Island), signing long-term deals with renewable developers, and in some cases simply building gas-fired generation. The emissions accounting on some of these arrangements is going to be interesting.

The energy cost of inference at scale is a binding constraint on AI's development, and like most constraints, it will be solved by doing something different.

In a companion piece, we look at what that different thing might be: the emerging class of specialised silicon that abandons the GPU's generality entirely, and what changes when the model becomes the chip.

← Back to Writing