The energy cost of inference at scale

Datacenter operators report efficiency gains. The aggregate numbers tell a different story.

When people talk about AI's energy problem, they almost always talk about training. The numbers are impressive in the worst sense. Training GPT-4 consumed an estimated 50 gigawatt-hours of electricity. Enough to power San Francisco for three days. The figure gets cited so often it has become a kind of shorthand for AI excess. But training is a one-off event. You do it, you have a model, you move on.

Inference is the other thing. Inference is what happens every time someone asks ChatGPT a question, every time a recommendation engine ranks a feed, every time an AI coding assistant suggests a line of completion. It happens billions of times a day, and it never stops. The International Energy Agency estimates that 80 to 90 percent of AI's total compute is now consumed by inference, not training. The expensive part is not building the model. It is running it.

The numbers nobody puts on the invoice

Global datacenters consumed roughly 415 terawatt-hours of electricity in 2024. That is about 1.5% of worldwide electricity generation. The IEA projects this will more than double to 945 TWh by 2030, with AI as the primary driver. In the United States alone, the Lawrence Berkeley National Laboratory forecasts datacenter demand growing from 176 TWh in 2023 to somewhere between 325 and 580 TWh by 2028. The range is wide because nobody is quite sure how fast this scales. That uncertainty should concern people more than it seems to.

Ireland offers a preview. Around 21% of the country's electricity now goes to datacenters. The IEA estimates that share could reach 32% by 2026. In the US state of Virginia, these facilities already consume 26% of electricity. These are not projections from alarmist think-tanks. These are operational numbers from grid operators watching the load curves in real time.

The cost is not abstract. In the PJM electricity market (stretching from Illinois to North Carolina), datacenters accounted for an estimated $9.3 billion price increase in the 2025-26 capacity market. The average residential bill in western Maryland is expected to rise by $18 a month as a direct consequence. Carnegie Mellon estimates that datacenters and cryptocurrency mining could push average US electricity bills up 8% by 2030, exceeding 25% in the highest-demand markets of northern Virginia. The people paying that premium are not the people asking ChatGPT to summarise their emails.

Efficiency gains that do not add up

The industry response to these numbers is reliably optimistic. Google reports a 33x efficiency improvement for Gemini between May 2024 and May 2025. Quantisation techniques can reduce memory requirements by 75% and cut energy consumption by 60-80%. Mixture of Experts architectures activate only 5-10% of parameters per token. DeepSeek-V3 activates 37 billion parameters out of 671 billion total. The engineering is genuinely impressive.

But efficiency gains tell you how much energy each query costs. They do not tell you how much energy all the queries cost. And the demand curve is not politely waiting for the efficiency curve to catch up. Worldwide, AI datacenter power consumption is expected to reach 90 TWh by 2026. That is roughly a tenfold increase from 2022 levels. The efficiency per query improves. The total consumption climbs anyway. This pattern has a name in economics: the Jevons paradox. Make something cheaper to run and people run more of it.

The GPU problem

The fundamental issue is architectural. GPUs were designed for graphics rendering and repurposed for AI. They are general-purpose machines doing specialised work, which means they carry overhead. An Nvidia H100 draws 350 to 400 watts. It is optimised for flexibility: you can run any model on it, switch between training and inference, deploy it for image generation or protein folding or language processing. That generality comes at a thermodynamic cost.

In a previous piece I wrote about the interconnect bottleneck: how GPT-4's training cluster achieved only 32-36% utilisation of available compute due to communication overhead between chips. Inference has its own version of this problem. The dominant bottleneck is memory bandwidth. Every time a model generates a token, it needs to read the model's weights from memory. For a large language model, that means moving gigabytes of data through a memory bus that was not designed for this access pattern. The compute units sit idle, waiting for data. The watts keep flowing.

This is not a problem you can engineer around within the existing paradigm. You can compress models. You can batch requests more cleverly. You can push the memory closer to the compute. But you are still fighting the basic physics of a general-purpose chip running a specific workload. The transistors that make the GPU flexible are the same transistors that make it inefficient for any single task.

The arithmetic of scale

Consider what happens as AI inference scales to the level the industry is projecting. The major labs want AI assistants handling billions of queries per day. They want AI embedded in every search result, every email draft, every customer service interaction. They want models running continuously, reasoning through multi-step tasks, generating not single responses but entire workflows.

At current efficiency levels, this trajectory requires building power generation capacity equivalent to several small countries. Not datacenters. Power plants. The infrastructure commitments being made by Microsoft, Google, Amazon, and Meta collectively amount to tens of billions of dollars in energy procurement, including reviving nuclear plants, signing long-term deals with renewable developers, and in some cases simply building gas-fired generation.

The energy cost of inference at scale is not a footnote in AI's development story. It is a constraint. And like most constraints, it will not be solved by doing the same thing harder. It will be solved by doing something different.

In a companion piece, we look at what that different thing might be: the emerging class of specialised silicon that abandons the GPU's generality entirely, and what the AI landscape looks like when the model becomes the chip.

← Back to Writing