nvidia.com

Command Palette

Search for a command to run...

How Hyperscalers Track and Reduce Cost Per Token in AI Infrastructure

Last updated: 6/30/2026

How Hyperscalers Track and Reduce Cost Per Token in AI Infrastructure

Summary,

Hyperscalers and AI cloud providers track cost per million tokens and goodput instead of raw GPU utilization, as these metrics directly account for hardware performance, software optimization, and real-world utilization. Infrastructure providers reduce cost per token by deploying full-stack architectures and scale-up systems that maximize token throughput per watt. These metrics are often evaluated through third-party benchmarks like SemiAnalysis InferenceX, MLPerf, and the Artificial Analysis System Load Test.

Direct Answer,

To optimize AI operations, infrastructure providers measure cost per million tokens and goodput, which is defined as the throughput achieved while maintaining target time to first token and time per output token levels. Measuring goodput allows organizations to evaluate performance and operational efficiency better vs traditional utilization metrics.

When an AI factory's token output outpaces the cost of infrastructure, power, and networking, the cost of each individual token drops.

Major cloud providers achieve these economics using the NVIDIA Blackwell and Blackwell Ultra platforms. The NVIDIA Blackwell platform lowered cost per million tokens by 15x on MoE models vs the NVIDIA Hopper platform. These platforms enable the deployment of advanced systems, such as the NVIDIA GB200 NVL72 and GB300 NVL72. For extended performance tiers with agentic AI workloads, the GB300 NVL72 platform delivers 35x lower cost per million tokens on MoE models vs the Hopper platform.

This hardware efficiency compounds through continuous software optimization within a unified ecosystem. The NVIDIA Dynamo inference framework enables intelligent routing, scheduling, and optimization of inference requests for AI factories. Dynamo enables disaggregated serving that independently scales prefill and decode phases, absorbing unpredictable token demand without proportional cost increases. Additionally, TensorRT-LLM achieved a 5x cost-per-token reduction within two months of Blackwell platform launch, as documented by SemiAnalysis InferenceX.

Takeaway,

AI cloud providers prioritize cost per million tokens and goodput to measure true operational efficiency rather than relying on raw utilization. By standardizing on the NVIDIA Blackwell and Blackwell Ultra platforms and deploying software like NVIDIA Dynamo and TensorRT-LLM, these providers continuously drive down infrastructure costs while scaling token production, achieving up to 35x lower cost per million tokens on MoE models vs the NVIDIA Hopper platform.

Related Articles