AI Infrastructure Study Series
Day 8: Inference Economics
Understanding what changes after training β why inference may become the larger long-term market, and what hardware and software tradeoffs matter most.
Summary
Training builds the model once. Inference runs the model every second of every day. While training dominates today's headlines and GPU budgets, inference is where AI actually meets users β and it generates cost continuously, not once. As AI models are embedded into more products and services, inference compute demand grows cumulatively and may eventually exceed training demand. Today we study the economics of inference: what drives cost per token, why the hardware competitive landscape is more diverse than training, and why software optimization has outsized economic leverage in this layer.
1) Why This Matters
Every time someone asks ChatGPT a question, generates an image, uses an AI coding assistant, or runs a search query enhanced by AI β that is inference. Training happens once per model generation. Inference happens billions of times per day, every day, for as long as the model is in service. The economics of inference determine whether AI services can scale profitably.
For investors, inference economics is where infrastructure spending translates into revenue. The cost per token determines API pricing, which determines margins, which determines whether the AI business model actually works at scale. Understanding this layer reveals which hardware and software companies have the most leverage over AI's unit economics.
2) One-Sentence Definitions
| Term |
Simple Definition |
Why It Matters |
| Inference |
Using a trained model to generate outputs from user inputs β the "serving" phase of AI. |
Runs 24/7 for the life of the model β cost never stops |
| Latency |
Time from user request to first response. Critical for real-time applications like chat and search. |
Users expect sub-second response β latency drives UX |
| Throughput |
Number of tokens or requests processed per second. Higher throughput = lower cost per token. |
Determines how efficiently hardware is used |
| Batching |
Grouping multiple user requests together for simultaneous GPU processing. |
Increases GPU utilization but can increase per-request latency |
| KV Cache |
Memory that stores previous token computations during generation. Grows with context length. |
Consumes massive HBM β limits concurrent requests per GPU |
| Cost per Token |
The fundamental unit of inference economics β how much it costs to process one token of input or output. |
Determines API pricing, margins, and commercial viability |
3) A Simple Analogy
Think of training vs inference like developing a recipe vs running a restaurant.
Training = developing the perfect recipe β expensive, but you do it once
Inference = cooking meals for customers every day using that recipe β cost runs continuously
Latency = how fast the food reaches the table after ordering
Batching = cooking 10 orders at once β more efficient kitchen, but the first customer waits a bit longer
KV Cache = remembering each diner's previous courses to inform the next β more courses, more notes to keep
Cost per Token = the cost of each individual dish served β this determines whether the restaurant is profitable
4) Training vs Inference: Two Different Economies
| Dimension |
Training |
Inference |
| Cost type |
One-time fixed cost |
Continuous variable cost |
| Frequency |
Once per model generation (months apart) |
Billions of times per day, 24/7 |
| Who does it |
~5 frontier labs |
Every company and user running AI |
| Key metric |
Total FLOPS, MFU |
Cost per token, latency, throughput |
| Hardware need |
Maximum FLOPS + bandwidth, tightly coupled |
Efficiency per watt, cost per token, memory capacity |
| NVIDIA dominance |
Near-monopoly |
Strong but more contested β ASICs compete |
The fundamental asymmetry: training is done by a handful of companies, a few times per year. Inference is done by everyone, every second. Over time, the cumulative compute demand from inference is expected to surpass training β making inference the larger long-term market for AI hardware.
5) What Drives Inference Cost
Model Size and Memory
Larger models need more GPU memory and more computation per token. A frontier-scale model often cannot fit on a single GPU, requiring multi-GPU serving with the associated communication overhead. The KV cache for long-context models (128K+ tokens) can consume tens of gigabytes of HBM per request, directly limiting how many users one GPU can serve simultaneously.
The Latency-Throughput Tradeoff
Batching multiple requests together increases GPU utilization and throughput β lowering cost per token. But larger batches increase the wait time for individual users. Every inference system must balance this tradeoff based on the application: real-time chat demands low latency; batch analytics can tolerate higher latency for better throughput.
Software Optimization: The Biggest Lever
In inference, software optimization has outsized economic impact. Quantization (reducing numerical precision from FP16 to INT8 or INT4) can cut memory and compute requirements by 2β4Γ with minimal quality loss. Speculative decoding uses a small model to draft tokens that a large model then verifies in parallel. Continuous batching dynamically groups incoming requests to maximize GPU utilization. FlashAttention reduces memory overhead for attention computation. These techniques can reduce inference cost by 2β10Γ on the same hardware β making software optimization economically equivalent to a hardware generation leap.
What Beginners Often Get Wrong
People assume inference is just a smaller version of training β same hardware, same approach, lower intensity. In reality, inference has completely different optimization priorities. Training maximizes total FLOPS. Inference optimizes cost per token, latency, and energy efficiency. This is why dedicated inference hardware (ASICs, custom chips) can compete with NVIDIA GPUs in inference even though they cannot compete in training.
6) The Inference Hardware Landscape
Unlike training β where NVIDIA holds a near-monopoly β inference has a more diverse hardware competitive landscape. Different optimization axes (cost/token, latency, power efficiency) create openings for alternative architectures.
| Hardware |
Company |
Inference Strength |
Limitation |
| H100 / B200 GPU |
NVIDIA |
Versatile, mature CUDA ecosystem, TensorRT optimization |
Expensive β high cost per token at scale |
| TPU |
Google |
Optimized for Google's models (Gemini), integrated with GCP |
Mostly captive to Google ecosystem |
| Inferentia / Trainium |
AWS (Amazon) |
Cost-efficient inference on AWS, designed for high throughput |
AWS-only, software ecosystem still maturing |
| Groq LPU |
Groq |
Extremely low latency inference |
Small scale, limited model support |
| MI300X GPU |
AMD |
Large HBM capacity β good for large model serving |
ROCm software ecosystem weaker than CUDA |
| Apple Silicon / Qualcomm |
Apple, Qualcomm |
On-device inference β privacy, zero latency, no cloud cost |
Limited to smaller models, constrained memory |
The key structural difference: training requires the absolute best hardware tightly coupled together (NVIDIA's moat). Inference values efficiency, cost, and flexibility β creating space for specialized alternatives. This is why NVIDIA's dominance in inference is strong but not as unassailable as in training.
7) Why Inference Will Likely Become the Bigger Market
Cumulative Demand
Training a frontier model uses a massive cluster for a few months. Serving that model to millions of users requires GPUs running 24/7 indefinitely. Over the model's lifetime, total inference compute is expected to far exceed the compute used for training.
Expanding Use Cases
AI is being embedded into search, productivity tools, coding, customer service, healthcare, finance, and more. Each new integration creates a new stream of inference demand. The total number of AI-powered applications is growing faster than the number of new model training runs.
Cost Reduction Drives Demand
As inference cost per token falls β through hardware improvements, software optimization, and smaller efficient models β new use cases become economically viable. This is Jevons Paradox again: cheaper inference does not reduce demand, it unlocks more of it.
Edge Inference
Not all inference happens in data centers. On-device AI on smartphones, laptops, and embedded systems creates a parallel demand stream. Apple Intelligence, Qualcomm's AI Engine, and MediaTek's NPUs are expanding inference to billions of edge devices β a market that is additive to data center inference.
8) Why Investors Should Care
Inference economics determines whether AI can scale as a business. If the cost per token is too high, AI services cannot be offered affordably to mass markets. If cost per token falls fast enough, AI becomes embedded in everything β and inference hardware demand grows for decades.
The Core Framework
Training Builds the Model. Inference Builds the Business.
Training is a cost center β a one-time investment to create a model. Inference is the revenue engine β the ongoing compute that serves users and generates income. The long-term economics of AI are determined not by how much it costs to train a model, but by how cheaply and efficiently that model can serve millions of users. Investors should track: cost per token trends, inference hardware diversification, software optimization adoption, and the ratio of inference to training compute across major cloud providers.
9) Connecting to the Stack
Day 1 + Day 2 β Day 8
GPU compute power (Day 1) determines inference throughput. HBM capacity (Day 2) limits KV cache size and concurrent users per GPU. Memory bandwidth determines whether inference is compute-bound or memory-bound.
Day 3 β Day 8
Networking (Day 3) matters for distributed inference β when models are too large for one GPU, inter-GPU communication overhead directly impacts latency and throughput.
Day 5 + Day 6 β Day 8
Server design (Day 5) and power/cooling (Day 6) determine the operational cost of running inference 24/7. Energy efficiency per token is a critical metric because inference runs continuously, unlike training's finite duration.
Day 7 β Day 8
Training (Day 7) produces the model. Inference (Day 8) serves it. Together they represent the full lifecycle cost of AI β and the balance between them is shifting toward inference as AI adoption grows.
Day 8 β Day 9
Day 9 will map all the bottlenecks across the entire AI stack β from compute to memory to networking to packaging to power to economics β into one unified view. Inference cost is one of the most important bottlenecks to resolve for AI to scale commercially.
10) What I Learned Today
- Inference is continuous variable cost β unlike training's one-time fixed cost β and scales with user demand. Over time, total inference compute is expected to exceed training compute, making inference the larger long-term hardware market.
- Inference cost is driven by model size, KV cache memory, batch efficiency, and generation length. Software optimization (quantization, speculative decoding, continuous batching) can reduce costs 2β10Γ on the same hardware β giving software outsized economic leverage.
- The inference hardware market is more competitive than training. Google TPU, AWS Inferentia, Groq LPU, and AMD MI300X all compete on cost-per-token and efficiency β creating a more diverse landscape where NVIDIA's dominance is strong but not monopolistic.
11) One Question I'm Still Thinking About
As inference cost per token continues to fall, will AI become so cheap to run that inference demand grows faster than cost declines β and if so, does that mean total inference hardware spending accelerates even as unit economics improve?
12) What Comes Next
In Day 9, I'll build the AI Stack Bottlenecks Map β a unified view of every major bottleneck across compute, memory, networking, packaging, power, training, and inference. After eight days of studying individual layers, Day 9 synthesizes them into one picture that shows where the constraints compound and where the biggest opportunities lie.
Continue the AI Infrastructure Study Series
This series is designed to make the AI stack easier to follow β one layer at a time, from compute and memory to networking, packaging, and system economics.
Next: Day 9 β AI Stack Bottlenecks Map