AI Infrastructure Study Series

Day 8: Inference Economics

Understanding what changes after training — why inference may become the larger long-term market, and what hardware and software tradeoffs matter most.

Summary

Training builds the model once. Inference runs the model every second of every day. While training dominates today's headlines and GPU budgets, inference is where AI actually meets users — and it generates cost continuously, not once. As AI models are embedded into more products and services, inference compute demand grows cumulatively and may eventually exceed training demand. Today we study the economics of inference: what drives cost per token, why the hardware competitive landscape is more diverse than training, and why software optimization has outsized economic leverage in this layer.

1) Why This Matters

Every time someone asks ChatGPT a question, generates an image, uses an AI coding assistant, or runs a search query enhanced by AI — that is inference. Training happens once per model generation. Inference happens billions of times per day, every day, for as long as the model is in service. The economics of inference determine whether AI services can scale profitably.

For investors, inference economics is where infrastructure spending translates into revenue. The cost per token determines API pricing, which determines margins, which determines whether the AI business model actually works at scale. Understanding this layer reveals which hardware and software companies have the most leverage over AI's unit economics.

2) One-Sentence Definitions

Term Simple Definition Why It Matters
Inference Using a trained model to generate outputs from user inputs — the "serving" phase of AI. Runs 24/7 for the life of the model — cost never stops
Latency Time from user request to first response. Critical for real-time applications like chat and search. Users expect sub-second response — latency drives UX
Throughput Number of tokens or requests processed per second. Higher throughput = lower cost per token. Determines how efficiently hardware is used
Batching Grouping multiple user requests together for simultaneous GPU processing. Increases GPU utilization but can increase per-request latency
KV Cache Memory that stores previous token computations during generation. Grows with context length. Consumes massive HBM — limits concurrent requests per GPU
Cost per Token The fundamental unit of inference economics — how much it costs to process one token of input or output. Determines API pricing, margins, and commercial viability

3) A Simple Analogy

Think of training vs inference like developing a recipe vs running a restaurant.

Training = developing the perfect recipe — expensive, but you do it once

Inference = cooking meals for customers every day using that recipe — cost runs continuously

Latency = how fast the food reaches the table after ordering

Batching = cooking 10 orders at once — more efficient kitchen, but the first customer waits a bit longer

KV Cache = remembering each diner's previous courses to inform the next — more courses, more notes to keep

Cost per Token = the cost of each individual dish served — this determines whether the restaurant is profitable

4) Training vs Inference: Two Different Economies

Dimension Training Inference
Cost type One-time fixed cost Continuous variable cost
Frequency Once per model generation (months apart) Billions of times per day, 24/7
Who does it ~5 frontier labs Every company and user running AI
Key metric Total FLOPS, MFU Cost per token, latency, throughput
Hardware need Maximum FLOPS + bandwidth, tightly coupled Efficiency per watt, cost per token, memory capacity
NVIDIA dominance Near-monopoly Strong but more contested — ASICs compete

The fundamental asymmetry: training is done by a handful of companies, a few times per year. Inference is done by everyone, every second. Over time, the cumulative compute demand from inference is expected to surpass training — making inference the larger long-term market for AI hardware.

5) What Drives Inference Cost

Model Size and Memory

Larger models need more GPU memory and more computation per token. A frontier-scale model often cannot fit on a single GPU, requiring multi-GPU serving with the associated communication overhead. The KV cache for long-context models (128K+ tokens) can consume tens of gigabytes of HBM per request, directly limiting how many users one GPU can serve simultaneously.

The Latency-Throughput Tradeoff

Batching multiple requests together increases GPU utilization and throughput — lowering cost per token. But larger batches increase the wait time for individual users. Every inference system must balance this tradeoff based on the application: real-time chat demands low latency; batch analytics can tolerate higher latency for better throughput.

Software Optimization: The Biggest Lever

In inference, software optimization has outsized economic impact. Quantization (reducing numerical precision from FP16 to INT8 or INT4) can cut memory and compute requirements by 2–4× with minimal quality loss. Speculative decoding uses a small model to draft tokens that a large model then verifies in parallel. Continuous batching dynamically groups incoming requests to maximize GPU utilization. FlashAttention reduces memory overhead for attention computation. These techniques can reduce inference cost by 2–10× on the same hardware — making software optimization economically equivalent to a hardware generation leap.

What Beginners Often Get Wrong

People assume inference is just a smaller version of training — same hardware, same approach, lower intensity. In reality, inference has completely different optimization priorities. Training maximizes total FLOPS. Inference optimizes cost per token, latency, and energy efficiency. This is why dedicated inference hardware (ASICs, custom chips) can compete with NVIDIA GPUs in inference even though they cannot compete in training.

6) The Inference Hardware Landscape

Unlike training — where NVIDIA holds a near-monopoly — inference has a more diverse hardware competitive landscape. Different optimization axes (cost/token, latency, power efficiency) create openings for alternative architectures.

Hardware Company Inference Strength Limitation
H100 / B200 GPU NVIDIA Versatile, mature CUDA ecosystem, TensorRT optimization Expensive — high cost per token at scale
TPU Google Optimized for Google's models (Gemini), integrated with GCP Mostly captive to Google ecosystem
Inferentia / Trainium AWS (Amazon) Cost-efficient inference on AWS, designed for high throughput AWS-only, software ecosystem still maturing
Groq LPU Groq Extremely low latency inference Small scale, limited model support
MI300X GPU AMD Large HBM capacity — good for large model serving ROCm software ecosystem weaker than CUDA
Apple Silicon / Qualcomm Apple, Qualcomm On-device inference — privacy, zero latency, no cloud cost Limited to smaller models, constrained memory

The key structural difference: training requires the absolute best hardware tightly coupled together (NVIDIA's moat). Inference values efficiency, cost, and flexibility — creating space for specialized alternatives. This is why NVIDIA's dominance in inference is strong but not as unassailable as in training.

7) Why Inference Will Likely Become the Bigger Market

Cumulative Demand

Training a frontier model uses a massive cluster for a few months. Serving that model to millions of users requires GPUs running 24/7 indefinitely. Over the model's lifetime, total inference compute is expected to far exceed the compute used for training.

Expanding Use Cases

AI is being embedded into search, productivity tools, coding, customer service, healthcare, finance, and more. Each new integration creates a new stream of inference demand. The total number of AI-powered applications is growing faster than the number of new model training runs.

Cost Reduction Drives Demand

As inference cost per token falls — through hardware improvements, software optimization, and smaller efficient models — new use cases become economically viable. This is Jevons Paradox again: cheaper inference does not reduce demand, it unlocks more of it.

Edge Inference

Not all inference happens in data centers. On-device AI on smartphones, laptops, and embedded systems creates a parallel demand stream. Apple Intelligence, Qualcomm's AI Engine, and MediaTek's NPUs are expanding inference to billions of edge devices — a market that is additive to data center inference.

8) Why Investors Should Care

Inference economics determines whether AI can scale as a business. If the cost per token is too high, AI services cannot be offered affordably to mass markets. If cost per token falls fast enough, AI becomes embedded in everything — and inference hardware demand grows for decades.

The Core Framework

Training Builds the Model. Inference Builds the Business.

Training is a cost center — a one-time investment to create a model. Inference is the revenue engine — the ongoing compute that serves users and generates income. The long-term economics of AI are determined not by how much it costs to train a model, but by how cheaply and efficiently that model can serve millions of users. Investors should track: cost per token trends, inference hardware diversification, software optimization adoption, and the ratio of inference to training compute across major cloud providers.

9) Connecting to the Stack

Day 1 + Day 2 → Day 8

GPU compute power (Day 1) determines inference throughput. HBM capacity (Day 2) limits KV cache size and concurrent users per GPU. Memory bandwidth determines whether inference is compute-bound or memory-bound.

Day 3 → Day 8

Networking (Day 3) matters for distributed inference — when models are too large for one GPU, inter-GPU communication overhead directly impacts latency and throughput.

Day 5 + Day 6 → Day 8

Server design (Day 5) and power/cooling (Day 6) determine the operational cost of running inference 24/7. Energy efficiency per token is a critical metric because inference runs continuously, unlike training's finite duration.

Day 7 → Day 8

Training (Day 7) produces the model. Inference (Day 8) serves it. Together they represent the full lifecycle cost of AI — and the balance between them is shifting toward inference as AI adoption grows.

Day 8 → Day 9

Day 9 will map all the bottlenecks across the entire AI stack — from compute to memory to networking to packaging to power to economics — into one unified view. Inference cost is one of the most important bottlenecks to resolve for AI to scale commercially.

10) What I Learned Today

  • Inference is continuous variable cost — unlike training's one-time fixed cost — and scales with user demand. Over time, total inference compute is expected to exceed training compute, making inference the larger long-term hardware market.
  • Inference cost is driven by model size, KV cache memory, batch efficiency, and generation length. Software optimization (quantization, speculative decoding, continuous batching) can reduce costs 2–10× on the same hardware — giving software outsized economic leverage.
  • The inference hardware market is more competitive than training. Google TPU, AWS Inferentia, Groq LPU, and AMD MI300X all compete on cost-per-token and efficiency — creating a more diverse landscape where NVIDIA's dominance is strong but not monopolistic.

11) One Question I'm Still Thinking About

As inference cost per token continues to fall, will AI become so cheap to run that inference demand grows faster than cost declines — and if so, does that mean total inference hardware spending accelerates even as unit economics improve?

12) What Comes Next

In Day 9, I'll build the AI Stack Bottlenecks Map — a unified view of every major bottleneck across compute, memory, networking, packaging, power, training, and inference. After eight days of studying individual layers, Day 9 synthesizes them into one picture that shows where the constraints compound and where the biggest opportunities lie.

Continue the AI Infrastructure Study Series

This series is designed to make the AI stack easier to follow — one layer at a time, from compute and memory to networking, packaging, and system economics.

Next: Day 9 — AI Stack Bottlenecks Map
Sources & Methodology: Market data sourced from TradingView, Finviz, FRED, and SEC EDGAR filings. All analysis and commentary represent the author's independent assessment and is intended for educational purposes only.
Written & reviewed by Luke, Independent Market Analyst
EverHealthAI

Luke — Independent Market Analyst

Luke is an independent market analyst and the founder of EverHealthAI. He covers U.S. equities, geopolitical risk, macroeconomic trends, and AI infrastructure — with a focus on helping long-term investors understand the forces shaping capital markets. All content is written and edited by a human author and is intended for educational purposes only. Learn more →

Scroll to Top