AI Infrastructure Study

AI Infrastructure Study

Day 9: AI Stack Bottlenecks Map

 

AI Infrastructure Study Series

Day 9: AI Stack Bottlenecks Map

Mapping every major bottleneck across the AI infrastructure stack — where they compound, how they cascade, and which ones last the longest.

Summary

After eight days studying each layer of the AI infrastructure stack individually, today we assemble the full picture. Every layer has its own bottleneck, and those bottlenecks do not exist in isolation — they cascade. When one is resolved, the next one becomes binding. The most important insight for investors is that bottlenecks do not disappear — they migrate. Understanding which bottleneck is binding today, which is next, and how long each takes to resolve is the key to navigating AI infrastructure investment over different time horizons.

1) Why This Matters

Most AI infrastructure analysis focuses on one layer at a time — GPUs, or memory, or power. But the real constraint on AI deployment is never a single layer. It is the interaction between layers, where one bottleneck masks another until it is resolved. Seeing the full map changes how you evaluate investments, timelines, and competitive dynamics.

For investors, the bottleneck map is a navigation tool. It answers: "Where is the constraint right now? What resolves it? And when it clears, where does the constraint move next?" The companies that sit at the longest-lasting bottlenecks have the most durable investment cases.

2) The Full Bottleneck Map

Layer Core Bottleneck Resolution Timeline Key Companies
GPU Compute (Day 1) NVIDIA near-monopoly in training; CUDA software lock-in limits alternatives 2–3 years NVIDIA, AMD, Google, Amazon
Memory / HBM (Day 2) HBM production concentrated in 3 suppliers; demand exceeds supply 1–2 years SK hynix, Samsung, Micron
Networking (Day 3) NVIDIA vertical integration (NVLink + InfiniBand); communication overhead limits MFU 2–3 years NVIDIA/Mellanox, Broadcom, Arista
Foundry / Packaging (Day 4) CoWoS packaging tighter than chip fab; ASML EUV equipment supply limited 2–4 years TSMC, ASML, Samsung, Intel
Systems / Servers (Day 5) NVIDIA expanding into rack-scale systems; OEM value compression; liquid cooling transition 1–2 years NVIDIA, Supermicro, Dell, Foxconn
Power / Cooling (Day 6) Grid infrastructure takes 5–10 years to build; AI rack power density 5–10× higher 5–10 years Vertiv, Eaton, Schneider, utilities, nuclear
Training Economics (Day 7) Scaling laws drive exponential cost; MFU at 30–60%; failed runs waste resources Ongoing Hyperscalers, NVIDIA (full stack)
Inference Economics (Day 8) Cost per token still too high for many use cases; KV cache limits concurrency 1–3 years NVIDIA, Google, AMD, Groq, software optimizers

3) A Simple Analogy

Think of the AI stack as a multi-lane highway system.

Each layer = one section of the highway

A bottleneck = the narrowest section — it sets the speed for the entire route

Bottleneck cascade = widening one section just reveals the next narrow point

Resolution timeline = how long it takes to widen each section — chip lanes take 1–3 years, power lanes take 5–10 years

4) The Bottleneck Cascade: How Constraints Migrate

Bottlenecks do not disappear — they migrate. Resolving one layer's constraint reveals the next layer's constraint. Understanding this cascade is essential for anticipating where investment opportunities will move.

Phase 1 — Now: Packaging Bottleneck

GPU designs are advancing fast, but TSMC's CoWoS packaging capacity cannot keep up. The dies exist but cannot all be assembled. AI chip supply is capped by packaging, not design.

Phase 2 — 1–2 Years: Power Bottleneck Emerges

As TSMC expands CoWoS capacity, more AI chips reach the market. But data centers cannot deploy them all because electrical infrastructure takes years to build. The binding constraint shifts from packaging to power.

Phase 3 — 3–5 Years: Economics Bottleneck

As power infrastructure catches up, the constraint shifts to economics. Scaling laws push training costs toward $10B per model. The question becomes: can any organization justify that cost? Algorithmic efficiency and new architectures become the binding factor.

Phase 4 — 5–10 Years: Physical Limits

Scaling laws may hit diminishing returns. New computing paradigms (optical, quantum, neuromorphic) may emerge to bypass current physical constraints. The stack could look fundamentally different.

What Beginners Often Get Wrong

People assume that once a bottleneck is "solved," the constraint disappears and the industry can scale freely. In reality, resolving one bottleneck simply reveals the next one. There is no point at which all constraints are cleared simultaneously. AI infrastructure investment is about tracking which bottleneck is binding now and which will be binding next — not waiting for a bottleneck-free future that will never arrive.

5) The Time Map: Sorting Bottlenecks by Duration

Not all bottlenecks are equal. Their resolution timelines determine how long the associated investment opportunities last.

Short-Term (1–2 years)

HBM supply expansion — production ramp underway

Inference software optimization — fast-moving, high-impact

Server/rack design adaptation — OEMs adjusting quickly

Mid-Term (2–4 years)

CoWoS packaging capacity — TSMC expanding aggressively

ASML High-NA EUV supply — limited annual output

GPU competition — AMD, custom ASICs gaining ground

Cooling infrastructure transition — air to liquid

Long-Term (5–10 years)

Data center power infrastructure — transmission, substations, grid

Energy source diversification — nuclear SMR, renewables at scale

Scaling law economics — training cost vs value curve

Physical compute limits — transistor scaling endpoints

6) The NVIDIA Concentration Thread

One pattern that runs across the entire bottleneck map is NVIDIA's presence at almost every layer. This concentration is both a strength and a systemic risk.

Stack Layer NVIDIA's Role Competitive Threat
Compute GPU design (H100, B200) + CUDA ecosystem AMD MI300X, Google TPU, custom ASICs
Networking NVLink (scale-up) + InfiniBand/Mellanox (scale-out) Ultra Ethernet, Broadcom, Arista
Systems HGX board → DGX server → NVL72 rack Hyperscaler custom designs, ODMs
Software CUDA, TensorRT, NeMo, AI Enterprise PyTorch ecosystem, open-source inference engines

NVIDIA's vertical integration across compute, networking, systems, and software gives it extraordinary pricing power and customer lock-in. But it also means that any successful competitive entry at one layer (e.g., AMD in GPUs, Ethernet in networking) could weaken the entire integrated stack advantage. Investors should track both NVIDIA's expansion and the competitive attacks at each layer.

7) Why Investors Should Care

The bottleneck map is the single most useful framework for AI infrastructure investing. It tells you where constrained supply creates pricing power, how long that pricing power lasts, and where it will move next.

The Core Framework

Bottlenecks Don't Disappear — They Migrate

When one constraint is resolved, the next layer's constraint becomes binding. Investment opportunity follows the bottleneck. Short-term investors should track the current binding constraint (packaging, HBM). Long-term investors should position at the longest-lasting bottleneck (power infrastructure, energy). The companies sitting at the most durable bottlenecks — those with 5–10 year resolution timelines — have the most structural investment cases.

8) Connecting to the Stack

Days 1–8 → Day 9

Every layer studied in Days 1–8 contributes one piece to the bottleneck map. Today's synthesis does not add new information — it reveals the structure connecting all previous lessons.

Day 9 → Day 10

The bottleneck map sets up the final question: given these constraints and their timelines, which companies are best positioned to capture value across the AI infrastructure stack over the next 1–3 years? Day 10 will answer that.

9) What I Learned Today

  • AI infrastructure bottlenecks do not disappear — they migrate. Resolving GPU supply reveals packaging constraints. Resolving packaging reveals power constraints. Investment opportunity follows this cascade.
  • Bottleneck resolution timelines range from 1–2 years (HBM, inference software) to 5–10 years (power infrastructure, energy sourcing). The longest-lasting bottlenecks represent the most durable investment opportunities.
  • Power and energy infrastructure is the most enduring bottleneck in the AI stack — semiconductor technology advances in 1–3 year cycles while power infrastructure takes 5–10 years, creating a structural timeline mismatch that caps AI deployment speed regardless of chip progress.

10) One Question I'm Still Thinking About

If bottlenecks always migrate rather than disappear, does that mean AI infrastructure is permanently supply-constrained — and if so, does the traditional semiconductor cycle of boom-and-bust apply to AI hardware, or is this a structurally different demand pattern?

11) What Comes Next

In Day 10, I'll conclude the series with Who Wins Across the AI Infrastructure Stack — tying together all ten days to analyze competitive positioning, likely value capture, and what investors should watch over the next 1–3 years across every layer of the stack.

Continue the AI Infrastructure Study Series

This series is designed to make the AI stack easier to follow — one layer at a time, from compute and memory to networking, packaging, and system economics.

Next: Day 10 — Who Wins Across the AI Infrastructure Stack?
AI Infrastructure Study

Day 8: Inference Economics

 

AI Infrastructure Study Series

Day 8: Inference Economics

Understanding what changes after training — why inference may become the larger long-term market, and what hardware and software tradeoffs matter most.

Summary

Training builds the model once. Inference runs the model every second of every day. While training dominates today's headlines and GPU budgets, inference is where AI actually meets users — and it generates cost continuously, not once. As AI models are embedded into more products and services, inference compute demand grows cumulatively and may eventually exceed training demand. Today we study the economics of inference: what drives cost per token, why the hardware competitive landscape is more diverse than training, and why software optimization has outsized economic leverage in this layer.

1) Why This Matters

Every time someone asks ChatGPT a question, generates an image, uses an AI coding assistant, or runs a search query enhanced by AI — that is inference. Training happens once per model generation. Inference happens billions of times per day, every day, for as long as the model is in service. The economics of inference determine whether AI services can scale profitably.

For investors, inference economics is where infrastructure spending translates into revenue. The cost per token determines API pricing, which determines margins, which determines whether the AI business model actually works at scale. Understanding this layer reveals which hardware and software companies have the most leverage over AI's unit economics.

2) One-Sentence Definitions

Term Simple Definition Why It Matters
Inference Using a trained model to generate outputs from user inputs — the "serving" phase of AI. Runs 24/7 for the life of the model — cost never stops
Latency Time from user request to first response. Critical for real-time applications like chat and search. Users expect sub-second response — latency drives UX
Throughput Number of tokens or requests processed per second. Higher throughput = lower cost per token. Determines how efficiently hardware is used
Batching Grouping multiple user requests together for simultaneous GPU processing. Increases GPU utilization but can increase per-request latency
KV Cache Memory that stores previous token computations during generation. Grows with context length. Consumes massive HBM — limits concurrent requests per GPU
Cost per Token The fundamental unit of inference economics — how much it costs to process one token of input or output. Determines API pricing, margins, and commercial viability

3) A Simple Analogy

Think of training vs inference like developing a recipe vs running a restaurant.

Training = developing the perfect recipe — expensive, but you do it once

Inference = cooking meals for customers every day using that recipe — cost runs continuously

Latency = how fast the food reaches the table after ordering

Batching = cooking 10 orders at once — more efficient kitchen, but the first customer waits a bit longer

KV Cache = remembering each diner's previous courses to inform the next — more courses, more notes to keep

Cost per Token = the cost of each individual dish served — this determines whether the restaurant is profitable

4) Training vs Inference: Two Different Economies

Dimension Training Inference
Cost type One-time fixed cost Continuous variable cost
Frequency Once per model generation (months apart) Billions of times per day, 24/7
Who does it ~5 frontier labs Every company and user running AI
Key metric Total FLOPS, MFU Cost per token, latency, throughput
Hardware need Maximum FLOPS + bandwidth, tightly coupled Efficiency per watt, cost per token, memory capacity
NVIDIA dominance Near-monopoly Strong but more contested — ASICs compete

The fundamental asymmetry: training is done by a handful of companies, a few times per year. Inference is done by everyone, every second. Over time, the cumulative compute demand from inference is expected to surpass training — making inference the larger long-term market for AI hardware.

5) What Drives Inference Cost

Model Size and Memory

Larger models need more GPU memory and more computation per token. A frontier-scale model often cannot fit on a single GPU, requiring multi-GPU serving with the associated communication overhead. The KV cache for long-context models (128K+ tokens) can consume tens of gigabytes of HBM per request, directly limiting how many users one GPU can serve simultaneously.

The Latency-Throughput Tradeoff

Batching multiple requests together increases GPU utilization and throughput — lowering cost per token. But larger batches increase the wait time for individual users. Every inference system must balance this tradeoff based on the application: real-time chat demands low latency; batch analytics can tolerate higher latency for better throughput.

Software Optimization: The Biggest Lever

In inference, software optimization has outsized economic impact. Quantization (reducing numerical precision from FP16 to INT8 or INT4) can cut memory and compute requirements by 2–4× with minimal quality loss. Speculative decoding uses a small model to draft tokens that a large model then verifies in parallel. Continuous batching dynamically groups incoming requests to maximize GPU utilization. FlashAttention reduces memory overhead for attention computation. These techniques can reduce inference cost by 2–10× on the same hardware — making software optimization economically equivalent to a hardware generation leap.

What Beginners Often Get Wrong

People assume inference is just a smaller version of training — same hardware, same approach, lower intensity. In reality, inference has completely different optimization priorities. Training maximizes total FLOPS. Inference optimizes cost per token, latency, and energy efficiency. This is why dedicated inference hardware (ASICs, custom chips) can compete with NVIDIA GPUs in inference even though they cannot compete in training.

6) The Inference Hardware Landscape

Unlike training — where NVIDIA holds a near-monopoly — inference has a more diverse hardware competitive landscape. Different optimization axes (cost/token, latency, power efficiency) create openings for alternative architectures.

Hardware Company Inference Strength Limitation
H100 / B200 GPU NVIDIA Versatile, mature CUDA ecosystem, TensorRT optimization Expensive — high cost per token at scale
TPU Google Optimized for Google's models (Gemini), integrated with GCP Mostly captive to Google ecosystem
Inferentia / Trainium AWS (Amazon) Cost-efficient inference on AWS, designed for high throughput AWS-only, software ecosystem still maturing
Groq LPU Groq Extremely low latency inference Small scale, limited model support
MI300X GPU AMD Large HBM capacity — good for large model serving ROCm software ecosystem weaker than CUDA
Apple Silicon / Qualcomm Apple, Qualcomm On-device inference — privacy, zero latency, no cloud cost Limited to smaller models, constrained memory

The key structural difference: training requires the absolute best hardware tightly coupled together (NVIDIA's moat). Inference values efficiency, cost, and flexibility — creating space for specialized alternatives. This is why NVIDIA's dominance in inference is strong but not as unassailable as in training.

7) Why Inference Will Likely Become the Bigger Market

Cumulative Demand

Training a frontier model uses a massive cluster for a few months. Serving that model to millions of users requires GPUs running 24/7 indefinitely. Over the model's lifetime, total inference compute is expected to far exceed the compute used for training.

Expanding Use Cases

AI is being embedded into search, productivity tools, coding, customer service, healthcare, finance, and more. Each new integration creates a new stream of inference demand. The total number of AI-powered applications is growing faster than the number of new model training runs.

Cost Reduction Drives Demand

As inference cost per token falls — through hardware improvements, software optimization, and smaller efficient models — new use cases become economically viable. This is Jevons Paradox again: cheaper inference does not reduce demand, it unlocks more of it.

Edge Inference

Not all inference happens in data centers. On-device AI on smartphones, laptops, and embedded systems creates a parallel demand stream. Apple Intelligence, Qualcomm's AI Engine, and MediaTek's NPUs are expanding inference to billions of edge devices — a market that is additive to data center inference.

8) Why Investors Should Care

Inference economics determines whether AI can scale as a business. If the cost per token is too high, AI services cannot be offered affordably to mass markets. If cost per token falls fast enough, AI becomes embedded in everything — and inference hardware demand grows for decades.

The Core Framework

Training Builds the Model. Inference Builds the Business.

Training is a cost center — a one-time investment to create a model. Inference is the revenue engine — the ongoing compute that serves users and generates income. The long-term economics of AI are determined not by how much it costs to train a model, but by how cheaply and efficiently that model can serve millions of users. Investors should track: cost per token trends, inference hardware diversification, software optimization adoption, and the ratio of inference to training compute across major cloud providers.

9) Connecting to the Stack

Day 1 + Day 2 → Day 8

GPU compute power (Day 1) determines inference throughput. HBM capacity (Day 2) limits KV cache size and concurrent users per GPU. Memory bandwidth determines whether inference is compute-bound or memory-bound.

Day 3 → Day 8

Networking (Day 3) matters for distributed inference — when models are too large for one GPU, inter-GPU communication overhead directly impacts latency and throughput.

Day 5 + Day 6 → Day 8

Server design (Day 5) and power/cooling (Day 6) determine the operational cost of running inference 24/7. Energy efficiency per token is a critical metric because inference runs continuously, unlike training's finite duration.

Day 7 → Day 8

Training (Day 7) produces the model. Inference (Day 8) serves it. Together they represent the full lifecycle cost of AI — and the balance between them is shifting toward inference as AI adoption grows.

Day 8 → Day 9

Day 9 will map all the bottlenecks across the entire AI stack — from compute to memory to networking to packaging to power to economics — into one unified view. Inference cost is one of the most important bottlenecks to resolve for AI to scale commercially.

10) What I Learned Today

  • Inference is continuous variable cost — unlike training's one-time fixed cost — and scales with user demand. Over time, total inference compute is expected to exceed training compute, making inference the larger long-term hardware market.
  • Inference cost is driven by model size, KV cache memory, batch efficiency, and generation length. Software optimization (quantization, speculative decoding, continuous batching) can reduce costs 2–10× on the same hardware — giving software outsized economic leverage.
  • The inference hardware market is more competitive than training. Google TPU, AWS Inferentia, Groq LPU, and AMD MI300X all compete on cost-per-token and efficiency — creating a more diverse landscape where NVIDIA's dominance is strong but not monopolistic.

11) One Question I'm Still Thinking About

As inference cost per token continues to fall, will AI become so cheap to run that inference demand grows faster than cost declines — and if so, does that mean total inference hardware spending accelerates even as unit economics improve?

12) What Comes Next

In Day 9, I'll build the AI Stack Bottlenecks Map — a unified view of every major bottleneck across compute, memory, networking, packaging, power, training, and inference. After eight days of studying individual layers, Day 9 synthesizes them into one picture that shows where the constraints compound and where the biggest opportunities lie.

Continue the AI Infrastructure Study Series

This series is designed to make the AI stack easier to follow — one layer at a time, from compute and memory to networking, packaging, and system economics.

Next: Day 9 — AI Stack Bottlenecks Map
AI Infrastructure Study

Day 6: Data Center Power and Cooling

AI Infrastructure Study Series

Day 6: Data Center Power and Cooling

Understanding why electricity, thermal density, and cooling infrastructure have become the binding physical constraints on AI deployment.

Summary

Every layer studied so far — GPUs, memory, networking, packaging, servers — ultimately depends on two physical resources: electricity and cooling. No matter how advanced the silicon, a chip that cannot be powered cannot compute, and a chip that cannot be cooled must be throttled or shut down. Today we study why data center power and cooling have become the longest-duration bottleneck in AI infrastructure — a constraint measured not in chip generations but in years of physical construction and regulatory approvals.

1) Why This Matters

Semiconductor technology advances on 1–2 year cycles. A new GPU generation, a new process node, a new packaging technique — these move fast. But the power infrastructure that feeds data centers — transmission lines, substations, transformers, generation capacity — takes 5 to 10 years to build. This mismatch means that even if every chip-level bottleneck were resolved tomorrow, AI deployment would still be constrained by how fast electricity can reach the data center.

For investors, this is the layer where AI infrastructure meets energy infrastructure. The companies that supply power equipment, cooling systems, and grid connectivity are indirect but structural beneficiaries of AI spending — and the constraints here will persist longer than any semiconductor bottleneck.

2) One-Sentence Definitions

Term Simple Definition Why It Matters
Power Capacity Total electricity a data center can draw, measured in MW. New AI facilities target 100 MW to 1 GW+. Power supply — not demand — is the bottleneck
Power Density Power consumed per rack, measured in kW/rack. AI racks draw 40–120 kW vs 10–20 kW for traditional racks. 5–10× higher density requires completely different infrastructure
PUE Power Usage Effectiveness — total facility power ÷ IT equipment power. Lower is better (1.0 = perfect). Cooling overhead directly impacts operating cost
Air Cooling Traditional cooling using fans and air conditioning. Sufficient for standard racks, insufficient for AI density. Reaching its physical limits with AI workloads
Direct Liquid Cooling (DLC) Cold plates attached directly to GPUs/CPUs with circulating coolant. Far more efficient than air. Required by NVIDIA GB200 NVL72 — becoming the new standard
Immersion Cooling Submerging entire servers in non-conductive coolant. Highest thermal efficiency but early-stage adoption. Future potential for extreme density deployments

3) A Simple Analogy

Think of AI data center infrastructure like city utilities for a rapidly growing factory district.

Power Capacity = the city's electrical grid — you can build factories fast, but running power lines to them takes years

Power Density = how many appliances per apartment — AI racks are like cramming 20 industrial ovens into each unit

Air Cooling = opening windows — works for a normal apartment, not for 20 ovens

Direct Liquid Cooling = running water pipes directly to each oven — much more effective

Immersion Cooling = submerging the entire oven in a cooling bath — maximum efficiency, complex plumbing

4) Why Power Is the Ultimate Bottleneck

The Timeline Mismatch

GPU generations advance every 1–2 years. TSMC delivers new process nodes every 2–3 years. But building the power infrastructure to feed a new data center — transmission lines, substations, transformers, grid interconnection — takes 5 to 10 years including permitting and environmental review. This is the fundamental mismatch: chip technology is outrunning the physical grid.

The Scale of AI Power Demand

To put the numbers in perspective: Microsoft, Google, Meta, and Amazon are each investing over $50 billion per year in AI infrastructure. A single new AI data center campus targets 100 MW to over 1 GW of power capacity. For reference, 1 GW is roughly equivalent to one small nuclear power plant. Data centers are projected to consume 6–9% of total U.S. electricity by 2030.

Why Getting Power Is Hard

Building a data center is relatively fast — 12 to 18 months for construction. But securing the power to run it involves a chain of physical infrastructure that cannot be accelerated easily. High-voltage transmission lines require environmental impact assessments and land acquisition. Large power transformers have 2–3 year lead times globally, and demand is surging. Grid interconnection studies and utility agreements add further delays. This is why hyperscalers are now going directly to power sources — signing contracts to restart nuclear plants, building next to power stations, and investing in on-site generation.

What Beginners Often Get Wrong

People assume that if you have the money, you can build AI infrastructure quickly. In reality, the binding constraint is not capital — it is the physical time required to build power and cooling infrastructure. A hyperscaler can order $10 billion worth of GPUs, but if the data center does not have the electrical capacity to run them, those GPUs sit idle. This is why power availability has become the most important site-selection criterion for new AI data centers.

5) The Cooling Transition: From Air to Liquid

As power density rises, cooling must keep pace. Almost all of the electrical power consumed by a GPU is converted to heat. If that heat is not removed, the chip throttles its performance or shuts down entirely.

Cooling Method How It Works Best For Limitation
Air Cooling Fans push cold air over components Standard racks (10–20 kW) Cannot handle AI-density heat loads
Rear Door HX (RDHx) Water coils on rack rear door cool exhaust air Retrofitting existing air-cooled facilities Bridge solution — not sufficient for highest density
Direct Liquid Cooling (DLC) Cold plates on GPUs/CPUs with circulating coolant High-density AI racks (40–120 kW+) Requires new plumbing, CDUs, and secondary cooling
Immersion Cooling Servers submerged in non-conductive liquid Extreme density, future deployments Complex maintenance, high upfront cost, early adoption

The transition from air cooling to liquid cooling is not optional — it is being driven by hardware requirements. NVIDIA's GB200 NVL72 rack system requires direct liquid cooling as a baseline specification. This means every data center deploying next-generation NVIDIA hardware must invest in coolant distribution units (CDUs), piping infrastructure, and secondary heat rejection systems. The cooling transition is structural and recurring: every new AI data center that gets built will need liquid cooling from day one.

6) The Energy Source Question

Renewables

All major hyperscalers have carbon neutrality goals. Solar and wind are growing fast but face intermittency challenges — AI workloads run 24/7 and need baseload power. Renewables alone cannot yet meet the scale and reliability demands of large AI campuses.

Nuclear (SMR + Restarts)

Nuclear provides carbon-free baseload power — exactly what AI data centers need. Microsoft signed a deal to restart a Three Mile Island reactor. Amazon is building data centers near nuclear plants. SMRs (Small Modular Reactors) are being explored as dedicated on-site power for future campuses. Nuclear is emerging as the serious long-term answer to AI power demand.

Natural Gas

Natural gas generation remains the fastest path to large-scale reliable power. Many new AI data centers are being sited near gas-fired power plants. While not carbon-neutral, gas bridges the gap until nuclear and renewables can scale sufficiently.

Grid Stress

The rapid growth in data center power demand is straining regional grids. Northern Virginia, the largest U.S. data center market, has already experienced grid capacity constraints. Utilities are pushing back on new interconnection requests. Grid congestion is becoming a real limiting factor for AI expansion.

7) Who Matters at This Layer

Company / Segment Role in AI Power/Cooling What Investors Should Watch
Vertiv Power management, thermal management, CDUs — core data center infrastructure AI-related revenue growth, backlog size, liquid cooling orders
Schneider Electric Power distribution, UPS, cooling, data center management software Data center segment growth, cooling product mix shift
Eaton Electrical distribution, power quality, UPS systems Data center order growth, transformer/switchgear lead times
nVent Liquid cooling solutions, rack infrastructure, thermal management Liquid cooling revenue ramp, hyperscaler adoption
Utilities (Dominion, AES, NextEra) Grid power supply to data center campuses Data center interconnection pipeline, rate case filings, CapEx for grid expansion
Nuclear (Constellation, NuScale, Oklo) Baseload clean power for next-generation AI campuses PPA announcements with hyperscalers, SMR development milestones, regulatory approvals

8) Why Investors Should Care

Power and cooling are fundamentally different from semiconductor bottlenecks. Chip-level bottlenecks (CoWoS capacity, EUV supply) are technology and manufacturing problems that can be solved with investment and engineering on 1–3 year horizons. Power bottlenecks are physical infrastructure and regulatory problems that take 5–10 years to resolve. This makes power the longest-duration constraint in the AI stack.

The Core Framework

Chip Speed ≠ Deployment Speed — Power Sets the Pace

Semiconductor technology can advance in 1–2 year cycles. Power infrastructure takes 5–10 years. This mismatch means that AI deployment speed is ultimately gated not by how fast chips improve, but by how fast electricity and cooling can be delivered to the facility. Investors must track power availability, grid capacity, cooling infrastructure investment, and energy sourcing strategies as leading indicators of AI infrastructure growth — not lagging ones.

9) Connecting to the Stack

Day 1–4 → Day 6

GPUs (Day 1), HBM (Day 2), interconnects (Day 3), and packaging (Day 4) all consume power and generate heat. Every improvement in chip performance increases the power and cooling demands at the facility level.

Day 5 → Day 6

AI servers and racks from Day 5 drive the extreme power density (40–120 kW/rack) that makes power and cooling the binding constraint. The server layer creates the demand; the facility layer must supply it.

Day 6 → Day 7

Power and cooling costs feed directly into training economics. Day 7 will study what drives the total cost of training frontier AI models — and electricity is one of the largest line items.

The Full Chain So Far

GPU designed → manufactured at TSMC (Day 4) → packaged with HBM via CoWoS (Day 4) → connected via NVLink (Day 3) → installed in AI server (Day 5) → powered and cooled by data center infrastructure (Day 6). Each layer depends on every layer before it.

10) What I Learned Today

  • AI server racks consume 5–10× more power than traditional racks (40–120 kW vs 10–20 kW), and the electrical infrastructure to deliver that power takes years to build — making power the ultimate physical bottleneck for AI scaling.
  • Cooling is transitioning from air to direct liquid cooling (DLC), driven by NVIDIA's GB200 NVL72 requiring liquid cooling as a baseline spec. This creates structural, recurring demand for cooling infrastructure with every new AI data center.
  • Power and cooling bottlenecks are fundamentally different from semiconductor bottlenecks — they are physical infrastructure and regulatory problems with 5–10 year resolution timelines, making them the longest-lasting constraint in the AI stack.

11) One Question I'm Still Thinking About

If AI power demand continues growing at the current pace, will the grid infrastructure crisis force hyperscalers toward fully on-site generation — and could nuclear SMRs eventually make data centers energy-independent?

12) What Comes Next

In Day 7, I'll study Training Economics — what actually drives the cost of training frontier AI models. Hardware utilization, networking efficiency, energy costs, and model scaling laws all converge to determine whether training a model costs $10 million or $1 billion. Power and cooling from today's study are a major component of that equation.

Continue the AI Infrastructure Study Series

This series is designed to make the AI stack easier to follow — one layer at a time, from compute and memory to networking, packaging, and system economics.

Next: Day 7 — Training Economics
AI Infrastructure Study

Day 7: Training Economics

 

AI Infrastructure Study Series

Day 7: Training Economics

Understanding what actually drives the cost of training frontier AI models — and why those costs are rising exponentially.

Summary

Every layer of the AI stack studied so far — GPUs, memory, networking, packaging, servers, power — converges into one question: how much does it cost to train a frontier AI model? The answer is not simply "GPU price × quantity." Training cost is a compound equation shaped by GPU utilization, network efficiency, power consumption, failure rates, and scaling laws. Today we study the economics of training and why costs are rising exponentially — creating a structural dynamic where only a handful of companies can afford to build frontier models.

1) Why This Matters

Hyperscalers are each spending over $50 billion per year on AI infrastructure. That is not charity — it is the physical cost of staying competitive in frontier model development. Understanding training economics explains why those budgets exist, why they keep growing, and which infrastructure layers benefit most from the spending.

For investors, training economics is the bridge between hardware and business value. It answers the question: "Why does all this infrastructure spending make economic sense?" — and reveals which companies and technologies have the most leverage over cost.

2) One-Sentence Definitions

Term Simple Definition Why It Matters
Training The process of teaching a model by running massive data through it to learn parameter weights. Requires thousands of GPUs for weeks to months. The single most expensive phase of building an AI model
Training Cost Total cost of one full training run — hardware, power, cooling, networking, labor, and wasted runs. GPT-4 class: ~$100M+. Next-gen: $1B+
GPU Utilization (MFU) The ratio of actual compute used vs the GPU's theoretical maximum. Realistic range: 30–60%. The hidden cost multiplier — low MFU means wasted GPUs
Scaling Laws Empirical finding that improving model performance by a fixed amount requires roughly 10× more compute. Drives exponential cost growth across model generations
Parallelism Techniques for distributing training across many GPUs — data, tensor, pipeline, and expert parallelism. Parallelism efficiency directly determines GPU utilization
Failed Runs Training attempts that crash or diverge mid-run, wasting all GPU time and power consumed up to that point. A hidden but significant cost — hundreds of failures per large training run

3) A Simple Analogy

Think of training a frontier AI model like building a skyscraper.

GPUs = skilled construction workers — each has a fixed hourly output

GPU Utilization (MFU) = the percentage of time workers are actually laying bricks vs waiting for materials or coordinating

Scaling Laws = the cost of building higher — going from 10 to 20 floors costs not 2× but 3–5×

Failed Runs = demolishing a half-built building and starting over — all materials and labor wasted

Power + Cooling = the electricity and climate control for the construction site — runs 24/7 for months

4) The Training Cost Equation

Training cost is not a single line item. It is a compound equation with multiple interacting factors. Here is how the major cost components break down:

Cost Component What It Includes Typical Share Stack Connection
Hardware (CapEx) GPU purchase or cloud rental, depreciated over 3–5 years ~60–70% of total Day 1 (GPU), Day 4 (foundry)
Power Electricity to run thousands of GPUs for weeks/months ~10–15% Day 6 (power)
Networking NVLink switches, InfiniBand/Ethernet fabric ~10–20% Day 3 (networking)
Cooling DLC infrastructure, CDUs, cooling power consumption ~3–5% Day 6 (cooling)
Labor ML engineers, infrastructure engineers, researchers ~5–10%
Failed Runs (Waste) GPU time and power lost to crashes, bugs, divergence, restarts ~10–20%+ (often unreported) Day 5 (system reliability)

What Beginners Often Get Wrong

People quote training costs as a single number — "$100M to train GPT-4" — as if it is a clean purchase. In reality, training cost includes enormous hidden waste: failed training runs that crash mid-way, GPU idle time during communication waits, and engineering months spent debugging instability. The real total cost of building a frontier model is significantly higher than the headline compute cost.

5) GPU Utilization: The Hidden Cost Multiplier

GPU utilization — measured as MFU (Model FLOPS Utilization) — is the most underappreciated variable in training economics. It determines how much of your GPU investment actually does useful work.

Why MFU Is Never 100%

Communication overhead: GPUs must synchronize parameters across thousands of devices. Network speed (NVLink, InfiniBand) directly determines wait time.

Memory bottlenecks: Moving data between HBM and compute units takes time. Some workloads are memory-bound, not compute-bound.

Pipeline bubbles: In pipeline parallelism, some GPUs sit idle waiting for their turn in the pipeline.

Failure recovery: Hardware crashes require rolling back to the last checkpoint and restarting.

Realistic MFU Numbers

Well-optimized training: 40–55% MFU

Best-in-class (Google, Meta): 50–60%+

Poorly optimized: 20–35%

The economic impact: Improving MFU from 30% to 50% means the same training can be done with ~40% fewer GPUs — or ~40% faster with the same cluster. MFU optimization is economically equivalent to buying more GPUs.

This is why the entire stack matters for training economics. Faster NVLink (Day 3) reduces communication overhead. Higher HBM bandwidth (Day 2) reduces memory bottlenecks. Better system reliability (Day 5) reduces failure recovery time. Every infrastructure layer studied so far feeds directly into MFU — and MFU feeds directly into cost.

6) Scaling Laws: Why Costs Rise Exponentially

Scaling laws are the single most important long-term driver of training economics. Research from OpenAI, Google DeepMind, and others has established a consistent empirical pattern: improving model performance by a fixed amount requires roughly 10× more compute.

Model Generation Estimated Training Cost Approximate Scale
GPT-3 era (2020) ~$5–10M Thousands of GPUs, weeks
GPT-4 era (2023) ~$100M+ Tens of thousands of GPUs, months
Current frontier (2025–26) ~$500M–$1B+ 50K–100K+ GPUs, months
Next generation ~$5B–$10B? 100K+ GPUs, potentially 6+ months

This exponential escalation happens because model performance improves on a logarithmic scale — each incremental improvement requires disproportionately more compute. Early gains come cheaply; further gains become enormously expensive. This is the AI equivalent of diminishing returns.

However, scaling laws are not fixed fate. Algorithmic innovations (Mixture of Experts architectures, better training recipes), data quality improvements, and hardware efficiency gains can shift the curve — achieving the same performance with less compute. Historically, though, efficiency gains have not reduced total compute demand. Instead, they have enabled even larger training runs — a pattern known as Jevons Paradox.

7) Who Can Afford Frontier Training?

When a single training run costs $1 billion or more, the number of organizations that can compete at the frontier shrinks dramatically.

Can Afford $1B+ Training

Microsoft/OpenAI — Azure infrastructure + dedicated clusters

Google/DeepMind — TPU pods + custom infrastructure

Meta — Massive internal GPU clusters

Amazon — AWS + Trainium custom chips

xAI — Large GPU cluster (Memphis Supercluster)

Structural Implication

The exponential cost curve acts as a natural barrier to entry. Frontier AI development is converging toward a natural oligopoly — not because of regulation, but because of physics and economics. Only companies with $50B+ annual CapEx budgets can sustain the hardware investment required for each successive generation. Startups can fine-tune or build smaller models, but training frontier models from scratch is becoming a big-company-only activity.

8) Why Investors Should Care

Training economics is the mechanism that translates hardware demand into business reality. Understanding it reveals why GPU demand grows nonlinearly, why infrastructure CapEx keeps increasing, and which parts of the stack have the most economic leverage.

The Core Framework

Scaling Laws Drive the Entire Stack's Economics

Scaling laws dictate that each new model generation requires ~10× more compute. This exponential growth cascades through every infrastructure layer: more GPUs (Day 1) → more HBM (Day 2) → more networking (Day 3) → more packaging (Day 4) → more servers (Day 5) → more power and cooling (Day 6). Training economics is not a separate topic — it is the demand engine that drives the entire AI hardware supply chain. Investors who understand scaling laws understand why AI infrastructure spending is structural, not cyclical.

9) Connecting to the Stack

Day 1 + Day 2 → Day 7

GPU compute (Day 1) and HBM bandwidth (Day 2) determine the raw performance ceiling. Whether that ceiling is reached depends on MFU — and MFU is shaped by memory bottlenecks and compute efficiency.

Day 3 → Day 7

NVLink and InfiniBand from Day 3 directly determine communication overhead — one of the biggest drags on MFU. Faster interconnects mean less GPU idle time and lower effective training cost.

Day 4 + Day 5 → Day 7

Foundry/packaging capacity (Day 4) limits how many GPUs exist. Server/rack design (Day 5) determines system reliability and failure rates. Both feed into training cost through hardware availability and waste.

Day 6 → Day 7

Power and cooling from Day 6 are direct cost line items in training. Electricity cost scales linearly with GPU count and training duration. Cooling efficiency affects PUE, which multiplies the power bill.

Day 7 → Day 8

Training produces a finished model. Day 8 will study what happens after training — inference economics. Inference may ultimately become the larger market, with different hardware requirements, cost structures, and competitive dynamics.

10) What I Learned Today

  • Training cost is a compound equation — not just GPU price × quantity — shaped by GPU utilization (MFU), network efficiency, power, cooling, and failure waste. Improving MFU from 30% to 50% is economically equivalent to adding 40% more GPUs.
  • Scaling laws dictate that each new model generation requires roughly 10× more compute, driving training costs from ~$10M (GPT-3) to ~$100M (GPT-4) to $1B+ for current frontier models — and potentially $5–10B for the next generation.
  • The exponential cost curve makes frontier AI development a natural oligopoly — only 5 or fewer organizations can afford $1B+ training runs, structurally concentrating frontier model development among the largest companies.

11) One Question I'm Still Thinking About

If algorithmic efficiency improves fast enough to bend the scaling curve, will that slow down hardware demand — or will Jevons Paradox hold, with efficiency gains simply enabling even larger models and sustaining exponential compute growth?

12) What Comes Next

In Day 8, I'll study Inference Economics — what changes after training is complete. Inference is how trained models serve users at scale, and it may become a larger long-term market than training. The hardware requirements, cost structures, optimization levers, and competitive dynamics of inference are fundamentally different from training — and understanding that difference is critical for AI infrastructure investors.

Continue the AI Infrastructure Study Series

This series is designed to make the AI stack easier to follow — one layer at a time, from compute and memory to networking, packaging, and system economics.

Next: Day 8 — Inference Economics
AI Infrastructure Study

Day 5: AI Systems, Servers, and Racks

AI Infrastructure Study Series

Day 5: AI Systems, Servers, and Racks

Understanding how GPUs, memory, networking, cooling, and power come together in the physical systems that actually run AI workloads.

Summary

Everything studied so far — GPUs, HBM, NVLink, foundry, packaging — must be assembled into a physical system before it can do any real work. An AI server is not just a computer with GPUs plugged in. It is a tightly integrated system where compute, memory, interconnects, power delivery, and cooling all constrain each other. Today we study how these components come together at the server, rack, and cluster level — and why the system layer creates its own bottlenecks and value capture dynamics that investors need to understand.

1) Why This Matters

It is easy to focus on individual components — a faster GPU, more HBM, better packaging. But none of these components deliver value until they are integrated into a working system that can be deployed in a data center. The system layer is where all component-level constraints compound. A server that cannot be cooled cannot run. A rack that exceeds the building's power budget cannot be deployed.

For investors, this means the AI hardware value chain does not end at the chip. Server design, system integration, power delivery, and cooling architecture are all layers where value is created and captured — and where new bottlenecks emerge.

2) One-Sentence Definitions

Term Simple Definition Why It Matters
AI Server A high-performance server designed around GPU accelerators as the primary compute engine, with CPU in a supporting role. Where all components become a working system
AI Rack A physical frame holding multiple AI servers stacked vertically, typically 42U tall. Power density per rack is 5–10× higher than traditional racks
AI Cluster Multiple AI racks connected by high-speed networks to form one large-scale compute system. Frontier model training requires thousands of GPUs in one cluster
HGX NVIDIA's GPU baseboard platform — 8 GPUs on one board, connected via NVLink — that OEMs build servers around. The standard building block for third-party AI servers
DGX NVIDIA's own complete AI server — GPU board, chassis, power, cooling, and software integrated. NVIDIA's move from chip seller to system seller
OEM / ODM OEMs (Dell, HPE, Supermicro) sell branded servers. ODMs (Foxconn, Quanta, Wistron) manufacture custom designs for hyperscalers. Different business models with very different margin structures

3) A Simple Analogy

Think of the AI system stack like building a car.

GPU = the engine — core power

HBM = the fuel tank — feeds the engine at high speed

NVLink = the high-speed fuel lines inside the powertrain

CoWoS Packaging = assembling engine + fuel tank into one powertrain unit

AI Server = the finished car — engine, fuel, cooling, electrical all integrated

AI Rack = a row in a parking lot — multiple cars lined up

AI Cluster = the full parking lot — hundreds of cars connected by roads (network) working as one fleet

4) How AI Servers Differ from Traditional Servers

AI servers are not just regular servers with GPUs added. They are fundamentally different machines, designed around entirely different constraints.

Dimension Traditional Server AI Server
Primary compute CPU GPU (CPU is supporting)
Power per rack 10–20 kW 40–120 kW+
Cooling Air cooling is sufficient Direct liquid cooling often required
Key interconnect CPU ↔ memory (DDR) GPU ↔ GPU (NVLink)
Physical size 1U–2U typically 4U–8U+ per server, or full-rack systems

What Beginners Often Get Wrong

People think of AI servers as "regular servers with extra GPUs." In reality, AI servers require completely different power infrastructure, different cooling systems, different interconnect architectures, and different physical dimensions. A data center built for traditional servers often cannot simply swap in AI servers — the entire facility may need to be redesigned.

5) NVIDIA's System Strategy: From Chips to Racks

NVIDIA is no longer just a chip company. It is systematically expanding its sales unit from individual GPUs to complete systems.

Product What It Is Scale Who Builds the Server
HGX GPU baseboard (8 GPUs + NVLink) Board-level OEM/ODM designs the server around it
DGX Complete AI server (GPU + chassis + cooling + software) Server-level NVIDIA designs everything
GB200 NVL72 Full-rack system (72 GPUs + NVLink + liquid cooling) Rack-level NVIDIA defines the full rack architecture

This progression matters because it means NVIDIA's average selling price is moving from chip-level ($30K–$40K per GPU) to rack-level (potentially $2M–$3M+ per rack). It also means NVIDIA is capturing value that used to belong to OEMs — chassis design, cooling integration, system software. This is one of the most important structural shifts in the AI hardware value chain.

6) The Three Players: OEMs, ODMs, and Hyperscalers

Traditional OEMs

Dell, HPE, Lenovo, Supermicro

Build branded AI servers around NVIDIA HGX boards. Add their own chassis, power, cooling, and support. Serve enterprise customers. Supermicro has grown fastest in AI server share by moving quickly on GPU server SKUs.

ODMs

Foxconn, Quanta, Wistron, Inventec

Manufacture custom-designed servers at scale for hyperscalers. No brand of their own — they build to customer spec. Handle the largest volume of AI servers but operate at lower margins.

Hyperscalers

Microsoft, Google, Meta, Amazon

Increasingly designing their own AI servers and even custom chips (Google TPU, Amazon Trainium). Use ODMs for manufacturing. Their internal design efforts aim to reduce dependency on NVIDIA and optimize for their specific workloads.

The competitive tension here is clear: NVIDIA is moving up to sell complete systems, hyperscalers are moving down to design their own, and OEMs/ODMs are caught in between. Who captures the most value at this layer will depend on who controls the design authority — and right now, that authority belongs to NVIDIA and the hyperscalers, not the assemblers.

7) Who Matters at This Layer

Company / Segment Role in AI Systems What Investors Should Watch
NVIDIA GPU designer expanding into full system/rack design (DGX, NVL72) System ASP growth, DGX/NVL72 adoption, OEM relationship dynamics
Supermicro Fastest-growing OEM in AI servers — GPU-first product strategy AI server revenue share, gross margin trends, cooling/rack innovation
Dell / HPE Traditional OEMs adapting to AI server demand AI server backlog, enterprise adoption pace, competitive positioning vs Supermicro
Foxconn / Quanta ODMs manufacturing custom AI servers at scale for hyperscalers AI server revenue growth, hyperscaler order concentration, margin structure
Hyperscalers (MSFT, GOOG, META, AMZN) Designing custom AI servers and chips to reduce NVIDIA dependency Custom chip progress (TPU, Trainium), CapEx allocation, internal vs external GPU mix

8) Why Investors Should Care

The system layer is where component-level value becomes deployable infrastructure. It is also where a critical shift is happening: NVIDIA is expanding from selling chips to selling racks, while hyperscalers are expanding from buying systems to designing their own.

The Core Framework

Design Authority = Value Capture

In the AI system value chain, whoever controls the design captures the most value. NVIDIA controls GPU architecture, NVLink topology, and increasingly the full rack design. Hyperscalers control their own infrastructure specs and custom chips. OEMs and ODMs that only assemble face margin compression as the design owners expand their reach. The key investor question is: who holds design authority, and is it expanding or shrinking?

9) Connecting to the Stack

Day 1 → Day 5

GPUs from Day 1 are the primary compute engine inside every AI server. The server exists to make the GPU usable.

Day 2 → Day 5

HBM from Day 2 sits on the same package as the GPU. The server's memory subsystem must accommodate both HBM bandwidth and system DRAM for CPU tasks.

Day 3 → Day 5

NVLink from Day 3 defines GPU-to-GPU connectivity inside the server. InfiniBand and Ethernet from Day 3 connect servers across the rack and cluster.

Day 4 → Day 5

Every GPU and HBM die in the server was manufactured and packaged through the foundry and CoWoS process from Day 4. Packaging capacity directly limits how many servers can be built.

Day 5 → Day 6

AI servers generate extreme power demand and heat. Day 6 will study why data center power infrastructure and cooling systems are becoming the next major deployment bottleneck.

10) What I Learned Today

  • AI servers are fundamentally different from traditional servers — power density is 5–10× higher, liquid cooling is increasingly required, and GPU-to-GPU interconnects (not CPU-to-memory) are the performance-defining link.
  • NVIDIA is expanding from chip seller to system seller (HGX → DGX → NVL72), raising its ASP from per-GPU to per-rack while compressing OEM value-add.
  • In the AI system value chain, design authority determines value capture — NVIDIA and hyperscalers hold it, while OEMs and ODMs face margin pressure as assemblers.

11) One Question I'm Still Thinking About

As NVIDIA moves to rack-scale systems like NVL72, will OEMs like Dell and Supermicro find ways to add meaningful differentiation — or will they gradually become distribution and support channels for NVIDIA's pre-designed systems?

12) What Comes Next

In Day 6, I'll study Data Center Power and Cooling — the physical infrastructure that determines whether AI servers can actually be deployed at scale. Power delivery, thermal density, liquid cooling, and facility upgrades are becoming the binding constraints on AI infrastructure growth.

Continue the AI Infrastructure Study Series

This series is designed to make the AI stack easier to follow — one layer at a time, from compute and memory to networking, packaging, and system economics.

Next: Day 6 — Data Center Power and Cooling
AI Infrastructure Study

Day 4: Foundry and Advanced Packaging

AI Infrastructure Study Series

Day 4: Foundry and Advanced Packaging

Understanding why chip manufacturing and packaging — not just chip design — determine how many AI accelerators can reach the market.

Summary

AI chips do not appear out of thin air. Someone has to physically manufacture the silicon, then assemble multiple dies into a single high-performance package. That someone is overwhelmingly one company: TSMC. Today we study why the foundry and advanced packaging layers have become the tightest bottlenecks in the entire AI hardware supply chain — tighter, in many cases, than chip design itself. The key lesson: AI chip supply is a multiplication chain of design × fabrication × packaging × equipment, and a bottleneck at any single node constrains the entire output.

1) Why This Matters

In the AI hardware stack, most attention goes to chip designers like NVIDIA and AMD. But no matter how brilliant a GPU design is, it cannot reach the market unless a foundry can manufacture it and a packaging line can assemble it. Manufacturing and packaging set the hard ceiling on how many AI accelerators the world can actually use.

For investors, this means the real supply constraint often sits not with the designer, but with the manufacturer. Understanding foundry economics, packaging bottlenecks, and equipment dependencies is essential for reading AI hardware supply accurately.

2) One-Sentence Definitions

Term Simple Definition Why It Matters
Foundry A factory that physically manufactures semiconductor chips from a designer's blueprints. TSMC holds 90%+ of leading-edge capacity
Advanced Packaging Technology that assembles multiple dies (GPU, HBM, I/O) into a single package with ultra-fast interconnects. CoWoS is the current AI packaging standard
Yield The percentage of usable chips from a wafer — the foundry's most critical competitive metric. Low yield = higher cost per chip
Process Node Generation labels (3nm, 5nm) describing how densely transistors can be packed on a die. Smaller node = more compute, but exponentially harder to make
EUV Lithography Extreme ultraviolet light technology required to print circuit patterns at leading-edge nodes. ASML is the sole supplier worldwide

3) A Simple Analogy

Think of the AI chip supply chain as a construction project.

Fabless Designer (NVIDIA, AMD) = the architect who draws the blueprints

Foundry (TSMC) = the construction company that actually builds the structure

Advanced Packaging (CoWoS) = connecting multiple buildings into one high-speed campus

Yield = the completion rate — out of 100 buildings, how many pass inspection

EUV Equipment (ASML) = the only crane manufacturer in the world

4) Why Packaging Became the Real Bottleneck

The End of Monolithic Scaling

In the past, performance scaling came from shrinking transistors on a single die — classic Moore's Law. But physics has imposed hard limits. As dies grow larger, yield collapses because one defect kills the entire chip. The cost per transistor, which used to decline with each node, has started increasing at the most advanced processes.

The Chiplet Solution

Instead of building one massive monolithic die, designers now split functions across smaller dies called chiplets and connect them inside a single package. NVIDIA's Blackwell B200 places two GPU dies in one package, linked via NVLink-on-package. This is only possible because of TSMC's CoWoS (Chip-on-Wafer-on-Substrate) advanced packaging technology.

What CoWoS Actually Does

GPU dies and HBM dies are placed side-by-side on a silicon interposer — a thin layer of silicon containing thousands of TSVs (Through-Silicon Vias) that provide ultra-high-bandwidth die-to-die communication. This interposer is then mounted on a substrate. The result is a multi-die system that behaves like a single chip. The critical problem is that CoWoS packaging capacity is more constrained than chip fabrication itself. TSMC has been expanding CoWoS capacity aggressively, but demand from NVIDIA, AMD, Google, and Amazon continues to outpace supply.

What Beginners Often Get Wrong

People assume "chip shortage" means the fab cannot produce enough GPU dies. In the AI era, the more common bottleneck is packaging. The dies may already exist, but they cannot be assembled into functional products because CoWoS lines are full. This is why TSMC's CoWoS capacity expansion plans matter more to AI chip supply than wafer capacity alone.

5) The Process Node Race Still Matters

Advanced packaging is the hot topic, but process node competition remains vital. Each chiplet inside an advanced package is built on a leading-edge node. More advanced nodes deliver better power efficiency and compute density per die, which directly translates to better AI performance per package.

Company Current Leading Node Next Generation Status
TSMC 3nm (N3E) 2nm (N2) — first GAA node Mass production expected 2025
Samsung 3nm GAA 2nm GAA Early GAA adopter, but yield challenges persist
Intel Intel 3 Intel 18A Seeking external customers; limited track record

Each node transition increases manufacturing complexity and cost. A 3nm wafer is estimated to cost over $20,000. EUV lithography exposure steps multiply at each new node, driving up both time and expense. This cost structure is why foundry leadership has consolidated to TSMC — only they can reliably sustain the economics of leading-edge manufacturing at scale.

6) The Double Bottleneck: TSMC + ASML

TSMC — The Foundry Bottleneck

TSMC dominates both leading-edge chip fabrication and CoWoS advanced packaging. It controls two of the four factors in the AI chip supply equation. Its CapEx guidance, CoWoS capacity expansion timeline, and utilization rates effectively set the ceiling for global AI chip supply.

ASML — The Hidden Monopoly

Behind TSMC sits another monopoly. ASML is the sole supplier of EUV lithography systems. Its next-generation High-NA EUV tools, essential for 2nm and beyond, cost over $350M per unit with extremely limited annual output. TSMC's expansion speed is constrained by ASML's equipment delivery schedule.

This creates a double bottleneck structure: foundry capacity is gated by TSMC, and TSMC's own expansion is gated by ASML. Investors who track only chip designers are looking at demand signals while missing the supply constraints that actually determine output.

7) Who Matters at This Layer

Company / Segment Role in AI Supply Chain What Investors Should Watch
TSMC Leading-edge fab + CoWoS packaging — the single largest bottleneck CapEx guidance, CoWoS expansion timeline, utilization rates
ASML Sole EUV lithography supplier — the hidden monopoly behind the foundry Order backlog, High-NA EUV delivery schedule, revenue geography
Samsung / Intel Challenger foundries — potential diversification of TSMC concentration risk Yield improvements, major external customer wins, GAA milestones
OSAT (ASE, Amkor) Downstream packaging and test — volume beneficiaries of AI chip growth Advanced packaging revenue mix, capacity additions, TSMC partnerships
Materials (Ajinomoto, Ibiden, Shinko) ABF substrates, interposer materials, bumping — critical packaging inputs ABF supply/demand balance, CapEx, CoWoS material qualification

8) Why Investors Should Care

The most common mistake in AI hardware investing is focusing only on chip designers. NVIDIA can design the best GPU in the world, but if TSMC cannot manufacture it and CoWoS cannot package it, that design never reaches the market. The real supply constraint sits in manufacturing and packaging.

The Core Framework

AI Chip Supply = Design × Fabrication × Packaging × Equipment

This is a multiplication chain, not an addition. A bottleneck at any single node constrains the entire output. TSMC controls two of these four factors (fabrication and packaging). ASML controls the equipment factor. Investors must track each node independently — looking only at NVIDIA tells you about demand, not about supply.

9) Connecting to the Stack

Day 1 → Day 4

The GPU dies from Day 1 are manufactured at TSMC on leading-edge nodes. Without the foundry, those designs are just files on a server.

Day 2 → Day 4

The HBM dies from Day 2 are placed next to GPU dies inside a CoWoS package. HBM's bandwidth advantage only materializes because advanced packaging connects HBM and GPU at extremely close range.

Day 3 → Day 4

NVLink interconnects from Day 3 begin at the package level. NVIDIA's NVLink-on-package connects multiple GPU dies within a single CoWoS package before extending outward to servers and racks.

Day 4 → Day 5

Once chips are fabricated and packaged, they go into servers and racks. Day 5 will cover how GPUs, memory, networking, cooling, and power come together in full AI systems.

10) What I Learned Today

  • The foundry is the factory that turns chip designs into physical silicon, and TSMC dominates leading-edge manufacturing with over 90% share.
  • Advanced packaging (CoWoS) assembles multiple dies into one high-performance system, and it is currently the tightest supply bottleneck in AI hardware — tighter than chip fabrication itself.
  • AI chip supply is a multiplication of design × fab × packaging × equipment, with TSMC and ASML as the most irreplaceable nodes in the entire chain.

11) One Question I'm Still Thinking About

If Samsung and Intel eventually close the yield gap and offer competitive advanced packaging, how quickly could the TSMC concentration risk actually unwind — and what would that mean for pricing power across the foundry layer?

12) What Comes Next

In Day 5, I'll move from manufacturing to full system integration and study AI Systems, Servers, and Racks. Once we understand how chips are designed, connected, and manufactured, the next question is how they are assembled into deployable AI infrastructure — including OEMs, server builders, cooling, and rack-scale architecture.

Continue the AI Infrastructure Study Series

This series is designed to make the AI stack easier to follow — one layer at a time, from compute and memory to networking, packaging, and system economics.

Next: Day 5 — AI Systems, Servers, and Racks
Scroll to Top