The AI Chip Wars: NVIDIA vs AMD vs Custom Silicon

Every model you've ever used — ChatGPT, Claude, Gemini, DeepSeek — runs on chips. Lots of chips. The AI revolution isn't a software story. It's a hardware story with software on top. And the company that controls the hardware controls the economics of the entire industry. Right now, that company is NVIDIA. But the challengers are closer than they've been in years. ## NVIDIA: The $5 Trillion Monopoly Let's start with the numbers, because they're ridiculous. NVIDIA posted $155.5 billion in revenue for fiscal year 2025 (ending January 2026). Operating income: $81.5 billion. Net income: $72.9 billion. For a semiconductor company, these margins are obscene. Apple makes phones with 40% margins and people call it a money machine. NVIDIA makes GPUs with 52% net margins. The company surpassed $5 trillion in market capitalization in early 2025, making it — depending on the day — the most valuable company on earth. Jensen Huang, the CEO who co-founded NVIDIA in 1993 at a Denny's in San Jose, is now one of the richest people alive. How did a gaming GPU company become this? CUDA. In the early 2000s, NVIDIA invested over a billion dollars developing CUDA, a software platform that let GPUs run general-purpose parallel computations. When deep learning took off in the 2010s, CUDA was the only mature platform for training neural networks at scale. Every major ML framework — TensorFlow, PyTorch, JAX — was built on CUDA. By the time competitors realized what was happening, the software ecosystem lock-in was nearly impossible to break. As of 2025, NVIDIA controls more than 80% of the market for GPUs used in training and deploying AI models. Their latest chips — the H100, H200, and the new Blackwell B200 series — are the standard hardware in every major data center on earth. When Anthropic commits to buying $30 billion in compute from Azure, most of that money goes to NVIDIA GPUs on the other end. The current flagship for AI training is the B200, part of the Blackwell architecture. It's manufactured on TSMC's 4nm process, packs 208 billion transistors, and offers roughly 2.5x the training performance of the H100 at similar power consumption. Data center versions ship in DGX configurations — full server racks packed with eight B200s connected via NVLink for ultra-fast inter-GPU communication. NVIDIA's not just selling chips anymore. They're selling systems: networking (acquired Mellanox for $7 billion in 2020), software stacks, reference architectures, and entire data center designs. The competitive moat isn't just hardware. It's the fact that switching away from NVIDIA means rewriting your entire software stack. ## AMD: The Credible Challenger AMD's been the eternal bridesmaid in AI compute. That might be changing. The MI300X, launched in December 2023 on AMD's CDNA 3 architecture, was the company's first serious shot at NVIDIA's AI dominance. Built on a 5nm process with 153 billion transistors, 304 compute units, and 192GB of HBM3 memory, the MI300X offers more memory than any NVIDIA competitor at its price point. In HPC workloads, its FP64 performance — 163.4 TFLOPS — is genuinely competitive with NVIDIA's best. The memory advantage matters more than most people realize. Large language models are memory-bound during inference. If you can fit more of the model into GPU memory, you need fewer chips, which means lower costs. The MI300X's 192GB of HBM3 — versus the H100's 80GB of HBM3 — means you can run larger models on fewer cards. For inference-heavy workloads, that's a real economic argument. AMD followed up with the MI325X in October 2024, bumping to 256GB of HBM3E at 6TB/s bandwidth. And the MI350X, slated for mid-2025, moves to CDNA 4 on a 3nm process with 288GB of HBM3E at 8TB/s. The trajectory is clear: AMD is closing the hardware gap chip by chip. But here's the problem: ROCm, AMD's answer to CUDA, still isn't there. Every developer who's tried to port CUDA code to ROCm has a war story. Libraries that don't compile. Kernels that run 30% slower. Documentation that ends mid-sentence. AMD's been pouring resources into ROCm — and it's genuinely better than it was two years ago — but the software gap remains the single biggest obstacle to adoption. The Frontier supercomputer at Oak Ridge National Lab runs on AMD MI250X GPUs and has been the world's fastest supercomputer since 2022. That's proof the hardware works at extreme scale. But supercomputers have dedicated software teams that can hand-optimize for specific hardware. Startups running PyTorch training loops need things to just work. Right now, NVIDIA just works. AMD mostly works, if you squint. AMD's AI datacenter revenue reportedly hit about $5 billion in 2024. Growing fast, but against NVIDIA's $155 billion? It's a rounding error. AMD needs the software ecosystem to mature, and they need a few anchor customers to prove out large-scale AI training on MI300X/MI350X clusters. Meta's been testing MI300X internally, which could be the inflection point. ## Google TPUs: The In-House Advantage Google took a different approach: build your own chips. The Tensor Processing Unit started in 2013 when Google recruited Amir Salek to establish custom silicon capabilities. The first TPU deployed internally in 2015, and by 2018 Google made TPUs available to external customers through Google Cloud. We're now on TPU v7. TPUs aren't general-purpose GPUs. They're ASICs — application-specific integrated circuits designed from the ground up for tensor operations. The original 2017 paper showed TPUs achieving 15-30x higher performance and 30-80x higher performance-per-watt compared to contemporary CPUs and GPUs for neural network inference. Those numbers have only improved. Google uses TPUs to train all its Gemini models. Anthropic signed a deal in October 2025 for access to over one million TPUs, with more than one gigawatt of AI compute capacity coming online by 2026. That's not a side project. That's industrial-scale AI infrastructure. The advantage of custom silicon is total vertical integration. Google designs the chip, builds the compiler (XLA), writes the framework (JAX), and trains the model. Every layer of the stack is optimized for every other layer. No wasted transistors. No compatibility layers. No CUDA dependency. The disadvantage is that you can't buy a TPU and put it in your own data center. TPUs are only available through Google Cloud. If you're not comfortable being locked into GCP, TPUs aren't an option. That's by design — Google wants to sell cloud compute, not hardware. For Google's own workloads and their cloud customers, TPUs offer a genuine alternative to NVIDIA. For everyone else, they're irrelevant. ## Amazon Trainium: The Cost Play Amazon saw what Google was doing and said "me too." Trainium, Amazon's custom AI training chip, is now in its second generation (Trainium2), and Amazon claims it offers up to 40% better price-performance than comparable GPU instances. Trainium2 powers Amazon's Trn2 EC2 instances and UltraClusters — configurations of up to 100,000 chips connected via high-bandwidth networking. The pitch is simple: if you're already on AWS (and most of the internet is), you can train on Trainium for less money than renting NVIDIA GPUs. The evidence is mixed. Some workloads show significant savings. Others show compatibility headaches similar to AMD's ROCm problems. Amazon's Neuron SDK is improving, but it's still young. The most telling data point: Anthropic, which started as an AWS-exclusive customer, signed multi-cloud deals with Azure and GCP. If Amazon's own chips were clearly superior, Anthropic wouldn't be diversifying. ## Apple Silicon: The Inference Dark Horse Nobody talks about Apple in the AI chip wars, and that's a mistake. The M-series chips — M1 through M4 Ultra — aren't designed for data center training. They're designed for on-device inference. But on-device inference is going to be a massive market as AI moves from the cloud to the edge. Apple's unified memory architecture means the GPU and CPU share the same memory pool. The M4 Ultra reportedly packs up to 192GB of unified memory, which means you can run surprisingly large models locally without a data center. Developers are already running quantized versions of 70B+ parameter models on M-series Macs. Try that on a consumer NVIDIA card. Apple Neural Engine — the dedicated neural processing unit in every M-series and A-series chip — handles common inference operations at remarkable power efficiency. Apple Intelligence runs entirely on-device for privacy-sensitive tasks. As models get smaller and more efficient (which they will), Apple's edge computing advantage grows. Don't count Apple out. They're playing a different game, but it might be the game that matters most in five years. ## Who Wins? Short answer: NVIDIA, for at least the next 2-3 years. CUDA's ecosystem lock-in is that strong. Medium answer: the market fragments. NVIDIA dominates training. Custom silicon (TPUs, Trainium) captures cloud-native workloads. AMD takes the price-sensitive middle. Apple owns the edge. Long answer: the winner is whoever breaks NVIDIA's software moat. That's AMD's ROCm challenge, and it's the most important open question in AI hardware. If ROCm reaches CUDA parity — and AMD's investing billions to make that happen — the AI chip market could become genuinely competitive for the first time in a decade. Until then, Jensen Huang keeps collecting. $155 billion in revenue. $73 billion in profit. And a leather jacket that's worth more than most chip companies.

The AI Chip Wars: NVIDIA vs AMD vs Custom Silicon

Key Terms Explained