Table of Contents
An AI infrastructure expert reveals why the real battle isn't about models - it's about who owns the compute, and why we're heading toward a massive shift from training to inference that will reshape everything.
Key Takeaways
- In 5 years, AI will be 95% inference and 5% training - completely flipping today's focus
- AMD GPUs can be 4x more cost-efficient than Nvidia, but PyTorch-CUDA lock-in keeps everyone trapped
- Google has the holy trinity: products, data, and compute ownership - making them the sleeping giant
- OpenAI doesn't own their compute, starting them with "something at their ankle" against true competitors
- The H100 inference pricing is a bubble built on training economics that will eventually burst
- Agents and reasoning will shift computing from throughput-bound to latency-bound, attacking Nvidia's strengths
- Most companies are paying 90% Nvidia margins plus 30% cloud provider margins, leaving tiny profit margins
- Memory access speed, not raw compute power, will determine the next generation of AI chip winners
- Current data center investments are still chasing training when the real money will be in inference infrastructure
- The switch from any hardware provider costs so much that you need 7x better performance just to get people to consider it
The Great Hardware Deception: Why Everyone's Locked Into Nvidia's Expensive Ecosystem
Here's something that might shock you: you can get four times better cost efficiency by switching from Nvidia to AMD GPUs for running large language models. Four times. That's not a small improvement - that's the difference between profit and bankruptcy for many AI companies. So why isn't everyone making this switch immediately?
The answer reveals one of the most brilliant business strategies in tech history. Steve Morin, whose company ZML helps run any model on any hardware, explains the trap that has the entire industry caught: "The probably the most important reason is the PyTorch CUDA duo, and that's very very hard to break. These two are very much intertwined."
Think of it like this: PyTorch is the framework most people use to build and train AI models, and it was built specifically to work with CUDA, which is Nvidia's software platform. The two are so deeply connected that while you technically can run PyTorch on other chips like AMD or Apple, there are always "tens of little details that don't exactly run like you would expect."
But the lock-in goes deeper than just technical compatibility. There's a self-perpetuating cycle happening in the market. Cloud providers buy mostly Nvidia GPUs because their customers want Nvidia. Customers want Nvidia because that's what they've trained on and they can reuse their code. So cloud providers keep buying Nvidia because that's what customers will rent.
"There's like this self-perpetuating circle of people just buy Nvidia because they want to resell, and people just use Nvidia because it's there," Morin explains. "But it's by far not the most efficient platform and arguably even in terms of software it's not the best software platform."
The switching costs are so high that being incrementally better isn't enough. Morin has seen this firsthand: "I know for a fact that being seven times better in whatever metric you want - whether it's spend, whether it's performance - it's not enough to get people to switch. People will choose nothing over something."
This creates a fascinating market dynamic where the best technology doesn't necessarily win - the most embedded technology wins. And Nvidia has spent decades making sure they're so deeply embedded in the development stack that extracting them would require rebuilding everything from scratch.
The Training vs. Inference Revolution That's Coming
Most people in AI are still thinking about training - building bigger models, running longer training runs, buying more compute for research. But Morin sees a massive shift coming that will flip the entire industry upside down.
"In five years I would say 95% inference, 5% training," he predicts. That's a complete reversal from where much of the focus is today.
The difference between training and inference isn't just technical - it's philosophical. Training is like doing research. You want more of everything - more GPUs, more memory, more interconnect bandwidth. You're constantly iterating, changing things, seeing how they work. "It's like changing the wheel of a moving car," Morin says.
Inference is production. It's the exact opposite mindset. You want reliability, efficiency, predictability. You don't want to be waking up at 3 AM because your production system is down. "Training is research and inference is production, and it's fundamentally different in terms of infrastructure."
The key difference comes down to interconnects - the high-speed connections between GPUs. For training massive models, you need these incredibly fast connections so thousands of GPUs can work together. For inference, you often want to avoid interconnects entirely if possible, because they add complexity and failure points.
This is why current models are designed to fit on single machines or small clusters. It's not just about technical capability - it's about making deployment practical for production environments.
But here's where it gets really interesting: most of the infrastructure being built today is still optimized for training. All those massive data center investments from Meta, Microsoft, and others? They're buying Nvidia H100s designed for training workloads, not the specialized inference chips that will actually matter in a few years.
The Coming Memory Wars: Why Speed Trumps Size
Everyone focuses on how much compute power AI chips have, but Morin points to a different battleground that will determine the winners: memory access speed. This is where the next generation of AI applications will live or die.
Current GPUs are basically a clever hack. They were designed to render graphics - pixels on screens - which happen to involve a lot of parallel matrix operations similar to what AI needs. "It was always a cool trick and very successful, but it was not dedicated for this," Morin explains.
The problem becomes obvious when you look at what's actually happening during AI inference. Your model might be stored in high-bandwidth memory (HBM), but accessing that memory is still relatively slow compared to on-chip memory called SRAM. "HBM compared to SRAM is absolutely slow," he says.
This is why companies like Cerebras and Groq are building chips with massive amounts of SRAM directly on the chip. Cerebras has 44 gigabytes of SRAM on what they call their "wafer scale engine" - a chip the size of an entire wafer that needs water cooling and copper needles touching the chip directly.
The trade-off is brutal: SRAM is incredibly fast but incredibly expensive. More SRAM means bigger chips, which means lower yields and higher costs. But for certain AI applications - especially the coming wave of agents and reasoning systems - that speed difference is everything.
"When you do inference, single stream, the data is right in the chip so you don't have to get it from memory which is slow, which GPUs have to do," Morin explains. This is why Groq can achieve 80% cost savings compared to Nvidia for certain workloads - they're optimized for the actual bottleneck.
The Google Sleeping Giant: Who Really Owns the AI Stack
While everyone obsesses over OpenAI versus Anthropic versus whoever released a model this week, Morin sees a completely different competitive landscape. He thinks in terms of what he calls "the triangle of wind" - three things you need to win in AI: products, data, and compute.
"Who has all three? Google," he says simply.
This isn't just about having good technology in each area - it's about owning your entire stack. Google has Android, Google Docs, Gmail, and countless other products that generate data. They have that data. And crucially, they own their own compute infrastructure with TPUs.
Compare that to OpenAI, which everyone thinks of as the AI leader. "OpenAI is amazing, but it's not their compute," Morin points out. "Ultimately if you don't own your compute, you're starting with something at your ankle."
Microsoft, even though they're OpenAI's partner, still has to buy Nvidia chips at massive margins. "I talk to a lot of people that build data centers and I ask them, 'Do you get at least a discount or something?' and they're like 'No, the only thing we get is the supply.'"
The math is brutal: "TSMC sells you at 60% margin, Nvidia sells you at 90% margin, and on top of that there's Amazon that takes let's say a 30% margin. So you are a very thin crust on a very big cake."
Google can just sidestep all of that. They can run their models on TPUs, avoid the Nvidia tax entirely, and keep all those margins for themselves. The problem is that outside of Google, TPUs haven't been a commercial success because of software compatibility issues.
But if Google ever decides to seriously compete - if they're not busy with internal reorganizations - they have structural advantages that are almost impossible to overcome.
The Agent Revolution: Why Nvidia's Dominance Might Crack
The current wave of AI applications - ChatGPT, Claude, Gemini - all work the same way. You ask a question, they generate tokens one by one, and you see the response streaming back. This plays perfectly to GPU strengths because they're optimized for throughput - generating lots of tokens for lots of users simultaneously.
But the next wave of AI applications will work completely differently, and this is where Morin sees Nvidia becoming vulnerable.
Agents and reasoning systems don't care about streaming tokens. You don't want to watch an AI agent "think" for 30 seconds as it slowly generates its reasoning. You want it to think quickly and then give you a complete answer. This shifts everything from throughput-bound to latency-bound computing.
"For agents and reasoning you need to wait until the end of the request to get whatever it is you came for," Morin explains. "You only care about how much time does it take between the beginning of my request and the end."
Current GPUs can generate 10,000 tokens per second across many users, but they can't give you 10,000 tokens per second for just your single request. For agents, that's exactly what you want - maximum speed for individual requests, not maximum throughput across all users.
This is where those expensive SRAM-heavy chips like Cerebras and Groq become game-changers. They can deliver extremely high token rates for single users because all the data is right there on the chip, no waiting for memory access.
The shift to reasoning systems goes even deeper. Current models reason "in tokens" - basically thinking out loud in English. But the future is "latent space reasoning" where models think in their internal representation without converting to language. This is much more efficient but requires completely different compute architectures.
"GPUs cannot deliver this at scale, plain and simple," Morin says. "The access to external memory prevents it."
The Deep Seek Wake-Up Call: Efficiency vs. Brute Force
When Deep Seek's models achieved GPT-4 level performance at a fraction of the cost, it sent shockwaves through the AI world and caused a temporary market crash. But Morin wasn't surprised at all.
"Constraint is the mother of innovation," he explains. "They had no choice. Here's the thing - if you can buy more, why would you optimize? You can just buy more. So if you are pushed to efficiency, then you will deliver efficiency."
The American approach to AI has been pure brute force: more compute, more data, bigger models. This works when you have unlimited capital and access to the latest chips. But it's not necessarily the smartest approach.
Deep Seek, constrained by export restrictions and limited access to cutting-edge hardware, had to find ways to do more with less. And they succeeded spectacularly, showing that much of the industry's compute spending might be wasteful.
"There's two approaches to scaling," Morin says. "One is we still scale but there's a lot of waste and excess spending on the engineering side, which is the Deep Seek approach. The other approach is Yan LeCun's approach, which is this is not scaling and at some point we need to look the problem in the face and do something better."
The efficiency gains from Deep Seek weren't just impressive - they were existentially threatening to companies built on the assumption that you need massive compute budgets to compete. "Suddenly efficiency is in," Morin notes.
The Data Center Bubble: Why Current Infrastructure Investments Might Be Misguided
Meta is spending $60-65 billion on data centers. Microsoft is spending $80 billion. These are staggering numbers that represent a massive bet on the future of AI infrastructure. But Morin thinks much of this spending is still fighting the last war.
"They're still going after training," he observes. All of this infrastructure is being built around Nvidia H100s and similar chips designed for training workloads. But if the future really is 95% inference and 5% training, then optimizing for training is like building a highway system for horse-drawn carriages.
The problem is worse than just buying the wrong type of compute. The entire economics of data center operations are built around utilization, but most AI companies are terrible at this. "If you deploy inference, the number one thing that will get you is autoscaling," Morin explains.
Instead of dynamically provisioning compute as needed, most companies are running their AI infrastructure at full capacity 24/7 because scaling up and down is so complex. "You want to say 'I have a thousand GPUs 24 hours, even if there's nobody on production, I will pay for them' - which is mind you what people are doing today. This is crazy."
The efficiency gains from proper autoscaling are massive - "five, sometimes 10x improvement" in cost efficiency. But the current infrastructure and software stacks make this incredibly difficult.
There's also a perverse incentive structure. Because provisioning compute is so hard and expensive, companies over-buy just to make sure they don't run out. This creates artificial scarcity because compute is sitting unused while other companies can't get access to it.
Morin predicts this will lead to an oversupply situation: "I very much worry there will be an oversupply of these chips. Somewhere in the US there's going to be a data center with like a thousand GPUs that people may buy 30 cents on the dollar."
The Memory Architecture Revolution: What Comes After GPUs
While everyone focuses on Nvidia versus AMD versus whoever else, Morin sees a much more fundamental shift coming in chip architecture. The bottleneck isn't compute power - it's memory access patterns.
"The next frontier is called compute in memory," he explains. Instead of moving data between memory and processors, you bring the processing directly to where the data lives. Two companies he's watching are Rain AI and Fractile, which are building chips based on this principle.
This isn't just a incremental improvement - it's a completely different approach to chip design. Current architectures are based on the von Neumann model where you have separate memory and processing units. Compute-in-memory architectures blur that distinction.
For AI workloads, especially the coming wave of reasoning and agent applications, this could be transformative. "It makes it much more efficient. You get maybe not SRAM level performance, but you get much faster performance in terms of compute."
The practical result would be AI systems that can think much faster without the memory access delays that plague current architectures. "You want your model to maybe think for like half a second and then boom, right? You don't want to wait 50 seconds."
But this technology is still early. "Maybe not this year, but it's coming," Morin says.
The Platform Wars: Why Software Will Determine Hardware Winners
Despite all the focus on chips and hardware, Morin believes the real battle will be won at the software layer. His company ZML is betting that abstraction will beat specialization.
"The thing with Nvidia is that they spend a lot of energy making you care about stuff you shouldn't care about," he says. "Who gives a damn about CUDA? I don't want to care about that. I want to do my stuff."
The parallel is obvious: nobody cares whether their laptop has an M2 or M3 chip. They just want it to work well. "Imagine if you had to care about these things - that would be insane."
If software can successfully abstract away the hardware differences, then providers will compete on specs and price rather than ecosystem lock-in. "If the software abstracts away those Nvidia idiosyncrasies as they do on CPUs, then the providers will compete on specs and not on fake moats."
This is why the switching costs are so crucial. Right now, changing hardware providers requires months of engineering work and significant risk. But if that barrier disappears, then a 30% cost improvement becomes worth switching for, instead of needing 7x improvement.
The key insight is that making the buy-in zero changes everything. "If the buy-in is zero, you don't need to worry about this. You just buy whatever is best today."
This is fundamentally different from the current model where you have to commit to entire ecosystems. Instead, you could run different workloads on different chips simultaneously, automatically routing to whatever offers the best price-performance at any given moment.
The Predictions: What the Next 5 Years Look Like
Based on all these trends, Morin makes some bold predictions about where AI infrastructure is heading:
The shift to inference-first thinking will accelerate. "95% inference, 5% training" means most infrastructure investments should be optimized for production workloads, not research workloads.
Memory architecture will become the key differentiator. Companies that can solve the memory access problem will win, while pure compute power becomes commoditized.
Hardware abstraction will break Nvidia's moat. Once switching costs drop to zero, the 90% margins become unsustainable and providers will compete on price and performance.
Agents and reasoning will drive demand for single-stream performance over aggregate throughput, favoring specialized chips over general-purpose GPUs.
The current data center investments will create an oversupply bubble as infrastructure optimized for training becomes less relevant.
But the biggest prediction is about market structure: "Google has the products, the data, and the compute. They have everything. They can sprinkle everywhere. This is the sleeping giant in my mind."
If Google ever decides to seriously compete in AI - and if they're not too distracted by internal reorganizations - they have structural advantages that might be impossible to overcome. They can optimize their entire stack, avoid the hardware margins that squeeze everyone else, and leverage data from billions of users.
The current AI boom feels like it's all about models and training runs and research breakthroughs. But the real war is being fought at the infrastructure level, and most people aren't even paying attention to it yet.