Table of Contents
In the rapidly evolving landscape of artificial intelligence, few technologies have shifted the paradigm as dramatically as diffusion models. While Large Language Models (LLMs) often dominate the headlines, diffusion has quietly become the backbone of modern generative AI—powering everything from hyper-realistic image generation to Nobel Prize-winning breakthroughs in biology.
In a recent discussion on Decoded, Francois Chaubard, a Y Combinator visiting partner and researcher at Stanford, broke down the mechanics, evolution, and future of this critical technology. With a background starting in Fei-Fei Li’s lab and a decade spent deploying computer vision systems, Chaubard argues that diffusion is not just a tool for creating art—it is a fundamental machine learning framework that every founder and researcher must understand.
Key Takeaways
- Universal Distribution Learning: Diffusion is a framework for learning probability distributions, excelling at mapping high-dimensional data (like images or robot actions) even in low-data regimes.
- Evolution Toward Simplicity: The field has moved from complex thermodynamic schedules to "Flow Matching," a method that simplifies the math into predicting a straight-line velocity between noise and data.
- The "Squint Test" for AGI: Unlike the linear, token-by-token processing of LLMs, diffusion mimics biological processes by leveraging randomness and refining concepts iteratively.
- Beyond Images: While famous for Stable Diffusion, the technology is now state-of-the-art in protein folding, robotic manipulation, weather forecasting, and material science.
Defining Diffusion: The Art of Controlled Noise
At its core, diffusion is a method for learning data distributions. While all machine learning models attempt to map inputs to outputs, diffusion takes a unique approach to high-dimensional spaces. It stands out specifically when mapping from one high-dimensional space to another, even when training data is scarce.
The process is built on a counter-intuitive principle: destruction creates the path for construction. To train a model, you take a data sample—such as an image—and iteratively add noise until the original structure is obliterated, leaving only random static. The model’s objective is not to create an image from scratch, but to learn the reverse process: how to remove that noise step-by-step to recover the original data.
"You take some sample of the data... and we just hit it with noise. And then we just keep hitting it with noise... It's very easy to create noisy images. It's hard to walk backwards and create from noise images of you."
This creates a "noiser" and a "denoiser." The "noiser" is a simple mathematical function that destroys information. The "denoiser" is the deep learning model trained to reconstruct it. By learning to navigate from chaos back to order, the model effectively learns the underlying structure of the data.
The Evolution: From Thermodynamics to Flow Matching
Since the seminal 2015 paper by Jascha Sohl-Dickstein, diffusion has undergone a radical evolution. Early iterations relied on complex noise schedules and rigorous thermodynamics to function. Researchers spent years "hill climbing" on metrics like the Fréchet Inception Distance (FID) to improve image quality, experimenting with how noise was added and how the loss functions were calculated.
The Breakthrough of Flow Matching
The most significant leap in recent years is a technique known as Flow Matching. Traditional diffusion models often took a circuitous, wandering path to get from noise to data. This required complex calculations and many steps at inference time.
Flow Matching simplifies this by asking a geometric question: What is the most direct path between the noise distribution and the data distribution? The answer is a straight line—a velocity vector.
"There is a global velocity between the noise and the data, and it's just this straight line... I don't care where you are, go in that line."
This shift has profound engineering implications. It reduces what was once a complex implementation involving specific noise schedules (Beta schedules) and intricate loss functions into roughly 15 lines of code. The model simply learns to predict the velocity vector required to move from the current noisy state toward the clean data. This abstraction makes the technology not only more powerful but significantly more accessible to implement.
The "Squint Test": Diffusion vs. LLMs
When discussing the path to Artificial General Intelligence (AGI), comparisons to human cognition are inevitable. Yann LeCun famously utilized an analogy regarding the invention of flight: humanity didn't need flapping wings to fly, but we did need airfoils. We needed to "squint" at biology to find the underlying principle without copying the exact mechanism.
Chaubard applies this "squint test" to modern AI architectures, contrasting autoregressive Large Language Models (LLMs) with diffusion models.
The Limitations of Autoregression
Current LLMs generate intelligence one token at a time. They are linear and generally cannot revise a previous thought once it is output. This contrasts sharply with human cognition, which involves recursion, revision, and thinking in broad concepts before refining them into words.
Why Diffusion Mimics Biology
Diffusion models possess two distinct characteristics that align closer to biological intelligence:
- Constructive Randomness: Biology is inherently noisy. Neurons fire stochastically. Diffusion leverages randomness as a feature, not a bug, using it to explore the solution space effectively.
- Iterative Refinement: Rather than emitting a final answer instantly, diffusion starts with a vague concept (noise) and iteratively refines it into a sharp result. This mirrors how the human brain refines thoughts or motor actions.
"I definitely don't think in one token at a time... I'm thinking in concepts... I think diffusion gives me both of those things for sure."
Beyond Generative Art: Real-World Applications
While the public associates diffusion largely with tools like Midjourney or Stable Diffusion, the technology has silently "eaten" almost every other sector of AI research. It has proven effective wherever there is a need to map high-dimensional inputs to complex outputs.
Scientific Discovery
In the life sciences, diffusion is driving a new era of discovery. DeepMind’s AlphaFold, which recently contributed to a Nobel Prize, utilizes diffusion processes to predict protein structures. Startups are using similar models (like DiffDock) to predict how small molecules bind to proteins, accelerating drug discovery.
Robotics and Physical Action
Perhaps the most transformative application is in "Diffusion Policy" for robotics. Robots operate in continuous, high-dimensional action spaces. Traditional methods struggle with the nuance of physical movement. Diffusion allows robots to learn complex policies—from driving autonomous vehicles to manipulating household objects—by treating robot trajectories effectively as "images" of motion that can be denoised.
Weather and Code
The versatility continues with models like GenCast, which uses diffusion for state-of-the-art weather forecasting, and new experiments in applying diffusion to code generation, moving beyond the token-by-token paradigm of Transformers.
Strategic Implications for Builders
For founders and researchers, the dominance of diffusion presents a clear strategic signal. The technology is scaling rapidly; just as image generation improved a thousand-fold in five years, other domains like robotics and biology are poised for similar exponential growth.
Currently, the only major areas where diffusion is not the undisputed state-of-the-art are language modeling (dominated by Transformers) and discrete gameplay (dominated by Monte Carlo Tree Search). However, even in these fields, research is blurring the lines.
The advice for builders is straightforward: if you are training models, diffusion should likely be a core component of your stack, even if only for learning latent representations. If you are building products on top of models, you must "update your priors" on what is possible. Problems that seemed intractable due to data scarcity or complexity—like general-purpose robotics—may now be solvable.
Conclusion
Diffusion has graduated from an experimental technique for generating pixelated images to a fundamental pillar of modern AI. By simplifying the mathematics through Flow Matching and proving its versatility across disciplines, it offers a robust alternative to autoregressive models.
As the technology continues to mature, its ability to handle randomness and iterative refinement suggests it will play a central role in bridging the gap between current narrow AI and the flexible, general intelligence of the future.