Overview of Diffusion Models

Introduction to Diffusion Models

Diffusion models are a class of generative models that have gained significant attention in recent years, particularly for their impressive results in image generation. The core idea is to gradually add noise to data and then learn to reverse this process.

Notable examples include Stable Diffusion, DALL-E, Midjourney, and the recent Flux model.

Diffusion models are a type of generative model that create high-quality samples (like images) by learning to reverse a process that gradually adds noise to the data. The core idea is to start with a data sample, corrupt it with noise over time, and then learn a model that can reverse this corruption to reconstruct or generate new data.

1. Brief History of Diffusion Models

1.1. Evolution of Image Generation in Machine Learning

Date	Model	Key Innovation	Strengths	Weaknesses
2014	GANs (Goodfellow et al.)	Generator and discriminator trained in a minimax game to produce realistic images.	Sharp, realistic images	Unstable training, mode collapse
2014	VAEs (Kingma & Welling)	Probabilistic framework using latent variables for image generation.	Stable training, well-defined probabilities	Blurry images
2016	PixelCNN (van den Oord et al.)	Autoregressive model generating images pixel by pixel, with explicit likelihood.	Likelihood modeling, captures pixel dependencies	Slow, limited global coherence
2019	BigGAN	Large-scale GAN models, improving quality and diversity in image generation.	High-resolution, diverse images	Requires enormous computational resources
2019	VQ-VAE-2 (DeepMind)	Combines discrete latent spaces and hierarchy to improve image generation quality.	High-quality, diverse images	Complex architecture
2020	DALL-E (OpenAI)	Text-to-image generation, using multimodal learning to generate images from descriptions.	Breakthrough in text-to-image generation	High computational cost
2021	DDPMs (Ho et al.)	Denoising Diffusion Probabilistic Models (DDPMs): diffusion models for generating images by denoising a Gaussian noise process over several steps.	Stable training, high-quality images	Slow sampling
2022	Latent Diffusion Models (LDMs)	Diffusion models operating in a lower-dimensional latent space for faster and more efficient generation.	Fast sampling, highly realistic images	Computationally complex model design

1.2. Prerequisites for Modern Diffusion Models

Date	Concept/Model	Key Contribution
1980s–1990s	Thermodynamics & Stochastic Processes	Foundations in non-equilibrium thermodynamics and stochastic processes. Brownian motion, Fokker-Planck equations.
2004	Denoising Autoencoders (Vincent)	Early model for denoising noisy inputs, a precursor to the reverse diffusion process.
2011	Score Matching (Hyvärinen)	Technique for matching the gradient of the data distribution, central to score-based diffusion models.
2014	Variational Inference in VAEs (Kingma & Welling)	Introduced variational methods to infer latent variables, inspiring inference methods in diffusion models.
2015	Deep Generative Models with Gaussian Noise	Emphasized stochastic processes in generative modeling, including the use of Gaussian noise.
2019	Neural ODEs (Chen et al.)	Continuous-time neural network models, inspiring continuous diffusion models and variants like score-based models.
2020	Score-Based Generative Models (Song & Ermon)	Unified framework combining score matching and reverse diffusion, leading to efficient diffusion-based models.
2021	Denoising Diffusion Probabilistic Models (Ho)	Formalized forward and reverse diffusion processes using variational inference, making diffusion models practical.

Mathematical Foundations of Diffusion Models

High-Level Intuition

Imagine starting with a clear image and progressively adding small amounts of noise over time until it becomes pure noise. If we can somehow manage to reverse this process in a non-deterministic way, we can generate novel, high-quality samples from a learned data distribution starting from nothing but random noise (basically, from scratch).

Indeed, the success of diffusion models shows that this is indeed possible. Diffusion models are a type of generative model that create high-quality samples (like images) by learning to reverse a process that gradually adds noise to the data. The core idea is to start with a data sample, corrupt it with noise over time, and then learn a model that can reverse this corruption to reconstruct or generate new data.

The following are the fundamental processes that make diffusion models work:

Forward Process: Start with clean data and add noise gradually over several steps until the data becomes almost indistinguishable from random noise.
Reverse Process: Learn to reverse this noise-adding process step by step, generating clean data starting from noise.

Key Concepts

Forward Diffusion Process: Gradually adding noise to data
Reverse Diffusion Process: Learning to remove noise step-by-step
Training Objective: Predicting and removing added noise
Sampling: Generating new data by denoising random noise

The Forward Diffusion Process

Mathematical Formulation

Let $x_0$ be our original data point (e.g., an image). The forward diffusion process adds Gaussian noise over $T$ timesteps:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$$

Where:

$x_t$ is the noisy version of the data at timestep $t$
$\beta_t$ is a variance schedule (typically small and increasing over time)
$\mathcal{N}(\mu, \Sigma)$ denotes a Gaussian distribution with mean $\mu$ and covariance $\Sigma$
The notation $\mathcal{N}(x; \mu, \Sigma)$ is shorthand for $x \sim \mathcal{N}(\mu, \Sigma)$
$I$ is the identity matrix

Closed Form for Noisy Data

We can directly express $x_t$ in terms of the original data $x_0$ and noise $\epsilon \sim \mathcal{N}(0, I)$:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$

Where $\bar{\alpha}t = \prod{i=1}^t (1 - \beta_i)$

The Reverse Diffusion Process

Mathematical Formulation

The reverse process is defined as:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

Where:

$\mu_\theta$ and $\Sigma_\theta$ are learned functions (typically neural networks)
$\theta$ represents the parameters of these functions

Training Objective

The model is trained to predict the noise added at each timestep. A common objective is denoising score matching:

$$ \begin{aligned} L(\theta) :=& \, \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right] \\ =& \, \mathbb{E}_{t, x_0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon, t) \right\|^2 \right] \\ \end{aligned} $$

Where:

$\epsilon \sim \mathcal{N}(0, I)$ is the noise added to the data
$\epsilon_\theta(x_t, t)$ is the model’s prediction of the total cumulated noise at timestep $t$

It is important to note that although the reverse process is iterative such that we subtract a part of the noise at each step, the objective of the model is to find the total noise at that step (therefore, recovering the original image) in a single step.

Empirically, trying to predict the noise between a single step does not work well. Intuitively, this is likely because two noisy images tend to look similar and have similar features, so the model has a hard time finding the correct noise to subtract.

Effectively, if the model were trained to predict the next intermediate timestep $x_{t}$ from $x_{t-1}$ directly, it would have to explicitly learn the full sequence of steps in the reverse process. This would complicate the training procedure:

The model would need to learn how to transition between each intermediate distribution at each step.
It would be more prone to error accumulation—if the model makes a mistake at one step, that error could propagate through subsequent steps, degrading the quality of the final output.

An explanation on why the reverse process is then iterative is mentioned in the questions section.

Sampling (Generation)

To generate new data, we start from pure noise $x_T \sim \mathcal{N}(0, I)$ and gradually reverse the diffusion process to generate a sample $x_0$ from the data distribution. This involves sampling $x_{T-1}, x_{T-2}, \dots, x_0$ by following the learned reverse process $p_\theta(x_{t-1} | x_t)$.

The sampling process can be described as:

$$x_{t-1} = \frac{1}{\sqrt{1 - \beta_t}}\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sqrt{\beta_t}\epsilon$$

Where $\epsilon$ is fresh Gaussian noise at each step.

Key Points and Common Areas of Confusion

Gaussian Noise: Used for its mathematical simplicity and properties like the Central Limit Theorem.
Markov Process: Simplifies forward and reverse processes, making computations tractable.
Model Learning: The model learns to predict the noise added at each timestep, allowing it to denoise data and recover original samples.
Comparison to Other Models: Unlike GANs or VAEs, diffusion models use a more direct probabilistic formulation and tend to have more stable training dynamics.
Variance Schedule: The choice of $\beta_t$ can significantly impact model performance.
Continuous Time Formulation: Some papers use stochastic differential equations for a continuous time formulation.
Prediction Targets: While we’ve described predicting $\epsilon$, some formulations predict $x_0$ or other quantities.
Sampling Techniques: Various techniques like DDIM or ancestral sampling can accelerate the sampling process or provide more control.
Loss Weighting: The loss is often reweighted to focus more on less noisy timesteps, improving sample quality.

Summary of Diffusion Models

Forward Process: Gradually corrupt data by adding Gaussian noise step by step until it becomes nearly indistinguishable from random noise.
Reverse Process: Learn a model that reverses the noise-adding process step by step, gradually refining the noisy sample into clean data.
Training: Train the model to predict the noise added at each step, using a mean-squared error loss between the true noise and the model’s predicted noise.
Sampling: After training, generate new samples by starting with pure noise and applying the learned reverse process to progressively remove noise.
Iterative Procedure: The iterative denoising allows for fine control and gradual refinement of data, ensuring high-quality, stable sample generation.
Noise Reinjection: Injecting fresh noise at each step improves robustness, generalization, and diversity, and prevents overfitting to specific noise configurations.

Overview of Diffusion Models

Introduction to Diffusion Models

1. Brief History of Diffusion Models

1.1. Evolution of Image Generation in Machine Learning

1.2. Prerequisites for Modern Diffusion Models

Mathematical Foundations of Diffusion Models

High-Level Intuition

Key Concepts

The Forward Diffusion Process

Mathematical Formulation

Closed Form for Noisy Data

The Reverse Diffusion Process

Mathematical Formulation

Training Objective

Sampling (Generation)

Key Points and Common Areas of Confusion

Other Questions

1. Why Use an Iterative Procedure Rather Than One-Shotting It?

2. Why Reinject Noise at Each Step (Excluding the Last)?

3. How Do Diffusion Models Compare to GANs and VAEs?

4. What Are the Limitations of Diffusion Models?

5. How Do You Choose the Number of Timesteps?

6. Can Diffusion Models Be Used for Tasks Other Than Generation?

7. What’s the Intuition Behind the Variance Schedule?

8. How Do Diffusion Models Handle Different Types of Data?

9. What Recent Advancements Have Been Made in Diffusion Models?

10. How Do You Evaluate the Quality of Diffusion Models?

11. How is Diversity in Generated Samples Achieved?

Summary of Diffusion Models

Introduction to Diffusion Models#

1. Brief History of Diffusion Models#

1.1. Evolution of Image Generation in Machine Learning#

1.2. Prerequisites for Modern Diffusion Models#

Mathematical Foundations of Diffusion Models#

High-Level Intuition#

Key Concepts#

The Forward Diffusion Process#

Mathematical Formulation#

Closed Form for Noisy Data#

The Reverse Diffusion Process#

Mathematical Formulation#

Training Objective#

Sampling (Generation)#

Key Points and Common Areas of Confusion#

Other Questions#

1. Why Use an Iterative Procedure Rather Than One-Shotting It?#

2. Why Reinject Noise at Each Step (Excluding the Last)?#

3. How Do Diffusion Models Compare to GANs and VAEs?#

4. What Are the Limitations of Diffusion Models?#

5. How Do You Choose the Number of Timesteps?#

6. Can Diffusion Models Be Used for Tasks Other Than Generation?#

7. What’s the Intuition Behind the Variance Schedule?#

8. How Do Diffusion Models Handle Different Types of Data?#

9. What Recent Advancements Have Been Made in Diffusion Models?#

10. How Do You Evaluate the Quality of Diffusion Models?#

11. How is Diversity in Generated Samples Achieved?#

Summary of Diffusion Models#

Introduction to Diffusion Models

1. Brief History of Diffusion Models

1.1. Evolution of Image Generation in Machine Learning

1.2. Prerequisites for Modern Diffusion Models

Mathematical Foundations of Diffusion Models

High-Level Intuition

Key Concepts

The Forward Diffusion Process

Mathematical Formulation

Closed Form for Noisy Data

The Reverse Diffusion Process

Mathematical Formulation

Training Objective

Sampling (Generation)

Key Points and Common Areas of Confusion

Other Questions

1. Why Use an Iterative Procedure Rather Than One-Shotting It?

2. Why Reinject Noise at Each Step (Excluding the Last)?

3. How Do Diffusion Models Compare to GANs and VAEs?

4. What Are the Limitations of Diffusion Models?

5. How Do You Choose the Number of Timesteps?

6. Can Diffusion Models Be Used for Tasks Other Than Generation?

7. What’s the Intuition Behind the Variance Schedule?

8. How Do Diffusion Models Handle Different Types of Data?

9. What Recent Advancements Have Been Made in Diffusion Models?

10. How Do You Evaluate the Quality of Diffusion Models?

11. How is Diversity in Generated Samples Achieved?

Summary of Diffusion Models