Intro to Diffusion Model — Part 2

Part 2 of a series of posts about the diffusion model. In this post, we will go over the mathematics that describe the forward and reverse processes of the diffusion model.

5 min readSep 9, 2023

This post is part of a series of posts about diffusion models:

In the previous post, the key ideas of the diffusion model, such as denoising, forward process, and reverse process, were introduced. Now, it’s time to move from a general intuitive description into a more formal mathematical one.

In this post, we are going to base on the paper “Denoising Diffusion Probabilistic Models” (or DDPM for short) for the mathematical formulation.

Let’s start with indexing the steps in the processes. We define a process to have T time steps, where T is some big integer (for example, in the DDPM they use T = 1000). The first step, where we have the original image without any noise, is at t = 0, and we mark that image as x₀. In the following steps, we add noise to the image, so for a large enough T, we will get a complete noise.

The original images (those without noise) come from a distribution of “real data”, so we can write it mathematically as x₀~q(x₀). In the forward diffusion process at a time t, we want to sample a new noisy image based on the nosy image at the previous step t-1, which means q(xₜ|xₜ₋₁). In the case of the diffusion model, in each time step, we add Gaussian noise to the image according to some known variance schedule 0 < β₁ < … < β_T < 1 where

The above equation means that the image at time step t is sampled from a normal distribution with a mean of μ = √(1-βₜ)xₜ₋₁ (i.e., depends on the values of the image at the previous step, which we assume to know), and a variance of σ² = βₜ (I is the identity matrix, which means that the distribution of the noise in the different parts of the image is independent of each other.

The complete forward process can be described by the following equation

Now pay attention to this. We can write xₜ as follows:

Why can we write it that way? Let’s assume X is a random variable sampled from a normal distribution with mean μ = 0 and variance σ² = 1. Now let’s define a new variable Y = μ + σX. Let’s check what the expectation value and the variance of Y are.

If you are not sure why the equations above are true, you can check out this post about the expectation and variance. The last results show that Y is also a Gaussian random variable (since X is) but with a mean μ and variance σ². As for our equation for xₜ, by the same principles, xₜ is a Gaussian random variable (or a vector actually) with a mean of √(1-βₜ)xₜ₋₁ (we assume we know xₜ₋₁ as we deal with conditional probability) and variance of βₜ, which is exactly what we wrote in the first equation.

Although we found an equation for xₜ, we still need to calculate all the previous times since the equation is dependent on xₜ₋₁, and for calculating it, we need the previous time and so on, until x₀. We can do better, but first, we need to introduce new notations.

So we can rewrite xₜ equation as:

Now, let’s calculate the expectation and variance of x

So we can write xₜ directly in terms of x₀:

This was the forward process that let us go from the original image x₀, up to a complete noise image step by step. But what we really want to know is the reverse process or how we can go from noise to an image. This process can be described in a similar way (but in the opposite direction) to the first equation of the forward process:

when we start from an isotropic Gaussian noise image such that

If we can find the conditional distribution p(xₜ₋₁|xₜ), we can reverse the process and go from a sample of noise x_T and gradually remove the noise until we end up with an image from the real data distribution x₀.

The problem, of course, is that we don’t know that distribution, but we can still try to estimate it using a neural network. For the estimation, we replace the original unknown distribution p(xₜ₋₁|xₜ), with parametric one p_θ(xₜ₋₁|xₜ), where θ is the network’s parameters.

As we deal with Gaussian distributions, p_θ(xₜ₋₁|xₜ) can be fully described by a mean and a variance so we can parameterize it as

the conditional part comes from the dependency of μ and ∑ of the data from the previous step xₜ at time step t. However, we can make an assumption for our model and make it simpler by deciding that the variance is not a parameter the network needs to learn (that assumption follows the DDPM paper, but in later works such as “Improved Denoising Diffusion Probabilistic Models” they also learn the variance).

In the next post, we are going to see how we can define a cost function to use with the neural network to learn the mean.