Setting & Goals
Consider an empirical data distribution . In order to simulate approximate samples from , Denoising Diffusion Probabilistic Models (DDPM) simulate a forward diffusion process on an interval . The diffusion is initialized at the data distribution, i.e. , and is chosen so that that the distribution of is very close to a known and tractable reference distribution , e.g. a Gaussian distribution. Denote by the marginal distribution at time , i.e. . By choosing the forward distribution with simple and tractable transition probabilities, e.g. an Ornstein-Uhlenbeck, it is relatively easy to estimate from simulated data: this can be formulated as a simple regression problem. This allows one to simulate the diffusion backward in time and generate approximate samples from . Why this is useful is another question…
The fact that the mapping from data-samples at time to (approximate) Gaussian samples at time is stochastic and described by diffusion processes is cumbersome. This would be much more convenient to build a deterministic mapping between the data-distribution and the Gaussian reference distribution : this would allows one to associate a likelihood to data samples and to easily “encode”/“decode” data-samples. To so this, one can try to replace diffusion by Ordinary Differential Equations.
The diffusion-ODE trick
Consider an arbitrary diffusion process with associated distribution at time . The Fokker-Planck equation that describes the evolution of reads
If there were no diffusion term and was describing instead the evolution of differential equation , the associated evolution of the density of would simply read
If one can find a vector field such that
then one can basically replace diffusions by ODEs. The diffusion-ODE trick is the simple remark that
does exactly this, as algebra immediately shows it. The additional term is intuitive. The coefficient is because one is trying to match the term in the Fokker-Planck equation. And the overall term is just driving the ODE in direction where the probability density is small, i.e. it follows the negative gradient of the log-density: it is exactly trying to imitate the diffusion term .
What this means is that a diffusion process started from and marginal distribution can be imitated by an ODE process started from . At any time , the marginal distributions of and both exactly equal .
The diffusion-ODE trick: application to DDPM
Consider a DDPM with forward dynamics given by an Ornstein-Uhlenbeck (OU) process
and initial condition . As explained in these notes, it is relatively straightforward to estimate the score function
from data. This means that the forward OU process can be replaced by the forward ODE
with . Similarly, the reverse diffusion (i.e. the “denoising” diffusion) defined as follows the dynamics
As described for the first time in the beautiful article (Song et al. 2020), the diffusion-ODE trick now shows that the denoising diffusion can be replaced by a denoising ODE with dynamics
Interestingly [and I do not know whether there was an obvious way of seeing this from the start], this shows that the forward and backward ODE are actually the same but run forward and backward in time. They corresponds to the ODE described by the vector field
The animation belows display the denoising ODE and the associated vector field Equation 2.
Likelihood computation
With the diffusion-ODE trick, we have just seen that it is possible to build a vector fields such that the forward ODE
and the backward ODE defined as
are such that and .
In general, consider a vector field and a bunch of particles distributed according to a distribution at time . If each particle follows the vector field for an amount of time , the particles that were in the vicinity of some at time end up in the vicinity of at time . At the same time, a volume element around gets stretch by a factor while following the vector field , which means that the density of particles at time and around equals . In other words . This means that if we follows a trajectory of one gets
That is the Lagrangian description of the density of particles. Indeed, one could directly get this identity by differentiating with respect to time while using the continuity Equation 1. When applied to the DDPM, this gives a way to assign likelihood the data samples, namely
where is trajectory of the forward ODE Equation 3 initialized as . Note that in high-dimensional setting, it may be computationally expensive to compute the divergence term since it typically is times slower that a gradient computation; for this reason, it is often advocated to use the Hutchinson trace estimator to get an unbiased estimate of it at a much lower computational cost.
References
Song, Yang, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020.
“Score-Based Generative Modeling Through Stochastic Differential Equations.” ICLR 2021.
https://arxiv.org/abs/2011.13456.