Setting & Goals
Consider \(N\) samples \(\mathcal{D}\equiv \{x_i\}_{i=1}^N\) in \(\mathbb{R}^D\) from an unknown data distribution \(\pi_{\mathrm{data}}(dx)\). We would like to build an mechanism that can generate other samples from this data distribution. Implicitly, this means that we will be fitting a statistical model to this finite set of samples and have an algorithmic procedure to generate samples from the fitted probabilistic model. This type of models that directly define a stochastic procedure that generates data are called implicit probabilistic models in the ML literature.
Ornstein–Uhlenbeck Noising process
Denoising Diffusion Probabilistic Models (DDPMs) work as follows. Consider a diffusion process \(\{ X_t \}_{t=0}^T\) that starts from the data distribution \(p_0(dx) \equiv \pi_{\mathrm{data}}(dx)\) at time \(t=0\). The notation \(p_t(dx)\) refers to the marginal distribution of the diffusion at time \(0 \leq t \leq T\). Assume furthermore that at time \(t=T\), the marginal distribution is (very close to) a reference distribution \(p_T(dx) = \pi_{\mathrm{ref}}(dx)\) that is straightforward to sample from. Typically, \(\pi_{\mathrm{ref}}(dx)\) is an isotropic Gaussian distribution. This diffusion process is often called the noising process since it transform the data distribution into a reference measure that can be thought of as “pure noise”. It is often chosen as an Ornstein–Uhlenbeck (OU) diffusion,
\[ dX = - \frac12 X \, dt + dW. \tag{1}\]
This diffusion is reversible with respect to, and quickly converges to, the reference distribution \(\pi_{\mathrm{ref}} = \mathcal{N}(0, I)\) and has the good taste of having simple transition densities: the law of \(X_{t+s}\) given that \(X_t = x_t\) is the same as \(e^{-s/2} x_t + \sqrt{1-e^{-s}} \, \mathbf{n}\), which we write as
\[ \alpha_s \, x + \sigma_s \, \mathbf{n} \]
with \(\alpha_s^2 + \sigma_s^2 = 1\) and \(\alpha_s = e^{-s/2}\) and \(\mathbf{n}\sim \pi_{\mathrm{ref}} = \mathcal{N}(0,I)\). The forward probability transitions are tractable and given by \[ \mathop{\mathrm{\mathbb{P}}}(X_{t+s} \in dy \, | \, X_t = x ) \; \propto \; \exp {\left\{ -\frac{(y - \alpha_s \, x)^2}{2 \, \sigma^2_s} \right\}} \, dy. \tag{2}\]
This also means that one can directly generate samples from \(p_t(dx)\) by first choosing a data samples \(x_i\) from the data distribution \(p_{\mathrm{data}} \equiv p_0\) and blend it with noise by setting \(x_i^{(t)} = \alpha_t \, x_i + \sigma_t \, \mathbf{n}\). In other words, one does not need to simulate the diffusion to generate samples from the data distribution, i.e. the method is “simulation free”.
The reverse diffusion
In order to generate samples from the data distribution, the DDPM strategy consists in sampling from the Gaussian reference measure \(\pi_{\mathrm{ref}}\) at time \(t=T\) and simulate the OU process backward in time. In other words, one would like to simulate from the reverse process \(\overleftarrow{X}_t\) defined as
\[\overleftarrow{X}_s = X_{T-s}.\]
In other words, the reverse process is distributed as \(\overleftarrow{X}_0 \sim \pi_{\mathrm{ref}}\) at time \(t=0\) and, crucially, we have that \(\overleftarrow{X}_T \sim \pi_{\mathrm{data}}\). Furthermore, and as explained in this note, the reverse diffusion follows the dynamics
\[ d\overleftarrow{X}_t = {\color{red} + }\frac12 \overleftarrow{X}_t \, dt \; {\color{red} + \nabla \log p_{T-t}(\overleftarrow{X}_t) \, dt} \; + dB \tag{3}\]
where \(B\) is another Wiener process. I have used the notation \(B\) to emphasize that there is no link between this Wiener process and the one used to simulate the forward process. Note that if the initial data distribution \(p_0 \equiv \pi_{\mathrm{data}}\) were equal to the reference measure \(\pi_{\mathrm{ref}}\), i.e. \(p_0 = p_T = \pi_{\mathrm{ref}}\) then it is easy to see that the reverse diffusion would have exactly teh same dynamics as the forward diffusion. In order to simulate the reverse diffusion, one needs to be able to estimate the new term \({\color{red}\nabla \log p_{T-t}(x)}\) called the score. If one can estimate the score, it is straightforward to simulate the reverse process \(\overleftarrow{X}_t\) and obtain samples from the data distribution. Estimating the score is the crux of the game.
Denoising to estimating the score
In practice, the score is unknown and one has to build an approximation of it
\[\mathcal{S}^\theta(t,x) \; \approx \; \nabla_x \log p_t(x).\]
The approximate score \(\mathcal{S}^\theta(t,x)\) is often parametrized by a neural network with parameters \(\theta \in \Theta\). Since
\[ \log p_t(x_t) \; = \; \log \int \; p(x_t \mid x_0)\; \pi_{\mathrm{data}}(d x_0) \]
it follows that \(\nabla_x \log p_t(x)\) is given by
\[ \nabla_{x_t} \log p_t(x_t) = \frac{\int \; \textcolor{blue}{ \Big\{ \nabla_{x_t} \log p(x_t \mid x_0) \Big\} } \; p(x_t \mid x_0) \, \pi_{\mathrm{data}}(d x_0)}{\int \; p(x_t \mid x_0)\; \pi_{\mathrm{data}}(d x_0)} \]
which is just a conditional expectation:
\[ \nabla_{x_t} \log p_t(x_t) \;=\; \mathop{\mathrm{\mathbb{E}}}[ \textcolor{blue}{ \nabla_{x_t} \log p(x_t \mid x_0) } \mid x_t] \tag{4}\]
and Equation 4 is true for any diffusion process, not only the OU process. Nevertheless, when applied to the OU process of Equation 1 with tractable transition probabilities \(p(x_t \mid x_0)\) as described in Equation 2, algebra gives that
\[ \nabla_{x_t} \log p_t(x_t) \; = \; -\frac{x_t - \alpha_t \, \textcolor{green}{\widehat{x}_0(x_t,t)}}{\sigma_t^2} \tag{5}\]
where \(\widehat{x}_0(x_t,t)\) is a “denoising” estimate of the initial position \(x_0\) given a noisy estimate \(X_t=x_t\) at time \(t\), i.e.
\[ \textcolor{green}{ \widehat{x}_0(x_t,t) \; = \; \mathop{\mathrm{\mathbb{E}}}[X_0 \; \mid \; X_t = x_t] }. \]
Note that this derivation fundamentally relies on the tractability of the transition density of the OU process, although any other linear Gaussian diffusion could have been used. For simplifying notation, I will often write \(\widehat{x}_0(x_t, t)\) as \(\widehat{x}_0(x_t)\) when it is clear that \(x_t\) is a sample obtained at time \(0 \leq t \leq T\). Equation 5 means that to estimate the score, one only needs to train a denoising function \(\widehat{x}^{\theta}_0: [0,T] \times \mathbb{R}^d \to \mathbb{R}^d\) parametrized by a parameter \(\theta \in \Theta\) so that it approximates the true conditional expectation,
\[ \widehat{x}^{\theta}_0(t,x_t) \; \approx \; \widehat{x}_0(t,x_t) = \mathop{\mathrm{\mathbb{E}}}[X_0 \mid X_t=x_t]. \]
It is a standard regression problem: take a bunch of pairs \((x_0, x_t)\) that can be generated as
\[ x_0 \sim \pi_{\mathrm{data}} \qquad \textrm{and} \qquad x_t \sim p(dx_t \mid x_0) = \alpha_t x_0 + \sigma_t \, \mathbf{n} \]
and minimize the Mean Squared Error (MSE) loss, i.e.
\[ \theta \mapsto \mathop{\mathrm{\mathbb{E}}} {\left[ \|\widehat{x}^{\theta}_0(t, x_t) - x_0\|^2 \right]} . \]
This can be implemented with a stochastic gradient descent method or any other stochastic optimization procedure. Indeed, this implicitly relies on the standard fact that the expectation of a squared integrable random variable \(Z\) is the minimizer of the function \(\gamma \mapsto \mathop{\mathrm{\mathbb{E}}}[(\gamma-Z)^2]\). The score is then defined as
\[ \mathcal{S}^{\theta}(t,x) \; = \; -\frac{x - \alpha_t \, \widehat{x}^{\theta}_0(t,x)}{\sigma^2_t}. \]
One of the important take-away is that a denoiser, i.e. a function that estimates the conditional expectation of \(X_0\) given \(X_t=x\), can be upgraded to a sampler! Furthermore, for a given estimate \(\mathcal{S}^\theta(t,x_t)\) of the score function, one can easily obtain bounds on the quality of the resulting approximation of the data distribution. For example, if \(q^{\theta}\) denotes the probability distribution on path-space induced by the reverse diffusion with approximate score \(\mathcal{S}^{\theta}\) while \(p\) denotes the actual probability distribution generated by reverse diffusion, the chain rule for the KL divergence gives
\[ \mathop{\mathrm{D_{\text{KL}}}}(\pi_{\mathrm{data}}, q^{\theta}_0) = \mathop{\mathrm{D_{\text{KL}}}}(p_0, q^{\theta}_0) \leq \mathop{\mathrm{D_{\text{KL}}}}(p, q^{\theta}). \]
As described in these notes, Girsanov’s theorem gives that
\[ \mathop{\mathrm{D_{\text{KL}}}}(p, q^{\theta}) = \frac12 \, \mathop{\mathrm{\mathbb{E}}} {\left\{ \int_{0}^T \|\mathcal{S}^{\theta}(t,X_t) - \nabla_x \log p_t(X_t) \|^2 \, dt \right\}} . \]
If there were a time-dependent volatility \(\gamma_t\) in the dynamics, i.e. \(dX = \mu(X) \, dt + \gamma_t \, dW\) instead of the one in Equation 1, the expression would remain essentially the same, with the only difference that the error term \(\|\mathcal{S}^{\theta}(t,X_t) - \nabla_x \log p_t(X_t) \|^2\) would be scaled by a factor \(\gamma^2_t\). Indeed, this means that when the approximation of the score is perfect, i.e. \(\mathcal{S}^{\theta}(t,x_t) \equiv \nabla_x \log p_t(x_t)\), the data distribution is perfectly recovered, as expected. When expressed in terms of the denoiser \(\widehat{x}\), the upper-bound on \(\mathop{\mathrm{D_{\text{KL}}}}(\pi_{\mathrm{data}}, q^{\theta}_0)\) reads
\[ \frac12 \, \mathop{\mathrm{\mathbb{E}}} {\left\{ \int_{0}^T {\left( \frac{\alpha_t}{\sigma_t^2} \right)} ^2 \, \|\widehat{x}^{\theta}_0(t,X_t) - \mathop{\mathrm{\mathbb{E}}}[X_0 \mid X_t] \|^2 \, dt \right\}} . \]
Since \((\alpha_t / \sigma_t) \to \infty\) as \(t \to 0\), this shows that it is probably a good idea to approximate denoiser \(\widehat{x}_0(t,x_t)\) much better for small times \(t \to 0\) than for large times \(t \to T\).
Denoiser: practical parametrization
In practice, it may not be efficient, or stable, to try to directly parametrize the denoiser \(\widehat{x}^\theta_0(t, x_t)\) with a neural network and simply minimize the loss
\[ \mathop{\mathrm{\mathbb{E}}}\|X_0 - \widehat{x}^{\theta}_0(t, X_t)\|^2. \tag{6}\]
For example, for \(t \approx 0\), we have that \(\widehat{x}_0(t,X_t) \approx X_t \approx X_0\) so that it is very easy to reconstruct \(X_0\) from \(X_t\). On the contrary, for large \(t\), there is almost no information contained within \(X_t\) to reconstruct \(X_0\). This means that the typical value of the naive loss Equation 6 depends widely on \(t\), which makes it difficult to optimize: with this parametrization, since large values of the loss will be typically concentrated to large values of \(t\), the denoiser will not be accurate for \(t \approx 0\), which is exactly the opposite than what is required, as discussed at the end of the previous section. Since \(x_t = \alpha_t \, x_0 + \sigma_t \, \mathbf{n}\), one can defined the Signal-to-Noise-Ratio as
\[\mathrm{SNR}(t) = \frac{\alpha_t}{\sigma_t}.\]
In order to normalize by the reconstruction difficulty, it makes more sense to train a denoiser by minimizing the loss:
\[ \mathop{\mathrm{\mathbb{E}}} {\left[ \mathrm{SNR}^2(t) \times \|X_0 - \widehat{x}^{\theta}_0(t, X_t)\|^2 \right]} . \]
It turns out that it is entirely equivalent to minimizing the loss
\[ \mathop{\mathrm{\mathbb{E}}} {\left[ \| \mathbf{n}- \widehat{\mathbf{n}}^{\theta}(t, X_t)\|^2 \right]} . \]
where \(x_t = \alpha_t \, x_0 + \sigma_t \, \mathbf{n}\) while the denoiser \(\widehat{x}^{\theta}_0\) and noise estimator \(\widehat{\mathbf{n}}^{\theta}\) are parametrized so that
\[ x_t = \alpha_t \, \widehat{x}^{\theta}_0(t, x_t) + \sigma_t \, \widehat{\mathbf{n}}^{\theta}(t,x_t). \]
That is one of the reasons why most of the papers on DDPM are parametrizing the denoiser \(\widehat{x}_0\) by building instead a “noise estimator” \(\widehat{\mathbf{n}}^{\theta}\) with a neural network and setting
\[ \widehat{x}^{\theta}_0(t,x_t) = \frac{x_t - \sigma_t \, \widehat{\mathbf{n}}^{\theta}(t,x_t)}{\alpha_t}. \tag{7}\]
Since \(\alpha_t \to 1\) and \(\sigma_t \to 0\) for \(t \to 0\), this also implicitly ensures that \(\widehat{x}_0(t,x_t) \approx x_t\) for \(t \to 0\), as required. With the parametrization Equation 7, the score is given by
\[ \mathcal{S}^{\theta}(t,x_t) \, = \, -\frac{\widehat{\mathbf{n}}^{\theta}(t,x_t)}{\sigma_t}, \]
which can still lead to instability issues for small times \(t \to 0\), although to a lesser extent. It is unlikely that there is an easy way to avoid these issues since, as soon as the data distribution is supported on a (neighbourhood of) lower dimensional manifold, the actual score function indeed does become very large for \(t \to 0^+\) since a strong “drift” is required to bring the diffusion back to the manifold.
The “denoising” diffusion
Once the denoiser \(\widehat{x}_0(\ldots)\) trained, the reverse diffusion defined \(\overleftarrow{X}_s = X_{T-s}\) has to be simulated. Plugging Equation 5 back in the expression of the dynamics of the reverse diffusion shows that
\[ d\overleftarrow{X}_s = -\frac12 \, \frac{1}{\tanh((T-s)/2)} \, {\left( \overleftarrow{X}_s - \frac{\widehat{x}_0(\overleftarrow{X}_s)}{\cosh((T-s)/2)} \right)} \; + \; dB \tag{8}\]
This dynamics is intuitive: as \(s \to T\) we have \(\cosh((T-s)/2) \to 1\) and \(\varepsilon^2 \equiv \tanh((T-s)/2) \sim (T-s)/2 \to 0\) so that the dynamics is similar to
\[ d\overleftarrow{X}_s = -\frac12 \, \frac{1}{\varepsilon^2} \, {\left( \overleftarrow{X}_s - \widehat{x}_0 \right)} + dB. \]
That is an Ornstein-Uhlenbeck process that converges quickly, i.e. on time-scale of order \(\mathcal{O}(\varepsilon^2)\), towards a Gaussian distribution centred at \(\widehat{x}_0\) and with variance \(\varepsilon^2\).
To discretize the reverse dynamics Equation 8 on a small interval \([\overline{s}, \overline{s}+\delta]\), one can for example consider the slightly simplified (linear) dynamics
\[ d\overleftarrow{X}_s = -\frac12 \, \frac{1}{\tanh((T-s)/2)} \, {\left( \overleftarrow{X}_s - \mu \right)} + dB. \]
Here, \(\mu = \widehat{x}_0(\overleftarrow{X}_{\overline{s}}) / \cosh((T-\overline{s})/2)\) with \(\overleftarrow{X}_{\overline{s}} = \overleftarrow{x}_{\overline{s}}\). Algebra gives that, conditioned upon \(\overleftarrow{X}_{\overline{s}} = \overleftarrow{x}_{\overline{s}}\), we have
\[ \left\{ \begin{aligned} \mathop{\mathrm{\mathbb{E}}}[ \overleftarrow{X}_{\overline{s} + \delta} ] \quad &= \quad \mu + \lambda \, (\overleftarrow{x}_{\overline{s}} - \mu)\\ \mathop{\mathrm{Var}}[ \overleftarrow{X}_{\overline{s} + \delta} ] \quad &= \quad \tanh {\left( \frac{T-\overline{s}-\delta}{2} \right)} \, (1-\lambda^2) \end{aligned} \right. \]
where the coefficient \(0<\lambda<1\) is given by
\[ \lambda = \frac{\sinh(T-\overline{s}-\delta)}{\sinh(T-\overline{s})}. \]
This discretization is more stable and efficient than a naive Euler-Maruyama approach since the drift term can get large as \(s \to T\). WIth the above discretization, one can easily simulate the reverse diffusion on \([0,T]\) and generate approximate samples from \(\pi_{\mathrm{data}}\).
In the animation below, the method described in these notes was used. This very small example trains in less than a minute on an old laptop without a GPU. The noise estimator was parametrized with an MLP with \(\mathrm{elu}(\ldots)\) non-linearity and two hidden-layers with size \(H=128\). It was very useful to add a few Fourier features as input of the network, I could not make this work properly without.
The literature on DDPM is enormous and still growing!