Reverse diffusions, Score & Tweedie

SDE
Generative-Model
Published

12 06 2023

Modified

12 06 2023

Reversing a diffusion

Imagine a scalar diffusion process {Xt}t=0T defined on the interval [0,T],

dX=μ(X)dt+σdW

where μ(X) is the drift term, σ is the diffusion coefficient, and dW is a Wiener process. Denote the distribution of this process at time t (0tT) as pt(dx) with an initial distribution of p0(dx). Now, what happens if we reverse time and examine the process backward? In other words, consider the time-reversed process X defined as

Xs=XTs.

Intuitively, the process X is also a diffusion on the interval [0,T], but with an initial distribution X0pT(dx). To gain intuition, consider an Euler discretization of the forward process:

(1)Xt+δ=Xt+μ(Xt)δ+σδn

where nN(0,1) represents a noise term independent from Xt, and δ1 is a time increment. Re-arranging terms and making the approximation μ(Xt)μ(Xt+δ) gives that

(2)XtXt+δμ(Xt+δ)δ+σδ(n).

This seems to suggest that the time-reversed process follows the dynamics dX=μ(X)dt+σdW started from X0pT(dx). However, this conclusion is incorrect because this would suggest that the time-reversed of a standard Brownian motion (where μ(x)0) starting at zero is also a Brownian motion starting at pT(dx)=N(0,T), which is clearly not the case. The flaw in this argument lies in assuming that the noise term Z is independent of Xt+δ, which is not true, rendering the Euler discretization argument invalid.

Deriving the dynamics of the backward process in a rigorous manner is not straightforward () (). What follows is a heuristic derivation that proceeds by estimating the mean and variance of Xt given Xt+δ=xt+δ, assuming δ1. Here, xt+δ is treated as a fixed and constant value, and we are only interested in the conditional distribution of Xt given xt+δ. Bayes’ law gives

P(Xtdx|xt+δ)pt(x)exp{(xt+δ[x+μ(x)δ])22σ2δ},

where the exponential term corresponds to the transition of the forward diffusion for δ1. Using the 1st order approximation

pt(x)pt(xt+δ)exp(logpt(xt+δ),(xxt+δ)),

eliminating multiplicative constants and higher-order error terms, we obtain:

P(Xtdx|xt+δ)exp{(x[xt+δμ(xt+δ)δ+σ2logpt(xt+δ)δ])22σ2δ}.

For δ1, this is transition of the reverse diffusion

dXt=μ(Xt)dt+σ2logpTt(Xt)dt+σdB.

The notation B is used to emphasize that this Brownian motion is distinct from the one used in the forward diffusion. The additional drift term, denoted as σ2logpt(X), is intuitive: it pushes the reverse diffusion toward regions where the forward diffusion spent a significant amount of time, i.e., where pt(dx) is large. The popular “denoising diffusion models” () can be seen as discretizations of this backward process, employing various techniques to estimate the additional drift term from data.

Denoising Score Matching & Tweedie formula

Maurice Tweedie (1919–1996)

The previous section shows that a quantity often referred to as the “score” in machine learning (ML) literature, i.e. St(x)xlogpt(x), naturally arises when a diffusion process runs backward in time. Interestingly, it is worth noting that since the beginning of times statisticians have been using the term “score” to refer to the other derivative, i.e. the derivative with respect to a model’s parameter, which is a completely different object!

Consider a Brownian diffusion process dX=σdW initiated from a distribution μ(dx)=p0(dx). If this process is ran forward for a duration δ>0, we have:

Xδ=X0+N(0,σ2δ)

where X0μ(dx). Now, focusing on a specific sample y=Xδ, the backward dynamics dX=σ2logpt(X),dt+σdB suggests that for sufficiently small δ, the following approximation holds:

(3)E[X0|Xδ=y]y+logpδ(y)σ2δ.

Maybe surprisingly, in the case of Brownian dynamics (no drift), this relationship holds exactly, even for arbitrarily large time increments δ>0. This observation attributed in () to Maurice Tweedie (1919–1996) has a straightforward proof. Specifically, if Xμ(dx), then Y=X+N(0,Γ2) has a density given by:

pY(dy)=μ(x)ρΓ(yx)dx

where ρΓ(z)exp[z2/(2Γ2)] is a centred Gaussian with variance Γ2. It follows that

E[X|Y=y]yΓ2=(xyΓ2)μ(x)ρΓ(yx)dxμ(x)ρΓ(yx)dx=ylog{μ(x)ρΓ(yx)dx}

since yρΓ(yx)=(xy)/Γ2. This leads to Tweedie’s formula:

E[X|Y=y]=y+Γ2logpY(y),

which is exactly the same as Equation 3 but with an equality sign: no approximation needed! Interestingly, this is what Machine-Learners often refer to as “denoising score matching” (): if we have access to a large number of samples (Xi,Yi), where Xiμ(dx) and Yi=Xi+N(0,Γ2), and fit a regression model Fθ by minimizing the mean-squared error θEXFθ(Y)2, then Tweedie formula shows that in the limit of infinite samples and with sufficient flexibility in the regression model Fθ(y)=y+Γ2logpY(y). This allows one to estimate logpY from data, which is often a good approximation of logμ if the variance Γ2 of the added noise is not too large. Indeed, things can go bad if Γ2 is very small and the number of training data is not large, no free lunch!

References

Anderson, Brian DO. 1982. “Reverse-Time Diffusion Equation Models.” Stochastic Processes and Their Applications 12 (3): 313–26.
Efron, Bradley. 2011. “Tweedie’s Formula and Selection Bias.” Journal of the American Statistical Association 106 (496): 1602–14.
Haussmann, Ulrich G, and Etienne Pardoux. 1986. “Time Reversal of Diffusions.” The Annals of Probability, 1188–1205.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33: 6840–51.
Vincent, Pascal. 2011. “A Connection Between Score Matching and Denoising Autoencoders.” Neural Computation 23 (7): 1661–74.