Alexandre Thiéry

Joe Doob & Change of measures on path-space

Mon, 13 May 2024 16:00:00 GMT

Joseph Doob (1910 – 2004)

Consider a continuous time Markov process on on a state space . This defines a probability on the set of -valued paths. Now, it is often the case that one has to consider a perturbed probability distribution defined as

for a normalization constant and some positive function . For example, if one were to collect a noisy observation at time , the distribution defined through Equation 1 with the choice describes the dynamics of the Markov process conditioned on the observation . How can one describe the dynamics of this conditioned process?

For convenience, let us use the notation . For a test function and a time increment , we have

We have introduced the important function defined as

One can readily check that the function satisfies the Kolmogorov equation

with boundary condition . Furthermore, the definition of the infinitesimal generator of the Markov process gives that

with an error of size . The infinitesimal generator of the conditioned process is defined as

Plugging Equation 3 within Equation 2 directly gives that

which is a description of the dynamics of the conditioned process. In fact, one could have done more or less the same computation with a more general change of measure: consider the probability distribution described as

for some function . One can define the function similarly as

This function satisfies the Feynman-Kac formula and one obtains entirely similarly that the probability distribution describes a Markov process with infinitesimal generator

To see how this works, consider the case of a diffusion process

with generator . Algebra applied to Equation 5 then shows that

where the function is described in Equation 5. Since , this reveals that the probability distribution describes in fact a new diffusion process with an additional drift :

Unfortunately, apart from a few straightforward cases such as a Brownian motion or an Ornstein-Uhlenbeck process, the function is generally intractable. However, there are indeed several numerical methods available to approximate it effectively.

RWM & HMC on manifolds

Fri, 08 Mar 2024 16:00:00 GMT

Consider a smooth manifold of dimension defined as the zero set of a well-behaved “constraint” function ,

We would like to use MCMC to sample from a probability distribution supported on with density with respect to the uniform Hausdorff measure on . It is relatively straightforward to adapt standard MCMC methods when dealing with simple manifolds such as a sphere or a torus since their geodesics and several other geometric quantities are analytically tractable. Maybe surprisingly, it is in fact relatively straightforward to design MCMC samplers on general implicitly defined manifold such as . The article (Zappa, Holmes-Cerfon, and Goodman 2018) explains these ideas beautifully.

Manifold Random Walk Metropolis-Hastings

Assume that is the current position of the MCMC chain. To generate a proposal that will eventually be accepted or rejected, one can proceed very similarly to the standard RWM algorithm with Gaussian perturbations with variance . First, generate a vector from a centred Gaussian distribution with covariance on the tangent space to at . To do so, it suffices for example to generate a standard Gaussian vector in and orthogonal-project it onto . Indeed, one cannot simply define the proposal as since it would not necessarily lie on . Instead, one projects back to . To do so, one needs to define the direction used for the projection and the manifold RWM algorithm uses , for reasons that will become clear later. In other words, the proposal is obtained by seeking a vector such that .

Projection onto from (Zappa, Holmes-Cerfon, and Goodman 2018)

If one calls the Jacobian matrix of at , i.e. the matrix whose rows are the gradients of the components of , this projection operation boils down to finding a vector such that

Note that Equation 1 is a non-linear equation in that can have no solution, one solution or many solutions – this can seem like a fundamental roadblock to the design of a valid MCMC algorithm, but we will see that it is not! Before discussing in slightly more details the resolution of Equation 1, assume that a standard root-finding algorithm takes the pair as input and attempts to produces the projection ,

The algorithm will either converge to one of the possible solutions or fail. If the algorithm fails to converge, one can simply reject the proposal and set and set . If the algorithm converges, this defines a valid proposal . To ensure reversibility, and it is one of the main novelty of the article (Zappa, Holmes-Cerfon, and Goodman 2018), one needs to verify that the reverse proposal is possible.

Reversibility check (Zappa, Holmes-Cerfon, and Goodman 2018)

To do so, note that the only possibility for the reverse move to happen is if where

The uniqueness follows from the decomposition . The reverse move is consequently possible if and only if the following reversibility check condition is satisfied,

This reversibility check is necessary as it is not guaranteed that the root-finding algorithm started from converges at all, or converges to in the case when there are several solutions. If Equation 2 is not satisfied, the proposal is rejected and one sets . On the other hand, if Equation 2 is satisfied, the proposal is accepted with the usual Metropolis-Hastings probability

where denotes the Gaussian density on the tangent space The above description defines a valid MCMC algorithm on that is reversible with respect to the target distribution .

Projection onto the manifold

As described above, the main difficulty is to solve the non-linear equation Equation 1 describing the projection of the proposal back to the manifold . The projection is along the space spanned by the columns of , i.e. find a vector such that

One can use a standard Newton’s method to solve this equation started from . Setting for notational convenience , this boils down to iterating

As described in (Barth et al. 1995), it can sometimes be computationally advantageous to use a quasi-Newton method and use instead

with fixed positive definite matrix since one can then pre-compute a decomposition of and use it to solve the linear systems at each iterations. In some recent and related work (Au, Graham, and Thiery 2022), we observed that the standard Newton method performed well in the settings we considered and there was most of the time no computational advantage to using a quasi-Newton method. In practice, the main computational bottleneck is to compute the Jacobian matrix , although it is problem-dependent and some structure can typically be exploited. In practice, only a relatively small number of iterations are performed and the root-finding algorithm is stopped as soon as is below a certain threshold. If the stepsize is small, i.e. , it is typically the case that the Newton’s method will converge to a solution in only a very small number of iterations – indeed, Newton’s method is quadratically convergent when close to a solution.

30k RWM chains ran in parallel to explore a double torus.

In the figure above, I have implemented the RWM algorithm above described to sample from the uniform distribution supported on a double torus described by the constraint function given as

The figure shows chains ran in parallel, which is straightforward to implement in practice with JAX (Bradbury et al. 2018). All the chains are initialized from the same position so that one can visualize the evolution of the density of particles.

Tuning of manifold-RWM

One can for example monitor the usual expected squared jump distance

and maximize it to tune the RWM step-size; it would probably make slightly more sense to monitor the squared geodesic distances instead the naive squared norm , but that’s way to much hassle and would probably make only a negligible difference. In the figure above, I have plotted the expected squared jump distance as a function of the acceptance rate for different step-sizes. It is interesting to see a pattern extremely similar to the one observed in the standard RWM algorithm (Roberts and Rosenthal 2001): in this double torus example, the optimal acceptance rate is around . Note that since the target distribution is uniform, the rate of acceptance is only very slightly lower than the proportion of successful reversibility checks.

Hamiltonian Monte Carlo (HMC) on manifolds

While the Random Walk Metropolis-Hastings algorithm is interesting, exploiting gradient information is often necessary to design efficient MCMC samplers. Consider a single iteration of a standard Hamiltonian Monte Carlo (HMC) sampler targeting a density on . The method proceeds by simulating from a dynamics that is reversible with respect to an extended target density on defined as

for a user-defined mass parameter . In general, the mass parameter is a positive definite matrix but generalizing this to manifolds is slightly less useful in practice. For a time-discretization step and a current position , the method proceeds by generating a proposal defined as

This proposal is accepted with probability . Indeed, in standard implementation, several leapfrog steps are performed instead of a single one. One can also choose to perform a single leapfrog step as above and only do a partial refreshment of the momentum after each leapfrog step – this may be more efficient or easier to implement when running a large number of HMC chains in parallel on a GPU for example. To adapt the HMC algorithm to sample from a density supported on a manifold , one can proceed similarly to the RWM algorithm by interleaving additional projection steps. These projections are needed to ensure that the momentum vectors remain in the right tangent spaces and the position vectors remain on the manifold ,

As in the RWM algorithm, reversibility checks need to be performed to ensure that the overall algorithm is reversible with respect to the target distribution . The resulting algorithm for generating a proposal reads as follows:

If any of the projection operations fail, the proposal is rejected. If no failure occurs, a reversibility check is performed by running the algorithm backward starting from . If the reversibility check is successful, the proposal is accepted with the usual Metropolis-Hastings probability .

5k HMC chains ran in parallel: the momentum is not refreshed

The article (Lelièvre, Rousset, and Stoltz 2019) provides a detailed description of several of these ideas along with detailed analysis and extensions.

References

Au, Khai Xiang, Matthew M Graham, and Alexandre H Thiery. 2022. “Manifold Lifting: Scaling MCMC to the Vanishing Noise Regime.” Journal of the Royal Statistical Society: Series B. https://arxiv.org/abs/2003.03950.

Barth, Eric, Krzysztof Kuczera, Benedict Leimkuhler, and Robert D Skeel. 1995. “Algorithms for Constrained Molecular Dynamics.” Journal of Computational Chemistry 16 (10). Wiley Online Library: 1192–1209. https://doi.org/10.1002/jcc.540161003.

Bradbury, James, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, et al. 2018. “JAX: Composable Transformations of Python+NumPy Programs.” http://github.com/google/jax.

Lelièvre, Tony, Mathias Rousset, and Gabriel Stoltz. 2019. “Hybrid Monte Carlo Methods for Sampling Probability Measures on Submanifolds.” Numerische Mathematik 143 (2). Springer: 379–421. https://arxiv.org/abs/1807.02356.

Roberts, Gareth O, and Jeffrey S Rosenthal. 2001. “Optimal Scaling for Various Metropolis-Hastings Algorithms.” Statistical Science 16 (4). Institute of Mathematical Statistics: 351–67. https://doi.org/10.1214/ss/1015346320.

Zappa, Emilio, Miranda Holmes-Cerfon, and Jonathan Goodman. 2018. “Monte Carlo on Manifolds: Sampling Densities and Integrating Functions.” Communications on Pure and Applied Mathematics 71 (12). Wiley Online Library: 2609–47. https://arxiv.org/abs/1702.08446.

Metropolis-Hastings ratio with deterministic proposals

Sun, 17 Dec 2023 16:00:00 GMT

Consider a probability density on and a (deterministic) function . Assume further that is an involution in the sense that

for all . To keep things simple since it is not really the point of this short note, suppose that everywhere and that is smooth. This type of transformations can be used to define Markov Chain Monte Carlo algorithms, eg. the standard Hamiltonian Monte Carlo (HMC) algorithm. To design a MCMC scheme with this involution , one needs to answer the following basic question: suppose that and the proposal is constructed and accepted with probability , how should the acceptance probability function be chosen so that the resulting random variable is also distributed according to ? The Bernoulli random variable is such that . In other words, for any test function , we would like , which means that

Requiring for Equation 1 to hold for any test function is easily seen to be equivalent to asking for the equation

to hold for any where and is the Jacobian of at . Since because the function is an involution, this also reads

At this point, it becomes clear to anyone familiar with the the correctness-proof of the usual Metropolis-Hastings algorithm that a possible solution is

although there are indeed many other possible solutions. Since , this also reads

One can reach a similar conclusion by looking at the Radon-Nikodym ratio where is the markov kernel described the deterministic transformation (Green 1995), but I do not find this approach significantly simpler. The very neat article (Andrieu, Lee, and Livingstone 2020) describes much more sophisticated and interesting generalizations. Indeed, Equation 2 is often used in the simpler case when is volume preserving, i.e. , as is the case for the Hamiltonian Monte Carlo (HMC). The discussion above was prompted by a student implementing a variant of this but with the wrong acceptance ratio and us taking quite a bit of time to find the bug…

Note that there are interesting and practical situations when the function satisfies the involution property only when belongs to a subset of the state-space. For instance, this can happen when implementing MCMC on a manifold and the function involves a “projection” on the manifold , as for example described in the really interesting article (Zappa, Holmes-Cerfon, and Goodman 2018). In that case, it suffices to add a “reversibility check”, i.e. make sure that when applying to the proposal , one goes back to in the sense that . The acceptance probability in that case should be amended and expressed as

In other words, if applying to the proposal does not lead back to , the proposal is always rejected.

Same, but without involution

in some situations, the requirement for to be an involution can seem cumbersome. What if we consider the more general situation of a smooth bijection and its inverse ? In that case, one can directly apply what has been described in the previous section: it suffices to consider an extended state-space obtained by including an index and the involution defined as

This allows one to define a Markov kernel that lets the distribution invariant. Things can even start to get a bit more interesting if a deterministic “flip” is applied after each application of the Markov kernel above describe: doing so avoids immediately coming back to in the event the move is accepted. There are indeed quite a few papers exploiting this type of ideas.

A mixture of deterministic transformations?

To conclude these notes, here is a small riddle whose answer I do not have. One can check that for any , the function is an involution of the real line. This means that for any target density on the real line, one can build the associated Markov kernel defined as

for an acceptance probability described as above,

Finally, choose a values and consider the mixture of Markov kernels

The Markov kernel lets the distribution invariant since each Markov kernel does, but it is not clear at all (to me) under what conditions the associated MCMC algorithm does converge to . One can empirically check that if is very small, things can break down quite easily. On the other, for large, the mixuture of Markov kernels empirically seems to behave as if it were ergodic with respect to .

For values chosen at random, the illustration aboves shows the empirical distribution of the associated Markov chain ran for iterations and targeting the standard Gaussian distribution : the fit seems almost perfect.

References

Andrieu, Christophe, Anthony Lee, and Sam Livingstone. 2020. “A General Perspective on the Metropolis-Hastings Kernel.” arXiv Preprint arXiv:2012.14881.

Green, Peter J. 1995. “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.” Biometrika 82 (4). Oxford University Press: 711–32.

Averaging and homogenization

Mon, 27 Nov 2023 16:00:00 GMT

Averaging

Consider a pair of (coupled) Markov processes and with dynamics that can informally be described as

for two independent “noise” terms and and a time-scale parameter . We assume that is a slow component that moves by in on the time interval . The scaling in the dynamics of fast process indicates that we expect the process to evolve on a time scale of order . We are interested in the limit and hope to “average out” the fast process and be able to describe the slow (and interesting) process without referring to the fast process. Informally, we would like to describe the process , in the limit , as following an effective Markovian dynamics

For describing the averaging phenomenon, we typically assume some ergodicity conditions on the fast process . Here, we assume that for each fixed , the fast process process with fixed slow-component , i.e.

is ergodic with respect to some probability distribution . Although the averaging phenomenon is quite general, it is somewhat easier to illustrate it for diffusion processes. In this case, let us assume that the slow process is given by

For and for a time increment , since the process can be considered constant we have

This can be regarded as a time-discretization of the averaged process

for averaged drift and volatility functions give by

One standard approach for proving this type of results is to write the Kolmogorov equations

for and perform a multiscale expansion (Hinch 1991) (Pavliotis and Stuart 2008) (Weinan 2011)

Indeed, the first order term is expected to not depend on the initial condition since the process forgets on time scales of order and we are interested in the regime . From Equation 2 one can obtain the dynamics of the averaged process described by the function . One finds that is described by the averaged generator of the slow component, i.e. averaging under ; this exactly gives Equation 1 in the case of diffusions. A typical example could be as follows:

The fast process is a Ornstein-Uhlenbeck process sped-up by a factor that will very rapidly oscillate around , with Gaussian fluctuations with variance , ie:

This averaging phenomenon is relatively straightforward and not extremely surprising. More interesting is the homogenization phenomenon described in the next Section.

Homogenization

Consider the presence of an additional intermediate time scale , with the same assumption that for any fixed the process is ergodic with respect to the probability distribution . The same reasoning as in the averaging case shows that averaging the term is relatively straightforward and has the exact same expression: it suffices to average under . This means that one can study instead

with, informally, . The new interesting phenomenon is coming from the intermediate time scale . Contrarily to the averaging phenomenon of the previous section that was only relying on a Law of Large Numbers, dealing with the intermediate time-scale requires exploiting a CLT and quantifying the rate of mixing of the fast process Note that since , for the dynamics to not explode one needs the centering condition:

Because of the centering condition*, the term will contribute an additional noise term in the effective dynamics of the slow process. To describe this additional noise term, assume an ergodic central limit theorem (CLT) for the fast process : for a test function with zero expectation under we have:

for asymptotic variance . For a time increment and assuming we have

The second integral term is an averaging term that can be treated easily. Approximating the process by , the first integral on the RHS of Equation 5 can be approximated as

After a time-rescaling, one can readily see that the first term is described by the CLT of Equation 4,

The second term is further approximated as

the second equality coming from the time-rescaling . The process mixes on scale so that the term inside bracket converges to its expectation. Setting , one obtains

In conclusion, the fast-slow system

can be described in the regime by the effective dynamics

for two independent Brownian motions and . The volatility terms comes from the CLT and the drift term comes from the self-interaction term:

For the drift function, the scaling may look a bit surprising at first sight as one may expect instead. Note that since the process mixes on a time scale and the centering condition holds, the expectation goes to zero as soon as . This means that only the subset of really matters in that double integral, hence the normalization factor.

Closed form solution & Poisson equation:

The drift and volatility terms and quantify the mixing properties of the fast process . While formulas Equation 6 are intuitive, they can be difficult to deal with if one needs the exact expressions of the drift and volatility functions. Instead, they can also be expressed in terms of the solution to an appropriate Poisson equations.

where the function is solution to the Poisson equation

for all and is the generator of the fast process . The last equality in Equation 7 follows from the integral representation of the Poisson equation. Similarly, and also as explained here, the asymptotic variance term can also be expressed in terms of the function ,

Example: integrated OU process

Consider a slow process obtained by integrating an OU process,

where is just a fixed time-scaling parameter. The fast OU process mixes on time scales of order and has a standard Gaussian distribution as invariant distribution. Homogenization gives that in the regime , the slow process can be approximated as

since the asymptotic variance is

where is the autocorrelation function of the fast OU process, as explained here. The fact that the effective diffusion is (twice) the integrated autocorrelation of the fast process is an example of Green-Kubo relations.

Example: Overdamped Langevin Dynamics

This example does not exactly fall within the homogenization result described in the previous section, but almost. Consider a potential and the slow-fast dynamics:

For any fixed value of , the fast OU-dynamics

converges to a Gaussian distribution with mean and unit variance. The same arguments as the previous section immediately give that, starting from , we have

The terms comes from the OU asymptotic variance. this shows that the slow process converges as to the overdamped Langevin dynamics

Example: Stratonovich Corrections

Consider a function and the slow-fast system

where is a fast OU process mixing on scales of order and with standard centred Gaussian invariant distribution .The discussion leading to Equation 8 suggest that the term can be heuristically be thought of as , which would imply that the effective dynamics for the slow-process is

We will see that this heuristic is wrong! In order to obtain the effective dynamics of the slow process as , since the generator of the fast-OU reads , one can solve the Poisson equation to obtain that . One already knows that . The drift term is given by

Putting everything together gives that the effective slow dynamics reads

where denotes Stratonovich integration.

Readings

The book (Pavliotis and Stuart 2008) is beautiful, and I quite like the section on multiscale expansion in (Weinan 2011). For proving this type of results with the “martingale problem” approach (Stroock and Varadhan 1997), the lectures (Papanicolaou 1977) are nicely done.

References

Hinch, E. J. 1991. Perturbation Methods. Cambridge University Press.

Papanicolaou, George. 1977. “Martingale Approach to Some Limit Theorems.” In Papers from the Duke Turbulence Conference, Duke Univ., Durham, NC, 1977.

Pavliotis, Grigoris, and Andrew Stuart. 2008. Multiscale Methods: Averaging and Homogenization. Springer Science & Business Media.

Stroock, Daniel W, and SR Srinivasa Varadhan. 1997. Multidimensional Diffusion Processes. Vol. 233. Springer Science & Business Media.

Weinan, E. 2011. Principles of Multiscale Modeling. Cambridge University Press.

Ensemble Kalman Smoother (EnKS)

Fri, 17 Nov 2023 16:00:00 GMT

Consider a linear-Gaussian state space model with -valued dynamics and -valued observations . Assuming a Gaussian initial distribution, the filtering distributions are Gaussian and can be sequentially computed with the Kalman Filter. Similarly, the predictive distributions are straightforward to obtain from the filtering distributions: and . Given observations and , the smoothing distributions can computed by performing a “backward pass”. Since everything is linear and Gaussian, it is just an exercise in Linear Algebra & Gaussian-conditioning, as described by the Rauch-Tung-Striebel (Rauch, Tung, and Striebel 1965) smoothing recursions. The backward recursion reads

and allows one to compute the smoothing means and covariances matrices for starting from the knowledge of . In Equation 1, the smoothing gain matrix is given by

The Ensemble Kalman Filter (EnKF) is a non-linear equivalent of the Kalman filter, and the purpose of these notes is to derive the equivalent “ensemble version” of the backward recursion Equation 1. For this purpose, it is important to understand slightly better the role of the smoothing gain matrix . Consider the pair of random variable distributed according to the joint distribution between the filtering distribution at time and the predictive distribution at time in the sense that

This means that and and . Furthermore, Equation 2 and the standard gaussian conditional probabilities formulas give that the conditional means and covariances are given by

The above expression for the conditional mean also shows that the matrix is a minimizer of the loss

over all matrices . Heuristically, this shows that the smoothing gain matrix can easily be computed by regressing against . We can use this remark to build an ensemble version of the backward recursion Equation 1. Recall that when running a EnKF for filtering the observations , the final stage proceeds in two steps:

Obtain an ensemble of particles that approximate the predictive distribution .
Assimilate the last observation using the Kalman gain matrix and the correction by setting The particles approximate the smoothing distribution .

Following our discussion of the smoothing gain matrix and Equation 4, it seems sensible to set

and hope that the ensemble of updated particles approximate the smoothing distribution . In words, the particle is obtained by “pulling” the correction term back to through the “regression” smoothing gain matrix . To check that the particles indeed approximate the smoothing distribution , it suffices to compute the mean/variance and verify that they are matching the one given by Equation 1. Recall that Equation 3 gives that the filtering/predictive distributions satisfy

where is independent from all other sources of randomness. Plugging this into Equation 5 gives that

Since the are distributed according to the smoothing distribution, i.e. , this immediately shows that is Gaussian with

as it should. One can then iterate this construction to obtain particle approximations of the smoothing distributions for by running a backward pass and recursively setting

The ensemble of particles approximates the smoothing distribution . In a nonlinear setting, it suffices to approximate the smoothing gain matrices with

[Experiments: TODO]

References

Rauch, Herbert E, F Tung, and Charlotte T Striebel. 1965. “Maximum Likelihood Estimates of Linear Dynamic Systems.” AIAA Journal 3 (8): 1445–50.

Asymptotic variance & Poisson Equation

Fri, 10 Nov 2023 16:00:00 GMT

Consider a continuous time Markov process on that is ergodic with respect to the probability distribution . A Langevin diffusion is a typical example. Call the generator of this process so that for a test function we have

Now, assume further that and that a Central Limit Theorem holds,

How can one estimate the asymptotic variance ?

Approach I: Integrated autocovariance

One can directly try to compute the second moment of Equation 2 and obtain that

Since falls quickly to zero as and defining the auto-covariance at lag as

one obtains that an expression of the asymptotic as the integrated autocovariance function,

In the MCMC literature, this relation is often expressed as

where the integrated autocorrelation function is defined as

for autocorrelation at lag defined as . The slower the autocorrelation function falls to zero as , the larger the asymptotic variance . Although Equation 3 is very intuitive, it can be difficult to estimate the autocorrelation function.

Approach II: Poisson Equation

Under relatively general and mild conditions, since the expectation of under the invariant distribution is zero and the Markov process is ergodic with respect to , there exists a function such that

Equation 4 is called a Poisson Equation since is often a Laplacian-like operator (eg. diffusion-type processes). Equation 1 gives that

where is the martingale and typically vanishes as and can be neglected. For computing the asymptotic variance, it suffices to estimate . And using the martingale property, it equals . Also, since , algebra gives that

where the so-called carré du champ is defined as

This shows that the asymptotic variance satisfies

Finally, since , this can equivalently be written as

where is the so-called Dirichlet form. In summary, we have just shown that the asymptotic variance of the additive functional is given by two times the Dirichlet form where is solution to the Poisson equation . Note that this implies that the generator is a negative operator in the sense that for a test function we have that

where we have used the dot-product notation .

Poisson equation: Integral representation

It is often useful to think of the generator as an infinite dimensional equivalent of a standard negative definite symmetric matrix/operator . And since , as can be seen by diagonalizing , one can expect the following equation to hold,

That is just another way of writing that the solution to the Poisson equation , with the centering condition for picking one solution out of the many possible solutions to the Poisson equation differing from each other by an additive constant, can be expressed as

Equation 6 is easily proved with Equation 1 by writing

and by taking expectation from both sides and noticing that thanks to the assumed centering condition . Note that this remarks allows to give another derivation of Equation 5 starting from the integrated autocovariance formulation Equation 3. Indeed, note that

Example: OU process

Consider a OU process that is ergodic with respect to the standard Gaussian density ,

That’s a standard OU process accelerated by a factor . Its generator reads

The function is such that and a solution to the Poisson equation is . This shows that the asymptotic variance is

As expected, accelerating the OU process by a factor means reducing the variance by a factor .

Gaussian Assimilation & the EnKF

Sun, 22 Oct 2023 16:00:00 GMT

Ensemble Kalman Updates

Assume a prior Gaussian prior distribution and a noisy observation with

where is an unknown quantity of interest. The posterior distribution is Gaussian and is given by

as standard Gaussian conditioning shows it. This can also be written as

for Kalman Gain Matrix defined as

which can also be expressed as

for and ; this point of view can be a useful for establishing generalization to non-linear scenarios. The important remark is that the posterior covariance matrix and the posterior mean can also be expressed as

This shows that is indeed positive semi-definite. More importantly, this gives a mechanism for transforming samples from the prior distributions into samples from the posterior distributions. Indeed, consider iid samples from the prior distribution, , and set

for iid noise terms . From Equation 2 it is clear that are iid samples from the Gaussian posterior distribution . It is more intuitive to write this as

where are fake observations that are obtained by perturbing the actual observation with additive Gaussian noise terms with covariance ,

Empirical version: non-linearity and non-Gaussianity

Suppose that we would like to estimate from the noisy observation

and possibly-nonlinear observation operator . Assume that we also have samples generated from some (unkown) prior distribution. For example, these samples could come from another numerical procedure. In order to obtain approximate samples from the posterior distribution, one can set

for fake observations . The approximate Kalman gain matrix is obtained by noting that in Equation 1 giving

we have and for . This means that an approximate Kalman matrix can be obtained using empirical estimates of these covariance matrices:

These updates form the basis of the Ensemble Kalman filter (EnKF), and very successful and scalable approach to data-assimilation of high-dimensional dynamical systems. This method is operationally employed across various weather forecasting centers across the globe.

EnKF Bible by Geir Evensen

Derivative-Free optimization

Interestingly enough, the remarks above can be design in a relatively principled manner a derivative free optimizer (Huang et al. 2022). For example, assume one would like to minimize a functional of the type

One can start with a cloud of particles and keep updating them by assuming that one assimilates the noisy observation generated from a postulated observation process

for and a “step-size”. Each assimilation step steers the cloud of points towards the rights direction. Careful choice of the step-size is often crucial, as in any optimization procedure. It is indeed related to Information-Geometric Optimization Algorithms (IGO): the article (Ollivier et al. 2017) is beautiful!

References

Huang, Daniel Zhengyu, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. 2022. “Efficient Derivative-Free Bayesian Inference for Large-Scale Inverse Problems.” Inverse Problems 38 (12). IOP Publishing: 125006.

Ollivier, Yann, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. 2017. “Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles.” The Journal of Machine Learning Research 18 (1). JMLR. org: 564–628.

Deriving Langevin MCMC

Wed, 18 Oct 2023 16:00:00 GMT

Julian Besag (1945 – 2010)

Consider a target density in . Since the Langevin diffusion

is reversible with respect to , it is natural to use a Euler-Maruyama discretization of Equation 1 to build MCMC proposals: in a MCMC simulation and for a time discretization parameter , if the current position is , a proposal can be generated as

with before being accepted-or-reject according to the usual Metropolis-Hastings ratio. This MCMC method, first proposed by Julian Besag in 1994, is commonly referred to as the Metropolis-Adjusted-Langevin-Algorithm (MALA). But how can one come-up with this proposal mechanism without knowing before hand the existence of this reversible Langevin diffusion Equation 1? While it is intuitively clear that following the direction of is not such a bad idea, i.e. one would like to move towards areas of “high probability mass”, where does this comes from? Naturally, one could look at proposals of the type for some free parameter and study the behavior of the Metropolis-Hastings ratio in the regime : as simple as it sounds, it is not entirely straightforward and requires quite a bit of algebra (do it!). Instead, I very much like the type of approaches described in (Titsias and Papaspiliopoulos 2018). To summarize, we would like to generate a MCMC proposal that stays in the vicinity of the current position while exploiting the knowledge of . One cannot simply approximate the target distribution as and sample from this approximation since it is typically does not define a probability distribution. Instead, consider the following extended target distribution

In other words, the Gaussian auxiliary variable is centred at and at distance about of it. Now, given the current position , to generate a proposal that stays in the vicinity of , one can proceed in two steps, in the spirit of a Gibbs-sampling approach:

First, generate
Second, sample from .

Unfortunately, the second step is typically not tractable. Nevertheless, the conditional density is concentrated in a -neighborhood of and a simple Gaussian approximation around should be enough for our purpose. We have:

This shows that the conditional can be approximated by a Gaussian distribution centred at and variance . This means that the final proposal can be generated as where . But that is equivalent to setting

with since . It is exactly the MALA proposal. Naturally, one can also try to be slightly more clever and use an extended distribution

for some appropriate positive-definite “mass” matrix . Indeed, this immediately leads to preconditioned MALA methods. I really like this approach since it can be adapted and generalized to quite a few other situations!

References

Titsias, Michalis K, and Omiros Papaspiliopoulos. 2018. “Auxiliary Gradient-Based Sampling Algorithms.” Journal of the Royal Statistical Society Series B: Statistical Methodology 80 (4). Oxford University Press: 749–67.

Wasserstein Gradients & Langevin Diffusions

Sun, 15 Oct 2023 16:00:00 GMT

Overdamped Langevin Diffusion as a gradient flow on the space of probability distributions endowed with the -Wasserstein metric

Consider a target probability density on that is known up to a normalizing constant . We also have a different probability density . The goal is to gradually tweak so that it eventually matches . More concretely, we aim to perform a gradient descent on the space of probability distributions to reduce the functional

This approach can be discretized: assume particles forming an empirical distribution that approximates ,

Define where denotes a time discretization parameter and is a “drift” function. Finding a suitable ‘drift function’ is the main problem. According to the Fokker-Planck equation, the computed empirical distribution

approximates given by

What is the optimal drift function that ensures that comes as close as possible to ? Typically, we select such that the quantity is minimized, provided that is not drastically different from . One method is to use the Wasserstein distance and assume the constraint

for a parameter . More pragmatically, it is generally easier (eg. proximal methods) to minimize the joint objective

Based on equations Equation 1 and Equation 2, a first-order expansion shows that the joint objective Equation 3 can be approximated by

a relatively straightforward quadratic function of the drift function . The optimal drift function, ie. the minimizer of Equation 4, is given by

Put simply, this suggests that we should select the drift function proportional to . To implement this scheme, we begin by sampling particles and let evolve each particle according to the following differential equation

where is the density of the set of particles at time . It is the usual diffusion-ODE trick for describing the evolution of the density of an overdamped Langevin diffusion,

This can be shown by writing down the associated Fokker-Planck equation. This heuristic discussion shows that minimizing by introducing a gradient flow in the space of probability distributions with the Wasserstein metric essentially produces a standard overdamped Langevin diffusion. Indeed, transforming this heuristic discussion into a formal statement is not trivial: the constructive proof in (Jordan, Kinderlehrer, and Otto 1998) is now usually referred to as the JKO scheme.

The above derivation shows that the Wasserstein distance plays particularly well with minimizing functionals of the space of probability distributions. The same heuristic discussion shows that minimizing a functional of the type

for some cost function and distribution leads to choosing a drift function minimizing

This can be approached identically to what as been done in the case of minimizing .

References

Jordan, Richard, David Kinderlehrer, and Felix Otto. 1998. “The Variational Formulation of the Fokker–Planck Equation.” SIAM Journal on Mathematical Analysis 29 (1). SIAM: 1–17.

Sanov’s Theorem

Sun, 08 Oct 2023 16:00:00 GMT

Sanov’s Theorem

Consider a random variable on the finite alphabet with . For , consider a sequence obtained by sampling times independently from and set

the proportion of within this sequence. In other words, the empirical distribution obtained from the samples reads

Indeed, the LLN indicates that as , and it is important to estimate the probability that significantly deviates from . To this end, note that for another probability vector the probability that

is straightforward to compute and reads

Stirling’s approximation then gives that

where is the Kullback–Leibler divergence of from . In other words, as soon as , the probability of observing falls exponentially quickly to zero. With the language of Large Deviations, one can make this statement slightly more precise, rigorous and general, but it is essentially the content of Sanov’s Theorem.

Rare events happen in the least unlikely manner

Given a list of mutually exclusive events and the knowledge that at least one of these events has taken place, the probability that the event was the one that happened is . The implication is that if all the events are rare, that is , and it is known that one event has indeed occurred, there is a high probability that the event with the smallest value was the one that happened: the rare event took place in the least unlikely manner.

Consider an iid sequence of discrete real-valued random variables with and mean . Suppose one observes the rare event

for some level significantly above . Naturally, the least unlikely way for this to happen is if . Furthermore, one may be interested in the empirical distribution associated to the sequence when the rare event Equation 1 does happen. The least unlikely empirical distribution is the one that minimizes under the constraint that

The function is convex and the introduction of Lagrange multipliers shows that the solution to this constraint minimization problem is given by the Boltzmann distribution defined as

The parameter is chosen so that the constraint Equation 2 be satisfied and the minus sign is to follow the “physics” convention. Note in passing that, in fact, the joint function is convex! As usual, if one defines the log-partition function as , with

one obtains that the constraint is equivalent to requiring . Furthermore, since is smooth and strictly concave, the function is convex and the condition is equivalent to setting

Naturally, one can now also estimate the probability of the event happening since one now knows that it is equivalent (on a log scale) to . Algebra gives

As a sanity check, note that since , we have that , as required. The statement that

with a (Large Deviation) rate function given by

is more or less the content of Cramer’s Theorem. The rate function and the function are related by a Legendre transform.

Example: averaging uniforms…

Now, to illustrate the above discussion, consider iid uniform random variables on the interval . It is straightforward to simulate these uniforms conditioned on the event that their mean exceeds the level , which is a relatively rare event. When this is done repetitively and the empirical distribution is plotted, the resulting distribution is as follows:

Mean of uniforms conditioned on being larger than

Indeed, the distribution in blue is (very close to) the Boltzmann distribution with density with chosen so that .

Auxiliary variable trick

Mon, 02 Oct 2023 16:00:00 GMT

Consider a complicated distribution on the state space given by

for a “complicated” functions and a simpler one . In some situations, it is possible to introduce an auxiliary random variable and an extended probability distribution on the extended space ,

with a tractable conditional probability . This extended target distribution can be often be easier to explore, for example when is continuous while is discrete, or to analyze, since the “complicated” term has disappeared. Furthermore, there are a number of scenarios when the variable can be averaged out of the extended distribution, i.e. the distribution

can be evaluated exactly.

Swendsen–Wang algorithm

Consider a set of edges on a graph with vertices . The Ising model is defined as

for spin configurations . The term couples the two spins and for each edge . The idea of the Swendsen–Wang_algorithm is to introduce an auxiliary variable for each edge that is uniformly distributed on the interval , i.e.

It follows that the extended distribution on reads

for and : the coupling term has disappeared. Furthermore, it is straightforward to sample from the conditional distribution and, perhaps surprisingly, it is also relatively straightforward to sample from the other conditional distribution – this boils down to finding the connect components of the graph on with an edge present if and flipping a fair coin for setting each connected component to . This leads to an efficient Gibbs sampling scheme to sample from the extended distribution.

Swendsen-Wang MCMC algorithm at critical temperature

Gaussian Integral trick: Curie-Weiss model

For an inverse temperature , consider the distribution on

where the magnetization of the system of spins is defined as

The distribution for favours configurations with a magnetization close to or . The normalization constant (i.e. partition function) is a sum of terms,

It is not difficult to estimate as with combinatorial arguments. Nevertheless, another way to proceed is as follows. One can introduce a Gaussian auxiliary random variable with mean and variance : the parameters and can then be judiciously chosen to cancel the bothering term . This approach is often called the a Hubbard-Stratonovich transformation. The bothering “coupling” term disappears when when choosing . With such a choice, it follows that

Averaging out the gives that the partition function reads

In order to use the method of steepest descent, it would be useful to have an integrand of the type . One can choose and . This gives

from which one directly obtains that:

Sherrington–Kirkpatrick model

Consider the distribution on

where the are some fixed weights with . We assume that the matrix is positive definite: this can be achieved by adding to it if necessary, which does not change the distribution . As described in (Zhang et al. 2012), although I would not be surprised if similar ideas appeared in the physics literature much earlier, one can introduce a Gaussian auxiliary random variable so that has mean and covariance . In other words,

In order to cancel-out the it suffices to make sure that . There are a number of possibilities, the simplest approaches being perhaps

In any case, the joint distribution reads

Indeed, one can try to implement some Gibbs-style update in order to explain this joint distribution since both and are straightforward to sample from: it is indeed related Restricted Boltzmann Machine models. One can also average-out the spins and obtain that

where is the density of a centred Gaussian distribution with covariance . [TODO: add SMC experiments to estimate ].

References

Zhang, Yichuan, Zoubin Ghahramani, Amos J Storkey, and Charles Sutton. 2012. “Continuous Relaxations for Discrete Hamiltonian Monte Carlo.” Advances in Neural Information Processing Systems 25.

Shearer’s lemma

Sun, 01 Oct 2023 16:00:00 GMT

The Shearer’s lemma (Chung et al. 1986) is concerned with a generalization of the sub-additivity of the Shannon Entropy,

Instead, consider an integer and a family of subsets of such that any index appears in at least of these subsets. Note that for a subset with we have

Since each index appears in at least of the subsets, summing Equation 1 over all the subset yields

This means that the following inequality holds,

Indeed, the standard sub-additivity property of the entropy corresponds to the set for and .

Application: projection on hyperplanes

Consider a measurable set and call the projection of on the hyperplane . A Theorem of Loomis and Whitney (Loomis and Whitney 1949) states that the lebesgue measure of the set satisfies

In other words, if all the projections of the set are small then, necessarily, the set itself is small. To proceed, one can approximate this set with the union of small cubes of side centred on . If one can prove the statement for , the results follows from a standard approximation argument (ie. outer measure). Now, each cube can be indexed with a -uple of integers , and one can consider the random variable that is uniformly distributed on the set of cubes coordinates. Because and etc…, choosing the subsets and in Shearer’s Lemma immediately gives the conclusion.

References

Chung, Fan RK, Ronald L Graham, Peter Frankl, and James B Shearer. 1986. “Some Intersection Theorems for Ordered Sets and Graphs.” Journal of Combinatorial Theory, Series A 43 (1). Academic Press: 23–37.

Loomis, Lynn H, and Hassler Whitney. 1949. “An Inequality Related to the Isoperimetric Inequality.”

Information Theory: References and Readings

Fri, 29 Sep 2023 16:00:00 GMT

Books

“Elements of information theory” by T. M. Cover and J. A. Thomas – perfect intro book to the topic.
“Information Theory, Inference, and Learning Algorithms” by David J.C. MacKay
“Information Theory From Coding to Learning” by Yury Polyanskiy and Yihong Wu

Lecture Notes & Articles

“A Mathematical Theory of Communication” by C. Shannon (2948) – entertaining and readable, even 70+ years later!
“Lecture Notes on Statistics and Information Theory” by John Duchi
“Information-theoretic methods for high-dimensional statistics” by Yihong Wu

Shannon Source Coding Theorem

Mon, 25 Sep 2023 16:00:00 GMT

Claude Shannon (1916 – 2001)

Transmission through a noisy channel

Consider a scenario involving a “noisy channel,” where a message expressed in an alphabet is transmitted before being received as a potentially different and corrupted message expressed using a potentially different alphabet . One can assume that letter is transformed into with probability so that the matrix has rows summing-up to one, and that the “letters” of the message are transmitted one by one and independently from each other (ie. there is no memory effect in the channel).

Now, imagine I have a text that needs to be transmitted through this channel. Assume that the text is represented using bits. My goal is to encode it in a way that introduces redundancy, utilizing the alphabet , so that it can be transmitted through the noisy channel. Eventually, I want it to be successfully decoded, allowing the original message to be recovered with minimal errors.

A Mathematical Theory of Communication

If transmitting each letter from the alphabet takes unit of time, I need to estimate the overall time it will take to transmit the entire text of bits. In cases where the channel does not completely destroy the information, one can use about any non-idiotic encoding and transmit the same text multiple times to increase the chances of accurately recovering the original message. This can be achieved by employing techniques such as majority voting to decode the received messages, which are all distorted versions of the original text.

The transmission rate represents the inverse of the time required to transfer a single bit of information:

In other words, it takes about unit of times to transfer a text of bits. Moreover, the error rate refers to the percentage of errors in the decoded message, indicating the fraction of erroneous bits among the decoded bits. Naturally, there exists a tradeoff, and it is evident that one can reach a vanishing error rate if one is willing to allow an arbitrarily slow transmission rate (eg. majority voting after transmitting a very large number of times the same text). For example, if and bits are flipped with probability , transmitting the text times would lead to a transmission rate of and an error rate approximately equal to .

The groundbreaking discovery made by Shannon is that it is possible to achieve a vanishing error rate even when transmitting at a finite transmission rate. He also managed to identify this optimal transmission rate. Shannon’s paper (Shannon 1948) is beautifully written and surprisingly readable for a text written more than 50 years ago.

Vanishing error rate: Shannon Codebooks

Let’s imagine that we have a piece of information encoded in a variable, . We send through a noisy channel, and at the other end we receive a somewhat distorted message, . So, how much of our original information actually was transmitted? To reconstruct our original message, , using our received message, , we require an average of additional bits of information. On average, contains bits of information. So, if we encode bits of useful information in , the variable that is correlated with still holds bits of that original information. The quantity is the mutual information between the random variables and . In a noisy channel that transmits one “letter” at a time, the conditional probabilities are fixed. However, we can optimize the distribution of incoming messages. For instance, we can choose to transmit letters that are less likely to be corrupted. This discussion suggests that on average, transmitting symbols through the channel can provide up to bits of information, where , the maximization being over the distribution of while keeping the conditional probabilities fixed. It may seem that this implies a noisy channel cannot transmit information at a rate higher than . This hypothesis was precisely proven by Claude Shannon, who further established that this transmission rate can indeed be reached.

To prove that this transmission rate is achievable, Shannon’s idea was to simultaneously encode blocks of letters. To put it simply, consider the feasible blocks of binary letters. Each block has binary letters, . Associate to each of block a codeword of size in the alphabet . The set of these codewords is usually called the codebook,

To transmit a block of letters from the original text, this block is first transformed into its associated codeword . This codeword is then sent through the noisy channel, resulting in a received message . The objective is to design a codebook with enough redundancy so that one can reconstruct the original codeword from the received message : the higher the ratio , the larger the redundancy and the easier it should be to achieve this goal. The transmission rate is defined as since transmitting a binary text of length with vanishing errors takes units of time.

For generating the codebook in Equation 1, Shannon adopted a simple approach consisting in generating each for and independently at random from some (encoding) distribution . The choice of this encoding distribution can be optimized at a later stage.

Consider the codeword . After being transmitted through the noisy channel, this gives rise to a message . The codeword can be easily recovered if is typical while all the other pairs for are atypical. Since there are about elements such that is typical, and each codeword was chosen approximately uniformly at random within its typical set of size , the probability for a random codeword to be atypical is about

Consequently, the probability that all the other pairs for are atypical is

The probability as soon as as . Furthermore, remembering that one were free to optimize the encoding distribution , a vanishing error rate is possible as soon as the transmission rate is lower than

To sum-up, consider the success rate of the codebook , ie. the probability that a random codeword of is successfully decoded when passing through the noisy channel. The reasoning above shows that the averaged success rate , i.e. averaging over all possible codebooks , converges to one as long as the transmission rate is below the channel capacity . This means that one can find at least one codebook that works well! This reasoning is an example of the “probabilistic method”… Indeed, one also expect most random codebook to work well!

No vanishing error below the channel capacity

To demonstrate that transmission at vanishing error-rate is impossible when the transmission rate exceeds the channel capacity, , we can utilize Fano’s inequality.

Imagine selecting a message uniformly at random within and encode this message into the sequence . We send through a channel with capacity and receive a corresponding, though somewhat distorted, signal . Finally, we decode this received message into , an estimate of our original message:

Fano’s inequality points out that the error probability, is such that

Applying the data-processing inequality to proves:

To wrap up, recall that each received letter in the message (Y_1, , Y_K)$ depends solely on the corresponding letter in the message sent through the channel. This implies that .This yields:

This reveals that for the probability of error to go to zero, i.e. as , the transmission rate must be lower than .

Experiment

Consider the Binary Symmetric Channel (BSC) that randomly flips and with equal probability . The capacity of this channel is easily computed and equals where is the binary entropy function: the optimal encoding distribution is

For a flipping rate of the channel capacity equals . To estimate the performance of the random Shannon codebook strategy, I chose and several values of . This means generating a random codebook of size consisting of random binary vectors of size . For a randomly chosen codeword , a received message is generated by flipping each of the coordinates of independently with probability . In the BSC setting, it is easily seen that the codeword of that was the most likely to have originated is

The nearest neighbor can be relatively efficiently computed with a nearest-neighbor routine (eg. FAISS). The figure below reports the probability of error (i.e. “Block Error Rate”),

when the codeword is chosen uniformly at random within the codebook.

It can be seen that, although the error rate does go to zero for low transmission rate, the choice of where is the channel capacity still yields a relatively large block error rate. This indicates that the block size is still far too low for the “law of large number” arguments presented in the previous section to kick-in. I did try for and a codebook of and the performace was still not impressive. This shows that even though the Shannon codebook approaches is an elegant construction, it is far from being practically useful. It requires a very large codebook of size and decoding requires doing a nearest-neighbors search that can become slow as increases.

References

Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3). Nokia Bell Labs: 379–423.

Information Theory: Fano’s inequality

Fri, 22 Sep 2023 16:00:00 GMT

Robert Fano (1917 – 2016)

Fano’s inequality

Consider a three random variables forming a Markov chain,

in the sense that and . Typical situations include:

We select a parameter for a probabilistic model . Afterward, we collect data from this model, and our goal is to estimate the parameter solely from the data .
We generate data , compress this data into , and then attempt to recover the original data as closely as we can.

Since each step in Equation 1 destroys some information (eg. data processing), it is important to measure how accurately estimates the initial input, . In other words, we want to know how much more information (expressed as ‘bits’) we need to reconstruct using knowledge of alone, i.e. we would like to upper-bound . For this purpose, imagine an “error” variable that indicates whether perfectly matches ,

The probability of error is and . To estimate from , we can start by learning if equals , which costs us ‘bits’ of information. If it turns out that , we are done asking. If we find that , however, we need to ask additional questions. Crucially, , but also since can take any value in except when . Writing this reasoning quantitatively gives Fano’s inequality:

Apparently, this inequality was first derived by Robert Fano in the 50s while teaching a Ph.D. seminar at MIT. In words: a large means that offers insufficient information about , and as a result, the probability of error must be high.

Applications:

Converse of Shannon’s coding theorem

Information Theory: Entropy and Basic Definitions

Fri, 22 Sep 2023 16:00:00 GMT

Claude Shannon (1916 – 2001)

Shannon Entropy & Compression

If Alice chooses a number uniformly at random from the set , Bob can use a simple “dichotomy” strategy to ask Alice binary Yes/No questions and correctly identify the number. This result can be generalized to non-uniform distributions (cf. Huffman codes, and also Kraft-McMillan inequality). If Alice chooses a number from with probabilities , Bob can design a deterministic strategy to find the answer using, on average, about

binary questions, ie. bits. To be more precise, there are strategies that require at most questions on average, and none that can require less than . Note that applying this remark to an iid sequence and using the the fact that , this shows that one can exactly determining the sequence with at most binary questions on average. The quantity defined in Equation 1, known as the Shannon Entropy of the distribution , also implies that there are strategies that can encode each integer as a binary string of length (i.e. with bits), with the expected length approximately equal to . It is because a sequence of binary questions can be thought of as a binary tree, etc…

This remark can be used for compression. Imagine a very long sequence of iid samples from . Encoding each with bits, one should be able to encode the resulting sequence with

bits. Can the usual zip compression algorithm do this? To test this, choose a probability distribution on , generate an iid sequence of length , compress this using the command, and finally look at the size of the resulting file (in bits). I have done that a few times, with a few different values of and a few random distributions on , and with . The plot of size of the compressed files versus the Shannon entropy looks as below:

Seems like the zip-algorithm works almost optimally for compressing iid sequences.

Sequence of random variables

Now consider a pair of discrete random variables . If Alice draws samples from this pair of rvs, one can ask binary questions on average to exactly find out these values. To do that, one can ask questions to estimate , and once is estimated, one can then ask about to estimate . This strategy requires on average binary questions and is actually optimal, showing that

where we have defined .

Indeed, one can generalize these concepts to more than two random variables. Iterating Equation 2 shows that the trajectory of a stationary ergodic Markov chain can be estimated on average with binary questions where

Here, is the equilibrium distribution of the Markov chain and are the transition probabilities.

Can compress Markovian trajectories and roughly achieve that level of compression? Indeed, one can test this by generating random Markov transition matrices (that are, with very high probability, ergodic with respect to an equilibrium distribution ). Doing this with trajectories of length (ie. quite short because it is quite slow to) on with , one get the following results:

In red is the entropy estimated without using the Markovian structure and assuming that the are iid samples. One can clearly see that the dependencies between successive samples are exploited for better compression. Nevertheless, the optimal compression rate given by the Shannon entropy is not reached. Indeed, is not an optimal algorithm – it cannot even compress well enough the sequence !

Asymptotic Equipartition Property (AEP)

The AEP is simple remark that gives a convenient way of reasoning about long sequences random variables with . For example, assuming that the random variables are independent and identically distributed as the random variable , the law of large numbers (LLN) gives that

This means that any “typical” sequence has a probability about of occurring, which also means that there are about such “typical” sequences. Indeed, one could use the LLN for Markov chains and obtain a similar statement for Markovian sequences. All this can be made rigorous with large deviation theory, for example. This also establishes the link with the “statistical physics” definition of entropy as the logarithm of the number of configurations… Set of the type

are usually called typical set: for any , the probability of to belongs to goes to one as . For these reasons, it is often a good heuristic to think of a draw of as a uniformly distributed on the associated typical set. For example, if are iid draws from a Bernoulli distribution with , the set of sequences such that has elements where is the entropy of a random variable.

Mutual information

Consider a pair of random variables . Assuming that stores (on average) bits of useful information, how much of this information can be extracted from ? Let us call this quantity since we will see in a second that this quantity is symmetric. If is independent from , no useful information about is contained in and . On the contrary, if , the knowledge of already contains all the information about and . If one knows , one needs on average binary questions (ie. bits of additional information) in order to determine certainly and recover all the information contained in . This means that the knowledge of already contains useful bits of information about ! This quantity is called the mutual information of the two random variable and , and it has the good taste of being symmetric:

Naturally, one can define conditional version of it by setting where has the law of conditioned on . Since is the reduction in uncertainty of due to when is given, there are indeed situations when is larger than – it is to be contrasted to the intuitive inequality , which is indeed true. A standard such examples is when and are independent random variables and : a short computation gives that while, indeed, . This definition of conditional mutual information leads to a chain-rule property,

which can indeed be generalized to any number of variables. Furthermore, if the are conditionally independent given (eg. if and only depend on ), then the sub-additivity of the entropy readily gives that

Importantly, algebra shows that can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of the marginals ,

This diagram from (MacKay 2003) nicely illustrate the different fundamental quantities and and and and :

From: Information Theory, Inference, and Learning Algorithms

Naturally, if one considers three random variables forming a “Markov chain”, we have the so-called data-processing inequality,

The first inequality is clear since all the useful information contained in must be coming from , and only contains bits about . For the second inequality, note that if contains bits about , and contains bits about , then cannot contain more than bits of :

Data Processing Inequality for Markov

References

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge university press.

From Denoising Diffusion to ODEs

Sat, 01 Jul 2023 16:00:00 GMT

Setting & Goals

Consider an empirical data distribution . In order to simulate approximate samples from , Denoising Diffusion Probabilistic Models (DDPM) simulate a forward diffusion process on an interval . The diffusion is initialized at the data distribution, i.e. , and is chosen so that that the distribution of is very close to a known and tractable reference distribution , e.g. a Gaussian distribution. Denote by the marginal distribution at time , i.e. . By choosing the forward distribution with simple and tractable transition probabilities, e.g. an Ornstein-Uhlenbeck, it is relatively easy to estimate from simulated data: this can be formulated as a simple regression problem. This allows one to simulate the diffusion backward in time and generate approximate samples from . Why this is useful is another question…

The fact that the mapping from data-samples at time to (approximate) Gaussian samples at time is stochastic and described by diffusion processes is cumbersome. This would be much more convenient to build a deterministic mapping between the data-distribution and the Gaussian reference distribution : this would allows one to associate a likelihood to data samples and to easily “encode”/“decode” data-samples. To so this, one can try to replace diffusion by Ordinary Differential Equations.

The diffusion-ODE trick

Consider an arbitrary diffusion process with associated distribution at time . The Fokker-Planck equation that describes the evolution of reads

If there were no diffusion term and was describing instead the evolution of differential equation , the associated evolution of the density of would simply read

If one can find a vector field such that

then one can basically replace diffusions by ODEs. The diffusion-ODE trick is the simple remark that

does exactly this, as algebra immediately shows it. The additional term is intuitive. The coefficient is because one is trying to match the term in the Fokker-Planck equation. And the overall term is just driving the ODE in direction where the probability density is small, i.e. it follows the negative gradient of the log-density: it is exactly trying to imitate the diffusion term .

What this means is that a diffusion process started from and marginal distribution can be imitated by an ODE process started from . At any time , the marginal distributions of and both exactly equal .

The diffusion-ODE trick: application to DDPM

Video

Consider a DDPM with forward dynamics given by an Ornstein-Uhlenbeck (OU) process

and initial condition . As explained in these notes, it is relatively straightforward to estimate the score function

from data. This means that the forward OU process can be replaced by the forward ODE

with . Similarly, the reverse diffusion (i.e. the “denoising” diffusion) defined as follows the dynamics

As described for the first time in the beautiful article (Song et al. 2020), the diffusion-ODE trick now shows that the denoising diffusion can be replaced by a denoising ODE with dynamics

Interestingly [and I do not know whether there was an obvious way of seeing this from the start], this shows that the forward and backward ODE are actually the same but run forward and backward in time. They corresponds to the ODE described by the vector field

The animation belows display the denoising ODE and the associated vector field Equation 2.

Video

Likelihood computation

With the diffusion-ODE trick, we have just seen that it is possible to build a vector fields such that the forward ODE

and the backward ODE defined as

are such that and .

In general, consider a vector field and a bunch of particles distributed according to a distribution at time . If each particle follows the vector field for an amount of time , the particles that were in the vicinity of some at time end up in the vicinity of at time . At the same time, a volume element around gets stretch by a factor while following the vector field , which means that the density of particles at time and around equals . In other words . This means that if we follows a trajectory of one gets

That is the Lagrangian description of the density of particles. Indeed, one could directly get this identity by differentiating with respect to time while using the continuity Equation 1. When applied to the DDPM, this gives a way to assign likelihood the data samples, namely

where is trajectory of the forward ODE Equation 3 initialized as . Note that in high-dimensional setting, it may be computationally expensive to compute the divergence term since it typically is times slower that a gradient computation; for this reason, it is often advocated to use the Hutchinson trace estimator to get an unbiased estimate of it at a much lower computational cost.

References

Song, Yang, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. “Score-Based Generative Modeling Through Stochastic Differential Equations.” ICLR 2021. https://arxiv.org/abs/2011.13456.

Denoising Diffusion Probabilistic Models (DDPM)

Sat, 01 Jul 2023 16:00:00 GMT

Setting & Goals

Consider samples in from an unknown data distribution . We would like to build an mechanism that can generate other samples from this data distribution. Implicitly, this means that we will be fitting a statistical model to this finite set of samples and have an algorithmic procedure to generate samples from the fitted probabilistic model. This type of models that directly define a stochastic procedure that generates data are called implicit probabilistic models in the ML literature.

Ornstein–Uhlenbeck Noising process

DDPMs work as follows. Consider a diffusion process that starts from the data distribution at time . The notation refers to the marginal distribution of the diffusion at time . Assume furthermore that at time , the marginal distribution is (very close to) a reference distribution that is straightforward to sample from. Typically, is an isotropic Gaussian distribution. This diffusion process is often called the noising process since it transform the data distribution into a reference measure that can be thought of as “pure noise”. It is often chosen as an Ornstein–Uhlenbeck (OU) diffusion,

This diffusion is reversible with respect to, and quickly converges to, the reference distribution and has the good taste of having simple transition densities: the law of given that is the same as , which we write as

for isotropic Gaussian noise term . We have:

where the notation designates the forward transition from to in “” amount of time. This also means that one can directly generate samples from by first choosing a data samples from the data distribution and blend it with noise by setting .

The reverse diffusion

In order to generate samples from the data distribution, the DDPM strategy consists in sampling from the Gaussian reference measure at time and simulate the OU process backward in time. In other words, one would like to simulate from the reverse process defined as

In other words, the reverse process is distributed as at time and, crucially, we have that . Furthermore, and as explained in this note, the reverse diffusion follows the dynamics

where is another Wiener process. I have used the notation to emphasize that there is no link between this Wiener process and the one used to simulate the forward process. Note that if the initial data distribution were equal to the reference measure , i.e. then it is easy to see that the reverse diffusion would have exactly teh same dynamics as the forward diffusion. In order to simulate the reverse diffusion, one needs to be able to estimate the new term called the score. If one can estimate the score, it is straightforward to simulate the reverse process all the way to and obtain samples from the data distribution.

Denoising to estimating the score

In practice, the score is unknown and one has to build an approximation of it

The approximate score is often parametrized by a neural network. Since the forward transitions are available and

the analytical expression of given in Equation 2 readily gives that

where is a “denoising” estimate of the initial position given a noisy estimate at time ,

For simplifying notation, I will often write as when it is clear that is a sample obtained at time . Equation 4 means that to estimate the score, one only needs to train a denoising function

It is a simple regression problem: take a bunch of pairs that can be generated as

with and minimize the Mean Squared Error (MSE) loss, i.e.

with stochastic gradient descent or any other stochastic optimization procedure. The score is then defined as

Denoiser: practical parametrization and training

In practice, it may not be efficient, or stable, to try to directly parametrize the denoiser with a neural network and simply descend the loss

For example, for , we have that so that it is very easy to reconstruct from . On the contrary, for large , there is almost no information contained within to reconstruct . This means that the typical value of the loss depends widely on , which makes it difficult to optimize: with this parametrization, since large values of the loss will be typically concentrated to large values of , the denoiser will not be accurate for , leading to sub-optimal results. Since , one can defined the Signal-to-Noise-Ratio as

and, in order to normalize by the reconstruction difficulty, it makes more sense to train a denoiser by minimizing the loss:

It turns out that it is entirely equivalent to minimizing the loss

where while the denoiser and noise estimator are parametrized so that

That is one of the reasons why most of the papers on DDPM are parametrizing the denoiser by building instead a “noise estimator” with a neural network and setting

Since and for , this also implicitly ensures that for , as required.

The “denoising” diffusion

Once the denoiser has been trained, the reverse diffusion defined as to be simulated. Plugging Equation 4 back in the expression of the dynamics of the reverse diffusion shows that

This dynamics is intuitive: as we have and so that the dynamics is similar to

which is OU process that converges quickly, i.e. on time-scale of order , towards a Gaussian distribution with mean (i.e. the denoised estimate) and variance .

To discretize the reverse dynamics Equation 5 on a small interval , one can for example consider the slightly simplified (linear) dynamics

Here, with . Algebra gives that, conditioned upon , we have

where the coefficient is given by

This discretization is more stable and efficient than a naive Euler-Maruyama approach since the drift term can get large as . WIth the above discretization, one can easily simulate the reverse diffusion on and generate approximate samples from .

In the animation below, the method described in these notes was used. This very small example trains in less than a minute on an old laptop without a GPU. The noise estimator was parametrized with an MLP with non-linearity and two hidden-layers with size . It was very useful to add a few Fourier features as input of the network, I could not make this work properly without.

Video

The literature on DDPM is enormous and still growing!

Reverse diffusions, Score & Tweedie

Sun, 11 Jun 2023 16:00:00 GMT

Reversing a diffusion

Imagine a scalar diffusion process defined on the interval ,

where is the drift term, is the diffusion coefficient, and is a Wiener process. Denote the distribution of this process at time () as with an initial distribution of . Now, what happens if we reverse time and examine the process backward? In other words, consider the time-reversed process defined as

Intuitively, the process is also a diffusion on the interval , but with an initial distribution . To gain intuition, consider an Euler discretization of the forward process:

where represents a noise term independent from , and is a time increment. Re-arranging terms and making the approximation gives that

This seems to suggest that the time-reversed process follows the dynamics started from . However, this conclusion is incorrect because this would suggest that the time-reversed of a standard Brownian motion (where ) starting at zero is also a Brownian motion starting at , which is clearly not the case. The flaw in this argument lies in assuming that the noise term is independent of , which is not true, rendering the Euler discretization argument invalid.

Deriving the dynamics of the backward process in a rigorous manner is not straightforward (Anderson 1982) (Haussmann and Pardoux 1986). What follows is a heuristic derivation that proceeds by estimating the mean and variance of given , assuming . Here, is treated as a fixed and constant value, and we are only interested in the conditional distribution of given . Bayes’ law gives

where the exponential term corresponds to the transition of the forward diffusion for . Using the 1st order approximation

eliminating multiplicative constants and higher-order error terms, we obtain:

For , this is transition of the reverse diffusion

The notation is used to emphasize that this Brownian motion is distinct from the one used in the forward diffusion. The additional drift term, denoted as , is intuitive: it pushes the reverse diffusion toward regions where the forward diffusion spent a significant amount of time, i.e., where is large. The popular “denoising diffusion models” (Ho, Jain, and Abbeel 2020) can be seen as discretizations of this backward process, employing various techniques to estimate the additional drift term from data.

Denoising Score Matching & Tweedie formula

Maurice Tweedie (1919–1996)

The previous section shows that a quantity often referred to as the “score” in machine learning (ML) literature, i.e. , naturally arises when a diffusion process runs backward in time. Interestingly, it is worth noting that since the beginning of times statisticians have been using the term “score” to refer to the other derivative, i.e. the derivative with respect to a model’s parameter, which is a completely different object!

Consider a Brownian diffusion process initiated from a distribution . If this process is ran forward for a duration , we have:

where . Now, focusing on a specific sample , the backward dynamics suggests that for sufficiently small , the following approximation holds:

Maybe surprisingly, in the case of Brownian dynamics (no drift), this relationship holds exactly, even for arbitrarily large time increments . This observation attributed in (Efron 2011) to Maurice Tweedie (1919–1996) has a straightforward proof. Specifically, if , then has a density given by:

where is a centred Gaussian with variance . It follows that

Since , this also read:

This leads to Tweedie’s formula:

which is exactly the same as Equation 3 but with an equality sign: no approximation needed! Interestingly, this is what Machine-Learners often refer to as “denoising score matching” (Vincent 2011): if we have access to a large number of samples , where and , and fit a regression model by minimizing the mean-squared error , then Tweedie formula shows that in the limit of infinite samples and with sufficient flexibility in the regression model . This allows one to estimate from data, which is often a good approximation of if the variance of the added noise is not too large. Indeed, things can go bad if and the number of training data is not large enough, no free lunch!

References

Anderson, Brian DO. 1982. “Reverse-Time Diffusion Equation Models.” Stochastic Processes and Their Applications 12 (3): 313–26.

Efron, Bradley. 2011. “Tweedie’s Formula and Selection Bias.” Journal of the American Statistical Association 106 (496): 1602–14.

Haussmann, Ulrich G, and Etienne Pardoux. 1986. “Time Reversal of Diffusions.” The Annals of Probability, 1188–1205.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33: 6840–51.

Vincent, Pascal. 2011. “A Connection Between Score Matching and Denoising Autoencoders.” Neural Computation 23 (7): 1661–74.

Notes

Sat, 31 Dec 2022 16:00:00 GMT

Notes for students, half-baked write-ups for myself, and probably other random maths/stats/ML texts. Certainly full of typos, comments welcome!