The slides for this short course on diffusion models (denoising diffusions, probability flows) and other flow methods (stochastic interpolants, flow-matching) are available here. There are a few animations, so loading the slides may be slow…

]]>Consider a diffusion in with deterministic starting position and dynamics

for a drift and volatility functions and . On the time interval , this defines a probability on the path-space . For two functions and , consider the probability distribution defined as

where denotes the normalizing constant

The distribution places more probability mass on trajectories such that is large. As described in these notes on Doob h-transforms, the tilted probability distribution can be described by a diffusion process with dynamics

The control function is of the gradient form

and the function is described by the conditional expectation,

The expression is quite intuitive; in order to describe the tilted measure that places more probability mass on trajectories such that is large, the optimal control should be in the direction of states such that the “reward-to-go” quantity is large.

To obtain a variational description of the optimal control function , it suffices to express it as the solution of an optimization problem. It turns out that KL-divergences between diffusion processes are the right tool for this: we will write as the minimizer of for a class of tractable probability distributions described by controlled diffusions. As described in these notes on the Girsanov Theorem, for any control function , the controlled diffusion with dynamics

and started at induces a probability distribution on path-space given by

This allows one to write down explicitly the expression for the negative KL divergence

between and the tilted distribution . The notation denotes the expectation with respect to the controlled diffusion . The negative KL is, up to a constant, the usual Evidence Lower Bound (ELBO) used in variational inference. Since the quantity can be expressed as

it follows from Equation 3 that equals

Since the KL divergence is positive and the optimal control in Equation 2 drives the KL divergence to zero, we have that

where the minimization is over all (reasonably well-behaved) control functions and

For maximizing the ELBO, the control needs to drive the trajectories to regions where is large while at the same time keep the control effort small. The optimal control is given by Equation 2 and Equation 1 gives that

Since there was nothing really special about the starting point and the time horizon , the above derivation gives the solution to the following stochastic optimal control problem. It is written as a maximization problem although a large part of the control and physics literature writes it as an equivalent minimization problem. Consider the reward-to-go function (a.k.a. value function) defined as

We have that

This shows that optimal control can also be expressed as

The expression is intuitive: since we are trying to maximize the reward-to-go function, the optimal control should be in the direction of the gradient of the reward-to-go function.

Finally, let us mention that one can easily derive the Hamilton-Jacobi-Bellman equation for the reward-to-go function . We have

with . For , we have

where is the generator of the uncontrolled diffusion. Since is a simple quadratic function, the supremum over the control can be computed in closed form,

as we already knew from Equation 4. This implies that the reward-to-go function satisfies the HJB equation

with terminal condition . Another route to derive Equation 5 is to simply use the fact that ; since the Feynman-Kac gives that the function satisfies , the conclusion follows from a few lines of algebra by starting writing , expanding and expressing everything back in terms of . The term naturally arises when expressing the diffusion term as a function of the second derivative of ; it is the idea of the standard Cole-Hopf transformation.

]]>Let be the Gaussian distribution with mean and covariances . For a direction , consider the distribution , i.e. the same Gaussian distribution but shifted by an amount . Algebra directly gives that

We will see that, not very surprisingly, a similar change-of-probability result holds in continuous time. On the time interval , let be a standard Brownian motion in and be the solution to the SDE

for some drift and diffusion and initial distribution . This SDE defines a probability measure on the path-space , the space of continuous functions from to . Consider a perturbation drift function and associated perturbed SDE given by

This perturbed SDE, started from the same initial distribution , defines a probability measure on the path-space and it is often useful to understand the Radon-Nikodym derivative of with respect to . I have never really liked the way this is usually derived, and also never really remember the result. It takes only a few lines of algebra to re-derive these results, at least informally. For this purpose, consider a simpler Euler discretization of the SDE with time-discretization for . Consider a discretized paths of Equation 2 obtained by iterating the update

with and . The probability of observing such a path reads

with the volatility matrix and an irrelevant multiplicative constant . One obtains a similar expression for a discretized path of the perturbed SDE Equation 3 and the ratio of these two quantities equals

where the tilde notation denotes the discretized version of the measures. Since

under and as , for a path , we have

Similarly, under and as , for a path , we have

These results remain identical for time-dependent drift and volatility functions, as is clear from this non-rigorous argument. The above two formulas for and may be slightly confusing since they are not immediately recognizable as inverse of each other. Another way to write these results that is very similar to Equation 1 and that is often used in the physics literature is as follows,

From this expression, it is slightly easier to see the relationship between and . As described below, these change of variables formulae are often useful when performing importance sampling on path-space. As a sanity check, one can see that in the case of a scalar Brownian motion and drifted version of it , we indeed have that has unit expectation under since it is equivalent to the fact for a standard Gaussian random variable . Finally, note that the Kullback-Leibler divergence between and has a particularly simple form. Since one obtains

Consider a functional on path-space; a typical example is

Suppose that we would like to evaluate the expectation of under the measure . Naive Monte-Carlo (MC) would require sampling trajectories from Equation 2 and computing the average of on these trajectories. To reduce the variance of this naive MC estimator, one can also use importance sampling by sampling trajectories from the measure and compute the average

with weights given by the Radon-Nikodym derivative

Choosing the optimal “control” function that minimizes the variance of the estimator is not entirely straightforward, although this previous note already gives the answer. More on this in another note.

Consider a continuous time Markov process on the time interval and with value in the state space . This defines a probability on the set of -valued paths. Now, it is often the case that one has to consider a perturbed probability distribution defined as

for a (typically unknown) normalization constant and some function . For example, collecting a noisy observation at time , the distribution defined with the log-likelihood function describes the dynamics of the Markov process conditioned on the observation ; we will use this interpretation in the following since this is the most common use case and gives the most intuitive interpretation. Doob h-transforms are a powerful tool to describe the dynamics of the conditioned process.

For convenience, let us use the notation . For a test function and a time increment , we have

We have introduced the important function defined as

One can readily check that the function satisfies the Kolmogorov equation

with boundary condition . Furthermore, denoting by the infinitesimal generator of the Markov process , we have:

The infinitesimal generator of the conditioned process is

Plugging Equation 3 within Equation 2 directly gives that

The generator describes the dynamics of the conditioned process. In fact, the same computation holds with a more general change of measure of the type

for some function . One can define the function similarly as

This function satisfies the Feynman-Kac formula and one obtains entirely similarly that the probability distribution describes a Markov process with infinitesimal generator

To see how this works, let us see a few examples:

Consider a diffusion process

with generator and initial distribution . We are interested in describing the dynamics of the “conditioned” process given by the probability distribution defined in Equation 4. Algebra applied to Equation 5 then shows that

where the function is described in Equation 5. Since , this reveals that the probability distribution describes a diffusion process with dynamics

The additional drift term is involves a “control” with

Note that the initial distribution of the conditioned process is

Unfortunately, apart from a few straightforward cases such as a Brownian motion or an Ornstein-Uhlenbeck process, the function is generally intractable. However, there are indeed several numerical methods available to approximate it effectively.

What about a Brownian motion in conditioned to hit the state at time , i.e. a Brownian bridge? In that case, the function is given by

for some irrelevant normalization constant that only depends on . Plugging this into Equation 7 gives that the conditioned Brownian motion has dynamics

The additional drift term is intuitive: it points in the direction of and gets increasingly large as .

What about a scalar Brownian conditioned to stay positive at all times? Let us consider and let us condition first on the event that the Brownian motion stays positive within and later consider the limit . The function reads

This can easily be calculated with the reflection principle. It equals

for a standard Gaussian . Plugging this into Equation 7 gives that the additional drift term is

as . This shows that a Brownian motion conditioned to stay positive at all times has a upward drift of size ,

Incidentally, it is the dynamics of a Bessel process of dimension , i.e. the law of the modulus of a three-dimensional Brownian motion. More generally, if one conditions a Brownian motion to stay within a closed domain , the conditioned dynamics exhibit a repulsive drift term of size about near the boundary of the domain, as described below.

What about a Brownian motion conditioned to stay within a domain forever? As before, consider an time horizon and define the function as

One can see that the function satisfies the PDE

and equals zero on the boundary of the domain. Furthermore as for all . Consider the eigenfunctions of the negative Laplacian with Dirichlet boundary conditions on . Recall that is a positive operator with a discrete spectrum of non-negative eigenvalues. The eigenfunction corresponding to the smallest eigenvalue is the principal eigenfunction and it is standard that it is a positive function within the domain , as a “slight” generalization of the Perron-Frobenius in linear algebra shows it. Expanding in the basis of eigenfunctions gives that

Since we are interested in the regime , it holds that

This shows that the conditioned Brownian motion has a drift term expressed in terms of the principal eigenfunction of the Laplacian:

For example, if for a 1D Brownian motion, the principal eigenfunction is . This shows that there is a upward drift of size near and a downward drift of size near .

Consider a smooth manifold of dimension defined as the zero set of a well-behaved “constraint” function ,

We would like to use MCMC to sample from a probability distribution supported on with density with respect to the uniform Hausdorff measure on . It is relatively straightforward to adapt standard MCMC methods when dealing with simple manifolds such as a sphere or a torus since their geodesics and several other geometric quantities are analytically tractable. Maybe surprisingly, it is in fact relatively straightforward to design MCMC samplers on general implicitly defined manifold such as . The article (Zappa, Holmes-Cerfon, and Goodman 2018) explains these ideas beautifully.

Assume that is the current position of the MCMC chain. To generate a proposal that will eventually be accepted or rejected, one can proceed very similarly to the standard RWM algorithm with Gaussian perturbations with variance . First, generate a vector from a centred Gaussian distribution with covariance on the tangent space to at . To do so, it suffices for example to generate a standard Gaussian vector in and orthogonal-project it onto . Indeed, one cannot simply define the proposal as since it would not necessarily lie on . Instead, one projects back to . To do so, one needs to define the direction used for the projection and the manifold RWM algorithm uses , for reasons that will become clear later. In other words, the proposal is obtained by seeking a vector such that .

If one calls the Jacobian matrix of at , i.e. the matrix whose **rows** are the gradients of the components of , this projection operation boils down to finding a vector such that

Note that Equation 1 is a non-linear equation in that can have no solution, one solution or many solutions – this can seem like a fundamental roadblock to the design of a valid MCMC algorithm, but we will see that it is not! Before discussing in slightly more details the resolution of Equation 1, assume that a standard root-finding algorithm takes the pair as input and attempts to produces the projection ,

The algorithm will either converge to one of the possible solutions or fail. If the algorithm fails to converge, one can simply reject the proposal and set and set . If the algorithm converges, this defines a valid proposal . To ensure reversibility, and it is one of the main novelty of the article (Zappa, Holmes-Cerfon, and Goodman 2018), one needs to verify that the reverse proposal is possible.

To do so, note that the only possibility for the reverse move to happen is if where

The uniqueness follows from the decomposition . The reverse move is consequently possible if and only if the following **reversibility check** condition is satisfied,

This reversibility check is necessary as it is not guaranteed that the root-finding algorithm started from converges at all, or converges to in the case when there are several solutions. If Equation 2 is not satisfied, the proposal is rejected and one sets . On the other hand, if Equation 2 is satisfied, the proposal is accepted with the usual Metropolis-Hastings probability

where denotes the Gaussian density on the tangent space The above description defines a valid MCMC algorithm on that is reversible with respect to the target distribution .

As described above, the main difficulty is to solve the non-linear equation Equation 1 describing the projection of the proposal back to the manifold . The projection is along the space spanned by the columns of , i.e. find a vector such that

One can use a standard Newton’s method to solve this equation started from . Setting for notational convenience , this boils down to iterating

As described in (Barth et al. 1995), it can sometimes be computationally advantageous to use a quasi-Newton method and use instead

with **fixed** positive definite matrix since one can then pre-compute a decomposition of and use it to solve the linear systems at each iterations. In some recent and related work (Au, Graham, and Thiery 2022), we observed that the standard Newton method performed well in the settings we considered and there was most of the time no computational advantage to using a quasi-Newton method. In practice, the main computational bottleneck is to compute the Jacobian matrix , although it is problem-dependent and some structure can typically be exploited. In practice, only a relatively small number of iterations are performed and the root-finding algorithm is stopped as soon as is below a certain threshold. If the stepsize is small, i.e. , it is typically the case that the Newton’s method will converge to a solution in only a very small number of iterations – indeed, Newton’s method is quadratically convergent when close to a solution.

In the figure above, I have implemented the RWM algorithm above described to sample from the uniform distribution supported on a double torus described by the constraint function given as

The figure shows chains ran in parallel, which is straightforward to implement in practice with JAX (Bradbury et al. 2018). All the chains are initialized from the same position so that one can visualize the evolution of the density of particles.

One can for example monitor the usual expected squared jump distance

and maximize it to tune the RWM step-size; it would probably make slightly more sense to monitor the squared geodesic distances instead the naive squared norm , but that’s way to much hassle and would probably make only a negligible difference. In the figure above, I have plotted the expected squared jump distance as a function of the acceptance rate for different step-sizes. It is interesting to see a pattern extremely similar to the one observed in the standard RWM algorithm (Roberts and Rosenthal 2001): in this double torus example, the optimal acceptance rate is around . Note that since the target distribution is uniform, the rate of acceptance is only very slightly lower than the proportion of successful reversibility checks.

While the Random Walk Metropolis-Hastings algorithm is interesting, exploiting gradient information is often necessary to design efficient MCMC samplers. Consider a single iteration of a standard Hamiltonian Monte Carlo (HMC) sampler targeting a density on . The method proceeds by simulating from a dynamics that is reversible with respect to an extended target density on defined as

for a user-defined mass parameter . In general, the mass parameter is a positive definite **matrix** but generalizing this to manifolds is slightly less useful in practice. For a time-discretization step and a current position , the method proceeds by generating a proposal defined as

This proposal is accepted with probability . Indeed, in standard implementation, several leapfrog steps are performed instead of a single one. One can also choose to perform a single leapfrog step as above and only do a partial refreshment of the momentum after each leapfrog step – this may be more efficient or easier to implement when running a large number of HMC chains in parallel on a GPU for example. To adapt the HMC algorithm to sample from a density supported on a manifold , one can proceed similarly to the RWM algorithm by interleaving additional projection steps. These projections are needed to ensure that the momentum vectors remain in the right tangent spaces and the position vectors remain on the manifold ,

As in the RWM algorithm, reversibility checks need to be performed to ensure that the overall algorithm is reversible with respect to the target distribution . The resulting algorithm for generating a proposal reads as follows:

If any of the projection operations fail, the proposal is rejected. If no failure occurs, a reversibility check is performed by running the algorithm backward starting from . If the reversibility check is successful, the proposal is accepted with the usual Metropolis-Hastings probability .

The article (Lelièvre, Rousset, and Stoltz 2019) provides a detailed description of several of these ideas along with detailed analysis and extensions.

Au, Khai Xiang, Matthew M Graham, and Alexandre H Thiery. 2022. “Manifold Lifting: Scaling MCMC to the Vanishing Noise Regime.” *Journal of the Royal Statistical Society: Series B*. https://arxiv.org/abs/2003.03950.

Barth, Eric, Krzysztof Kuczera, Benedict Leimkuhler, and Robert D Skeel. 1995. “Algorithms for Constrained Molecular Dynamics.” *Journal of Computational Chemistry* 16 (10). Wiley Online Library: 1192–1209. https://doi.org/10.1002/jcc.540161003.

Bradbury, James, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, et al. 2018. “JAX: Composable Transformations of Python+NumPy Programs.” http://github.com/google/jax.

Lelièvre, Tony, Mathias Rousset, and Gabriel Stoltz. 2019. “Hybrid Monte Carlo Methods for Sampling Probability Measures on Submanifolds.” *Numerische Mathematik* 143 (2). Springer: 379–421. https://arxiv.org/abs/1807.02356.

Roberts, Gareth O, and Jeffrey S Rosenthal. 2001. “Optimal Scaling for Various Metropolis-Hastings Algorithms.” *Statistical Science* 16 (4). Institute of Mathematical Statistics: 351–67. https://doi.org/10.1214/ss/1015346320.

Zappa, Emilio, Miranda Holmes-Cerfon, and Jonathan Goodman. 2018. “Monte Carlo on Manifolds: Sampling Densities and Integrating Functions.” *Communications on Pure and Applied Mathematics* 71 (12). Wiley Online Library: 2609–47. https://arxiv.org/abs/1702.08446.

for all . To keep things simple since it is not really the point of this short note, suppose that everywhere and that is smooth. This type of transformations can be used to define Markov Chain Monte Carlo algorithms, eg. the standard Hamiltonian Monte Carlo (HMC) algorithm. To design a MCMC scheme with this involution , one needs to answer the following basic question: suppose that and the proposal is constructed and accepted with probability , how should the acceptance probability function be chosen so that the resulting random variable is also distributed according to ? The Bernoulli random variable is such that . In other words, for any test function , we would like , which means that

Requiring for Equation 1 to hold for any test function is easily seen to be equivalent to asking for the equation

to hold for any where and is the Jacobian of at . Since because the function is an involution, this also reads

At this point, it becomes clear to anyone familiar with the the correctness-proof of the usual Metropolis-Hastings algorithm that a possible solution is

although there are indeed many other possible solutions. Since , this also reads

One can reach a similar conclusion by looking at the Radon-Nikodym ratio where is the markov kernel described the deterministic transformation (Green 1995), but I do not find this approach significantly simpler. The very neat article (Andrieu, Lee, and Livingstone 2020) describes much more sophisticated and interesting generalizations. Indeed, Equation 2 is often used in the simpler case when is volume preserving, i.e. , as is the case for the Hamiltonian Monte Carlo (HMC). The discussion above was prompted by a student implementing a variant of this but with the wrong acceptance ratio and us taking quite a bit of time to find the bug…

Note that there are interesting and practical situations when the function satisfies the involution property only when belongs to a subset of the state-space. For instance, this can happen when implementing MCMC on a manifold and the function involves a “projection” on the manifold , as for example described in the really interesting article (Zappa, Holmes-Cerfon, and Goodman 2018). In that case, it suffices to add a “reversibility check”, i.e. make sure that when applying to the proposal , one goes back to in the sense that . The acceptance probability in that case should be amended and expressed as

In other words, if applying to the proposal does not lead back to , the proposal is always rejected.

in some situations, the requirement for to be an involution can seem cumbersome. What if we consider the more general situation of a smooth bijection and its inverse ? In that case, one can directly apply what has been described in the previous section: it suffices to consider an extended state-space obtained by including an index and the involution defined as

This allows one to define a Markov kernel that lets the distribution invariant. Things can even start to get a bit more interesting if a deterministic “flip” is applied after each application of the Markov kernel above describe: doing so avoids immediately coming back to in the event the move is accepted. There are indeed quite a few papers exploiting this type of ideas.

To conclude these notes, here is a small riddle whose answer I do not have. One can check that for any , the function is an involution of the real line. This means that for any target density on the real line, one can build the associated Markov kernel defined as

for an acceptance probability described as above,

Finally, choose a values and consider the mixture of Markov kernels

The Markov kernel lets the distribution invariant since each Markov kernel does, but it is not clear at all (to me) under what conditions the associated MCMC algorithm does converge to . One can empirically check that if is very small, things can break down quite easily. On the other, for large, the mixuture of Markov kernels empirically seems to behave as if it were ergodic with respect to .

For values chosen at random, the illustration aboves shows the empirical distribution of the associated Markov chain ran for iterations and targeting the standard Gaussian distribution : the fit seems almost perfect.

Andrieu, Christophe, Anthony Lee, and Sam Livingstone. 2020. “A General Perspective on the Metropolis-Hastings Kernel.” *arXiv Preprint arXiv:2012.14881*.

Green, Peter J. 1995. “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.” *Biometrika* 82 (4). Oxford University Press: 711–32.

Zappa, Emilio, Miranda Holmes-Cerfon, and Jonathan Goodman. 2018. “Monte Carlo on Manifolds: Sampling Densities and Integrating Functions.” *Communications on Pure and Applied Mathematics* 71 (12). Wiley Online Library: 2609–47.

Consider a pair of (coupled) Markov processes and with dynamics that can informally be described as

for two independent “noise” terms and and a time-scale parameter . We assume that is a **slow component** that moves by in on the time interval . The scaling in the dynamics of **fast process** indicates that we expect the process to evolve on a time scale of order . We are interested in the limit and hope to “average out” the fast process and be able to describe the slow (and interesting) process without referring to the fast process. Informally, we would like to describe the process , in the limit , as following an effective Markovian dynamics

For describing the averaging phenomenon, we typically assume some ergodicity conditions on the fast process . Here, we assume that for each fixed , the fast process process with fixed slow-component , i.e.

is ergodic with respect to some probability distribution . Although the averaging phenomenon is quite general, it is somewhat easier to illustrate it for diffusion processes. In this case, let us assume that the slow process is given by

For and for a time increment , since the process can be considered constant we have

This can be regarded as a time-discretization of the **averaged process**

for averaged drift and volatility functions give by

One standard approach for proving this type of results is to write the Kolmogorov equations

for and perform a multiscale expansion (Hinch 1991) (Pavliotis and Stuart 2008) (Weinan 2011)

Indeed, the first order term is expected to not depend on the initial condition since the process forgets on time scales of order and we are interested in the regime . From Equation 2 one can obtain the dynamics of the averaged process described by the function . One finds that is described by the averaged generator of the slow component, i.e. averaging under ; this exactly gives Equation 1 in the case of diffusions. A typical example could be as follows:

The fast process is a Ornstein-Uhlenbeck process sped-up by a factor that will very rapidly oscillate around , with Gaussian fluctuations with variance , ie:

This averaging phenomenon is relatively straightforward and not extremely surprising. More interesting is the homogenization phenomenon described in the next Section.

Consider the presence of an additional intermediate time scale , with the same assumption that for any fixed the process is ergodic with respect to the probability distribution . The same reasoning as in the averaging case shows that averaging the term is relatively straightforward and has the exact same expression: it suffices to average under . This means that one can study instead

with, informally, . The new interesting phenomenon is coming from the intermediate time scale . Contrarily to the averaging phenomenon of the previous section that was only relying on a Law of Large Numbers, dealing with the intermediate time-scale requires exploiting a CLT and quantifying the rate of mixing of the fast process Note that since , for the dynamics to not explode one needs the **centering condition**:

Because of the centering condition*, the term will contribute an additional noise term in the effective dynamics of the slow process. To describe this additional noise term, assume an ergodic central limit theorem (CLT) for the fast process : for a test function with zero expectation under we have:

for asymptotic variance . For a time increment and assuming we have

The second integral term is an averaging term that can be treated easily. Approximating the process by , the first integral on the RHS of Equation 5 can be approximated as

After a time-rescaling, one can readily see that the first term is described by the CLT of Equation 4,

The second term is further approximated as

the second equality coming from the time-rescaling . The process mixes on scale so that the term inside bracket converges to its expectation. Setting , one obtains

In conclusion, the fast-slow system

can be described in the regime by the effective dynamics

for two independent Brownian motions and . The volatility terms comes from the CLT and the drift term comes from the self-interaction term:

For the drift function, the scaling may look a bit surprising at first sight as one may expect instead. Note that since the process mixes on a time scale and the centering condition holds, the expectation goes to zero as soon as . This means that only the subset of really matters in that double integral, hence the normalization factor.

The drift and volatility terms and quantify the mixing properties of the fast process . While formulas Equation 6 are intuitive, they can be difficult to deal with if one needs the exact expressions of the drift and volatility functions. Instead, they can also be expressed in terms of the solution to an appropriate Poisson equations.

where the function is solution to the Poisson equation

for all and is the generator of the fast process . The last equality in Equation 7 follows from the integral representation of the Poisson equation. Similarly, and also as explained here, the asymptotic variance term can also be expressed in terms of the function ,

Consider a slow process obtained by integrating an OU process,

where is just a fixed time-scaling parameter. The fast OU process mixes on time scales of order and has a standard Gaussian distribution as invariant distribution. Homogenization gives that in the regime , the slow process can be approximated as

since the asymptotic variance is

where is the autocorrelation function of the fast OU process, as explained here. The fact that the effective diffusion is (twice) the integrated autocorrelation of the fast process is an example of Green-Kubo relations.

This example does not exactly fall within the homogenization result described in the previous section, but almost. Consider a potential and the slow-fast dynamics:

For any fixed value of , the fast OU-dynamics

converges to a Gaussian distribution with mean and unit variance. The same arguments as the previous section immediately give that, starting from , we have

The terms comes from the OU asymptotic variance. this shows that the slow process converges as to the overdamped Langevin dynamics

Consider a function and the slow-fast system

where is a fast OU process mixing on scales of order and with standard centred Gaussian invariant distribution .The discussion leading to Equation 8 suggest that the term can be heuristically be thought of as , which would imply that the effective dynamics for the slow-process is

We will see that this heuristic is **wrong**! In order to obtain the effective dynamics of the slow process as , since the generator of the fast-OU reads , one can solve the Poisson equation to obtain that . One already knows that . The drift term is given by

Putting everything together gives that the effective slow dynamics reads

where denotes Stratonovich integration.

The book (Pavliotis and Stuart 2008) is beautiful, and I quite like the section on multiscale expansion in (Weinan 2011). For proving this type of results with the “martingale problem” approach (Stroock and Varadhan 1997), the lectures (Papanicolaou 1977) are nicely done.

Hinch, E. J. 1991. *Perturbation Methods*. Cambridge University Press.

Papanicolaou, George. 1977. “Martingale Approach to Some Limit Theorems.” In *Papers from the Duke Turbulence Conference, Duke Univ., Durham, NC, 1977*.

Pavliotis, Grigoris, and Andrew Stuart. 2008. *Multiscale Methods: Averaging and Homogenization*. Springer Science & Business Media.

Stroock, Daniel W, and SR Srinivasa Varadhan. 1997. *Multidimensional Diffusion Processes*. Vol. 233. Springer Science & Business Media.

Weinan, E. 2011. *Principles of Multiscale Modeling*. Cambridge University Press.

and allows one to compute the smoothing means and covariances matrices for starting from the knowledge of . In Equation 1, the **smoothing gain matrix** is given by

The Ensemble Kalman Filter (EnKF) is a non-linear equivalent of the Kalman filter, and the purpose of these notes is to derive the equivalent “ensemble version” of the backward recursion Equation 1. For this purpose, it is important to understand slightly better the role of the smoothing gain matrix . Consider the pair of random variable distributed according to the joint distribution between the filtering distribution at time and the predictive distribution at time in the sense that

This means that and and . Furthermore, Equation 2 and the standard gaussian conditional probabilities formulas give that the conditional means and covariances are given by

The above expression for the conditional mean also shows that the matrix is a minimizer of the loss

over all matrices . Heuristically, this shows that the smoothing gain matrix can easily be computed by **regressing** against . We can use this remark to build an ensemble version of the backward recursion Equation 1. Recall that when running a EnKF for filtering the observations , the final stage proceeds in two steps:

- Obtain an ensemble of particles that approximate the predictive distribution .

- Assimilate the last observation using the Kalman gain matrix and the correction by setting The particles approximate the smoothing distribution .

Following our discussion of the smoothing gain matrix and Equation 4, it seems sensible to set

and hope that the ensemble of updated particles approximate the smoothing distribution . In words, the particle is obtained by “pulling” the correction term back to through the “regression” smoothing gain matrix . To check that the particles indeed approximate the smoothing distribution , it suffices to compute the mean/variance and verify that they are matching the one given by Equation 1. Recall that Equation 3 gives that the filtering/predictive distributions satisfy

where is independent from all other sources of randomness. Plugging this into Equation 5 gives that

Since the are distributed according to the smoothing distribution, i.e. , this immediately shows that is Gaussian with

as it should. One can then iterate this construction to obtain particle approximations of the smoothing distributions for by running a backward pass and recursively setting

The ensemble of particles approximates the smoothing distribution . In a nonlinear setting, it suffices to approximate the smoothing gain matrices with

[Experiments: TODO]

Rauch, Herbert E, F Tung, and Charlotte T Striebel. 1965. “Maximum Likelihood Estimates of Linear Dynamic Systems.” *AIAA Journal* 3 (8): 1445–50.

Consider a continuous time Markov process on that is ergodic with respect to the probability distribution . A Langevin diffusion is a typical example. Call the generator of this process so that for a test function we have

Now, assume further that and that a Central Limit Theorem holds,

How can one estimate the asymptotic variance ?

One can directly try to compute the second moment of Equation 2 and obtain that

Since falls quickly to zero as and defining the auto-covariance at lag as

one obtains that an expression of the asymptotic as the integrated autocovariance function,

In the MCMC literature, this relation is often expressed as

where the **integrated autocorrelation** function is defined as

for autocorrelation at lag defined as . The slower the autocorrelation function falls to zero as , the larger the asymptotic variance . Although Equation 3 is very intuitive, it can be difficult to estimate the autocorrelation function.

Under relatively general and mild conditions, since the expectation of under the invariant distribution is zero and the Markov process is ergodic with respect to , there exists a function such that

Equation 4 is called a Poisson Equation since is often a Laplacian-like operator (eg. diffusion-type processes). Equation 1 gives that

where is the martingale and typically vanishes as and can be neglected. For computing the asymptotic variance, it suffices to estimate . And using the martingale property, it equals . Also, since , algebra gives that

where the so-called **carré du champ** is defined as

This shows that the asymptotic variance satisfies

Finally, since , this can equivalently be written as

where is the so-called Dirichlet form. In summary, we have just shown that the asymptotic variance of the additive functional is given by two times the Dirichlet form where is solution to the Poisson equation . Note that this implies that the generator is a negative operator in the sense that for a test function we have that

where we have used the dot-product notation .

It is often useful to think of the generator as an infinite dimensional equivalent of a standard negative definite symmetric matrix/operator . And since , as can be seen by diagonalizing , one can expect the following equation to hold,

That is just another way of writing that the solution to the Poisson equation , with the **centering condition** for picking one solution out of the many possible solutions to the Poisson equation differing from each other by an additive constant, can be expressed as

Equation 6 is easily proved with Equation 1 by writing

and by taking expectation from both sides and noticing that thanks to the assumed centering condition . Note that this remarks allows to give another derivation of Equation 5 starting from the integrated autocovariance formulation Equation 3. Indeed, note that

Consider a OU process that is ergodic with respect to the standard Gaussian density ,

That’s a standard OU process accelerated by a factor . Its generator reads

The function is such that and a solution to the Poisson equation is . This shows that the asymptotic variance is

As expected, accelerating the OU process by a factor means reducing the variance by a factor .

Assume a prior Gaussian prior distribution and a noisy observation with

where is an unknown quantity of interest. The posterior distribution is Gaussian and is given by

as standard Gaussian conditioning shows it. This can also be written as

for **Kalman Gain Matrix** defined as

which can also be expressed as

for and ; this point of view can be a useful for establishing generalization to non-linear scenarios. The important remark is that the posterior covariance matrix and the posterior mean can also be expressed as

This shows that is indeed positive semi-definite. More importantly, this gives a mechanism for transforming samples from the prior distributions into samples from the posterior distributions. Indeed, consider iid samples from the prior distribution, , and set

for iid noise terms . From Equation 2 it is clear that are iid samples from the Gaussian posterior distribution . It is more intuitive to write this as

where are **fake observations** that are obtained by perturbing the actual observation with additive Gaussian noise terms with covariance ,

Suppose that we would like to estimate from the noisy observation

and possibly-nonlinear observation operator . Assume that we also have samples generated from some (unkown) prior distribution. For example, these samples could come from another numerical procedure. In order to obtain approximate samples from the posterior distribution, one can set

for fake observations . The approximate Kalman gain matrix is obtained by noting that in Equation 1 giving

we have and for . This means that an approximate Kalman matrix can be obtained using empirical estimates of these covariance matrices:

These updates form the basis of the Ensemble Kalman filter (EnKF), and very successful and scalable approach to data-assimilation of high-dimensional dynamical systems. This method is operationally employed across various weather forecasting centers across the globe.

Interestingly enough, the remarks above can be design in a relatively principled manner a derivative free optimizer (Huang et al. 2022). For example, assume one would like to minimize a functional of the type

One can start with a cloud of particles and keep updating them by assuming that one assimilates the noisy observation generated from a postulated observation process

for and a “step-size”. Each assimilation step steers the cloud of points towards the rights direction. Careful choice of the step-size is often crucial, as in any optimization procedure. It is indeed related to Information-Geometric Optimization Algorithms (IGO): the article (Ollivier et al. 2017) is beautiful!

Huang, Daniel Zhengyu, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. 2022. “Efficient Derivative-Free Bayesian Inference for Large-Scale Inverse Problems.” *Inverse Problems* 38 (12). IOP Publishing: 125006.

Ollivier, Yann, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. 2017. “Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles.” *The Journal of Machine Learning Research* 18 (1). JMLR. org: 564–628.

Consider a target density in . Since the Langevin diffusion

is reversible with respect to , it is natural to use a Euler-Maruyama discretization of Equation 1 to build MCMC proposals: in a MCMC simulation and for a time discretization parameter , if the current position is , a proposal can be generated as

with before being accepted-or-reject according to the usual Metropolis-Hastings ratio. This MCMC method, first proposed by Julian Besag in 1994, is commonly referred to as the Metropolis-Adjusted-Langevin-Algorithm (MALA). But how can one come-up with this proposal mechanism without knowing before hand the existence of this reversible Langevin diffusion Equation 1? While it is intuitively clear that following the direction of is not such a bad idea, i.e. one would like to move towards areas of “high probability mass”, where does this comes from? Naturally, one could look at proposals of the type for some free parameter and study the behavior of the Metropolis-Hastings ratio in the regime : as simple as it sounds, it is not entirely straightforward and requires quite a bit of algebra (do it!). Instead, I very much like the type of approaches described in (Titsias and Papaspiliopoulos 2018). To summarize, we would like to generate a MCMC proposal that stays in the vicinity of the current position while exploiting the knowledge of . One cannot simply approximate the target distribution as and sample from this approximation since it is typically does not define a probability distribution. Instead, consider the following extended target distribution

In other words, the Gaussian auxiliary variable is centred at and at distance about of it. Now, given the current position , to generate a proposal that stays in the vicinity of , one can proceed in two steps, in the spirit of a Gibbs-sampling approach:

First, generate

Second, sample from .

Unfortunately, the second step is typically not tractable. Nevertheless, the conditional density is concentrated in a -neighborhood of and a simple Gaussian approximation around should be enough for our purpose. We have:

This shows that the conditional can be approximated by a Gaussian distribution centred at and variance . This means that the final proposal can be generated as where . But that is equivalent to setting

with since . It is exactly the MALA proposal. Naturally, one can also try to be slightly more clever and use an extended distribution

for some appropriate positive-definite “mass” matrix . Indeed, this immediately leads to preconditioned MALA methods. I really like this approach since it can be adapted and generalized to quite a few other situations!

Titsias, Michalis K, and Omiros Papaspiliopoulos. 2018. “Auxiliary Gradient-Based Sampling Algorithms.” *Journal of the Royal Statistical Society Series B: Statistical Methodology* 80 (4). Oxford University Press: 749–67.

Consider a target probability density on that is known up to a normalizing constant . We also have a different probability density . The goal is to gradually tweak so that it eventually matches . More concretely, we aim to perform a gradient descent on the space of probability distributions to reduce the functional

This approach can be discretized: assume particles forming an empirical distribution that approximates ,

Define where denotes a time discretization parameter and is a “drift” function. Finding a suitable ‘drift function’ is the main problem. According to the Fokker-Planck equation, the computed empirical distribution

approximates given by

What is the optimal drift function that ensures that comes as close as possible to ? Typically, we select such that the quantity is minimized, provided that is not drastically different from . One method is to use the Wasserstein distance and assume the constraint

for a parameter . More pragmatically, it is generally easier (eg. proximal methods) to minimize the joint objective

Based on equations Equation 1 and Equation 2, a first-order expansion shows that the joint objective Equation 3 can be approximated by

a relatively straightforward quadratic function of the drift function . The optimal drift function, ie. the minimizer of Equation 4, is given by

Put simply, this suggests that we should select the drift function proportional to . To implement this scheme, we begin by sampling particles and let evolve each particle according to the following differential equation

where is the density of the set of particles at time . It is the usual diffusion-ODE trick for describing the evolution of the density of an overdamped Langevin diffusion,

This can be shown by writing down the associated Fokker-Planck equation. This heuristic discussion shows that minimizing by introducing a gradient flow in the space of probability distributions with the Wasserstein metric essentially produces a standard overdamped Langevin diffusion. Indeed, transforming this heuristic discussion into a formal statement is not trivial: the constructive proof in (Jordan, Kinderlehrer, and Otto 1998) is now usually referred to as the JKO scheme.

The above derivation shows that the Wasserstein distance plays particularly well with minimizing functionals of the space of probability distributions. The same heuristic discussion shows that minimizing a functional of the type

for some cost function and distribution leads to choosing a drift function minimizing

This can be approached identically to what as been done in the case of minimizing .

Jordan, Richard, David Kinderlehrer, and Felix Otto. 1998. “The Variational Formulation of the Fokker–Planck Equation.” *SIAM Journal on Mathematical Analysis* 29 (1). SIAM: 1–17.

Consider a random variable on the finite alphabet with . For , consider a sequence obtained by sampling times independently from and set

the proportion of within this sequence. In other words, the empirical distribution obtained from the samples reads

Indeed, the LLN indicates that as , and it is important to estimate the probability that significantly deviates from . To this end, note that for another probability vector the probability that

is straightforward to compute and reads

Stirling’s approximation then gives that

where is the Kullback–Leibler divergence of from . In other words, as soon as , the probability of observing falls exponentially quickly to zero. With the language of Large Deviations, one can make this statement slightly more precise, rigorous and general, but it is essentially the content of Sanov’s Theorem.

Given a list of mutually exclusive events and the knowledge that at least one of these events has taken place, the probability that the event was the one that happened is . The implication is that if all the events are rare, that is , and it is known that one event has indeed occurred, there is a high probability that the event with the smallest value was the one that happened: the rare event took place in the least unlikely manner.

Consider an iid sequence of discrete real-valued random variables with and mean . Suppose one observes the rare event

for some level significantly above . Naturally, the least unlikely way for this to happen is if . Furthermore, one may be interested in the empirical distribution associated to the sequence when the rare event Equation 1 does happen. The least unlikely empirical distribution is the one that minimizes under the constraint that

The function is convex and the introduction of Lagrange multipliers shows that the solution to this constraint minimization problem is given by the Boltzmann distribution defined as

The parameter is chosen so that the constraint Equation 2 be satisfied and the minus sign is to follow the “physics” convention. Note in passing that, in fact, the joint function is convex! As usual, if one defines the log-partition function as , with

one obtains that the constraint is equivalent to requiring . Furthermore, since is smooth and strictly concave, the function is convex and the condition is equivalent to setting

Naturally, one can now also estimate the probability of the event happening since one now knows that it is equivalent (on a log scale) to . Algebra gives

As a sanity check, note that since , we have that , as required. The statement that

with a (Large Deviation) rate function given by

is more or less the content of Cramer’s Theorem. The rate function and the function are related by a Legendre transform.

Now, to illustrate the above discussion, consider iid uniform random variables on the interval . It is straightforward to simulate these uniforms conditioned on the event that their mean exceeds the level , which is a relatively rare event. When this is done repetitively and the empirical distribution is plotted, the resulting distribution is as follows:

Indeed, the distribution in blue is (very close to) the Boltzmann distribution with density with chosen so that .

for a “complicated” functions and a simpler one . In some situations, it is possible to introduce an auxiliary random variable and an extended probability distribution on the extended space ,

with a tractable conditional probability . This extended target distribution can be often be easier to explore, for example when is continuous while is discrete, or to analyze, since the “complicated” term has disappeared. Furthermore, there are a number of scenarios when the variable can be averaged out of the extended distribution, i.e. the distribution

can be evaluated exactly.

Consider a set of edges on a graph with vertices . The Ising model is defined as

for spin configurations . The term couples the two spins and for each edge . The idea of the Swendsen–Wang_algorithm is to introduce an auxiliary variable for each edge that is uniformly distributed on the interval , i.e.

It follows that the extended distribution on reads

for and : the coupling term has disappeared. Furthermore, it is straightforward to sample from the conditional distribution and, perhaps surprisingly, it is also relatively straightforward to sample from the other conditional distribution – this boils down to finding the connect components of the graph on with an edge present if and flipping a fair coin for setting each connected component to . This leads to an efficient Gibbs sampling scheme to sample from the extended distribution.

For an inverse temperature , consider the distribution on

where the magnetization of the system of spins is defined as

The distribution for favours configurations with a magnetization close to or . The normalization constant (i.e. partition function) is a sum of terms,

It is not difficult to estimate as with combinatorial arguments. Nevertheless, another way to proceed is as follows. One can introduce a Gaussian auxiliary random variable with mean and variance : the parameters and can then be judiciously chosen to cancel the bothering term . This approach is often called the a Hubbard-Stratonovich transformation. The bothering “coupling” term disappears when when choosing . With such a choice, it follows that

Averaging out the gives that the partition function reads

In order to use the method of steepest descent, it would be useful to have an integrand of the type . One can choose and . This gives

from which one directly obtains that:

Consider the distribution on

where the are some fixed weights with . We assume that the matrix is positive definite: this can be achieved by adding to it if necessary, which does not change the distribution . As described in (Zhang et al. 2012), although I would not be surprised if similar ideas appeared in the physics literature much earlier, one can introduce a Gaussian auxiliary random variable so that has mean and covariance . In other words,

In order to cancel-out the it suffices to make sure that . There are a number of possibilities, the simplest approaches being perhaps

In any case, the joint distribution reads

Indeed, one can try to implement some Gibbs-style update in order to explain this joint distribution since both and are straightforward to sample from: it is indeed related Restricted Boltzmann Machine models. One can also average-out the spins and obtain that

where is the density of a centred Gaussian distribution with covariance . [**TODO:** add SMC experiments to estimate ].

Zhang, Yichuan, Zoubin Ghahramani, Amos J Storkey, and Charles Sutton. 2012. “Continuous Relaxations for Discrete Hamiltonian Monte Carlo.” *Advances in Neural Information Processing Systems* 25.

Instead, consider an integer and a family of subsets of such that any index appears in at least of these subsets. Note that for a subset with we have

Since each index appears in at least of the subsets, summing Equation 1 over all the subset yields

This means that the following inequality holds,

Indeed, the standard sub-additivity property of the entropy corresponds to the set for and .

Consider a measurable set and call the projection of on the hyperplane . A Theorem of Loomis and Whitney (Loomis and Whitney 1949) states that the lebesgue measure of the set satisfies

In other words, if all the projections of the set are small then, necessarily, the set itself is small. To proceed, one can approximate this set with the union of small cubes of side centred on . If one can prove the statement for , the results follows from a standard approximation argument (ie. outer measure). Now, each cube can be indexed with a -uple of integers , and one can consider the random variable that is uniformly distributed on the set of cubes coordinates. Because and etc…, choosing the subsets and in Shearer’s Lemma immediately gives the conclusion.

Chung, Fan RK, Ronald L Graham, Peter Frankl, and James B Shearer. 1986. “Some Intersection Theorems for Ordered Sets and Graphs.” *Journal of Combinatorial Theory, Series A* 43 (1). Academic Press: 23–37.

Loomis, Lynn H, and Hassler Whitney. 1949. “An Inequality Related to the Isoperimetric Inequality.”

- “Elements of information theory” by T. M. Cover and J. A. Thomas –
*perfect intro book to the topic.* - “Information Theory, Inference, and Learning Algorithms” by David J.C. MacKay
- “Information Theory From Coding to Learning” by Yury Polyanskiy and Yihong Wu

- “A Mathematical Theory of Communication” by C. Shannon (2948) –
*entertaining and readable, even 70+ years later!* - “Lecture Notes on Statistics and Information Theory” by John Duchi
- “Information-theoretic methods for high-dimensional statistics” by Yihong Wu

Consider a scenario involving a “noisy channel,” where a message expressed in an alphabet is transmitted before being received as a potentially different and corrupted message expressed using a potentially different alphabet . One can assume that letter is transformed into with probability so that the matrix has rows summing-up to one, and that the “letters” of the message are transmitted one by one and independently from each other (ie. there is no memory effect in the channel).

Now, imagine I have a text that needs to be transmitted through this channel. Assume that the text is represented using bits. My goal is to encode it in a way that introduces redundancy, utilizing the alphabet , so that it can be transmitted through the noisy channel. Eventually, I want it to be successfully decoded, allowing the original message to be recovered with minimal errors.

If transmitting each letter from the alphabet takes unit of time, I need to estimate the overall time it will take to transmit the entire text of bits. In cases where the channel does not completely destroy the information, one can use about any non-idiotic encoding and transmit the same text multiple times to increase the chances of accurately recovering the original message. This can be achieved by employing techniques such as majority voting to decode the received messages, which are all distorted versions of the original text.

The transmission rate represents the inverse of the time required to transfer a single bit of information:

In other words, it takes about unit of times to transfer a text of bits. Moreover, the error rate refers to the percentage of errors in the decoded message, indicating the fraction of erroneous bits among the decoded bits. Naturally, there exists a tradeoff, and it is evident that one can reach a vanishing error rate if one is willing to allow an arbitrarily slow transmission rate (eg. majority voting after transmitting a very large number of times the same text). For example, if and bits are flipped with probability , transmitting the text times would lead to a transmission rate of and an error rate approximately equal to .

The groundbreaking discovery made by Shannon is that it is possible to achieve a vanishing error rate even when transmitting at a finite transmission rate. He also managed to identify this optimal transmission rate. Shannon’s paper (Shannon 1948) is beautifully written and surprisingly readable for a text written more than 50 years ago.

Let’s imagine that we have a piece of information encoded in a variable, . We send through a noisy channel, and at the other end we receive a somewhat distorted message, . So, how much of our original information actually was transmitted? To reconstruct our original message, , using our received message, , we require an average of additional bits of information. On average, contains bits of information. So, if we encode bits of useful information in , the variable that is correlated with still holds bits of that original information. The quantity is the mutual information between the random variables and . In a noisy channel that transmits one “letter” at a time, the conditional probabilities are fixed. However, we can optimize the distribution of incoming messages. For instance, we can choose to transmit letters that are less likely to be corrupted. This discussion suggests that on average, transmitting symbols through the channel can provide up to bits of information, where , the maximization being over the distribution of while keeping the conditional probabilities fixed. It may seem that this implies a noisy channel cannot transmit information at a rate higher than . This hypothesis was precisely proven by Claude Shannon, who further established that this transmission rate can indeed be reached.

To prove that this transmission rate is achievable, Shannon’s idea was to simultaneously encode blocks of letters. To put it simply, consider the feasible blocks of binary letters. Each block has binary letters, . Associate to each of block a **codeword** of size in the alphabet . The set of these codewords is usually called the **codebook**,

To transmit a block of letters from the original text, this block is first transformed into its associated codeword . This codeword is then sent through the noisy channel, resulting in a received message . The objective is to design a codebook with enough redundancy so that one can reconstruct the original codeword from the received message : the higher the ratio , the larger the redundancy and the easier it should be to achieve this goal. The transmission rate is defined as since transmitting a binary text of length with vanishing errors takes units of time.

For generating the codebook in Equation 1, Shannon adopted a simple approach consisting in generating each for and independently at random from some (encoding) distribution . The choice of this encoding distribution can be optimized at a later stage.

Consider the codeword . After being transmitted through the noisy channel, this gives rise to a message . The codeword can be easily recovered if is typical while all the other pairs for are atypical. Since there are about elements such that is typical, and each codeword was chosen approximately uniformly at random within its typical set of size , the probability for a random codeword to be atypical is about

Consequently, the probability that all the other pairs for are atypical is

The probability as soon as as . Furthermore, remembering that one were free to optimize the encoding distribution , a vanishing error rate is possible as soon as the transmission rate is lower than

To sum-up, consider the success rate of the codebook , ie. the probability that a random codeword of is successfully decoded when passing through the noisy channel. The reasoning above shows that the averaged success rate , i.e. averaging over all possible codebooks , converges to one as long as the transmission rate is below the channel capacity . This means that one can find at least one codebook that works well! This reasoning is an example of the “probabilistic method”… Indeed, one also expect **most** random codebook to work well!

To demonstrate that transmission at vanishing error-rate is impossible when the transmission rate exceeds the channel capacity, , we can utilize Fano’s inequality.

Imagine selecting a message uniformly at random within and encode this message into the sequence . We send through a channel with capacity and receive a corresponding, though somewhat distorted, signal . Finally, we decode this received message into , an estimate of our original message:

Fano’s inequality points out that the error probability, is such that

Applying the data-processing inequality to proves:

To wrap up, recall that each received letter in the message (Y_1, , Y_K)$ depends solely on the corresponding letter in the message sent through the channel. This implies that .This yields:

This reveals that for the probability of error to go to zero, i.e. as , the transmission rate must be lower than .

Consider the Binary Symmetric Channel (BSC) that randomly flips and with equal probability . The capacity of this channel is easily computed and equals where is the binary entropy function: the optimal encoding distribution is

For a flipping rate of the channel capacity equals . To estimate the performance of the random Shannon codebook strategy, I chose and several values of . This means generating a random codebook of size consisting of random binary vectors of size . For a randomly chosen codeword , a received message is generated by flipping each of the coordinates of independently with probability . In the BSC setting, it is easily seen that the codeword of that was the most likely to have originated is

The nearest neighbor can be relatively efficiently computed with a nearest-neighbor routine (eg. FAISS). The figure below reports the probability of error (i.e. “Block Error Rate”),

when the codeword is chosen uniformly at random within the codebook.

It can be seen that, although the error rate does go to zero for low transmission rate, the choice of where is the channel capacity still yields a relatively large block error rate. This indicates that the block size is still far too low for the “law of large number” arguments presented in the previous section to kick-in. I did try for and a codebook of and the performace was still not impressive. This shows that even though the Shannon codebook approaches is an elegant construction, it is far from being practically useful. It requires a very large codebook of size and decoding requires doing a nearest-neighbors search that can become slow as increases.

Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” *The Bell System Technical Journal* 27 (3). Nokia Bell Labs: 379–423.

Consider a three random variables forming a Markov chain,

in the sense that and . Typical situations include:

We select a parameter for a probabilistic model . Afterward, we collect data from this model, and our goal is to estimate the parameter solely from the data .

We generate data , compress this data into , and then attempt to recover the original data as closely as we can.

Since each step in Equation 1 destroys some information (eg. data processing), it is important to measure how accurately estimates the initial input, . In other words, we want to know how much more information (expressed as ‘bits’) we need to reconstruct using knowledge of alone, i.e. we would like to upper-bound . For this purpose, imagine an “error” variable that indicates whether perfectly matches ,

The probability of error is and . To estimate from , we can start by learning if equals , which costs us ‘bits’ of information. If it turns out that , we are done asking. If we find that , however, we need to ask additional questions. Crucially, , but also since can take any value in except when . Writing this reasoning quantitatively gives Fano’s inequality:

Apparently, this inequality was first derived by Robert Fano in the 50s while teaching a Ph.D. seminar at MIT. In words: a large means that offers insufficient information about , and as a result, the probability of error must be high.

- Converse of Shannon’s coding theorem

If Alice chooses a number uniformly at random from the set , Bob can use a simple “dichotomy” strategy to ask Alice binary Yes/No questions and correctly identify the number. This result can be generalized to non-uniform distributions (cf. Huffman codes, and also Kraft-McMillan inequality). If Alice chooses a number from with probabilities , Bob can design a deterministic strategy to find the answer using, on average, about

binary questions, ie. bits. To be more precise, there are strategies that require at most questions on average, and none that can require less than . Note that applying this remark to an iid sequence and using the the fact that , this shows that one can exactly determining the sequence with at most binary questions on average. The quantity defined in Equation 1, known as the **Shannon Entropy** of the distribution , also implies that there are strategies that can encode each integer as a binary string of length (i.e. with bits), with the expected length approximately equal to . It is because a sequence of binary questions can be thought of as a binary tree, etc…

This remark can be used for compression. Imagine a very long sequence of iid samples from . Encoding each with bits, one should be able to encode the resulting sequence with

bits. Can the usual **zip compression** algorithm do this? To test this, choose a probability distribution on , generate an iid sequence of length , compress this using the command, and finally look at the size of the resulting file (in bits). I have done that a few times, with a few different values of and a few random distributions on , and with . The plot of size of the compressed files versus the Shannon entropy looks as below:

Seems like the zip-algorithm works almost optimally for compressing iid sequences.

Now consider a pair of discrete random variables . If Alice draws samples from this pair of rvs, one can ask binary questions on average to exactly find out these values. To do that, one can ask questions to estimate , and once is estimated, one can then ask about to estimate . This strategy requires on average binary questions and is actually optimal, showing that

where we have defined .

Indeed, one can generalize these concepts to more than two random variables. Iterating Equation 2 shows that the trajectory of a stationary ergodic Markov chain can be estimated on average with binary questions where

Here, is the equilibrium distribution of the Markov chain and are the transition probabilities.

Can compress Markovian trajectories and roughly achieve that level of compression? Indeed, one can test this by generating random Markov transition matrices (that are, with very high probability, ergodic with respect to an equilibrium distribution ). Doing this with trajectories of length (ie. quite short because it is quite slow to) on with , one get the following results:

In red is the entropy estimated without using the Markovian structure and assuming that the are iid samples. One can clearly see that the dependencies between successive samples are exploited for better compression. Nevertheless, the optimal compression rate given by the Shannon entropy is not reached. Indeed, is not an optimal algorithm – it cannot even compress well enough the sequence !

The AEP is simple remark that gives a convenient way of reasoning about long sequences random variables with . For example, assuming that the random variables are independent and identically distributed as the random variable , the law of large numbers (LLN) gives that

This means that **any** “typical” sequence has a probability about of occurring, which also means that there are about such “typical” sequences. Indeed, one could use the LLN for Markov chains and obtain a similar statement for Markovian sequences. All this can be made rigorous with large deviation theory, for example. This also establishes the link with the “statistical physics” definition of entropy as the logarithm of the number of configurations… Set of the type

are usually called **typical set**: for any , the probability of to belongs to goes to one as . For these reasons, it is often a good heuristic to think of a draw of as a uniformly distributed on the associated typical set. For example, if are iid draws from a Bernoulli distribution with , the set of sequences such that has elements where is the entropy of a random variable.

Consider a pair of random variables . Assuming that stores (on average) bits of useful information, how much of this information can be extracted from ? Let us call this quantity since we will see in a second that this quantity is symmetric. If is independent from , no useful information about is contained in and . On the contrary, if , the knowledge of already contains all the information about and . If one knows , one needs on average binary questions (ie. bits of additional information) in order to determine certainly and recover all the information contained in . This means that the knowledge of already contains useful bits of information about ! This quantity is called the **mutual information** of the two random variable and , and it has the good taste of being symmetric:

Naturally, one can define conditional version of it by setting where has the law of conditioned on . Since is the reduction in uncertainty of due to when is given, there are indeed situations when is larger than – it is to be contrasted to the intuitive inequality , which is indeed true. A standard such examples is when and are independent random variables and : a short computation gives that while, indeed, . This definition of conditional mutual information leads to a chain-rule property,

which can indeed be generalized to any number of variables. Furthermore, if the are conditionally independent given (eg. if and only depend on ), then the sub-additivity of the entropy readily gives that

Importantly, algebra shows that can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of the marginals ,

This diagram from (MacKay 2003) nicely illustrate the different fundamental quantities and and and and :

Naturally, if one considers three random variables forming a “Markov chain”, we have the so-called **data-processing** inequality,

The first inequality is clear since all the useful information contained in must be coming from , and only contains bits about . For the second inequality, note that if contains bits about , and contains bits about , then cannot contain more than bits of :

MacKay, David JC. 2003. *Information Theory, Inference and Learning Algorithms*. Cambridge university press.

Consider an empirical data distribution . In order to simulate approximate samples from , Denoising Diffusion Probabilistic Models (DDPM) simulate a forward diffusion process on an interval . The diffusion is initialized at the data distribution, i.e. , and is chosen so that that the distribution of is very close to a known and tractable reference distribution , e.g. a Gaussian distribution. Denote by the marginal distribution at time , i.e. . By choosing the forward distribution with simple and tractable transition probabilities, e.g. an Ornstein-Uhlenbeck, it is relatively easy to estimate from simulated data: this can be formulated as a simple regression problem. This allows one to simulate the diffusion backward in time and generate approximate samples from . Why this is useful is another question…

The fact that the mapping from data-samples at time to (approximate) Gaussian samples at time is stochastic and described by diffusion processes is cumbersome. This would be much more convenient to build a deterministic mapping between the data-distribution and the Gaussian reference distribution : this would allows one to associate a likelihood to data samples and to easily “encode”/“decode” data-samples. To so this, one can try to replace diffusion by Ordinary Differential Equations.

Consider an arbitrary diffusion process with associated distribution at time . The Fokker-Planck equation that describes the evolution of reads

If there were no diffusion term and was describing instead the evolution of differential equation , the associated evolution of the density of would simply read

If one can find a vector field such that

then one can basically replace diffusions by ODEs. The diffusion-ODE trick is the simple remark that

does exactly this, as algebra immediately shows it. The additional term is intuitive. The coefficient is because one is trying to match the term in the Fokker-Planck equation. And the overall term is just driving the ODE in direction where the probability density is small, i.e. it follows the negative gradient of the log-density: it is exactly trying to imitate the diffusion term .

What this means is that a diffusion process started from and marginal distribution can be imitated by an ODE process started from . At any time , the marginal distributions of and both exactly equal .

Consider a DDPM with forward dynamics given by an Ornstein-Uhlenbeck (OU) process

and initial condition . As explained in these notes, it is relatively straightforward to estimate the score function

from data. This means that the forward OU process can be replaced by the forward ODE

with . Similarly, the reverse diffusion (i.e. the “denoising” diffusion) defined as follows the dynamics

As described for the first time in the beautiful article (Song et al. 2020), the diffusion-ODE trick now shows that the denoising diffusion can be replaced by a denoising ODE with dynamics

Interestingly [and I do not know whether there was an obvious way of seeing this from the start], this shows that the forward and backward ODE are actually the same but run forward and backward in time. They corresponds to the ODE described by the vector field

The animation belows display the denoising ODE and the associated vector field Equation 2.

With the diffusion-ODE trick, we have just seen that it is possible to build a vector fields such that the *forward* ODE

and the *backward* ODE defined as

are such that and .

In general, consider a vector field and a bunch of particles distributed according to a distribution at time . If each particle follows the vector field for an amount of time , the particles that were in the vicinity of some at time end up in the vicinity of at time . At the same time, a volume element around gets stretch by a factor while following the vector field , which means that the density of particles at time and around equals . In other words . This means that if we follows a trajectory of one gets

That is the Lagrangian description of the density of particles. Indeed, one could directly get this identity by differentiating with respect to time while using the continuity Equation 1. When applied to the DDPM, this gives a way to assign likelihood the data samples, namely

where is trajectory of the forward ODE Equation 3 initialized as . Note that in high-dimensional setting, it may be computationally expensive to compute the divergence term since it typically is times slower that a gradient computation; for this reason, it is often advocated to use the Hutchinson trace estimator to get an unbiased estimate of it at a much lower computational cost.

Song, Yang, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. “Score-Based Generative Modeling Through Stochastic Differential Equations.” *ICLR 2021*. https://arxiv.org/abs/2011.13456.