Alexandre Thiéry - Deriving Langevin MCMC

Consider a target density \(\pi(x)\) in \(\mathbb{R}^D\). Since the Langevin diffusion

\[ dX_t = \nabla \log \pi(X_t) \, dt + \sqrt{2} \, dW \tag{1}\]

is reversible with respect to \(\pi\), it is natural to use a Euler-Maruyama discretization of Equation 1 to build MCMC proposals: in a MCMC simulation and for a time discretization parameter \(\varepsilon> 0\), if the current position is \(x \in \mathbb{R}^D\), a proposal \(y \in \mathbb{R}^D\) can be generated as

\[ y = x + \varepsilon\, \nabla \log \pi(X_t) + \sqrt{2 \varepsilon} \, \xi \]

with \(\xi \sim \mathcal{N}(0,I)\) before being accepted-or-reject according to the usual Metropolis-Hastings ratio. This MCMC method, first proposed by Julian Besag in 1994, is commonly referred to as the Metropolis-Adjusted-Langevin-Algorithm (MALA). But how can one come-up with this proposal mechanism without knowing before hand the existence of this reversible Langevin diffusion Equation 1? While it is intuitively clear that following the direction of \(\nabla \log \pi\) is not such a bad idea, i.e. one would like to move towards areas of “high probability mass”, where does this \(\sqrt{2}\) comes from? Naturally, one could look at proposals of the type \(y = x + \nabla \log \pi(X_t) \, \varepsilon+ \lambda \, \xi\) for some free parameter \(\lambda > 0\) and study the behavior of the Metropolis-Hastings ratio in the regime \(\varepsilon\to 0\): as simple as it sounds, it is not entirely straightforward and requires quite a bit of algebra (do it!). Instead, I very much like the type of approaches described in (Titsias and Papaspiliopoulos 2018). To summarize, we would like to generate a MCMC proposal \(y \in \mathbb{R}^D\) that stays in the vicinity of the current position \(x \in \mathbb{R}^D\) while exploiting the knowledge of \(\nabla \log \pi(x)\). One cannot simply approximate the target distribution as \(\pi(x) \approx \pi(x_k) e^{\left< \nabla \log \pi(x_k), x-x_k \right>}\) and sample from this approximation since it is typically does not define a probability distribution. Instead, consider the following extended target distribution

\[ \overline{\pi}(x,z) \, \propto \pi(x) \, \exp {\left\{ -\frac{1}{2\varepsilon}\|z-x\|^2 \right\}} . \]

In other words, the Gaussian auxiliary variable \(z \in \mathbb{R}^D\) is centred at \(x\) and at distance about \(\sqrt{\varepsilon}\) of it. Now, given the current position \(x_k\), to generate a proposal \(y_\star\) that stays in the vicinity of \(x_k\), one can proceed in two steps, in the spirit of a Gibbs-sampling approach:

First, generate \(z_\star \sim \overline{\pi}(dz | x_k) \sim \mathcal{N}(x_k, \sqrt{\varepsilon}I)\)
Second, sample from \(y_\star \sim \overline{\pi}(dx | z_\star)\).

Unfortunately, the second step is typically not tractable. Nevertheless, the conditional density \(\overline{\pi}(dx | z_\star)\) is concentrated in a \(\sqrt{\varepsilon}\)-neighborhood of \(z_\star\) and a simple Gaussian approximation around \((x_k, z_\star)\) should be enough for our purpose. We have:

\[ \begin{align} \log \overline{\pi}(dx | z_\star) &= \log \pi(x) - \frac{1}{2 \varepsilon} \|x - z_\star\|^2 + \textrm{(Cst)}\\ &\approx \left< \nabla \log \pi(x_k), x-x_k \right> - \frac{1}{2 \varepsilon} \|x - z_\star\|^2 + \textrm{(Cst)}\\ &= - \frac{1}{2 \varepsilon} \|x - [z_\star + \varepsilon\, \nabla \log \pi(x_k)]\|^2 + \textrm{(Cst)}. \end{align} \]

This shows that the conditional \(\overline{\pi}(dx | z_\star)\) can be approximated by a Gaussian distribution centred at \([z_\star + \nabla \log \pi(x_k)]\) and variance \(\varepsilon\, I\). This means that the final proposal \(y \in \mathbb{R}^D\) can be generated as \(y \sim z_\star + \nabla \log \pi(x_k) + \xi\) where \(\xi \sim \mathcal{N}(0,\varepsilon)\). But that is equivalent to setting

\[ y \sim x + \varepsilon\, \nabla \log \pi(x_k) + \sqrt{2 \varepsilon} \, \xi \]

with \(\xi \sim \mathcal{N}(0,I)\) since \(z_\star \sim \mathcal{N}(x, \sqrt{\varepsilon} I)\). It is exactly the MALA proposal. Naturally, one can also try to be slightly more clever and use an extended distribution

\[ \overline{\pi}(x,z) \, \propto \, \pi(x) \, \exp {\left\{ -\frac{1}{2\varepsilon} \left< (z-x), M^{-1} \, (z-x) \right> \right\}} \]

for some appropriate positive-definite “mass” matrix \(M \in \mathbb{R}^{D,D}\). Indeed, this immediately leads to preconditioned MALA methods. I really like this approach since it can be adapted and generalized to quite a few other situations!

References

Titsias, Michalis K, and Omiros Papaspiliopoulos. 2018. “Auxiliary Gradient-Based Sampling Algorithms.” Journal of the Royal Statistical Society Series B: Statistical Methodology 80 (4). Oxford University Press: 749–67.