Alexandre Thiéry - Auxiliary variable trick

Consider a complicated distribution on the state space \(x \in \mathcal{X}\) given by

\[ \pi(x) = \frac{1}{\mathcal{Z}} \, e^{C(x) + B(x)} \]

for a “complicated” functions \(C(x)\) and a simpler one \(B(x)\). In some situations, it is possible to introduce an auxiliary random variable \(a \in \mathcal{A}\) and an extended probability distribution \(\pi(x,a)\) on the extended space \(\mathcal{X}\times \mathcal{A}\),

\[ \pi(x,a) = \pi(x) \, \textcolor{red}{\pi(a | x)} = \frac{1}{\mathcal{Z}} e^{C(x) + B(x)} \, \textcolor{red}{e^{-C(x) + D(x, a)}}, \]

with a tractable conditional probability \(\pi(a | x)\). This extended target distribution \(\pi(x,a) = (1/\mathcal{Z}) \, \exp[B(x) + D(x,a)]\) can be often be easier to explore, for example when \(a\) is continuous while \(x\) is discrete, or to analyze, since the “complicated” term \(C(x)\) has disappeared. Furthermore, there are a number of scenarios when the variable \(x\) can be averaged out of the extended distribution, i.e. the distribution

\[ \pi(a) = \frac{1}{\mathcal{Z}} \, \int_{x \in \mathcal{X}} e^{B(x) + D(x,a)} \]

can be evaluated exactly.

Swendsen–Wang algorithm

Consider a set of edges \(\mathcal{E}\) on a graph with vertices \(\{1, \ldots, N\}\). The Ising model is defined as \[ \pi(x) \propto \exp {\left\{ \sum_{(i,j) \in \mathcal{E}} \beta x_i x_j \right\}} \]

for spin configurations \(x=(x_1, \ldots, x_N) \in \{-1,1\}^N\). The term \(\exp[\beta x_i x_j]\) couples the two spins \(x_i\) and \(x_j\) for each edge \((i,j) \in \mathcal{E}\). The idea of the Swendsen–Wang_algorithm is to introduce an auxiliary variable \(u_{i,j}\) for each edge \((i,j) \in \mathcal{E}\) that is uniformly distributed on the interval \([0, \exp(\beta x_i x_j)]\), i.e.

\[ \pi(u_{i,j} | x) \; = \; \frac{ \mathbf{1} {\left\{ 0 < u_{i,j} < \exp[\beta x_i x_j] \right\}} }{\exp[\beta x_i x_j] } \]

It follows that the extended distribution on \(\{-1,1\}^N \times (0,\infty)^{|\mathcal{E}|}\) reads

\[ \pi(x,u) = \frac{1}{Z} \prod_{(i,j) \in \mathcal{E}} \; \mathbf{1} {\left\{ 0 < u_{i,j} < \exp[\beta x_i x_j] \right\}} \]

for \(x=(x_1, \ldots, x_N)\) and \(u = (u_{i,j})_{(i,j) \in \mathcal{E}}\): the coupling term \(\exp[\beta x_i x_j]\) has disappeared. Furthermore, it is straightforward to sample from the conditional distribution \(\pi(u | x)\) and, perhaps surprisingly, it is also relatively straightforward to sample from the other conditional distribution \(\pi(x | u)\) – this boils down to finding the connect components of the graph on \(\{1, \ldots, N\}\) with an edge \(i \sim j\) present if \(u_{i,j} > e^{-\beta}\) and flipping a fair coin for setting each connected component to \(\pm 1\). This leads to an efficient Gibbs sampling scheme to sample from the extended distribution.

Swendsen-Wang MCMC algorithm at critical temperature

Gaussian Integral trick: Curie-Weiss model

For an inverse temperature \(\beta > 0\), consider the distribution on \(x \in \{-1,1\}^N\)

\[ \pi(x) = \frac{1}{\mathcal{Z}(\beta)} \, e^{\beta \, N \, m^2} \]

where the magnetization of the system of spins \(x=(x_1, \ldots, x_N)\) is defined as

\[ m = \frac{x_1 + \ldots + x_N}{N}. \]

The distribution \(\pi(x)\) for \(\beta \gg 1\) favours configurations with a magnetization close to \(+1\) or \(-1\). The normalization constant (i.e. partition function) \(\mathcal{Z}(\beta)\) is a sum of \(2^N\) terms,

\[ \mathcal{Z}(\beta) = \sum_{s_1 \in \{ \pm 1\} } \ldots \sum_{s_N \in \{ \pm 1\} } \exp {\left\{ \frac{\beta}{N} {\left( \sum_{i=1}^N x_i \right)} ^2 \right\}} . \]

It is not difficult to estimate \(\log \mathcal{Z}(\beta)\) as \(N \to \infty\) with combinatorial arguments. Nevertheless, another way to proceed is as follows. One can introduce a Gaussian auxiliary random variable \(\pi(a | x) = \mathcal{N}(\alpha \, {\left( \sum_i x_i \right)} , \sigma^2)\) with mean \(\mu = \alpha \, {\left( \sum_i x_i \right)} \) and variance \(\sigma^2\): the parameters \(\alpha\) and \(\sigma^2 > 0\) can then be judiciously chosen to cancel the bothering term \(\exp[\frac{\beta}{N} \, m^2]\). This approach is often called the a Hubbard-Stratonovich transformation. The bothering “coupling” term disappears when when choosing \(\frac{\alpha^2}{2 \sigma^2} = \frac{\beta}{N}\). With such a choice, it follows that

\[ \pi(x, a) = \frac{1}{\mathcal{Z}(\beta)} \, \frac{1}{\sqrt{2 \pi \sigma^2}} \exp {\left\{ - \frac{a^2}{2 \sigma^2} + \frac{\alpha}{\sigma^2} a \, {\left( \sum_i x_i \right)} \right\}} . \]

Averaging out the \(x_i \in \{-1, +1\}\) gives that the partition function reads

\[ \mathcal{Z}(\beta) \; = \; \frac{1}{\sqrt{2 \pi \sigma^2}} \, \int_{a = -\infty}^{a=\infty} \, \exp {\left\{ -\frac{a^2}{2 \sigma^2} + \textcolor{red}{N} \, \log[ 2 \, \cosh(\alpha a / \sigma^2)] \right\}} . \]

In order to use the method of steepest descent, it would be useful to have an integrand of the type \(\exp[N \times (\ldots)]\). One can choose \(1/(2 \, \sigma^2) = \beta \, N\) and \(\alpha = 1/N\). This gives

\[ \mathcal{Z}(\beta) \; = \; \frac{1}{\sqrt{2 \pi \sigma^2}} \, \int_{a = -\infty}^{a=\infty} \, \exp {\left\{ \textcolor{red}{N} {\left[ -\beta \, a^2 + \log[ 2 \, \cosh(2 \beta a )] \right]} \right\}} \, da \]

from which one directly obtains that:

\[ \lim_{N \to \infty} \; -\frac{\log \mathcal{Z}(\beta)}{N} \; = \; \min_{a \in \mathbb{R}} \; \Big\{ \beta a^2 - \log[2 \, \cosh(2 \beta a)] \Big\}. \]

Sherrington–Kirkpatrick model

Consider the distribution on \(x \in \{-1,1\}^N\)

\[ \pi(x) = \frac{1}{\mathcal{Z}(\beta)} \, \exp {\left\{ \frac 12 \, \sum_{i,j} W_{ij} x_i x_j \right\}} = \frac{1}{\mathcal{Z}(\beta)} \, \exp {\left\{ \frac 12 \, \left< x, W x \right> \right\}} \]

where the \(w_{ij}\) are some fixed weights with \(w_{ij} = w_{ji}\). We assume that the matrix \(W = [W_{ij}]_{ij}\) is positive definite: this can be achieved by adding \(\lambda \, I_N\) to it if necessary, which does not change the distribution \(\pi\). As described in (Zhang et al. 2012), although I would not be surprised if similar ideas appeared in the physics literature much earlier, one can introduce a Gaussian auxiliary random variable \(a = (a_1, \ldots, a_N)\) so that \(\pi(a | x)\) has mean \(Fx\) and covariance \(\Gamma\). In other words,

\[ \pi(a | x) = \frac{1}{(2 \pi)^{d/2} \, |\Gamma|^{1/2}} \, \exp {\left\{ -\frac 12 \left< (a - Fx), \Gamma^{-1} (a - Fx) \right> \right\}} . \]

In order to cancel-out the \(\left< x, W, x \right>\) it suffices to make sure that \(F^\top \, \Gamma^{-1} \, F = W\). There are a number of possibilities, the simplest approaches being perhaps

\[ (F,\Gamma) = (W^{1/2}, I_N) \quad \text{or} \quad (F,\Gamma) = (I, W^{-1}) \quad \text{or} \quad (F,\Gamma) = (W, W). \]

In any case, the joint distribution reads

\[ \pi(x,a) \; = \; \frac{1}{\mathcal{Z}} \, \frac{1}{(2 \pi)^{d/2} \, |\Gamma|^{1/2}} \, \exp {\left\{ -\frac{1}{2} \left< a, \Gamma^{-1} a \right> + \left< x, F^\top \Gamma^{-1} \, a \right> \right\}} . \]

Indeed, one can try to implement some Gibbs-style update in order to explain this joint distribution since both \(\pi(x|a)\) and \(\pi(a|x)\) are straightforward to sample from: it is indeed related Restricted Boltzmann Machine models. One can also average-out the spins \(x_i \in \{-1,1\}\) and obtain that

\[ \mathcal{Z}= \int_{\mathbb{R}^N} \exp {\left\{ \sum_{i=1}^N \log {\left( 2 \, \cosh([F^\top \Gamma^{-1} \, a]_i) \right)} \right\}} \, \mathcal{D}_{\Gamma}(da) \]

where \(\mathcal{D}_{\Gamma}\) is the density of a centred Gaussian distribution with covariance \(\Gamma\). [TODO: add SMC experiments to estimate \(\mathcal{Z}\)].

References

Zhang, Yichuan, Zoubin Ghahramani, Amos J Storkey, and Charles Sutton. 2012. “Continuous Relaxations for Discrete Hamiltonian Monte Carlo.” Advances in Neural Information Processing Systems 25.