Alexandre Thiéry - Sparse GP

These notes are mainly for my own reference; I’m pretty clueless about GPs at the moment, and that needs to change. Read at your own risk; typos and mistakes are likely.

Let \(f(\cdot) \sim \mathrm{GP}(m, K)\) denote a Gaussian Process (GP) prior with zero mean \(m=0\) and covariance kernel \(K\). Assume we observe \(n \gg 1\) noisy measurements \(y = (y_i){i=1}^n\) of \(f_i = f(x_i)\) at input locations \(x = (x_i){i=1}^n\). The goal is to compute the posterior of \(f = (f_i)_{i=1}^n\) and to infer GP hyperparameters. The main challenge with GP models is the cubic complexity of the matrix inversion required to many of the posterior computations.

Sparse GPs are a class of approaches that aim to reduce this complexity by approximating the full GP posterior with a smaller set of so-called inducing variables \(u=(u_1, \ldots, u_m)\) that entirely describe an approximate posterior distribution. Consider \(m \ll n\) locations \(z=(z_i)_{i=1}^m\) called inducing points and set \(u_i = f(z_i)\) for the latent function values at the inducing points. The Gaussian random variables \(u_i\) can be used as inducing random variable; the choice of inducing points \(z\) defines a different set of inducing variables \(u\). In this setting, optimizing the inducing variables simply means optimizing the locations of the inducing points \(z\). The strategy is to approximate the posterior of \((u,f)\) with a tractable distribution \(q(u,f)\),

\[ p(u,f \mid y) \, = \, \frac{p(u) \, p(f \mid u) \, p(y \mid f)}{p(y)} \; \approx \; q(u,f). \]

Later, we will see that setting \(u_i = f(z_i)\) is indeed not the only choice of inducing variables, but let’s keep it to this for now. We have \(p(u) = N(0,K_u)\) and \(p(f \mid u) = N(\mu_{f|u}, K_{f|u})\) where

\[ \left\{ \begin{align*} \mu_{f|u} &= K_{fu} K_u^{-1} u\\ K_{f|u} &= K_f - K_{fu} K_u^{-1} K_{uf}. \end{align*} \right. \tag{1}\]

where \(K_u\) is the covariance matrix of the inducing variables \(u\) and \(K_{fu}\) is the covariance matrix between \(f\) and \(u\). Evaluating \(p(f \mid u)\) involves inverting \(K_{f|u} \in \mathbb{R}^{n,n}\), which typically scales as \(\mathcal{O}(n^3)\), hence intractable for large \(n\). To approximate \(p(u,f \mid y)\) with another distribution \(q(u,f)\), one can minimize the Kullback-Leibler divergence \(D_{\text{KL}}[q(u,f) \| p(u,f \mid y)]\), i.e.

\[ \int q(u) \, q(f|u) \, \log {\left\{ \frac{q(u) \, q(f|u)}{p(u) \, \textcolor{red}{p(f \mid u)} \, p(y \mid f)} \right\}} \, du \, df \, + \, \log p(y). \]

It is not extremely helpful since the intractable term \( \textcolor{red}{p(f \mid u)}\) is present. However, (Titsias 2009) proposes to set \(q(f|u) = p(f|u)\), i.e. to consider an approximate posterior of the form:

\[ q(u,f) = q(u) \, p(f \mid u). \]

Note that the correct posterior distribution is typically not of this form, although when the number of inducing points is large enough, this approximations becomes increasingly accurate. With this class of approximate posterior, the expectation \(\mathbb{E}[\Phi(f) \mid y]\) of some functional \(\Phi\) is approximated as \(\int \mathbb{E}[\Phi(f) \mid u] \, q(u) \, du\). For example, if \(q(u) = \mathcal{N}(\mu_q, K_q)\) is a Gaussian variational distribution, the posterior distribution of \(f_\star = f(x_\star)\) at a new location \(x_\star\) is approximated as \(K_{\star,u} \, K_{u}^{-1} \mathcal{N}(\mu_q, K_q) + K_{\star|u}\); it is a Gausian with

\[ \left\{ \begin{align*} \textrm{mean} &= K_{\star,u} \, K_{u}^{-1} \mu_q\\ \textrm{cov} &= K_{\star,u} \, K_{u}^{-1} \, K_q \, K_{u}^{-1} K_{u,\star} + K_{\star|u} \end{align*} \right. \]

Optimizing the inducing variables is equivalent to minimizing the free energy quantity

\[ \mathcal{F}\; \equiv \; \int q(u) \, p(f|u) \, \log {\left\{ \frac{q(u)}{p(u) \, p(y \mid f)} \right\}} \, du \, df, \tag{2}\]

over the variational distribution \(q(u)\) and choice of inducing variables. For a fixed set of inducing variables (eg. set of inducing points), it is clear that the optimal variational distribution is given by

\[ \begin{align*} q_{\star}(u) &= p(u) \exp {\left\{ \int p(f|u) \, \log {\left( p(y \mid f) \right)} \, df \right\}} / \mathcal{Z}\\ &= p(u) \exp {\left\{ \mathbb{E}[\log p(y \mid f) \mid u] \right\}} / \mathcal{Z} \end{align*} \tag{3}\]

for some normalization constant \(\mathcal{Z}= \int p(u) \exp {\left\{ \mathbb{E}[\log p(y \mid f) \mid u] \right\}} \, du\); this can be seen by expressing Equation 2 as KL divergence, as similarly done for example when deriving the Coordinate Ascent Variational Inference (CAVI) method,

\[ \mathcal{F}= D_{\text{KL}}[q(u) \mid q_{\star}(u)] \, - \, \log \mathcal{Z}. \tag{4}\]

Equation 3 shows that \(q_{\star}(u)\) is the prior \(p(u)\) weighted by a term that is large when the observations \(y\) are likely given \(u\), i.e. when \(\mathbb{E}[\log p(y \mid f) \mid u]\) is large.

Nystrom approximation

Before describing the simple and most important case of additive Gaussian noise, let’s give a brief reminder on the Nystrom approximation. The distribution of \(f \mid u\) is Gaussian with mean \(\mu_{f|u} = K_{fu} K_u^{-1} u\) and covariance \(K_{f|u}\). This means that \(K_{fu} K_u^{-1} u + \mathcal{N}(0, K_{f|u})\) is distributed as the unconditional distribution \(f \sim \mathcal{N}(0, K_f)\). In particular:

\[ \begin{align*} K_f &= K_{fu} K_u^{-1} K_u K_u^{-1} K_{uf} + K_{f|u} \\ &= \textcolor{red}{K_{fu} K_u^{-1} K_{uf}} + K_{f|u} \\ &= \textcolor{blue}{\widehat{K}^{u}_{f}} + K_{f|u} \end{align*} \]

where \( \textcolor{blue}{\widehat{K}^{u}_{f}} \equiv K_{fu} K_u^{-1} K_{uf}\) is the so-called Nystrom approximation of the covariance matrix \(K_f\) based on the inducing variable \(u\). This shows that the Nystrom approximation \(\widehat{K}^{u}_{f}\) simply consists in ignoring the conditional variance term \(K_{f|u}\), and is thus an underestimate of the covariance matrix \(K_f\). Furthermore, if \(u\) is very informative of \(f\), then \(K_{f|u}\) is small and the Nystrom approximation is accurate.

Observation with additive Gaussian noise

The case of additive Gaussian noise is particularly simple. Assume that

\[ y_i = f_i + \varepsilon_i \qquad \text{with} \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \]

where the noise terms \(\varepsilon_i\) are independent. Since \(f|u \sim \mathcal{N}(\mu_{f|u}, K_{f|u})\), algebra gives that

\[\mathbb{E}[\log p(y \mid f) \mid u] = \log[ \mathcal{N}(y; \mu_{f|u}, \sigma^2 \, I) ] - \frac{1}{2 \sigma^2} \, \mathop{\mathrm{Tr}}( K_{f|u} )\]

Using that \(\mu_{f|u} = K_{fu} K_u^{-1} u\) and the matrix inversion lemma it quickly follows that optimal variational distribution is \(q_{\star}(u) = \mathcal{N}(\mu_{\star}, K_{\star})\) with

\[ \left\{ \begin{aligned} \mu_{\star} &= K_{uf} \, {\left( \widehat{K}^{u}_{f} + \sigma^2 I \right)} ^{-1} \, y\\ K_{\star} &= K_u - K_{uf} \, {\left( \widehat{K}^{u}_{f} + \sigma^2 I \right)} ^{-1} \, K_{fu}. \end{aligned} \right. \]

Indeed, these are approximations of the exact condition moments,

\[ \left\{ \begin{aligned} \mu_{u|y} &= K_{uf} \, {\left( K_{f} + \sigma^2 I \right)} ^{-1} \, y\\ K_{u|y} &= K_u - K_{uf} \, {\left( K_{f} + \sigma^2 I \right)} ^{-1} \, K_{fu}. \end{aligned} \right. \]

where the Nystrom approximation \(\widehat{K}^{u}_{f} \approx K_f\) is used instead. One then finds that \(\log \mathcal{Z}= \log \mathcal{N}(y; 0, \widehat{K}^u_f + \sigma^2 I) - \frac{1}{2\sigma^2} \mathop{\mathrm{Tr}}(K_{f|u})\). With the optimal variational distribution \(q_\star(u)\), Equation 4 gives that the free energy is:

\[ \mathcal{F} = -\log \mathcal{N} {\left( y; 0, \; \widehat{K}^{u}_{f} + \sigma^2 I \right)} + \frac{1}{2\sigma^2} \mathop{\mathrm{Tr}}K_{f|u} \\ \]

Furthermore, note that exact likelihood of the observations is \(p(y) = \mathcal{N} {\left( y; 0, \; K_f + \sigma^2 I \right)} \) so that the free energy can be expressed as

\[ \mathcal{F} \; = \; -\log \widehat{p}^u(y) + \frac{1}{2\sigma^2} \mathop{\mathrm{Tr}}K_{f|u} \\ \]

for pseudo-likelihood \(\widehat{p}^u(y) = \mathcal{N} {\left( y; 0, \; \widehat{K}^{u}_{f} + \sigma^2 I \right)} \). This shows that \(D_{\text{KL}}\) is given by: \[ \begin{align*} D_{\text{KL}}&[q(u,f) \mid p(u,f \mid y)] = \mathcal{F}+ \log p(y) \\ &= \log \frac{p(y)}{\widehat{p}^u(y)} + \frac{1}{2\sigma^2} \mathop{\mathrm{Tr}}K_{f|u}. \end{align*} \]

The term \(\mathop{\mathrm{Tr}}K_{f|u}\) is just the sum of the conditional variances \(\mathop{\mathrm{Var}} {\left( f_i | u \right)} \) and can be thought of as a regularization term,

\[ R = \frac{1}{2\sigma^2} \mathop{\mathrm{Tr}}K_{f|u} = \frac12 \, \sum_{i=1}^n \frac{\mathop{\mathrm{Var}} {\left( f_i | u \right)} }{\sigma^2}. \]

As the number of inducing variables \(m\) increases, the pseudo-likelihood becomes more accurate \(\widehat{p}^u(y) \to p(y)\), the conditional variances \(\mathop{\mathrm{Var}} {\left( f_i | u \right)} \to 0\) shrink to zero, and the KL divergence \(D_{\text{KL}}[q(u,f) \mid p(u,f \mid y)]\) approaches zero.

The animation at the start of this note illustrates the effect of optimizing the location of the inducing points \(z\) with a very simple gradient descent. A few experiments show that it is worth being careful with the initial choice of inducing points. Inducing points chosen very far from the data essentially remain fixed during the optimization (ie. the gradient is very small). Initializing with k-means++ clustering of the data points seems to be a robust strategy and give an almost optimal choice of inducing points.

References

Titsias, Michalis. 2009. “Variational Learning of Inducing Variables in Sparse Gaussian Processes.” In Artificial Intelligence and Statistics, 567–74. PMLR.