These notes are mainly for my own reference; I’m pretty clueless about GPs at the moment, and that needs to change. Read at your own risk; typos and mistakes are likely.
Let denote a Gaussian Process (GP) prior with zero mean and covariance kernel . Assume we observe noisy measurements of at input locations . The goal is to compute the posterior of and to infer GP hyperparameters. The main challenge with GP models is the cubic complexity of the matrix inversion required to many of the posterior computations.
Sparse GPs are a class of approaches that aim to reduce this complexity by approximating the full GP posterior with a smaller set of so-called inducing variables that entirely describe an approximate posterior distribution. Consider locations called inducing points and set for the latent function values at the inducing points. The Gaussian random variables can be used as inducing random variable; the choice of inducing points defines a different set of inducing variables . In this setting, optimizing the inducing variables simply means optimizing the locations of the inducing points . The strategy is to approximate the posterior of with a tractable distribution ,
Later, we will see that setting is indeed not the only choice of inducing variables, but let’s keep it to this for now. We have and where
where is the covariance matrix of the inducing variables and is the covariance matrix between and . Evaluating involves inverting , which typically scales as , hence intractable for large . To approximate with another distribution , one can minimize the Kullback-Leibler divergence , i.e.
It is not extremely helpful since the intractable term is present. However, (Titsias 2009) proposes to set , i.e. to consider an approximate posterior of the form:
Note that the correct posterior distribution is typically not of this form, although when the number of inducing points is large enough, this approximations becomes increasingly accurate. With this class of approximate posterior, the expectation of some functional is approximated as . For example, if is a Gaussian variational distribution, the posterior distribution of at a new location is approximated as ; it is a Gausian with
Optimizing the inducing variables is equivalent to minimizing the free energy quantity
over the variational distribution and choice of inducing variables. For a fixed set of inducing variables (eg. set of inducing points), it is clear that the optimal variational distribution is given by
for some normalization constant ; this can be seen by expressing Equation 2 as KL divergence, as similarly done for example when deriving the Coordinate Ascent Variational Inference (CAVI) method,
Equation 3 shows that is the prior weighted by a term that is large when the observations are likely given , i.e. when is large.
Nystrom approximation
Before describing the simple and most important case of additive Gaussian noise, let’s give a brief reminder on the Nystrom approximation. The distribution of is Gaussian with mean and covariance . This means that is distributed as the unconditional distribution . In particular:
where is the so-called Nystrom approximation of the covariance matrix based on the inducing variable . This shows that the Nystrom approximation simply consists in ignoring the conditional variance term , and is thus an underestimate of the covariance matrix . Furthermore, if is very informative of , then is small and the Nystrom approximation is accurate.
Observation with additive Gaussian noise
The case of additive Gaussian noise is particularly simple. Assume that
where the noise terms are independent. Since , algebra gives that
Using that and the matrix inversion lemma it quickly follows that optimal variational distribution is with
Indeed, these are approximations of the exact condition moments,
where the Nystrom approximation is used instead. One then finds that . With the optimal variational distribution , Equation 4 gives that the free energy is:
Furthermore, note that exact likelihood of the observations is so that the free energy can be expressed as
for pseudo-likelihood . This shows that is given by:
The term is just the sum of the conditional variances and can be thought of as a regularization term,
As the number of inducing variables increases, the pseudo-likelihood becomes more accurate , the conditional variances shrink to zero, and the KL divergence approaches zero.
The animation at the start of this note illustrates the effect of optimizing the location of the inducing points with a very simple gradient descent. A few experiments show that it is worth being careful with the initial choice of inducing points. Inducing points chosen very far from the data essentially remain fixed during the optimization (ie. the gradient is very small). Initializing with k-means++ clustering of the data points seems to be a robust strategy and give an almost optimal choice of inducing points.
References
Titsias, Michalis. 2009. “Variational Learning of Inducing Variables in Sparse Gaussian Processes.” In Artificial Intelligence and Statistics, 567–74. PMLR.