Sparse GP

GP
Published

18 04 2025

Modified

18 04 2025

These notes are mainly for my own reference; I’m pretty clueless about GPs at the moment, and that needs to change. Read at your own risk; typos and mistakes are likely.

Let f()GP(m,K) denote a Gaussian Process (GP) prior with zero mean m=0 and covariance kernel K. Assume we observe n1 noisy measurements y=(yi)i=1n of fi=f(xi) at input locations x=(xi)i=1n. The goal is to compute the posterior of f=(fi)i=1n and to infer GP hyperparameters. The main challenge with GP models is the cubic complexity of the matrix inversion required to many of the posterior computations.

Sparse GPs are a class of approaches that aim to reduce this complexity by approximating the full GP posterior with a smaller set of so-called inducing variables u=(u1,,um) that entirely describe an approximate posterior distribution. Consider mn locations z=(zi)i=1m called inducing points and set ui=f(zi) for the latent function values at the inducing points. The Gaussian random variables ui can be used as inducing random variable; the choice of inducing points z defines a different set of inducing variables u. In this setting, optimizing the inducing variables simply means optimizing the locations of the inducing points z. The strategy is to approximate the posterior of (u,f) with a tractable distribution q(u,f),

p(u,fy)=p(u)p(fu)p(yf)p(y)q(u,f).

Later, we will see that setting ui=f(zi) is indeed not the only choice of inducing variables, but let’s keep it to this for now. We have p(u)=N(0,Ku) and p(fu)=N(μf|u,Kf|u) where

(1){μf|u=KfuKu1uKf|u=KfKfuKu1Kuf.

where Ku is the covariance matrix of the inducing variables u and Kfu is the covariance matrix between f and u. Evaluating p(fu) involves inverting Kf|uRn,n, which typically scales as O(n3), hence intractable for large n. To approximate p(u,fy) with another distribution q(u,f), one can minimize the Kullback-Leibler divergence DKL[q(u,f)p(u,fy)], i.e.

q(u)q(f|u)log{q(u)q(f|u)p(u)p(fu)p(yf)}dudf+logp(y).

It is not extremely helpful since the intractable term p(fu) is present. However, () proposes to set q(f|u)=p(f|u), i.e. to consider an approximate posterior of the form:

q(u,f)=q(u)p(fu).

Note that the correct posterior distribution is typically not of this form, although when the number of inducing points is large enough, this approximations becomes increasingly accurate. With this class of approximate posterior, the expectation E[Φ(f)y] of some functional Φ is approximated as E[Φ(f)u]q(u)du. For example, if q(u)=N(μq,Kq) is a Gaussian variational distribution, the posterior distribution of f=f(x) at a new location x is approximated as K,uKu1N(μq,Kq)+K|u; it is a Gausian with

{mean=K,uKu1μqcov=K,uKu1KqKu1Ku,+K|u

Optimizing the inducing variables is equivalent to minimizing the free energy quantity

(2)Fq(u)p(f|u)log{q(u)p(u)p(yf)}dudf,

over the variational distribution q(u) and choice of inducing variables. For a fixed set of inducing variables (eg. set of inducing points), it is clear that the optimal variational distribution is given by

(3)q(u)=p(u)exp{p(f|u)log(p(yf))df}/Z=p(u)exp{E[logp(yf)u]}/Z

for some normalization constant Z=p(u)exp{E[logp(yf)u]}du; this can be seen by expressing Equation 2 as KL divergence, as similarly done for example when deriving the Coordinate Ascent Variational Inference (CAVI) method,

(4)F=DKL[q(u)q(u)]logZ.

Equation 3 shows that q(u) is the prior p(u) weighted by a term that is large when the observations y are likely given u, i.e. when E[logp(yf)u] is large.

Nystrom approximation

Before describing the simple and most important case of additive Gaussian noise, let’s give a brief reminder on the Nystrom approximation. The distribution of fu is Gaussian with mean μf|u=KfuKu1u and covariance Kf|u. This means that KfuKu1u+N(0,Kf|u) is distributed as the unconditional distribution fN(0,Kf). In particular:

Kf=KfuKu1KuKu1Kuf+Kf|u=KfuKu1Kuf+Kf|u=K^fu+Kf|u

where K^fuKfuKu1Kuf is the so-called Nystrom approximation of the covariance matrix Kf based on the inducing variable u. This shows that the Nystrom approximation K^fu simply consists in ignoring the conditional variance term Kf|u, and is thus an underestimate of the covariance matrix Kf. Furthermore, if u is very informative of f, then Kf|u is small and the Nystrom approximation is accurate.

Observation with additive Gaussian noise

The case of additive Gaussian noise is particularly simple. Assume that

yi=fi+εiwithεiN(0,σ2)

where the noise terms εi are independent. Since f|uN(μf|u,Kf|u), algebra gives that

E[logp(yf)u]=log[N(y;μf|u,σ2I)]12σ2Tr(Kf|u)

Using that μf|u=KfuKu1u and the matrix inversion lemma it quickly follows that optimal variational distribution is q(u)=N(μ,K) with

{μ=Kuf(K^fu+σ2I)1yK=KuKuf(K^fu+σ2I)1Kfu.

Indeed, these are approximations of the exact condition moments,

{μu|y=Kuf(Kf+σ2I)1yKu|y=KuKuf(Kf+σ2I)1Kfu.

where the Nystrom approximation K^fuKf is used instead. One then finds that logZ=logN(y;0,K^fu+σ2I)12σ2Tr(Kf|u). With the optimal variational distribution q(u), Equation 4 gives that the free energy is:

F=logN(y;0,K^fu+σ2I)+12σ2TrKf|u

Furthermore, note that exact likelihood of the observations is p(y)=N(y;0,Kf+σ2I) so that the free energy can be expressed as

F=logp^u(y)+12σ2TrKf|u

for pseudo-likelihood p^u(y)=N(y;0,K^fu+σ2I). This shows that DKL is given by: DKL[q(u,f)p(u,fy)]=F+logp(y)=logp(y)p^u(y)+12σ2TrKf|u.

The term TrKf|u is just the sum of the conditional variances Var(fi|u) and can be thought of as a regularization term,

R=12σ2TrKf|u=12i=1nVar(fi|u)σ2.

As the number of inducing variables m increases, the pseudo-likelihood becomes more accurate p^u(y)p(y), the conditional variances Var(fi|u)0 shrink to zero, and the KL divergence DKL[q(u,f)p(u,fy)] approaches zero.

The animation at the start of this note illustrates the effect of optimizing the location of the inducing points z with a very simple gradient descent. A few experiments show that it is worth being careful with the initial choice of inducing points. Inducing points chosen very far from the data essentially remain fixed during the optimization (ie. the gradient is very small). Initializing with k-means++ clustering of the data points seems to be a robust strategy and give an almost optimal choice of inducing points.

References

Titsias, Michalis. 2009. “Variational Learning of Inducing Variables in Sparse Gaussian Processes.” In Artificial Intelligence and Statistics, 567–74. PMLR.