Gaussian Assimilation & the EnKF

Ensemble Kalman Updates

Assume a prior Gaussian prior distribution \(\pi_0 \equiv \mathcal{N}(m_0,P_0)\) and a noisy observation \(y_\star \in \mathbb{R}^{D_y}\) with

\[ y_\star = H x + \xi \qquad \text{with} \qquad \xi \sim \mathcal{N}(0,R) \]

where \(x \in \mathbb{R}^{D_x}\) is an unknown quantity of interest. The posterior distribution \(\pi \equiv \mathcal{N}(m,P)\) is Gaussian and is given by

\[ \begin{align} \left\{ \begin{aligned} m &= m_0 + P_0 H^\top {\left( H P_0 H^\top + R \right)} ^{-1}(y_\star - H m)\\ P &= P_0 - P_0 H^\top {\left( H P_0 H^\top + R \right)} ^{-1} \, H P_0, \end{aligned} \right. \end{align} \]

as standard Gaussian conditioning shows it. This can also be written as

\[ \begin{align} \left\{ \begin{aligned} m &= m_0 + K \, (y_\star - H m_0)\\ P &= (I - K \, H) P_0, \end{aligned} \right. \end{align} \]

for Kalman Gain Matrix \(K \in \mathbb{R}^{D_x,D_y}\) defined as

\[ K \; = \; P_0 H^\top {\left( H P_0 H^\top + R \right)} ^{-1}. \tag{1}\]

which can also be expressed as

\[ K \; = \; \mathop{\mathrm{Cov}}(X, HX) \, \mathop{\mathrm{Var}}(Y)^{-1}. \]

for \(X \sim \pi_0\) and \(Y = HX + \mathcal{N}(0,R)\); this point of view can be a useful for establishing generalization to non-linear scenarios. The important remark is that the posterior covariance matrix \(P \in \mathbb{R}^{D_x,D_x}\) and the posterior mean \(m \in \mathbb{R}^{D_x}\) can also be expressed as

\[ \begin{align} \left\{ \begin{aligned} m &= \textcolor{red}{(I - K \, H)} \, m_0 + \textcolor{blue}{K} \, y_\star\\ P &= \; \textcolor{red}{(I - K \, H)} \, P_0 \, \textcolor{red}{(I - K \, H)^\top} + \textcolor{blue}{K} R \textcolor{blue}{K^\top}. \end{aligned} \right. \end{align} \tag{2}\]

This shows that \(P\) is indeed positive semi-definite. More importantly, this gives a mechanism for transforming samples from the prior distributions into samples from the posterior distributions. Indeed, consider \(N\) iid samples from the prior distribution, \(x_1, \ldots, x_N \sim \pi_0(dx)\), and set

\[ \widetilde{x}_i = \textcolor{red}{(I - K \, H)} \, x_i + \textcolor{blue}{K}(y_\star - \xi_i) \]

for iid noise terms \(\xi_i \sim \mathcal{N}(0,R)\). From Equation 2 it is clear that \(\widetilde{x}_1, \ldots, \widetilde{x}_N\) are iid samples from the Gaussian posterior distribution \(\pi = \mathcal{N}(m,P)\). It is more intuitive to write this as

\[ \widetilde{x}_i = x_i + K \, (y_\star - \textcolor{green}{ y_{i} } ) \]

where \( \textcolor{green}{ y_{i} } \in \mathbb{R}^{D_y}\) are fake observations obtained as:

\[ \textcolor{green}{ y_{i} \, = \, H x_i + \xi_i.} \]

Empirical version: non-linearity and non-Gaussianity

Suppose that we would like to estimate \(x \in \mathbb{R}^{D_x}\) from the noisy observation

\[ y_\star = \mathcal{H}(x) + \xi \qquad \text{with} \qquad \xi \sim \mathcal{N}(0,R) \]

and possibly-nonlinear observation operator \(\mathcal{H}: \mathbb{R}^{D_x} \to \mathbb{R}^{D_y}\). Assume that we also have \(N\) samples \(x_1, \ldots, x_N\) generated from some (unkown) prior distribution. For example, these samples could come from another numerical procedure. In order to obtain \(N\) approximate samples from the posterior distribution, one can set

\[ \widetilde{x}_i = x_i + \widehat{K} \, [ y_{\star} - \textcolor{green}{y_i} ] \]

for fake observations \( \textcolor{green}{y_{i} = \mathcal{H}(x_i) + \xi_i}\). The approximate Kalman gain matrix \(\widehat{K}\) is obtained by noting that in Equation 1 giving

\[ K \; = \; P_0 H^\top {\left( H P_0 H^\top + R \right)} ^{-1} \]

we have \(P_0 H^\top = \mathop{\mathrm{Cov}}(X,HX)\) and \(H P_0 H^\top = \mathop{\mathrm{Var}}(HX)\) for \(X \sim \pi_0\). This means that an approximate Kalman matrix can be obtained using empirical estimates of these covariance matrices:

\[ \widehat{K} \; = \; \widehat{\mathop{\mathrm{Cov}}}([x_i]_i, [\mathcal{H}(x_i)]_i) \, {\left( \widehat{\mathop{\mathrm{Var}}}([\mathcal{H}(x_i)]_i) + R \right)} ^{-1}. \]

These updates form the basis of the Ensemble Kalman filter (EnKF), and very successful and scalable approach to data-assimilation of high-dimensional dynamical systems. This method is operationally employed across various weather forecasting centers across the globe.

Matheron’s Rule

Consider a jointly Gaussian random vector \((X,Y) \in \mathbb{R}^{D_x + D_y}\). To sample from the conditional distribution \(X|Y=y_\star\), one can sample first sample \((X,Y)\) from the unconditional distribution and then set:

\[ % \widetilde{X} \; = \; X + \cov(X,Y) \, \var(Y)^{-1} (y_\star - Y). \widetilde{X} \; = \; X + C_{XY} \, C_{YY}^{-1} (y_\star - Y). \]

The sample \(\widetilde{X}\) is a sample from the conditional distribution \(X|Y=y_\star\) and the proof is straightforward: it suffices to compute the first two moments of the Gaussian vector \(\widetilde{X}\) and check that they are equal to what they should be. This method is often called the Matheron’s rule and (Wilson et al. 2020) mentions that this “notion of conditioning by kriging was first presented by Matheron in the early 1970s”. Naturally, this provides another avenue to derive the EnKF update equations, as has been described by a number of authors (Doucet 2010). Given samples \(x_i\) from the Gaussian prior, generate samples \(y_i = \mathcal{H}(x_i) + \xi_i\) for \(\xi_i \sim \mathcal{N}(0,R)\): this gives pairs samples \((x_i,y_i)\) from the joint “prior”. One can obtain samples from the posterior by setting \(\widetilde{x}_i = x_i + C_{xx} \, C_{yy}^{-1} (y_\star - y_i)\). Naturally, the matrix \(C_{xy}\) can be estimated as \(\widehat{\mathop{\mathrm{Cov}}}([x_i]_i, [\mathcal{H}(x_i)]_i)\) and \(C_{yy}\) is estimated as \(\widehat{\mathop{\mathrm{Var}}}([\mathcal{H}(x_i)]_i) + R\). Indeed, this is exactly the EnKF update!

Derivative-Free optimization

Interestingly enough, the remarks above can be used to design in a relatively principled manner a derivative free optimizer (Huang et al. 2022). For example, assume one would like to minimize a functional of the type

\[ \Psi(x) \; = \; \|y_\star - \mathcal{H}(x)\|^2_{R^{-1}}. \]

One can start with a cloud of particles \(x_1, \ldots, x_N\) and keep updating them by assuming that one assimilates the noisy observation \(y_\star\) generated from a postulated observation process

\[ y_\star = \mathcal{H}(x) + \varepsilon^{-1} \, \xi \]

for \(\xi \sim \mathcal{N}(0,R)\) and \(\varepsilon\ll 1\) a “step-size”. Each assimilation step steers the cloud of points towards the rights direction. Careful choice of the step-size \(\varepsilon\) is often crucial, as in any optimization procedure. It is indeed related to Information-Geometric Optimization Algorithms (IGO): the article (Ollivier et al. 2017) is beautiful!

References

Doucet, Arnaud. 2010. “A Note on Efficient Conditional Simulation of Gaussian Distributions.” Departments of Computer Science and Statistics, University of British Columbia 1020.

Huang, Daniel Zhengyu, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. 2022. “Efficient Derivative-Free Bayesian Inference for Large-Scale Inverse Problems.” Inverse Problems 38 (12). IOP Publishing: 125006.

Ollivier, Yann, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. 2017. “Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles.” The Journal of Machine Learning Research 18 (1). JMLR. org: 564–628.

Wilson, James, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, and Marc Deisenroth. 2020. “Efficiently Sampling Functions from Gaussian Process Posteriors.” In International Conference on Machine Learning, 10292–302. PMLR.