Natural Gradient

optimization

variational

Published

23 11 2024

Modified

23 11 2024

Shun’ichi Amari, founder of modern information geometry

Consider a parametric family of probability distributions \(q_{\theta}(x)\) indexed by \(\theta \in \Theta \subseteq \mathbb{R}^d\). We are interested in minimizing a functional \(\theta \mapsto J(q_{\theta})\), typically the Kullback-Leibler divergence between \(q_{\theta}\) and a target distribution \(\pi(x)\). One could indeed set \(F(\theta) = J(q_{\theta})\) and use a variation of the gradient descent algorithm to minimize \(F\), i.e. \(\theta_{t+1} = \theta_t - \alpha_t \nabla F(\theta_t)\) for some step size \(\alpha_t\). Although simple, this approach has the important drawback of depending very much one the choice of the parametrization of the family \(q_{\theta}\). Instead, one could for example consider an update of the type:

\[ \theta_{t+1} = \mathop{\mathrm{argmin}}_{\theta \in \Theta} \; {\left\{ \textcolor{blue}{J(q_{\theta})} + \frac{1}{\alpha_t} \, D(q_{\theta} \| q_{\theta_t}) \right\}} \]

for step size \(\alpha_t\) and some divergence \(D\) often taken to be the Kullback-Leibler divergence. This has the advantage of beging entirely independent of the parametrization of the family \(q_{\theta}\). Unfortunately, the right hand side of the above equation is not tractable in general. A middle ground consists in considering

\[ \theta_{t+1} = \mathop{\mathrm{argmin}}_{\theta \in \Theta} \; {\left\{ \textcolor{blue}{F(\theta_t) + \left< \theta - \theta_t, \nabla_{\theta} F(\theta_t) \right>} + \frac{1}{\alpha_t} \, D(q_{\theta} \| q_{\theta_t}) \right\}} . \tag{1}\]

Furthermore, in the case \(D\) is the Kullback-Leibler divergence, a Taylor expansion of the divergence around \(\theta_t\) gives

\[ D_{\text{KL}}(q_{\theta} \| q_{\theta_t}) \approx \frac{1}{2} \left< (\theta - \theta_t), \mathcal{I}(\theta_t) (\theta - \theta_t) \right> \, + \, \mathcal{O}(\|\theta - \theta_t\|^3) \]

where \(\mathcal{I}(\theta_t)\) is the Fisher information matrix at \(\theta_t\):

\[ \begin{align*} \mathcal{I}(\theta_t) &= \mathbb{E}_{q_{\theta_t}} {\left[ \nabla_{\theta} \log q_{\theta_t}(X) \, \nabla_{\theta} \log q_{\theta_t}^\top(X) \right]} \\ &= - \mathbb{E}_{q_{\theta_t}} {\left[ \nabla_{\theta}^2 \log q_{\theta_t} \right]} . \end{align*} \]

Replacing \(D(q_{\theta} \| q_{\theta_t})\) by \(\frac{1}{2} \left< (\theta - \theta_t), \mathcal{I}(\theta_t) (\theta - \theta_t) \right>\) in Equation 1 gives the so-called natural gradient update rule (Amari 1998):

\[ \theta_{t+1} = \theta_t - \alpha_t \, \mathcal{I}(\theta_t)^{-1} \, \nabla_{\theta} F(\theta_t). \]

In other words, the inverse of the Fisher information matrix acts as a preconditioner.

References

Amari, Shun-Ichi. 1998. “Natural Gradient Works Efficiently in Learning.” Neural Computation 10 (2): 251–76.