Adjoint method for sensitivities

ODE
PDE
Adjoint
Published

10 05 2025

Modified

10 05 2025

Lev Pontryagin (1908 - 1988)

Table of Contents

The adjoint method is, at its core, the same idea as backpropagation or reverse-mode automatic differentiation. In practice, though, it’s often helpful to understand how it works under the hood. Basic implementations of backprop can be sub-optimal or impractical (eg. memory intensive), especially in settings like PDE-constrained optimization or stochastic optimal control.

Linear Systems

For a parameter \(\theta \in \mathbb{R}^{d_\theta}\), consider \(x = x(\theta)\) the solution of the linear system

\[ A x = b, \]

where both the matrix \(A = A(\theta) \in \mathbb{R}^{d_x \times d_x}\) and the vector \(b = b(\theta) \in \mathbb{R}^{d_x}\) depend on \(\theta\). This setup is typical when discretizing PDEs: \(A\) arises from the differential operator, and \(b\) from the source term. We are interested in a loss function of the type

\[ \mathcal{L}(\theta) = F(\theta, x(\theta)), \]

with \(F: \mathbb{R}^{d_\theta} \times \mathbb{R}^{d_x} \to \mathbb{R}\). We aim to compute the derivative of \(\mathcal{L}\) with respect to \(\theta\). The chain rule gives

\[ D_\theta \mathcal{L}= F_\theta - F_x A^{-1} \left( A_\theta \, x - b_\theta \right) \; \in \mathbb{R}^{1, d_\theta}. \]

The notation \(F_\theta = \nabla_\theta F^\top \in \mathbb{R}^{1, d_\theta}\) denotes the Jacobian of \(F\) with respect to \(\theta\), and similarly notations are used for \(F_x\) and \(A_\theta\) and \(b_\theta\). As usual, the jacobian can be thought of as the transpose of the gradient for scalar functions. When \(d_x \gg 1\) and \(d_\theta \gg 1\), directly computing \(A^{-1}\) is not feasible. Naively evaluating \(A^{-1} (A_\theta x - b_\theta)\) would require \(d_\theta\) linear solves, each of one of complexity cubic in \(d_x\). A better approach is to first compute \(\lambda^\top = F_x A^{-1}\) by solving the so-called adjoint system

\[ A^\top \lambda = F_x^\top. \]

This requires only one linear solve. Once \(\lambda \in \mathbb{R}^{d_x}\) is computed, the jacobian simplifies to

\[ D_\theta \mathcal{L}= F_\theta - \lambda^\top (A_\theta x - b_\theta). \]

Adjoint Method

Now consider a more general situation where \(x \in \mathbb{R}^{d_x}\) and \(\theta \in \mathbb{R}^{d_\theta}\) are related by an implicit equation

\[ \Phi(x, \theta) = 0 \]

for some function \(\Phi: \mathbb{R}^{d_x} \times \mathbb{R}^{d_\theta} \to \mathbb{R}^{d_x}\) that satisfies the usual conditions for the implicit function theorem to hold. Differentiating gives \(x_\theta = -\Phi_x^{-1} \Phi_\theta\). As before, we want the sensitivity with respect to \(\theta\) of \(\mathcal{L}(\theta) = F(x(\theta), \theta)\). It equals \(D_\theta \mathcal{L}= F_\theta - F_x \Phi_x^{-1} \Phi_\theta\) and can also be expressed as \(D_\theta \mathcal{L}= F_\theta - \lambda^\top \Phi_\theta\) where \(\lambda \in \mathbb{R}^{d_x}\) is the solution of the adjoint system

\[ \Phi_x^\top \lambda = F_x^\top. \tag{1}\]

Another way to present this computation is to note that, for any vector \(\lambda \in \mathbb{R}^{d_x}\), we have

\[ \mathcal{L}(\theta) = F(x(\theta), \theta) - \lambda^\top \Phi(x(\theta), \theta) \]

since \(\Phi(x(\theta), \theta) \equiv 0\). As will soon become clear, introducing the “adjoint” variable \(\lambda\) allows one to eliminate cumbersome terms when computing \(D_\theta \mathcal{L}\). Differentiation with respect to \(\theta\) gives

\[ D_\theta \mathcal{L}= F_\theta - \lambda^\top \Phi_\theta + {\left( F_x - \lambda^\top \Phi_x \right)} x_\theta. \]

The term \(x_\theta = -\Phi_x^{-1} \Phi_\theta \in \mathbb{R}^{d_x, d_\theta}\) is cumbersome (eg. intractable in high-dimensional settings), and we would like to eliminate it. To this end, it suffices to choose \(\lambda\) so that the term \(F_x - \lambda^\top \Phi_x\) vanishes; indeed, this is exactly the adjoint-system Equation 1.

PDE Inverse Problems

Let us see how this works in the context of PDE-constrained optimization. Let \(\Omega \subset \mathbb{R}^d\) be a domain and let \(\kappa: \Omega \to \mathbb{R}\) be a scalar field. Consider the PDE

\[ \nabla {\left( e^{\kappa(x)} \nabla u \right)} = f, \tag{2}\]

where \(f\) is a given source term. The field \(\kappa\) describes the diffusion, or permeability, properties of the medium. We are interested in the solution \(u\) of the PDE on a bounded domain \(\Omega \subset \mathbb{R}^d\) with Dirichlet boundary conditions \(u(x) = 0\) for \(x \in \partial \Omega\). For each field \(\kappa\), the elliptic PDE determines a unique solution \(u\). We are interested in minimizing the quantity

\[ \mathcal{L}(\kappa) = \int_\Omega F(u(x)) \, dx, \]

for some given function \(F: \mathbb{R} \to \mathbb{R}\). A common case is

\[ \mathcal{L}(\kappa) = \frac{1}{2} \, \int_\Omega \left| u(x) - u^\star(x) \right|^2 \, w(x) \, dx, \]

where \(u^\star\) is a target solution and \(w(x)>0\) is a weight. The goal is to adjust the field \(\kappa\) so that \(u\) matches \(u^\star\) as closely as possible. To carry out the minimization of \(\mathcal{L}\) with respect to \(\kappa\), one needs the derivative of \(\mathcal{L}\) with respect to \(\kappa\). To that end, define the augmented functional

\[ \mathcal{L} = \int_\Omega F(u(x)) \, dx - \int_\Omega \lambda \left( \nabla \cdot (e^{\kappa} \nabla u) - f \right) \, dx, \]

for an auxiliary field \(\lambda : \Omega \to \mathbb{R}\) that will be chosen later. As before, a good choice of \(\lambda\) can simplify the computations. Let \(\delta \kappa\) be a perturbation of \(\kappa\). This induces a perturbation \(u + \delta u\) in the solution and the first order variation of \(\mathcal{L}\) reads:

\[ \delta \mathcal{L} = \int_\Omega F'(u) \, \delta u \, dx - \int_\Omega \lambda \, \nabla \cdot (e^{\kappa} \nabla \delta u) \, dx - \int_\Omega \lambda \, \nabla \cdot (e^{\kappa} \delta \kappa \nabla u) \, dx. \]

The term involving \(\delta u\) is inconvenient. Assuming \(\lambda\) also satisfies Dirichlet boundary conditions, which we can indeed assume seems we are free to define \(\lambda\) in any manner we want, we integrate by parts:

\[ \delta \mathcal{L} = \int_\Omega \left( F'(u) - \nabla \cdot (e^{\kappa} \nabla \lambda) \right) \delta u \, dx - \int_\Omega \lambda \, \nabla \cdot (e^{\kappa} \delta \kappa \nabla u) \, dx. \]

To eliminate the \(\delta u\) term, choose \(\lambda\) to satisfy the adjoint equation

\[ \nabla \cdot (e^{\kappa} \nabla \lambda) = F'(u), \tag{3}\]

with Dirichlet boundary conditions. Then, an integration by parts gives

\[ \begin{align*} \delta \mathcal{L} &= -\int_\Omega \lambda \, \nabla \cdot (e^{\kappa} \delta \kappa \nabla u) \, dx\\ &= \int_\Omega e^{\kappa} \, \left< \nabla u, \nabla \lambda \right> \, \delta \kappa \, dx = \left< g, \delta \kappa \right>_{L^2(\Omega)}. \end{align*} \]

This means that the \(L^2\) gradient of \(\mathcal{L}\) with respect to \(\kappa\) is given by:

\[ g = e^{\kappa} \, \left< \nabla u, \nabla \lambda \right>, \]

where \(\lambda: \Omega \to \mathbb{R}\) solves the adjoint system Equation 3. This shows that the gradient of the objective can be computed at the same computational cost as the solution the original PDE Equation 2. This expression can be used directly in gradient-based optimization schemes.

Controlled Diffusions

Consider the ODE on \([0, T]\):

\[ \dot{x} = b(t, \theta, x), \tag{4}\]

with initial condition \(x(0) = \mu(\theta)\), where \(x \in \mathbb{R}^{d_x}\) and \(\theta \in \mathbb{R}^{d_\theta}\). The drift term \(b(t, \theta, x) \in \mathbb{R}^{d_x}\) is parameterized by \(\theta\). We want the sensitivity of the functional

\[ \mathcal{L}(\theta) = \int_0^T f(t, \theta, x(t)) \, dt + g(\theta, x(T)), \]

where \(f\) and \(g\) are given functions. As before, it is often helpful to introduce an auxiliary function \(\lambda(t) \in \mathbb{R}^{d_x}\) and write:

\[ \mathcal{L}(\theta) = \mathcal{L}(\theta) - \int_0^T \lambda^\top \, \underbrace{ {\left( \dot{x} - b \right)} }_{\equiv 0} \, dt. \]

Differentiating with respect to \(\theta\) and integrating by parts gives:

\[ \begin{aligned} D_\theta \mathcal{L} &= \int_0^T \left( f_\theta + \lambda^\top b_\theta \right) \, dt + g_\theta(\theta, x(T)) + \lambda^\top(0) \mu_\theta \\ &\quad + \left( g_x - \lambda^\top(T) \right) \, x_\theta(T) + \int_0^T \left( f_x + \dot{\lambda}^\top + \lambda^\top b_x \right) x_\theta(t) \, dt. \end{aligned} \]

To eliminate the dependence on \(x_\theta(t)\), choose \(\lambda(t)\) to satisfy the adjoint system:

\[ \begin{cases} \dot{\lambda}(t) = -\nabla_x f - b_x^\top \lambda(t), \\ \lambda(T) = \nabla_x g. \end{cases} \tag{5}\]

This is a linear ODE with a terminal condition \(\lambda(T) = \nabla_x g\) that needs to be solved backward in time. This means that for computing the derivative of \(\mathcal{L}\), one can first solve the forward ODE Equation 4 to obtain \(x(t)\), and then solve the adjoint system Equation 5 backward in time to obtain \(\lambda(t)\). The Jacobian (i.e. transpose of the gradient) of \(\mathcal{L}\) is then given by

\[ D_\theta \mathcal{L} = \int_0^T \left( f_\theta + \lambda^\top b_\theta \right) dt + g_\theta(\theta, x(T)) + \lambda^\top(0) \mu_\theta. \tag{6}\]

The term \(\lambda^\top b_\theta\) is a vector jacobian product, and can be computed efficiently. This formulation is often referred to as the “continuous adjoint method” or “adjoint sensitivity analysis” or “optimize-then-discretize” and dates back to the work of (Pontryagin 1962). A naive implementation of backpropagation can be inefficient memory-wise since quantities such as \(f_\theta\) would typically be stored along the forward pass. When \(d_\theta \gg 1\), as is for example the case when the drift is parameterized by a neural network, this can be impractical. Instead, it may be more efficient to store the forward trajectory \(x(t)\) only, and recompute all the other quantities during the backward pass; there is a slight computational cost but potentially very large memory savings. In machine-learning settings, this often means the possibility to exploit much larger batch sizes. Similarly, if implicit methods are used to solve the ODE instead of a simple Euler-Maruyama scheme, backpropagation through the implicit solver can be tricky.

Nothing really changes when considering a SDE with additive noise instead,

\[ dx = b(t, \theta, x) \, dt + \sigma(t) \, dW_t. \]

Informally, one can apply the same reasoning as previously to the controlled ODE: \[ \dot{x} \, = \, b(t, \theta, x) + \sigma(t) \, dW_t/dt. \]

Again, it suffices to solve the SDE forward in time to obtains \(x(t)\) and then solve the same exact same adjoint ODE Equation 5 backward in time to obtain \(\lambda(t)\): it is still an ordinary differential equation. The derivative of \(\mathcal{L}\) with respect to \(\theta\) is then given by the same expression Equation 6. For SDEs with multiplicative noise, the adjoint system is slightly more complicated, but hardly changes the overall picture. Finally, note that in the case the two functions \(f\) and \(g\) do not depend on \(\theta\) and one chooses \(\theta = x_0\) and the initial condition \(\mu_{x_0} = x_0\), Equation 6 shows that \(D_{x_0} \mathcal{L}= {\left( \nabla_{x_0} \mathcal{L} \right)} ^\top = \lambda(0)^\top\). More generally, this shows that:

\[ \lambda(t) = \nabla_{x(t)} \, {\left\{ \int_t^T f(s, \theta, x(s)) \, ds + g(\theta, x(T)) \right\}} . \]

References

Pontryagin, Lev Semenovich. 1962. Mathematical Theory of Optimal Processes.