Adjoint Matching

Setup

Consider a pre-trained generative model that produces samples \(X_1\) by simulating the SDE

\[ dX_t = b(X_t, t) \, dt + \sigma(t) \, dW_t, \qquad X_0 \sim \mathcal{N}(0, I) \tag{1}\]

on \([0,1]\), where \(b\) encodes a learned score or velocity field. The path distribution of this process is \(\mathbb{P}\) and the terminal distribution is \(p_1(x)\). Given a reward function \(r: \mathbb{R}^D \to \mathbb{R}\), we want to fine-tune the model so that its output distribution becomes the tilted distribution

\[ \pi(x) \propto p_1(x) \, \exp(r(x)). \tag{2}\]

Introduce a control \(u(t,x)\) and the controlled SDE

\[ dX^u_t = {\left[ b(X^u_t, t) + \sigma(t) \, \textcolor{blue}{u(t, X^u_t)} \right]} \, dt + \sigma(t) \, dW_t. \tag{3}\]

The Girsanov theorem gives \(D_{\text{KL}}(\mathbb{P}^u \| \mathbb{P}) = \frac{1}{2} \mathbb{E}_{\mathbb{P}^u} \int_0^1 \|u(t, X^u_t)\|^2 \, dt\), so the stochastic optimal control problem

\[ \min_u \; \mathcal{J}(u) = \mathbb{E}_{\mathbb{P}^u} {\left[ \int_0^1 \tfrac{1}{2} \|u(t, X^u_t)\|^2 \, dt - r(X^u_1) \right]} \tag{4}\]

maximizes the terminal reward while penalizing deviations from the base process. Equivalently, we maximize \(\mathbb{E}[-\frac{1}{2}\|u\|^2 + r(X_1)]\), which is the SOC problem from the HJB notes with \(f = 0\) and \(g = r\).

Value function bias

The HJB notes show that the optimal control is \(u^\star(t,x) = \sigma(t)^\top \nabla V(t,x)\) where \(V(t,x) = \log h(t,x)\) is the value function and

\[ h(t,x) = \mathbb{E} {\left[ \exp(r(X_1)) \mid X_t = x \right]} \]

under the base process. The HJB notes derive the Radon-Nikodym derivative between the optimally controlled path \(\mathbb{P}^\star\) and the base \(\mathbb{P}\) via Girsanov. For \(f = 0\) and \(g = r\), the optimal path distribution satisfies

\[ \frac{d\mathbb{P}^\star}{d\mathbb{P}}(X_{[0,1]}) \propto \exp {\left( r(X_1) + V(0, X_0) \right)} . \tag{5}\]

The \(V(0, X_0)\) term comes from the initial distribution shift: the optimally controlled process starts from \(q_0(x_0) \propto p_0(x_0) \, h(0, x_0) = p_0(x_0) \, \exp(V(0, x_0))\), not from \(p_0\).

Marginalizing the joint over \(X_0\):

\[ \begin{align} p^\star(X_0, X_1) &\propto p^{\text{base}}(X_0, X_1) \, \exp {\left( r(X_1) + V(0, X_0) \right)} \\ p^\star(X_1) &= \int p^{\text{base}}(X_0, X_1) \, \exp {\left( r(X_1) + V(0, X_0) \right)} \, dX_0. \end{align} \tag{6}\]

When \(X_0\) and \(X_1\) are correlated under \(\mathbb{P}\), the factor \(\exp(V(0, X_0))\) cannot be pulled out of the integral. The marginal \(p^\star(X_1)\) is not proportional to \(p_1(X_1) \exp(r(X_1))\). The SOC solution tilts both the terminal and the initial distribution, and the fine-tuned model does not sample from Equation 2.

As a concrete example, take \(\sigma(t) = 0\) (standard flow matching). The ODE \(\dot{X}_t = b(X_t, t)\) is deterministic: \(X_0\) fully determines \(X_1 = \Phi(X_0)\) for a diffeomorphism \(\Phi\). Adding a control \(u\) in Equation 3 has zero effect since \(\sigma = 0\) kills both the noise and the control term. There is nothing to optimize. The bias is maximal.

Memoryless noise schedule

Look again at Equation 6. If \(X_0 \perp X_1\) under the base process, then \(p^{\text{base}}(X_0, X_1) = p^{\text{base}}(X_0) \, p_1(X_1)\) and

\[ \begin{align} p^\star(X_1) &= p_1(X_1) \, \exp(r(X_1)) \int p^{\text{base}}(X_0) \, \exp(V(0, X_0)) \, dX_0\\ &\propto p_1(X_1) \, \exp(r(X_1)). \end{align} \tag{7}\]

The integral over \(X_0\) collapses to a constant. The bias disappears.

A base process Equation 1 is memoryless when \(X_0 \perp X_1\). For the family of generative SDEs that arise in flow matching and diffusion models, this condition pins down \(\sigma(t)\).

The generative SDE Equation 1 is built to have the same time marginals as the reference interpolation \(\bar{X}_t = \beta_t \bar{X}_0 + \alpha_t \bar{X}_1\) with \(\bar{X}_0 \sim \mathcal{N}(0,I)\) independent of \(\bar{X}_1\). Conditionally on \(\bar{X}_1\), this reference has \(\bar{X}_t \mid \bar{X}_1 \sim \mathcal{N}(\alpha_t \bar{X}_1, \beta_t^2 I)\): the conditional variance is \(\beta_t^2\), which starts at \(\beta_0^2 = 1\) (pure noise) and decays to \(\beta_1^2 = 0\) (deterministic). The unified drift from (domingoenrich2024adjoint?) is

\[ b(x,t) = \kappa_t \, x + {\left( \tfrac{\sigma(t)^2}{2} + \eta_t \right)} \nabla \log p_t(x) \tag{8}\]

with \(\kappa_t = \dot{\alpha}_t / \alpha_t\) and \(\eta_t = \beta_t(\dot{\alpha}_t \beta_t / \alpha_t - \dot{\beta}_t)\). For the SDE to produce independent \(X_0\) and \(X_1\), it needs to inject enough noise to erase memory of \(X_0\) by time 1. The conditional variance \(\beta_t^2\) decays at a rate controlled by \(\eta_t\), and the diffusion coefficient \(\sigma(t)^2/2\) adds noise. Matching these exactly, so that the SDE’s transition kernel reproduces the conditional Gaussian structure \(X_t \mid X_1 \sim \mathcal{N}(\alpha_t X_1, \beta_t^2 I)\), requires

\[ \sigma(t) = \sqrt{2 \eta_t}. \tag{9}\]

Near \(t=0\), \(\eta_t \to \infty\) so \(\sigma(t) \to \infty\): the process mixes aggressively, erasing memory of \(X_0\). Near \(t=1\), \(\eta_t \to 0\) so \(\sigma(t) \to 0\): the process stabilizes around \(X_1\).

Checking the flow matching schedule explicitly:

Take \(\alpha_t = t\) and \(\beta_t = 1-t\). Then \(\dot{\alpha}_t = 1\), \(\dot{\beta}_t = -1\), \(\kappa_t = 1/t\), and \[ \eta_t = (1-t) {\left( \frac{1}{t}(1-t) - (-1) \right)} = (1-t) {\left( \frac{1-t}{t} + 1 \right)} = \frac{(1-t)}{t}. \] So the memoryless schedule is \(\sigma(t) = \sqrt{2(1-t)/t}\), which blows up as \(t \to 0\) (aggressive mixing near the noise) and vanishes as \(t \to 1\) (stabilizing near the sample).

For DDIM with \(\alpha_t = \sqrt{\bar{\alpha}_t}\), \(\beta_t = \sqrt{1 - \bar{\alpha}_t}\), the memoryless schedule gives \(\sigma(t) = \sqrt{\dot{\bar{\alpha}}_t / \bar{\alpha}_t}\), which is exactly the DDPM noise schedule. This gives retrospective justification for DDPM’s noise choice, which was previously just one heuristic among others.

The result of (domingoenrich2024adjoint?) is stronger than sufficiency: within the family of generative SDEs sharing time marginals as \(\bar{X}_t\), the memoryless schedule is the only one guaranteeing convergence to Equation 2 for arbitrary reward functions. The memoryless schedule is the unique choice that preserves the relationship between velocity and score, enabling conversion to arbitrary noise schedules after fine-tuning.

The memoryless schedule is only needed during fine-tuning. After learning the control \(u\), the fine-tuned model (base drift plus control) can be converted back to a velocity field and sampled with any \(\sigma(t)\), including \(\sigma(t) = 0\).

With the memoryless schedule, we need to solve Equation 4. The adjoint method gives the gradient of the cost \(\mathcal{J}\) along a trajectory. The adjoint state is the sensitivity of the future cost to perturbations at time \(t\):

\[ a(t) = \nabla_{X_t} {\left[ \int_t^1 \tfrac{1}{2} \|u(s, X_s)\|^2 \, ds - r(X_1) \right]} . \tag{10}\]

Applying the adjoint ODE from the adjoint notes to the controlled drift \(b + \sigma u\) with running cost \(\frac{1}{2}\|u\|^2\) and terminal cost \(-r\):

\[ \dot{a}(t) = - {\left[ \nabla_x b + \sigma \, \nabla_x u \right]} ^\top a(t) - \underbrace{u^\top \nabla_x u}_{\nabla_x(\frac{1}{2}\|u\|^2)}, \qquad a(1) = -\nabla r(X_1). \tag{11}\]

The conditional expectation \(\mathbb{E}[a(t) \mid X_t = x] = \nabla_x \mathcal{J}(u; x, t)\) is the gradient of the cost functional. Since \(u^\star = \sigma^\top \nabla V\) and \(V(t,x) = -\mathcal{J}(u^\star; x, t)\) at optimality, we get

\[ u^\star(t, x) = -\sigma(t)^\top \mathbb{E} {\left[ a(t; X, u^\star) \mid X_t = x \right]} . \tag{12}\]

This is a fixed-point condition: \(u^\star\) is the unique control satisfying Equation 12.

Turn this into a regression. For a current control \(\bar{u}\) (parameters frozen), simulate \(X \sim \mathbb{P}^{\bar{u}}\), solve the adjoint ODE Equation 11 backward, and regress:

\[ \mathcal{L}_{\text{basic}}(\theta) = \frac{1}{2} \int_0^1 \| u_\theta(X_t, t) + \sigma(t)^\top a(t; X, \bar{u}) \|^2 \, dt, \qquad X \sim \mathbb{P}^{\bar{u}}, \quad \bar{u} = \texttt{sg}(u_\theta). \tag{13}\]

Here \(\texttt{sg}\) denotes stop-gradient: the trajectory and adjoint use frozen parameters; gradients only flow through the \(u_\theta\) term. Expanding the square and differentiating with respect to \(\theta\) recovers exactly the continuous adjoint gradient from Equation 11. So this basic version produces the same parameter updates as the standard adjoint method.

The lean adjoint

The full adjoint Equation 11 contains terms involving \(\nabla_x u\). At optimality, these terms have conditional expectation zero. To see why, note that the minimizer of the regression Equation 13 is the conditional expectation:

\[ u^\star(t, x) = -\sigma(t)^\top \mathbb{E} {\left[ a(t) \mid X_t = x \right]} . \]

Right-multiply both sides by the Jacobian \(\nabla_x u^\star(t, x)\):

\[ u^\star(t, x)^\top \nabla_x u^\star(t, x) = -\mathbb{E} {\left[ a(t) \mid X_t = x \right]} ^\top \sigma(t) \, \nabla_x u^\star(t, x). \]

Rearranging and applying the tower property:

\[ \mathbb{E} {\left[ u^\star(t, x)^\top \nabla_x u^\star(t, x) + a(t)^\top \sigma(t) \, \nabla_x u^\star(t, x) \mid X_t = x \right]} = 0. \tag{14}\]

The terms inside the expectation appear in the full adjoint ODE Equation 11 (they are the \(u\)-dependent pieces: \(\sigma \, \nabla_x u^\top a\) and \(\nabla_x(\frac{1}{2}\|u\|^2) = u^\top \nabla_x u\)). Since they vanish in expectation at optimality, removing them does not change the fixed point. This gives the lean adjoint \(\tilde{a}(t)\):

\[ \dot{\tilde{a}}(t) = -\nabla_x b(X_t, t)^\top \tilde{a}(t), \qquad \tilde{a}(1) = -\nabla r(X_1). \tag{15}\]

Compare with Equation 11: every \(u\)-dependent term is gone. The lean adjoint depends only on the base drift \(b\), not on the control. This has two practical effects: (i) no need to compute \(\nabla_x u\), which is expensive for neural network controls, and (ii) the lean adjoint has smaller magnitude, reducing gradient variance. Intuitively, the removed terms are large in magnitude but cancel in expectation, so they only add noise to the gradient estimator.

Note the connection to the adjoint notes: the adjoint ODE for a system \(\dot{x} = b(t, x)\) with loss \(g(x(T))\) and no running cost is \(\dot{\lambda} = -b_x^\top \lambda\), \(\lambda(T) = \nabla g\). The lean adjoint is precisely this equation. Stripping the \(u\)-dependent terms from the full adjoint recovers the “pure” adjoint through the base dynamics only.

The Adjoint Matching loss replaces the full adjoint with the lean one:

\[ \mathcal{L}_{\text{AM}}(\theta) = \frac{1}{2} \int_0^1 \| u_\theta(X_t, t) + \sigma(t)^\top \tilde{a}(t; X, \bar{u}) \|^2 \, dt, \qquad X \sim \mathbb{P}^{\bar{u}}. \tag{16}\]

The unique critical point of \(\mathbb{E}[\mathcal{L}_{\text{AM}}]\) is \(u^\star\). Unlike the basic version, the lean adjoint produces a different gradient than the standard adjoint method away from optimality (the removed terms have expectation zero only at \(u^\star\), not elsewhere). Empirically, this leads to more stable convergence.

After fine-tuning

The memoryless schedule \(\sigma(t) = \sqrt{2\eta_t}\) is only used during training. After convergence, the fine-tuned model can switch to any noise schedule for sampling, including \(\sigma(t) = 0\) for deterministic generation. This works because the memoryless schedule is the unique one preserving the velocity-score relationship: a model trained under it yields a valid velocity field \(v^{\text{finetune}}\) that can be plugged directly into the flow matching ODE or any other SDE sampler from the same family.