Adjoint Matching

Setup

Consider a pre-trained generative model that produces samples \(X_1\) by simulating the SDE

\[ dX_t = b(X_t, t) \, dt + \sigma_t \, dW_t, \qquad X_0 \sim \mathcal{N}(0, I) \tag{1}\]

on \([0,1]\). We restrict to scalar, state-independent \(\sigma_t > 0\) throughout, matching the generative model setting. The drift \(b\) encodes a learned score or velocity field. The path distribution of this process is \(\mathbb{P}\) and the terminal distribution is \(p_1(x)\). Given a reward function \(r: \mathbb{R}^D \to \mathbb{R}\), we want to fine-tune the model so that its output distribution becomes the tilted distribution

\[ \pi(x) \propto p_1(x) \, \exp(r(x)). \tag{2}\]

We assume \(r\) is smooth and that all expectations below are well-defined. Introduce a control \(u(t,x)\) and the controlled SDE

\[ dX^u_t = {\left[ b(X^u_t, t) + \sigma_t \, \textcolor{blue}{u(t, X^u_t)} \right]} \, dt + \sigma_t \, dW_t. \tag{3}\]

The Girsanov theorem gives \(D_{\text{KL}}(\mathbb{P}^u \| \mathbb{P}) = \frac{1}{2} \mathbb{E}_{\mathbb{P}^u} \int_0^1 \|u(t, X^u_t)\|^2 \, dt\) (the Girsanov notes derive \(D_{\text{KL}}(\mathbb{P}, \mathbb{P}^u)\) with expectation under \(\mathbb{P}\); the two directions are equal for the control-affine case since \(\int u^\top dW\) has zero expectation under either measure), so the stochastic optimal control problem

\[ \min_u \; J(u) = \mathbb{E}_{\mathbb{P}^u} {\left[ \int_0^1 \tfrac{1}{2} \|u(t, X^u_t)\|^2 \, dt - r(X^u_1) \right]} \tag{4}\]

maximizes the terminal reward while penalizing deviations from the base process. Equivalently, we maximize \(\mathbb{E}[-\frac{1}{2}\|u\|^2 + r(X_1)]\), which is the SOC problem from the HJB notes with \(f = 0\) and \(g = r\). Here we fix \(X_0 \sim \mathcal{N}(0,I)\) and optimize only over \(u\), unlike the joint \((q,u)\) optimization in the HJB notes.

Value function bias

The HJB notes show that the optimal control is \(u^\star(t,x) = \sigma_t^\top \nabla V(t,x)\) where \(V(t,x) = \log h(t,x)\) is the value function and

\[ h(t,x) = \mathbb{E} {\left[ \exp(r(X_1)) \mid X_t = x \right]} \]

under the base process. Combining Ito’s formula on \(V(t,X_t)\) with the HJB equation and Girsanov, the bivariate marginal of the optimal path measure satisfies

\[ p^\star(X_0, X_1) \propto p^{\text{base}}(X_0, X_1) \, \exp {\left( r(X_1) - V(0, X_0) \right)} . \tag{5}\]

The HJB notes derive (final displayed equation) the pathwise identity \(V(t_2, X_{t_2}) - V(t_1, X_{t_1}) = \int_{t_1}^{t_2} \frac{1}{2}\|u^\star\|^2 ds + \text{martingale}\) via Ito’s formula and the HJB equation. Plugging this into the Girsanov path-level RN derivative \(d\mathbb{P}^{u^\star}/d\mathbb{P}= \exp(-\frac{1}{2}\int \|u^\star\|^2 ds + \int u^{\star\top} dW_s)\) and taking expectations (which kills the martingale), the path-dependent terms reduce to endpoint functions, yielding Equation 5.

The factor \(\exp(-V(0, X_0)) = 1/h(0, X_0)\) is the conditional normalizing constant: given \(X_0\), the RN derivative of the optimal path conditioned on \(X_0\) is \(\exp(r(X_1))/h(0, X_0)\).

Marginalizing the joint over \(X_0\):

\[ \begin{align} p^\star(X_0, X_1) &\propto p^{\text{base}}(X_0, X_1) \, \exp {\left( r(X_1) - V(0, X_0) \right)} \\ p^\star(X_1) &= \int p^{\text{base}}(X_0, X_1) \, \exp {\left( r(X_1) - V(0, X_0) \right)} \, dX_0. \end{align} \tag{6}\]

When \(X_0\) and \(X_1\) are correlated under \(\mathbb{P}\), the factor \(\exp(-V(0, X_0))\) cannot be pulled out of the integral. The marginal \(p^\star(X_1)\) is not proportional to \(p_1(X_1) \exp(r(X_1))\). The SOC solution tilts both the terminal and the initial distribution, and the fine-tuned model does not sample from Equation 2.

As a concrete example, take \(\sigma_t = 0\) (standard flow matching). The ODE \(\dot{X}_t = b(X_t, t)\) is deterministic: \(X_0\) fully determines \(X_1 = \Phi(X_0)\) for a diffeomorphism \(\Phi\). Adding a control \(u\) in Equation 3 has zero effect since \(\sigma = 0\) kills both the noise and the control term (the control enters as \(\sigma_t u\), so it vanishes with \(\sigma_t\); this is specific to the control-affine parameterization). The optimal control is trivially \(u = 0\) since any nonzero \(u\) adds cost (\(\frac{1}{2}\|u\|^2\)) without affecting the trajectory. The bias is maximal.

Memoryless noise schedule

Look again at Equation 6. If \(X_0 \perp X_1\) under the base process, then \(p^{\text{base}}(X_0, X_1) = p^{\text{base}}(X_0) \, p_1(X_1)\) and

\[ \begin{align} p^\star(X_1) &= p_1(X_1) \, \exp(r(X_1)) \int p^{\text{base}}(X_0) \, \exp(-V(0, X_0)) \, dX_0\\ &\propto p_1(X_1) \, \exp(r(X_1)). \end{align} \tag{7}\]

The integral over \(X_0\) collapses to a constant. The bias disappears.

A base process Equation 1 is memoryless when \(X_0 \perp X_1\). For the family of generative SDEs that arise in flow matching and diffusion models, this condition pins down \(\sigma_t\).

The generative SDE Equation 1 is built to have the same time marginals as the reference interpolation \(\bar{X}_t = \beta_t \bar{X}_0 + \alpha_t \bar{X}_1\) with \(\bar{X}_0 \sim \mathcal{N}(0,I)\) independent of \(\bar{X}_1\). Conditionally on \(\bar{X}_1\), this reference has \(\bar{X}_t \mid \bar{X}_1 \sim \mathcal{N}(\alpha_t \bar{X}_1, \beta_t^2 I)\): the conditional variance is \(\beta_t^2\), which starts at \(\beta_0^2 = 1\) (pure noise) and decays to \(\beta_1^2 = 0\) (deterministic). The unified drift from (domingoenrich2024adjoint?) is

\[ b(x,t) = \kappa_t \, x + {\left( \tfrac{\sigma_t^2}{2} + \eta_t \right)} \nabla \log p_t(x) \tag{8}\]

This drift is constructed so that the SDE has the same time marginals as the reference interpolation; the \(\kappa_t x\) term handles the deterministic scaling and the score term corrects for the noise. Here \(\kappa_t = \dot{\alpha}_t / \alpha_t\) and \(\eta_t = \beta_t(\dot{\alpha}_t \beta_t / \alpha_t - \dot{\beta}_t)\). For the SDE to produce independent \(X_0\) and \(X_1\), it needs to inject enough noise to erase memory of \(X_0\) by time 1. The base SDE with drift \(b\) and diffusion \(\sigma_t\) has the same marginals as \(\bar{X}_t\). For \(X_0 \perp X_1\) to hold, the noise must be large enough that the SDE “forgets” its initial condition by time 1. The memoryless noise level \(\sigma_t = \sqrt{2\eta_t}\) is the unique choice that makes the conditional law \(X_t | X_1\) under the SDE match the reference Gaussian structure \(\mathcal{N}(\alpha_t X_1, \beta_t^2 I)\) for all \(t\). The paper (Proposition in Section 12.1 of (domingoenrich2024adjoint?)) proves this by analyzing the time-reversed SDE and showing that the coefficient of \(X_0\) in the explicit solution vanishes if and only if \(\sigma_t^2 \geq 2\eta_t\), with \(\sigma_t^2 = 2\eta_t\) being the minimal such choice. This requires

\[ \textcolor{blue}{\sigma_t = \sqrt{2 \eta_t}}. \tag{9}\]

Near \(t=0\), \(\eta_t \to \infty\) so \(\sigma_t \to \infty\): the process mixes aggressively, erasing memory of \(X_0\). Near \(t=1\), \(\eta_t \to 0\) so \(\sigma_t \to 0\): the process stabilizes around \(X_1\).

Why \(\sigma_t = \sqrt{2\eta_t}\) is the right condition (sketch):

The time-reversed SDE for \(\vec{X}_t = X_{1-t}\) is a linear SDE whose explicit solution expresses \(\vec{X}_1 = X_0\) as a function of \(\vec{X}_0 = X_1\). The coefficient of \(X_1\) in this expression involves \(\exp(-\int_0^1 \sigma_s^2/(2\beta_s^2) ds)\). For \(X_0 \perp X_1\), this coefficient must vanish, requiring \(\int \sigma_s^2/(2\beta_s^2) ds = \infty\). The threshold is \(\sigma_s^2 = 2\eta_s\) (the paper proves this in Appendix 12.1). Below this threshold, correlation persists; above it, the process mixes too aggressively and distorts the time marginals.

Checking the flow matching schedule explicitly:

Take \(\alpha_t = t\) and \(\beta_t = 1-t\). Then \(\dot{\alpha}_t = 1\), \(\dot{\beta}_t = -1\), \(\kappa_t = 1/t\), and \[ \eta_t = (1-t) {\left( \frac{1}{t}(1-t) - (-1) \right)} = (1-t) {\left( \frac{1-t}{t} + 1 \right)} = \frac{(1-t)}{t}. \] So the memoryless schedule is \(\sigma_t = \sqrt{2(1-t)/t}\), which blows up as \(t \to 0\) (aggressive mixing near the noise) and vanishes as \(t \to 1\) (stabilizing near the sample).

(In practice, the time interval is truncated to \([\varepsilon, 1]\) to avoid the singularity at \(t = 0\).)

For DDIM with \(\alpha_t = \sqrt{\bar{\alpha}_t}\), \(\beta_t = \sqrt{1 - \bar{\alpha}_t}\), the memoryless schedule gives \(\sigma_t = \sqrt{\dot{\bar{\alpha}}_t / \bar{\alpha}_t}\), which is exactly the DDPM noise schedule (verified in the details block above for flow matching; for the DDPM verification, see Table 1 of the paper). The DDPM noise schedule is the memoryless schedule for DDIM.

The paper proves (Theorem 2) that within the family of SDEs sharing marginals with the reference flow, the memoryless schedule is not just sufficient but necessary for the fine-tuned velocity to be convertible to arbitrary noise schedules. The memoryless schedule is the unique choice that preserves the velocity-score relationship: a model trained under the memoryless schedule yields consistent velocity and score fields, \(v_\theta(t,x) = b(x,t) + \sigma_t u_\theta(t,x)\) and \(s_\theta(t,x) = -u_\theta(t,x)/\sigma_t + \nabla \log p_t(x)\), enabling conversion to arbitrary noise schedules after fine-tuning.

The memoryless schedule is only needed during fine-tuning. After learning the control \(u\), the fine-tuned model (base drift plus control) can be converted back to a velocity field and sampled with any \(\sigma_t\), including \(\sigma_t = 0\).

With the memoryless schedule, we need to solve Equation 4. The adjoint method gives the gradient of the cost \(J\) along a trajectory. The adjoint state is the sensitivity of the future cost to perturbations at time \(t\):

\[ a(t) = \nabla_{X_t} {\left[ \int_t^1 \tfrac{1}{2} \|u(s, X_s)\|^2 \, ds - r(X_1) \right]} . \tag{10}\]

Applying the adjoint ODE from the adjoint notes to the controlled drift \(b + \sigma_t u\) with running cost \(\frac{1}{2}\|u\|^2\) and terminal cost \(-r\):

\[ \dot{a}(t) = - {\left[ \nabla_x b + \sigma_t \, \nabla_x u \right]} ^\top a(t) - \underbrace{(\nabla_x u)^\top u}_{\nabla_x(\frac{1}{2}\|u\|^2) \text{ (chain rule)}}, \qquad a(1) = -\nabla r(X_1). \tag{11}\]

The conditional expectation \(\mathbb{E}[a(t) \mid X_t = x] = \nabla_x J(u; x, t)\) is the gradient of the cost functional; heuristically, following the SDE extension described in the adjoint notes (the interchange of gradient and conditional expectation requires regularity of \(r\) and integrability of \(\nabla r \cdot \exp(r)\)). The paper defines \(V\) as cost-to-go (minimization); we use the HJB notes’ convention \(V = \log h\) (reward-to-go, maximization). The two are negatives of each other: \(V_{\text{here}} = -V_{\text{paper}}\). Since \(u^\star = \sigma_t^\top \nabla V\) and \(V(t,x) = -J(u^\star; x, t)\) at optimality, we get

\[ u^\star(t, x) = -\sigma_t^\top \mathbb{E} {\left[ a(t; X, u^\star) \mid X_t = x \right]} . \tag{12}\]

This is a fixed-point condition: \(u^\star\) is the unique control satisfying Equation 12.

Turn this into a regression. For a current control \(\bar{u}\) (parameters frozen), simulate \(X \sim \mathbb{P}^{\bar{u}}\), solve the adjoint ODE Equation 11 backward, and regress:

\[ L_{\text{basic}}(\theta) = \frac{1}{2} \int_0^1 \| u_\theta(X_t, t) + \sigma_t^\top a(t; X, \bar{u}) \|^2 \, dt, \qquad X \sim \mathbb{P}^{\bar{u}}, \quad \bar{u} = \texttt{sg}(u_\theta). \tag{13}\]

Here \(\texttt{sg}\) denotes stop-gradient: the trajectory and adjoint use frozen parameters; gradients only flow through the \(u_\theta\) term. Expanding: \(\nabla_\theta L_{\text{basic}} = \mathbb{E}[(\nabla_\theta u_\theta)^\top (u_\theta + \sigma_t a)]\). At \(\bar{u} = u_\theta\) (stop-gradient), the \(a\) term produces the continuous adjoint gradient from Equation 11. So this basic version produces the same parameter updates as the standard adjoint method.

The lean adjoint

The full adjoint Equation 11 contains terms involving \(\nabla_x u\). At optimality, these terms have conditional expectation zero.

Removing the \(u\)-dependent terms:

Both sides are column vectors in \(\mathbb{R}^D\). The minimizer of the regression Equation 13 is the conditional expectation:

\[ u^\star(t, x) = -\sigma_t^\top \mathbb{E} {\left[ a(t) \mid X_t = x \right]} . \]

Right-multiply both sides by the Jacobian \(\nabla_x u^\star(t, x)\):

\[ u^\star(t, x)^\top \nabla_x u^\star(t, x) = -\mathbb{E} {\left[ a(t) \mid X_t = x \right]} ^\top \sigma_t \, \nabla_x u^\star(t, x). \]

Since \(u^\star(t,x)\) and \(\nabla_x u^\star(t,x)\) are deterministic functions of \((t,x)\), they can be pulled in and out of \(\mathbb{E}[\cdot | X_t = x]\) freely. Rearranging and applying the tower property:

\[ \mathbb{E} {\left[ u^\star(t, x)^\top \nabla_x u^\star(t, x) + a(t)^\top \sigma_t \, \nabla_x u^\star(t, x) \mid X_t = x \right]} = 0. \tag{14}\]

The terms inside the expectation appear in the full adjoint ODE Equation 11 (they are the \(u\)-dependent pieces: \(\sigma_t \, \nabla_x u^\top a\) and \(\nabla_x(\frac{1}{2}\|u\|^2) = (\nabla_x u)^\top u\)). They vanish in conditional expectation at optimality.

This gives the lean adjoint \(\tilde{a}(t)\):

\[ \dot{\tilde{a}}(t) = -\nabla_x b(X_t, t)^\top \tilde{a}(t), \qquad \tilde{a}(1) = -\nabla r(X_1). \tag{15}\]

Compare with Equation 11: every \(u\)-dependent term is gone. The lean adjoint depends only on the base drift \(b\), not on the control. No need to compute \(\nabla_x u\), which is expensive for neural networks. The lean adjoint also has smaller magnitude since the removed terms cancel in expectation but add gradient variance.

The lean adjoint is the adjoint ODE from the adjoint notes applied to the base dynamics \(\dot{x} = b(t,x)\) with terminal cost \(-r\).

The Adjoint Matching loss replaces the full adjoint with the lean one:

\[ L_{\text{AM}}(\theta) = \frac{1}{2} \int_0^1 \| u_\theta(X_t, t) + \sigma_t^\top \tilde{a}(t; X, \bar{u}) \|^2 \, dt, \qquad X \sim \mathbb{P}^{\bar{u}}. \tag{16}\]

The unique critical point of \(\mathbb{E}[L_{\text{AM}}]\) is \(u^\star\). The argument above shows that \(\mathbb{E}[a(t)^\top \sigma_t \nabla_x u^\star + (\nabla_x u^\star)^\top u^\star \mid X_t] = 0\) at optimality. This does NOT mean the lean adjoint \(\tilde{a}(t)\) equals \(\mathbb{E}[a(t) \mid X_t]\) pointwise; it means that replacing \(a\) by \(\tilde{a}\) in the regression loss produces the same gradient at \(u = u^\star\). To see why: the gradient of \(L_{\text{AM}}(\theta) = \frac{1}{2}\mathbb{E}\|u_\theta + \sigma_t \tilde{a}\|^2\) with respect to \(\theta\) is \(\mathbb{E}[(\nabla_\theta u_\theta)^\top(u_\theta + \sigma_t \tilde{a})]\). At \(u_\theta = u^\star\), this equals \(\mathbb{E}[(\nabla_\theta u^\star)^\top(u^\star + \sigma_t \tilde{a})]\). Adding back the removed terms: \(u^\star + \sigma_t a = u^\star + \sigma_t \tilde{a} + \sigma_t(\text{removed terms})\). Since the removed terms have zero conditional expectation given \(X_t\) and \(\nabla_\theta u^\star\) is a function of \(X_t\), the tower property gives \(\mathbb{E}[(\nabla_\theta u^\star)^\top \sigma_t(\text{removed})] = 0\). So the gradients of the lean and basic losses agree at \(u^\star\), and by the uniqueness of the critical point of \(\mathbb{E}[L_{\text{basic}}]\) (Appendix 13.3 of the paper), \(u^\star\) is also the unique critical point of \(\mathbb{E}[L_{\text{AM}}]\).

Unlike the basic version, the lean adjoint produces a different gradient than the standard adjoint method away from optimality (the removed terms have expectation zero only at \(u^\star\), not elsewhere). The paper reports more stable convergence in practice.

After fine-tuning

The memoryless schedule \(\sigma_t = \sqrt{2\eta_t}\) is only used during training. After convergence, the fine-tuned drift can be converted to a velocity field and sampled with any noise schedule, including \(\sigma_t = 0\) for deterministic generation.