Adjoint Sampling – Alexandre Thiéry

The HJB notes left us with a clean picture and a frustrating bottleneck. To sample from a target \(\pi(x) \propto \exp(-E(x))\), start a Brownian motion at \(X_0 = 0\), add a control drift \(u(t,x)\), and choose \(u\) to minimize the stochastic optimal control cost \[ J(u) = \mathbb{E}_{\mathbb{P}^u} {\left[ \int_0^1 \tfrac{1}{2}\|u(t,X_t)\|^2 \, dt + g(X_1) \right]} , \tag{1}\] where \(g(x) = \log p^\text{base}_1(x) - \log \pi(x)\) is the terminal cost. (Sign convention: the HJB notes use a maximization convention with reward \(g_\text{HJB}\); here we use minimization with cost \(g = -g_\text{HJB}\), so \(V = -J\) and the optimal control takes the same form \(u^\star = \sigma_t \nabla V\) in both conventions.) The controlled SDE is \[ dX_t = \sigma_t \, u(t,X_t) \, dt + \sigma_t \, dW_t, \qquad X_0 = 0, \tag{2}\] with scalar noise schedule \(\sigma_t > 0\) (when \(\sigma\) is a matrix, \(u^\star = \sigma_t^\top \nabla V\); the scalar case simplifies this to \(\sigma_t \nabla V\)). The base process (\(u \equiv 0\)) gives \(X_t \sim \mathcal{N}(0, \nu_t I)\) with cumulative variance \(\nu_t = \int_0^t \sigma_s^2 \, ds\). By Girsanov, \(D_{\text{KL}}(\mathbb{P}^u \| \mathbb{P}) = \frac{1}{2} \mathbb{E}_{\mathbb{P}^u} \int_0^1 \|u\|^2 \, dt\) (the Girsanov notes derive \(D_{\text{KL}}(\mathbb{P}, \mathbb{P}^u) = \frac{1}{2}\mathbb{E}_\mathbb{P}[\int \|u\|^2 dt]\); the formula with \(\mathbb{E}_{\mathbb{P}^u}\) follows from an analogous calculation, or by substituting the Girsanov weight), so the SOC trades terminal cost against control effort. The optimal control is \(u^\star(t,x) = \sigma_t \nabla V(t,x)\) where \(V = \log h\) and \[ h(t,x) = \mathbb{E} {\left[ \exp(-g(X_1)) \mid X_t = x \right]} \tag{3}\] under the base process. This is the Doob h-transform: \(h\) tilts the path measure toward low-cost terminal states. The optimal control \(u^\star\) points in the direction of \(\nabla \log h\), i.e. toward regions where \(h\) is large, where the expected future cost-to-go is small, where the energy \(E\) is low. At optimality the controlled path measure equals the Schrodinger bridge: \(\mathbb{P}^{u^\star}(\boldsymbol{X}) = \mathbb{P}(\boldsymbol{X} \mid X_1) \, \pi(X_1)\). The SB notes show that the SB minimizing \(D_{\text{KL}}(\mathbb{Q}\| \mathbb{P})\) with \(\mathbb{Q}_0 = \delta_0\) and \(\mathbb{Q}_1 = \pi\) has \(d\mathbb{Q}/d\mathbb{P}\propto \varphi_1(X_1)\) (no initial tilt, since \(\delta_0\) is already the base initial distribution). This means \(\mathbb{Q}(\boldsymbol{X}) = \mathbb{P}(\boldsymbol{X} | X_1) \pi(X_1)\) after normalizing the terminal potential so that \(\int \varphi_1(x) p_1^{\text{base}}(x) dx = 1\). The optimal cost is \(J(u^\star) = -\log h(0,0)\). The formula \(u^\star = \sigma_t \nabla V\) is elegant; computing \(V\) is not. The function \(h\) in Equation 3 requires averaging \(\exp(-g)\) over all possible futures of a Brownian motion, which is as hard as the original sampling problem.

Three natural questions arise.

Can I learn \(u^\star\) without computing \(h\)?

The adjoint notes provide a tool for exactly this situation. For an ODE \(\dot{x} = b(t,x)\) with running cost \(f\) and terminal cost \(g\), the adjoint variable \(\lambda(t) = \nabla_{X_t}[\text{future cost}]\) satisfies \[ \dot{\lambda}(t) = -\nabla_x f - (\nabla_x b)^\top \lambda(t), \qquad \lambda(1) = \nabla g(X_1). \] The full adjoint ODE for a controlled SDE with drift \(b + \sigma_t u\) also contains terms involving \(\nabla_x u\): \[ \dot{a}(t) = - {\left[ \nabla_x b + \sigma_t \nabla_x u \right]} ^\top a(t) - (\nabla_x u)^\top u, \qquad a(1) = \nabla g(X_1). \] As described in the adjoint matching note, the \(u\)-dependent terms (\(\sigma_t \nabla_x u^\top a\) and \((\nabla_x u)^\top u\)) have zero conditional expectation at optimality. Dropping them does not change the fixed point but reduces gradient variance. The resulting lean adjoint \(\tilde{a}(t)\) satisfies \[ \dot{\tilde{a}}(t) = -(\nabla_x b)^\top \tilde{a}(t), \qquad \tilde{a}(1) = \nabla g(X_1), \] which depends only on the base drift \(b\), not on the control \(u\). This is a practical win: no need to compute \(\nabla_x u\), which is expensive for neural network parameterizations.

Now set \( \textcolor{blue}{b = 0}\) and \( \textcolor{blue}{f = 0}\): the base process is pure scaled Brownian motion, and the running cost is just the control penalty (which does not enter the lean adjoint). Both forcing terms vanish. The lean adjoint ODE becomes \(\dot{\tilde{a}} = 0\) with terminal condition \(\tilde{a}(1) = \nabla g(X_1)\). So \[ \textcolor{blue}{\tilde{a}(t; \boldsymbol{X}) = \nabla g(X_1)}, \qquad \text{for all } t \in [0,1]. \tag{4}\] No backward ODE to solve. The regression target at every time \(t\) is just the terminal gradient \(\nabla g(X_1)\).

The intuition is clean. This decomposition \(X_1 = X_t + \int_t^1 \sigma_s \, dW_s\) holds for the base process (\(u = 0\)). Under the base process, the two pieces \(X_t\) and \(\int_t^1 \sigma_s dW_s\) are independent: Brownian increments do not remember the past. Holding the future noise realization fixed and perturbing \(X_t\) by a small displacement \(\delta X_t\), the terminal state shifts by the same amount: \(X_1 \mapsto X_1 + \delta X_t\). The sensitivity of the terminal cost to the intermediate state is therefore \(\nabla_{X_t} g(X_1) = \nabla g(X_1)\) along each base trajectory, confirming Equation 4. This argument uses the base process decomposition \(X_1 = X_t + \int_t^1 \sigma_s dW_s\), which holds for \(u = 0\). Along controlled trajectories, \(X_1\) depends nonlinearly on \(X_t\) through \(u\), and the full adjoint \(a(t) \neq \nabla g(X_1)\). The lean adjoint is constant because it is defined using the base dynamics (\(b = 0\)), not the controlled dynamics. Its virtue is that it avoids computing \(\nabla_x u\) while having the same fixed point as the full adjoint. This is the simplest possible adjoint: the base dynamics offer no resistance to perturbations, so the terminal gradient propagates unchanged backward through time.

The adjoint matching loss with target Equation 4 is \[ L_\text{AM}(u) = \mathbb{E}_{\boldsymbol{X} \sim \mathbb{P}^{\bar{u}}} {\left[ \int_0^1 \tfrac{1}{2}\|u(t,X_t) + \sigma_t \, \nabla g(X_1)\|^2 \, dt \right]} , \tag{5}\] where \(\bar{u} = \texttt{stopgrad}(u)\): trajectories are simulated with the current (frozen) parameters, and gradients only flow through the \(u_\theta\) term inside the square. Concretely, for each trajectory the computational graph is: simulate \(X_0 \to X_1\) using \(\bar{u}\) (no gradient), compute \(\nabla g(X_1)\) (one energy gradient call), then for each saved \((t, X_t)\) along the path, evaluate \(\|u_\theta(t, X_t) + \sigma_t \nabla g(X_1)\|^2\) and backpropagate through \(u_\theta\) only. This regresses \(u(t, X_t)\) onto \(-\sigma_t \nabla g(X_1)\) at each \((t, X_t)\) pair along the trajectory. The unique fixed point is \(u^\star(t,x) = -\sigma_t \, \mathbb{E}_{\mathbb{P}^{u^\star}}[\nabla g(X_1) \mid X_t = x]\) (exchanging \(\nabla\) and \(\mathbb{E}\), valid under regularity of \(g\)): the conditional expectation, under the optimally controlled process, of the terminal gradient. Comparing with \(u^\star = \sigma_t \nabla \log h\), the adjoint target \(\nabla g(X_1)\) is a stochastic estimate of \(-\nabla V(t, X_t)\); the regression averages many noisy terminal gradients to recover the smooth gradient of the value function.

This is just mean-squared-error regression. No backward ODE, no Hessians, no importance weights. For each trajectory, evaluate \(\nabla g\) once at the endpoint, and you have training signal at every intermediate time. But there is still a computational bottleneck: each gradient step on Equation 5 requires simulating the controlled SDE forward (to produce trajectories), and evaluating the energy function at \(X_1\). Both operations are expensive. The next section removes the first cost almost entirely.

Can I avoid simulating the SDE to generate training data?

The loss Equation 5 only depends on the pair \((X_t, X_1)\), not the full trajectory: \[ L_\text{AM}(u) = \int_0^1 \mathbb{E}_{(X_t, X_1) \sim p^{\bar{u}}_{t,1}} {\left[ \tfrac{1}{2}\|u(t,X_t) + \sigma_t \nabla g(X_1)\|^2 \right]} \, dt. \tag{6}\] Sampling these pairs still requires simulating the controlled SDE forward, which is the expensive part. The key practical innovation is to replace the joint \(p^{\bar{u}}_{t,1}(x_t, x_1)\) with the product \[ p^\text{base}_{t \mid 1}(x_t \mid x_1) \cdot p^{\bar{u}}_1(x_1). \tag{7}\] Keep the terminal marginal from the current control, but fill in the interior with the base Brownian bridge. This decouples the expensive step (generating \(X_1\)) from the cheap step (sampling \(X_t\) given \(X_1\)).

The bridge formula. Under the base process, \(X_t \sim \mathcal{N}(0, \nu_t I)\) and \(X_1 = X_t + Z\) where \(Z = \int_t^1 \sigma_s \, dW_s \sim \mathcal{N}(0, (\nu_1 - \nu_t)I)\) is independent of \(X_t\). The pair \((X_t, X_1)\) is jointly Gaussian with \(\mathbb{E}[X_t] = 0\), \(\mathbb{E}[X_1] = 0\), \(\mathop{\mathrm{Var}}(X_t) = \nu_t I\), \(\mathop{\mathrm{Var}}(X_1) = \nu_1 I\), and \(\mathop{\mathrm{Cov}}(X_t, X_1) = \nu_t I\) (since \(\mathop{\mathrm{Cov}}(X_t, X_t + Z) = \mathop{\mathrm{Var}}(X_t)\)). Standard Gaussian conditioning (cf. the Brownian bridge in the Doob notes) gives \[ X_t \mid X_1 = x_1 \;\sim\; \mathcal{N} {\left( \frac{\nu_t}{\nu_1} \, x_1, \;\; \frac{\nu_t(\nu_1 - \nu_t)}{\nu_1} \, I \right)} . \tag{8}\] Sampling from Equation 8 is a single Gaussian draw: \(X_t = \frac{\nu_t}{\nu_1} x_1 + \sqrt{\frac{\nu_t(\nu_1 - \nu_t)}{\nu_1}} \, \xi\) with \(\xi \sim \mathcal{N}(0, I)\). For constant \(\sigma_t = \sigma\), we have \(\nu_t = \sigma^2 t\) and Equation 8 simplifies to \(X_t \mid X_1 \sim \mathcal{N}(t \, x_1, \sigma^2 t(1-t) I)\), the classical Brownian bridge.

The Gaussian conditioning, spelled out.

Write \(X_1 = X_t + Z\) with \(Z \sim \mathcal{N}(0, (\nu_1 - \nu_t)I)\) independent of \(X_t\). The joint \((X_t, X_1)\) is Gaussian, and the conditional \(X_t \mid X_1 = x_1\) is also Gaussian with \[ \mathbb{E}[X_t \mid X_1 = x_1] = \mathop{\mathrm{Cov}}(X_t, X_1) \, \mathop{\mathrm{Var}}(X_1)^{-1} \, x_1 = \frac{\nu_t}{\nu_1} \, x_1, \] \[ \mathop{\mathrm{Var}}(X_t \mid X_1) = \mathop{\mathrm{Var}}(X_t) - \mathop{\mathrm{Cov}}(X_t, X_1) \, \mathop{\mathrm{Var}}(X_1)^{-1} \, \mathop{\mathrm{Cov}}(X_1, X_t) = \nu_t {\left( 1 - \frac{\nu_t}{\nu_1} \right)} I = \frac{\nu_t(\nu_1 - \nu_t)}{\nu_1} \, I. \] The conditional mean is a linear interpolation between \(0\) and \(x_1\), weighted by \(\nu_t / \nu_1\). The conditional variance peaks at the midpoint and vanishes at \(t = 0\) (since \(\nu_0 = 0\), we know \(X_0 = 0\) with certainty) and at \(t = 1\) (since we conditioned on \(X_1 = x_1\)).

Substituting Equation 7 into Equation 6 gives the Reciprocal Adjoint Matching (RAM) loss: \[ L_\text{RAM}(u) = \int_0^1 \lambda(t) \, \mathbb{E}_{\substack{X_1 \sim p^{\bar{u}}_1 \\ X_t \sim p^\text{base}_{t \mid 1}(\cdot \mid X_1)}} {\left[ \tfrac{1}{2}\|u(t,X_t) + \sigma_t \nabla g(X_1)\|^2 \right]} \, dt, \tag{9}\] where \(\lambda(t) = 1/\sigma_t^2\) is a time-weighting that normalizes the regression target magnitude across time (does not change the optimum). Sampling the training pair \((X_t, X_1)\) now proceeds as: (1) generate \(X_1\) by running the controlled SDE forward (one expensive simulation), (2) for each \(X_1\), sample \(t \sim \text{Uniform}([0,1])\) and draw \(X_t\) from Equation 8 (one Gaussian draw). One energy evaluation of \(\nabla g(X_1)\) gives training signal for arbitrarily many \((t, X_t)\) pairs.

Why does this substitution work? At optimality, \(\mathbb{P}^{u^\star} = \mathbb{P}^\star\) is the Schrodinger bridge, which factorizes as \(\mathbb{P}(\boldsymbol{X} \mid X_1) \cdot \pi(X_1)\). Marginalizing over intermediate times, \(p^{u^\star}_{t,1}(x_t, x_1) = p^\text{base}_{t \mid 1}(x_t \mid x_1) \cdot \pi(x_1)\). At the fixed point, the replacement Equation 7 is exact. Away from optimality, the replacement defines a reciprocal projection \(\Pi(u)\): the Schrodinger bridge with the same terminal marginal \(p^u_1\) as the current control.

The projection never increases the SOC cost.

Write \(\text{SB}_{p^u_1}\) for the Schrodinger bridge with terminal marginal \(p^u_1\), i.e. the path measure \(\mathbb{P}(\boldsymbol{X} \mid X_1) \, p^u_1(X_1)\). Its Radon-Nikodym derivative against \(\mathbb{P}\) is \[ \frac{d[\text{SB}_{p^u_1}]}{d\mathbb{P}}(\boldsymbol{X}) = \frac{p^u_1(X_1)}{p^\text{base}_1(X_1)}, \] since \(\mathbb{P}(\boldsymbol{X} \mid X_1) = \mathbb{P}(\boldsymbol{X}) / p^\text{base}_1(X_1)\). The path-space KL from \(\mathbb{P}^u\) to \(\text{SB}_{p^u_1}\) is then \[ \begin{aligned} D_{\text{KL}}(\mathbb{P}^u \| \text{SB}_{p^u_1}) &= \mathbb{E}_{\mathbb{P}^u} {\left[ \log \frac{d\mathbb{P}^u}{d\mathbb{P}} + \log \frac{d\mathbb{P}}{d[\text{SB}_{p^u_1}]} \right]} \\ &= \underbrace{D_{\text{KL}}(\mathbb{P}^u \| \mathbb{P})}_{\frac{1}{2}\mathbb{E}_{\mathbb{P}^u}\int \|u\|^2 \, dt} + \mathbb{E}_{\mathbb{P}^u} {\left[ \log \frac{p^\text{base}_1(X_1)}{p^u_1(X_1)} \right]} . \end{aligned} \] The first term is the Girsanov KL (Girsanov notes). Now decompose the SOC cost using \(g = \log(p^\text{base}_1 / \pi) = \log(p^\text{base}_1 / p^u_1) + \log(p^u_1 / \pi)\): \[ \begin{aligned} J(u) &= \mathbb{E}_{\mathbb{P}^u} {\left[ \tfrac{1}{2}\|u\|^2 + g(X_1) \right]} \\ &= \underbrace{\mathbb{E}_{\mathbb{P}^u} {\left[ \tfrac{1}{2}\|u\|^2 + \log \frac{p^\text{base}_1(X_1)}{p^u_1(X_1)} \right]} }_{D_{\text{KL}}(\mathbb{P}^u \| \text{SB}_{p^u_1})} + \underbrace{\mathbb{E}_{p^u_1} {\left[ \log \frac{p^u_1(X_1)}{\pi(X_1)} \right]} }_{D_{\text{KL}}(p^u_1 \| \pi)}. \end{aligned} \] The first term is the path-space KL computed above; the second is the marginal KL. The projection \(\Pi(u)\) minimizes the first term over all controls sharing terminal marginal \(p^u_1\), driving the path-space KL to zero (this requires \(p^u_1\) to be absolutely continuous with respect to \(p_1^{\text{base}}\), which holds for typical neural network controls). Its minimum is achieved by the SB with terminal marginal \(p^u_1\), which has drift \(\sigma_t^2 \nabla \log \varphi_t\) for the appropriate Doob h-transform. Since the terminal marginal is preserved, the marginal KL is unchanged. Hence \(J(\Pi(u)) \leq J(u)\).

After projection, RAM and AM coincide: \(\Pi(u)\) is a SB with terminal marginal \(p^u_1\), so its path measure factorizes as \(\mathbb{P}(\cdot | X_1) p^u_1(X_1)\), and the reciprocal projection is the identity. The two losses have the same fixed point and the same critical points on the projected space. In other words, we lose nothing by using the cheaper loss, and we gain a monotone improvement guarantee.

The algorithm. The full Adjoint Sampling algorithm (havens2025adjoint?) alternates between two phases. The outer loop (expensive) simulates the controlled SDE forward to produce terminal samples \(\{X_1^{(i)}\}\), evaluates \(\nabla g(X_1^{(i)})\), and stores the pairs in a replay buffer \(\cB\). The inner loop (cheap) draws \((X_1, \nabla g)\) from the buffer, samples \(t \sim \text{Uniform}([0,1])\) and \(X_t\) from the bridge Equation 8, and updates \(u\) by gradient descent on the RAM loss Equation 9. Many gradient steps per energy evaluation, since the inner loop never touches \(E\).

This scheme has a clean fixed-point interpretation. Each outer step implicitly performs the reciprocal projection \(\Pi\) (by freezing the terminal samples), and each inner step minimizes the AM loss on the projected control. If \(u_i\) denotes the current control and we fully converge the inner loop using \(X_1\) samples from \(p^{u_i}_1\), the update satisfies \[ u_{i+1} = \Pi(u_i) - \frac{\delta L_\text{AM}}{\delta u}(\Pi(u_i)). \] The fixed point \(u = \Pi(u)\) with \(\frac{\delta L_\text{AM}}{\delta u}(u) = 0\) is exactly the optimal control \(u^\star\). In practice the inner loop is not converged fully: the buffer is refreshed when the control has drifted sufficiently from the stored samples.

The computational savings are significant. Previous SOC-based samplers require simulating the SDE and backpropagating through it at every gradient step, plus at least one energy evaluation per step. Adjoint Sampling decouples sampling from optimization: the outer loop produces \(N\) terminal samples per buffer refresh (each requiring one SDE simulation + one \(\nabla g\) evaluation), while the inner loop does \(K \gg N\) gradient updates using stored buffer entries (each requiring one Gaussian draw from Equation 8 + one forward pass of the neural network). No SDE simulation, no energy call during the inner loop. The ratio \(K/N\) of gradient updates to energy evaluations can be in the hundreds. For expensive energy functions like molecular force fields, this is the difference between feasibility and infeasibility.

The replay buffer also provides a form of experience replay: older samples from earlier stages of training remain in the buffer alongside fresh samples, smoothing the optimization. Buffer entries are weighted uniformly; more sophisticated prioritization schemes are possible but have not been explored.

What if the prior is not a Dirac?

The setup \(X_0 = 0\) is restrictive. For molecular systems, a harmonic oscillator prior \(\mu\) that starts particles near equilibrium bond lengths is a far better starting point: the transport cost is lower and convergence is faster. Even for non-molecular problems, starting from a Gaussian \(\mu = \mathcal{N}(0, I)\) and using a moderate noise schedule \(\sigma_t\) is more natural than pumping all the stochasticity through the Brownian motion from a single point.

The base process \(\mathbb{P}\) now starts from \(X_0 \sim \mu\) (not \(\delta_0\)), and \(\nu_t\), \(p_t^{\mathbb{P}}\), and the Brownian bridge all depend on this initial distribution. With a non-trivial prior, \(X_0\) and \(X_1\) are coupled under the base process, and the initial value function bias from the adjoint matching note reappears. The SOC optimal terminal marginal is \(p^\star(X_1) \propto \int \mathbb{P}(X_0, X_1) \, e^{-g(X_1) + V_0(X_0)} \, dX_0\), which is not proportional to \(\pi(X_1)\) because the \(V_0(X_0)\) factor cannot be pulled out of the integral. The memoryless condition (\(X_0 \perp X_1\) under \(\mathbb{P}\)) eliminates this bias by making the integral over \(X_0\) collapse to a constant, but it also forces \(X_0 = 0\) (or equivalently, requires a noise schedule \(\sigma_t\) that injects enough noise to erase all memory of \(X_0\) by time 1, as discussed in the adjoint matching note).

The fix comes from the Schrodinger bridge formulation. Instead of solving a plain SOC (which only imposes \(\mathbb{Q}_1 \approx \pi\) via a terminal cost), solve the full SB problem that imposes both marginal constraints \(\mathbb{Q}_0 = \mu\) and \(\mathbb{Q}_1 = \pi\) simultaneously. The SB \(\mathbb{Q}^\star\) minimizing \(D_{\text{KL}}(\mathbb{Q}\| \mathbb{P})\) subject to these two constraints has path measure \[ \frac{d\mathbb{Q}^\star}{d\mathbb{P}}(\boldsymbol{X}) \;\propto\; \widehat{\varphi}_0(X_0) \, \varphi_1(X_1), \tag{10}\] with time-dependent SB potentials \(\varphi_t(x) = \mathbb{E}_\mathbb{P}[\varphi_1(X_1) \mid X_t = x]\) and \(\widehat{\varphi}_t(x) = \mathbb{E}_\mathbb{P}[\widehat{\varphi}_0(X_0) \mid X_t = x]\) (in the SB notes, these are defined with endpoint potentials \(f, g\); here \(\varphi_1\) corresponds to \(g\) and \(\widehat{\varphi}_0\) to \(f\)). The forward potential \(\varphi_t\) is the Doob h-transform: the SB process has drift \(b_t + \sigma_t^2 \nabla \log \varphi_t\), steering trajectories toward the target. The backward potential \(\widehat{\varphi}_t\) propagates information about the prior \(\mu\) forward in time.

The SB \(\mathbb{Q}^\star\) is also a controlled diffusion (its drift is the base drift plus \(\sigma_t^2 \nabla \log \varphi_t\)), so it must solve some SOC problem with an appropriate terminal cost. What terminal cost? The SOC optimal joint from the HJB notes is \(p^\star(X_0, X_1) \propto \mathbb{P}(X_0, X_1) \, e^{-g(X_1) + V_0(X_0)}\), while the SB joint from Equation 10 is \(p^\star(X_0, X_1) \propto \mathbb{P}(X_0, X_1) \, \widehat{\varphi}_0(X_0) \, \varphi_1(X_1)\). The SOC here optimizes both the drift \(u\) and the initial distribution (the HJB formulation in Equation 1 jointly optimizes \(q_0\) and \(u\)). The SB constraint \(\mathbb{Q}_0 = \mu\) pins the initial distribution, and \(\widehat{\varphi}_0\) enters through this constraint. Matching the terminal tilt requires \(e^{-g(X_1)} \propto \varphi_1(X_1)\). The SB marginal constraint at \(t = 1\) requires \(\pi(x) = p^\mathbb{P}_1(x) \, \widehat{\varphi}_1(x) \, \varphi_1(x)\), giving \(\varphi_1(x) = \pi(x)/(p^\mathbb{P}_1(x) \, \widehat{\varphi}_1(x))\). Substituting into \(g = -\log \varphi_1 + \text{const}\) identifies the modified terminal cost: \[ \textcolor{blue}{g(x) = \log \frac{\widehat{\varphi}_1(x)}{\pi(x)}} + \text{const}. \tag{11}\] Compared to the original \(g = \log(p^\mathbb{P}_1 / \pi)\), the base marginal \(p^\mathbb{P}_1\) is replaced by \( \textcolor{blue}{\widehat{\varphi}_1}\): a corrector that accounts for the coupling between \(X_0\) and \(X_1\). This is the central insight from (liu2025adjoint?): every SB problem can be recast as an SOC problem with a modified terminal cost that absorbs the prior coupling.

Does this corrector actually remove the bias? Marginalize the SB joint over \(X_0\): \[ \begin{aligned} p^\star(X_1) &\propto \varphi_1(X_1) \int \mathbb{P}(X_1 \mid X_0) \, \widehat{\varphi}_0(X_0) \, \mu(X_0) \, dX_0\\ &= \varphi_1(X_1) \, \widehat{\varphi}_1(X_1) \, p^\mathbb{P}_1(X_1) = \pi(X_1). \end{aligned} \tag{12}\] The second step uses \(\widehat{\varphi}_1(x) \, p^\mathbb{P}_1(x) = \int \mathbb{P}_{1|0}(x \mid y) \, \widehat{\varphi}_0(y) \, \mu(y) \, dy\) (the definition of \(\widehat{\varphi}_1\) combined with Bayes’ rule), and the last step uses the SB marginal constraint \(\varphi_1 \, \widehat{\varphi}_1 \, p^\mathbb{P}_1 = \pi\). This is a consistency check: IF the SB potentials satisfy the Sinkhorn equations, THEN the terminal marginal is \(\pi\). The existence of potentials satisfying these constraints follows from the Sinkhorn/IPFP theory in the SB notes (this requires \(D_{\text{KL}}(\mathbb{Q}\| \mathbb{P}) < \infty\) for some \(\mathbb{Q}\) satisfying the marginal constraints; see the SB notes for conditions). The corrector \(\widehat{\varphi}_1\) cancels the initial value function bias exactly. Every SB problem decomposes as: a Doob h-transform (the forward potential \(\varphi_t\), which steers trajectories toward the target) plus a corrector (the backward potential \(\widehat{\varphi}_t\), which absorbs the prior bias). The adjoint method handles the first piece; we now need a way to learn the second.

Corrector matching

The modified AM loss with terminal cost Equation 11 needs \(\nabla \log \widehat{\varphi}_1\), which is unknown. Since \(b = 0\), the lean adjoint is still constant (the same argument from Equation 4 applies), and the AM loss becomes \[ L_\text{AM}(u) = \mathbb{E} {\left[ \big\|u_t(X_t) + \sigma_t {\left( \nabla E + \textcolor{blue}{\nabla \log \widehat{\varphi}_1} \right)} (X_1)\big\|^2 \right]} \tag{13}\] where the expectation is over bridge pairs \((X_t, X_1)\). The corrector score \(\nabla \log \widehat{\varphi}_1\) adds one extra term alongside the energy gradient \(\nabla E\). When \(\mu = \delta_0\), we have \(\widehat{\varphi}_0 = \text{const}\) (no prior information to propagate), hence \(\widehat{\varphi}_t = \text{const}\) for all \(t\), and Equation 13 reduces to the plain AM loss Equation 5.

To derive a learning objective for \(\nabla \log \widehat{\varphi}_1\), start from the definition \(\widehat{\varphi}_t(x) = \mathbb{E}_\mathbb{P}[\widehat{\varphi}_0(X_0) \mid X_t = x]\). Converting from a backward conditional to a forward integral via Bayes’ rule (\(\mathbb{P}_{0|t}(y|x) = \mathbb{P}_{t|0}(x|y) \mu(y) / p_t^\mathbb{P}(x)\)): \[ \widehat{\varphi}_t(x) = \int \widehat{\varphi}_0(y) \, \mathbb{P}_{0|t}(y | x) \, dy = \frac{\int \widehat{\varphi}_0(y) \, \mathbb{P}_{t|0}(x|y) \, \mu(y) \, dy}{p_t^\mathbb{P}(x)}. \] Define \(\tilde{\varphi}_0(y) = \widehat{\varphi}_0(y) \cdot \mu(y)\) to absorb the prior. Then \[ \widehat{\varphi}_t(x) \, p_t^\mathbb{P}(x) = \int \mathbb{P}_{t|0}(x|y) \, \tilde{\varphi}_0(y) \, dy. \tag{14}\] Take \(\nabla_x \log\) of both sides (exchanging \(\nabla\) and \(\int\), valid under standard regularity): \[ \nabla_x \log \widehat{\varphi}_t(x) + \nabla_x \log p_t^\mathbb{P}(x) = \frac{\int \nabla_x \mathbb{P}_{t|0}(x|y) \, \tilde{\varphi}_0(y) \, dy}{\widehat{\varphi}_t(x) \, p_t^\mathbb{P}(x)}. \] Rewrite the numerator using \(\nabla_x \mathbb{P}_{t|0} = (\nabla_x \log \mathbb{P}_{t|0}) \, \mathbb{P}_{t|0}\): \[ \nabla_x \log[\widehat{\varphi}_t(x) \, p_t^\mathbb{P}(x)] = \frac{\int \nabla_x \log \mathbb{P}_{t|0}(x|y) \, \mathbb{P}_{t|0}(x|y) \, \tilde{\varphi}_0(y) \, dy}{\widehat{\varphi}_t(x) \, p_t^\mathbb{P}(x)}. \tag{15}\] The ratio \(\mathbb{P}_{t|0}(x|y) \, \tilde{\varphi}_0(y) / [\widehat{\varphi}_t(x) \, p_t^\mathbb{P}(x)]\) is a probability distribution over \(y\), and it is exactly the SB posterior \(p^\star(X_0 = y \mid X_t = x)\).

The Bayes’ rule step.

The SB joint at times \(0\) and \(t\) factorizes as \(p^\star(X_0, X_t) \propto \widehat{\varphi}_0(X_0) \, \mu(X_0) \, \mathbb{P}_{t|0}(X_t | X_0) \, \varphi_t(X_t)\). The posterior is \(p^\star(X_0 = y \mid X_t = x) \propto \widehat{\varphi}_0(y) \, \mu(y) \, \mathbb{P}_{t|0}(x | y) = \tilde{\varphi}_0(y) \, \mathbb{P}_{t|0}(x|y)\). Normalizing over \(y\) gives exactly the denominator \(\widehat{\varphi}_t(x) \, p_t^\mathbb{P}(x) = \int \mathbb{P}_{t|0}(x|y) \tilde{\varphi}_0(y) \, dy\) from Equation 14. So the ratio in Equation 15 is \(p^\star(X_0 = y \mid X_t = x)\), as claimed.

Substituting back into Equation 15: \[ \nabla_x \log[\widehat{\varphi}_t(x) \, p_t^\mathbb{P}(x)] = \mathbb{E}_{p^\star} {\left[ \nabla_x \log \mathbb{P}_{t \mid 0}(x \mid X_0) \mid X_t = x \right]} . \tag{16}\] The left side is \(\nabla \log \widehat{\varphi}_t + \nabla \log p_t^\mathbb{P}\), not just \(\nabla \log \widehat{\varphi}_t\). To isolate the corrector score, subtract \(\nabla \log p_t^\mathbb{P}\): \[ \nabla \log \widehat{\varphi}_t(x) = \mathbb{E}_{p^\star} {\left[ \nabla_x \log \mathbb{P}_{t|0}(x|X_0) \mid X_t = x \right]} - \nabla \log p_t^\mathbb{P}(x). \] At \(t = 1\), the corrector enters the AM loss Equation 13 evaluated at \(X_1\). The base marginal score \(\nabla \log p_1^\mathbb{P}\) is known in closed form (it is the score of \(\mathcal{N}(0, \nu_1 I)\), namely \(-x/\nu_1\)). So the corrector score at \(t=1\) decomposes into a conditional expectation (which we learn by regression) minus a known Gaussian score (which we subtract analytically). Equation 16 is a Tweedie-type formula: the score of the product \(\widehat{\varphi}_t \cdot p_t^\mathbb{P}\) at \((t,x)\) is the conditional expectation of the transition score \(\nabla_x \log \mathbb{P}_{t \mid 0}(x \mid X_0)\) given \(X_t = x\). The structure is identical to the denoising score matching identity: the score of a mixture equals the expected score of the component, averaged over the posterior on the component index.

Since any conditional expectation minimizes a least-squares loss, Equation 16 at \(t = 1\) gives: \[ \nabla \log[\widehat{\varphi}_1 \cdot p_1^\mathbb{P}] = \mathop{\mathrm{argmin}}_h \; \mathbb{E}_{(X_0, X_1) \sim p^\star_{0,1}} {\left[ \big\|h(X_1) - \nabla_{X_1} \log \mathbb{P}(X_1 \mid X_0)\big\|^2 \right]} . \tag{17}\] This is the corrector matching (CM) loss: regress \(h\) onto the base transition score at endpoint pairs. In practice, the expectation is over the current model’s endpoint pairs \((X_0, X_1) \sim p^{u^{(k)}}_{0,1}\), not the unknown SB joint \(p^\star_{0,1}\). Each CM step finds the best approximation to \(\nabla \log[\widehat{\varphi}_1 \cdot p_1^{\mathbb{P}}]\) at the current iterate; alternation converges by the IPFP contraction argument. For the Brownian base, \(\mathbb{P}(X_1 \mid X_0) = \mathcal{N}(X_0, (\nu_1 - \nu_0)I)\), and the transition score is just \(\nabla_{X_1} \log \mathbb{P}(X_1 \mid X_0) = -(X_1 - X_0)/(\nu_1 - \nu_0)\), known in closed form. The regression target is a Gaussian score evaluated at the endpoints; no energy evaluation needed. The minimizer of Equation 17 gives \(\nabla \log[\widehat{\varphi}_1 \cdot p_1^\mathbb{P}]\); to recover \(\nabla \log \widehat{\varphi}_1\) we subtract the known base score \(\nabla \log p_1^\mathbb{P}(x) = -x/\nu_1\).

Note the parallel with the AM loss. Both are mean-squared-error regressions. AM regresses the control \(u\) onto the energy gradient \(\nabla g\) at pairs \((X_t, X_1)\) from the bridge. CM regresses the corrector \(h\) onto the transition score at endpoint pairs \((X_0, X_1)\) from the current process. Both are self-consistent: the expectation is over the model’s own samples. Both have a conditional expectation as their unique fixed point.

The CM loss Equation 17 should not be confused with the bridge matching objectives in data-driven SB methods like (de2021diffusion?), where \(X_1\) must be drawn from the target \(\pi\). Here, both AM and CM use only on-policy samples from the current control \(u\). This is what makes the approach scalable: no target samples are needed.

Alternating AM + CM = IPFP

The AM loss Equation 13 needs \(\nabla \log \widehat{\varphi}_1\). The CM loss Equation 17 needs samples from \(p^\star\). Neither can be solved alone. The fix: alternate.

AM step: Solve Equation 13 with \(\nabla \log \widehat{\varphi}_1 \approx h^{(k-1)}\) to get \(u^{(k)}\).
CM step: Solve Equation 17 with endpoint pairs from \(u^{(k)}\) to get \(\tilde{h}^{(k)} \approx \nabla \log[\widehat{\varphi}_1 \cdot p_1^\mathbb{P}]\), then set \(h^{(k)} = \tilde{h}^{(k)} - \nabla \log p_1^\mathbb{P}\) (subtracting the known base score).

Initialize with \(h^{(0)} = 0\). The first AM stage uses terminal cost \(g(x) = E(x) + \text{const}\) (since \(\log \widehat{\varphi}_1 \approx 0\)) and regresses the control onto \(-\sigma_t \nabla E(X_1)\): pure energy-guided transport with no corrector. This recovers exactly the Adjoint Sampling algorithm, which is the right starting point when no better guess for the corrector is available. The first CM stage then learns \(h^{(1)} \approx \nabla \log \widehat{\varphi}_1\) from the endpoint pairs produced by \(u^{(1)}\). Subsequent stages progressively refine both the control and the corrector.

The alternation has a variational interpretation. The AM step solves a forward half-bridge: \(\min_{\mathbb{Q}: \mathbb{Q}_0 = \mu} D_{\text{KL}}(\mathbb{Q}\| \mathbb{Q}^{\text{bwd}})\), where \(\mathbb{Q}^{\text{bwd}}\) is the backward SDE determined by the previous corrector \(h^{(k-1)}\). The CM step solves a backward half-bridge: \(\min_{\mathbb{Q}: \mathbb{Q}_1 = \pi} D_{\text{KL}}(\mathbb{P}^{u^{(k)}} \| \mathbb{Q})\), updating the corrector to absorb whatever bias remains. Alternating between forward and backward half-bridge projections is exactly IPFP/Sinkhorn on path-space (de2021diffusion?).

Recall from the SB notes that the static Sinkhorn algorithm alternates between projecting the coupling onto the set \(\{\gamma : \gamma_0 = \nu_0\}\) and the set \(\{\gamma : \gamma_1 = \nu_1\}\). Here, we do the same thing on path-space: the forward half-bridge imposes \(\mathbb{Q}_0 = \mu\) but lets \(\mathbb{Q}_1\) float; the backward half-bridge imposes \(\mathbb{Q}_1 = \pi\) but lets \(\mathbb{Q}_0\) float. Each projection is itself a KL minimization over a convex constraint set, so each step reduces the KL to the SB solution. Convergence to \(\mathbb{Q}^\star\) follows from standard IPFP contraction (de2021diffusion?), provided each inner optimization is solved exactly (which never holds in practice with finite gradient steps; the practical version uses a few optimization steps per stage). In practice, 3-5 outer stages suffice; each stage consists of many AM or CM gradient steps.

Method	Prior	Terminal cost \(g(x)\)	Corrector
Adjoint Sampling (havens2025adjoint?)	\(\delta_0\) (memoryless)	\(\log \frac{p^\mathbb{P}_1(x)}{\pi(x)}\)	None (\(\widehat{\varphi}_1 = \text{const}\))
ASBS (liu2025adjoint?)	Arbitrary \(\mu\)	\(\log \frac{\widehat{\varphi}_1(x)}{\pi(x)}\)	Learned \(\nabla \log \widehat{\varphi}_1\)

ASBS reduces to Adjoint Sampling when \(\mu = \delta_0\): the corrector becomes constant and the CM step is trivially solved. A concrete example of a useful non-Dirac prior: for molecular conformer generation, one can use a harmonic oscillator prior \(\mu(x) \propto \exp(-\frac{\alpha}{2} \sum_{i < j} \|x_i - x_j - r^0_{ij}\|^2)\) where \(r^0_{ij}\) are equilibrium bond lengths read from the molecular graph. Particles start in a physically reasonable arrangement rather than at the origin, reducing the transport cost substantially.

Every piece was already in the earlier notes. The adjoint method gives the regression target (\(\nabla g\) is constant along trajectories when \(b = 0\)). The Brownian bridge gives cheap training pairs (Gaussian conditioning, one draw per sample). The Schrodinger bridge gives the corrector (endpoint potentials that debias the prior). Adjoint sampling (havens2025adjoint?) assembles the first two; ASBS (liu2025adjoint?) adds the third. The reader who has followed the series could have assembled this themselves.

The overall picture: sampling from an unnormalized Boltzmann distribution \(\pi \propto e^{-E}\) reduces to solving a stochastic optimal control problem, which reduces to a sequence of mean-squared-error regressions on cheap Gaussian training pairs. The only expensive operation (energy evaluation at terminal samples) is amortized over many gradient updates. With a Dirac prior, a single matching objective suffices. With a non-trivial prior, two alternating objectives (AM + CM) converge to the Schrodinger bridge via IPFP.

The ideas generalize beyond the \(b = 0\) setting, though the simplifications are less dramatic. With a non-zero base drift (e.g. a pre-trained diffusion model as in the adjoint matching note), the lean adjoint ODE is no longer trivially constant; it must be solved backward along each trajectory. The reciprocal projection still applies in principle, but the bridge distribution \(p^\text{base}_{t \mid 1}\) is no longer a simple Gaussian, and must be approximated (e.g. by linearizing the base drift). The \(b = 0\) case is the sweet spot: everything is closed-form, the adjoint is constant, the bridge is Gaussian, and the only approximation is the neural network parameterization of \(u\).

A note on related methods. The RAM loss Equation 9 is closely related to the training objectives in PDDS (phillips2024particle?) and Target Score Matching (de2024target?), where the same squared-error expression appears inside the expectation. The difference is in what the expectation is taken over: PDDS and TSM sample \(X_1\) from the target distribution \(\pi\) (requiring importance sampling or SMC), while Adjoint Sampling samples \(X_1\) from the current model \(p^{\bar{u}}_1\). This makes the loss a moving target, which is harder to analyze but far more practical: no target samples needed, no importance weights, no resampling.