Discrete Adjoint Matching – Alexandre Thiéry

Setup

We work with a continuous-time Markov chain (CTMC) on a finite state space \(\mathcal{X}\). Background on CTMCs, rate matrices, and time-reversal is in the discrete diffusion notes. The CTMC is characterized by its transition rate \(u_t(y,x)\): the infinitesimal probability per unit time of jumping from state \(x\) to state \(y \neq x\). A reference process with rate \(r_t(y,x)\) defines the base path measure \(\mathbb{P}\). Adding a control means replacing \(r_t\) by \(u_t\), producing a controlled path measure \(\mathbb{P}^u\).

The path-space KL between \(\mathbb{P}^u\) and \(\mathbb{P}\) is (see the Girsanov notes for the continuous analogue):

\[ D_{\text{KL}}(\mathbb{P}^u \| \mathbb{P}) = \mathbb{E}_{\mathbb{P}^u} \int_0^1 \sum_{y \neq X_t} {\left[ u_t(y, X_t) \log \frac{u_t(y, X_t)}{r_t(y, X_t)} - u_t(y, X_t) + r_t(y, X_t) \right]} dt. \tag{1}\]

Each summand \(u \log(u/r) - u + r\) is the generalized KL divergence between rates at a single pair \((y, x)\). It is non-negative, equals zero when \(u = r\), and plays the role of \(\frac{1}{2}\|u\|^2\) from the continuous Girsanov penalty.

The discrete stochastic optimal control (SOC) problem (cf. the HJB notes) is:

\[ \min_u \; J(u) = D_{\text{KL}}(\mathbb{P}^u \| \mathbb{P}) + \mathbb{E}_{\mathbb{P}^u} {\left[ g(X_1) \right]} , \tag{2}\]

where \(g: \mathcal{X}\to \mathbb{R}\) is a terminal cost. For sampling from \(\pi\), the terminal cost is \(g(x) = -\log \pi(x) + \text{const}\) (when the reference has uniform stationary distribution). By the Doob h-transform (discrete version), the optimal rate is

\[ \textcolor{blue}{u^\star_t(y,x) = r_t(y,x) \cdot \frac{\varphi_t(y)}{\varphi_t(x)}}, \tag{3}\]

where \(\varphi_t(x) = \mathbb{E}_\mathbb{P} {\left[ e^{-g(X_1)} \mid X_t = x \right]} \) is the discrete Doob \(h\)-function. Writing \(V_t(x) = \log \varphi_t(x)\) (consistent with the HJB notes):

\[ u^\star_t(y,x) = r_t(y,x) \cdot e^{V_t(y) - V_t(x)}. \tag{4}\]

The structure here is multiplicative: the optimal rate tilts the reference rate by \(\varphi(y)/\varphi(x)\). In the continuous case, \(u^\star = \sigma_t \nabla V\) is an additive perturbation of the drift. The DAM paper defines \(V\) as cost-to-go (\(V_{\text{DAM}} = -V_{\text{here}}\)); the formulas agree since both give \(u^\star = r \cdot \varphi(y)/\varphi(x)\).

Additive noise on cyclic groups

Continuous adjoint matching relies on translation invariance of Gaussian noise: the reference transition kernel \(p^r_{1|t}(y \mid x) = q_t(y - x)\) depends only on the displacement \(y - x\).

For CTMCs on a finite state space, there is no obvious analogue. The key observation from (liu2025dasbs?; so2026dam?) is to equip the state space with a group structure. Take \(\mathcal{X}= \mathbb{Z}_N^D\), the cyclic group of integers modulo \(N\) in \(D\) dimensions. Each state is \(x = (x^1, \ldots, x^D)\) with \(x^d \in \{0, 1, \ldots, N-1\}\), and addition is component-wise modulo \(N\).

Choose a uniform reference rate:

\[ r_t(y, x) = \frac{\gamma_t}{N} \quad \text{for all } y \neq x \text{ with } d_H(y,x) = 1, \tag{5}\]

where \(d_H\) is the Hamming distance (differing in exactly one coordinate) and \(\gamma_t > 0\) is a noise schedule. Restricting to single-coordinate jumps (\(d_H(y,x) = 1\)) keeps the rate matrix sparse and the computation per step \(O(DN)\). The resulting transition kernel \(p^r_{1|t}(y \mid x)\) depends only on \(y - x\) (subtraction in \(\mathbb{Z}_N^D\)):

\[ p^r_{1|t}(y \mid x) = q_t(y - x). \tag{6}\]

The argument is simple. The uniform rate is invariant under any translation \(x \mapsto x + a\) on \(\mathbb{Z}_N^D\): the jump rate from \(x + a\) to \(y + a\) equals the rate from \(x\) to \(y\). Starting from \(x\) or \(x + a\) and running the same Markov chain, the displacement \(X_1 - X_0\) has the same distribution. This is the discrete analogue of adding mean-zero Gaussian noise.

Discrete adjoint estimator

From gradients to exponential differences

The continuous lean adjoint and its simplification for \(b=0\) are derived in the adjoint matching and adjoint sampling notes. In the discrete case, derivatives do not exist. The optimal control is multiplicative (Equation 4), so the relevant quantity is not \(V(y) - V(x)\) but \(e^{V(y) - V(x)}\). The analogue of \(\nabla g\) is the exponential difference \(e^{g(X_1) - g(y)}\) for each possible target state \(y\). The discrete adjoint terminal condition is:

\[ \tilde{a}_1(y; X_1) = e^{g(X_1) - g(y)}. \tag{7}\]

The lean adjoint ODE

The full discrete adjoint satisfies a backward ODE driven by the rate matrix. At optimality, the \(u\)-dependent terms vanish (the same mechanism as in the continuous case, see adjoint matching), leaving the lean adjoint. We write \(r_t(z, y)\) for the full generator, including the diagonal \(r_t(y,y) = -\sum_{z \neq y} r_t(z,y)\):

\[ -\frac{d}{dt} \tilde{a}_t(y) = \sum_{z} r_t(z, y) \, \tilde{a}_t(z), \qquad \tilde{a}_1(y; X_1) = e^{g(X_1) - g(y)}. \tag{8}\]

Because the reference rates \(r_t(z,y)\) do not depend on the current state \(X_t\), this ODE is deterministic and can be solved offline. In the continuous case, \(\nabla_x b\) depends on \(X_t\), making the lean adjoint a stochastic ODE that must be integrated along each trajectory.

Closed-form solution

Because Equation 8 is linear with coefficients determined by the reference rate matrix, its solution can be expressed in terms of the reference transition kernel \(p^r_{1|t}(z \mid y)\):

\[ \tilde{a}_t(y; X_1) = \sum_{z \in \mathcal{X}} p^r_{1|t}(z \mid y) \, e^{g(X_1) - g(z)}. \tag{9}\]

This is a weighted average over states \(z\) that the reference process can reach from \(y\) by time 1, weighted by \(e^{g(X_1) - g(z)}\).

Verifying the closed form.

Write \(P^r_{1|t}(z \mid y)\) for the reference transition probability. The function \(\tilde{a}_t(y) = \sum_z P^r_{1|t}(z \mid y) \, e^{g(X_1) - g(z)}\) needs to satisfy Equation 8. Differentiating with respect to \(t\): \[ -\frac{d}{dt} \tilde{a}_t(y) = -\sum_z \frac{d}{dt} P^r_{1|t}(z \mid y) \, e^{g(X_1) - g(z)}. \] The Kolmogorov backward equation gives \(\frac{d}{dt} P^r_{1|t}(z \mid y) = -\sum_{w} r_t(w, y) P^r_{1|t}(z \mid w)\) (the generator acts on the starting state \(y\)). Substituting: \[ -\frac{d}{dt} \tilde{a}_t(y) = \sum_z \sum_w r_t(w, y) P^r_{1|t}(z \mid w) \, e^{g(X_1) - g(z)} = \sum_w r_t(w, y) \tilde{a}_t(w). \] This matches Equation 8. At \(t = 1\), \(P^r_{1|1}(z \mid y) = \delta_{z,y}\), so \(\tilde{a}_1(y) = e^{g(X_1) - g(y)}\), matching the terminal condition.

Key identity

The discrete adjoint recovers the value function ratio in conditional expectation. Under the optimal path measure \(\mathbb{P}^{u^\star}\):

\[ \mathbb{E}_{\mathbb{P}^{u^\star}} {\left[ \tilde{a}_t(y; X_1) \mid X_t = x \right]} = e^{V_t(y) - V_t(x)}. \tag{10}\]

At optimality, the conditional expectation of the discrete adjoint recovers the exponential value difference, exactly the multiplicative factor in Equation 4.

Deriving Equation 10.

Under \(\mathbb{P}^{u^\star}\), the Doob h-transform tilts the transition kernel: \(p^{u^\star}_{1|t}(z \mid x) = p^r_{1|t}(z \mid x) \cdot e^{-g(z)}/\varphi_t(x)\), where \(\varphi_t(x) = \sum_z p^r_{1|t}(z|x) e^{-g(z)}\) is the Doob \(h\)-function from Equation 3. Then: \[ \mathbb{E}_{\mathbb{P}^{u^\star}} {\left[ \tilde{a}_t(y; X_1) \mid X_t = x \right]} = \sum_{x_1} p^{u^\star}_{1|t}(x_1 \mid x) \sum_z p^r_{1|t}(z \mid y) \, e^{g(x_1) - g(z)}. \] Substitute \(p^{u^\star}_{1|t}(x_1 \mid x) = p^r_{1|t}(x_1 \mid x) e^{-g(x_1)} / \varphi_t(x)\): \[ = \frac{1}{\varphi_t(x)} \sum_{x_1} p^r_{1|t}(x_1 \mid x) \sum_z p^r_{1|t}(z \mid y) e^{-g(z)} = \frac{\varphi_t(y)}{\varphi_t(x)} = e^{V_t(y) - V_t(x)}. \] The \(x_1\) sum gives \(\sum_{x_1} p^r_{1|t}(x_1 \mid x) = 1\), and the \(z\) sum gives \(\varphi_t(y) = \sum_z p^r_{1|t}(z \mid y) e^{-g(z)}\).

Matching loss

Why generalized KL, not \(L^2\)

The continuous AM loss uses \(L^2\) regression: \(\|u_\theta - \text{target}\|^2\). Rates must be positive, so \(L^2\) regression can produce negative values. The natural divergence for positive quantities is the generalized KL:

\[ D_{\text{gKL}}(a, b) = \sum_{y \neq x} {\left[ a(y,x) \log \frac{a(y,x)}{b(y,x)} - a(y,x) + b(y,x) \right]} . \tag{11}\]

This is the Bregman divergence generated by \(\varphi(a) = a\log a - a\), with \(\varphi'(a) = \log a\). It is non-negative, equals zero when \(a = b\), and the minimizer of \(\mathbb{E} {\left[ D_{\text{gKL}}(W, c) \right]} \) over \(c\) is \(c = \mathbb{E}[W]\) (the arithmetic mean, the standard Bregman mean property).

The DAM loss

Combining Equation 3 with Equation 10, the regression target for the rate \(u_\theta(y, x, t)\) is \(r_t(y, x) \cdot \tilde{a}_t(y; X_1)\). The Discrete Adjoint Matching loss:

\[ L_{\text{DAM}}(\theta) = \mathbb{E}_{t, \, X_t \sim \mathbb{P}^{\bar{u}}} {\left[ D_{\text{gKL}} {\left( r_t(\cdot, X_t) \cdot \tilde{a}_t(\cdot; X_1), \;\; u_\theta(\cdot, X_t, t) \right)} \right]} , \tag{12}\]

where \(\bar{u} = \texttt{sg}(u_\theta)\) (stop-gradient: trajectories are simulated with frozen parameters, gradients only flow through the \(u_\theta\) term). The Bregman mean property gives the unique fixed point: the minimizer over \(u_\theta\) of \(\mathbb{E}[D_{\text{gKL}}(r \cdot \tilde{a}, u_\theta)]\) is \(u_\theta = \mathbb{E}[r \cdot \tilde{a}] = r \cdot \mathbb{E}[\tilde{a}]\). By Equation 10, \(r_t \cdot \mathbb{E}[\tilde{a}_t \mid X_t] = u^\star_t\), so the unique fixed point is \(u^\star\).

When the reference has the additive property (Equation 6, uniform rate on \(\mathbb{Z}_N^D\)), the target simplifies. Since \(r_t(y, x) = \gamma_t / N\) is constant in \(y\), the regression target at each \((y, x)\) pair only depends on \(y\) through \(\tilde{a}_t(y; X_1)\), which uses the additive transition kernel \(p^r_{1|t}(z \mid y) = q_t(z - y)\).

The additive structure enters the closed-form Equation 9 as follows. Substituting \(p^r_{1|t}(z \mid y) = q_t(z - y)\):

\[ \tilde{a}_t(y; X_1) = \sum_z q_t(z - y) \, e^{g(X_1) - g(z)} = \mathbb{E}_{Z \sim q_t(\cdot)} {\left[ e^{g(X_1) - g(y + Z)} \right]} . \]

The regression target depends on \(y\) only through \(g(y + Z)\), where \(Z\) is a random displacement drawn from the reference noise distribution. When \(y\) differs from \(x\) in a single coordinate \(d\) (Hamming distance 1), \(y + Z\) and \(x + Z\) differ in the same coordinate. This coordinate-wise factorization makes the computation scale as \(O(DN)\) per state, not \(O(N^D)\).

When the base model uses a masking structure (as in discrete diffusion), the optimal rate inherits this structure: if \(r_t\) only allows single-coordinate jumps, then \(u^\star_t\) also only allows single-coordinate jumps. This reduces the modeling complexity from \(O(N^D)\) to \(O(DN)\).

Dropping the memoryless condition: DASBS

The value function bias from adjoint matching and the SB decomposition from adjoint sampling apply here verbatim: without the memoryless condition, the SOC terminal marginal is not \(\pi\). The fix is the same: decompose the Schrodinger bridge into a controller (Doob h-transform via \(\varphi_t\)) plus a corrector (backward potential \(\widehat{\varphi}_t\)), giving

\[ u^\star_t(y, x) = r_t(y, x) \cdot \frac{\varphi_t(y)}{\varphi_t(x)}, \]

with modified terminal cost \(g(x) = \log(\widehat{\varphi}_1(x)/\pi(x))\) that absorbs the prior coupling.

Alternating scheme

Controller step: fix the corrector \(\widehat{\varphi}_1\), solve the modified SOC with terminal cost \(g(x) = \log(\widehat{\varphi}_1(x)/\pi(x))\). This is discrete AM (Equation 12) with the modified terminal cost. The controller learns the ratio \(\varphi_t(y)/\varphi_t(x)\).
Corrector step: fix the controller, learn \(\widehat{\varphi}_1\) by matching the backward transition rates. Two options:
- Adjoint matching for the corrector: uses the additive property to express \(\widehat{\varphi}_1(z)/\widehat{\varphi}_1(y)\) as a conditional expectation under the current path measure, with target \(\widehat{\varphi}_0(x - y + z)/\widehat{\varphi}_0(x)\) and the boundary condition \(\widehat{\varphi}_0 = \mu / \varphi_0\).
- Denoising matching for the corrector: regresses onto \(p^r_{1|t}(x_1^{d \gets n} \mid x) / p^r_{1|t}(x_1 \mid x)\), the ratio of reference transition probabilities. This does not require the additive property and works for general reference kernels.

This alternation is IPFP/Sinkhorn in disguise: each half-step solves a half-bridge problem. Convergence follows from the contraction argument in the SB notes; see (liu2025dasbs?) for the discrete convergence proof.

Empirical finding from (liu2025dasbs?): non-memoryless reference schedules outperform memoryless ones in discrete settings. The memoryless condition forces aggressive mixing that wastes computation; relaxing it allows more efficient transport.

Summary: continuous vs discrete

	Continuous	Discrete
State space	\(\mathbb{R}^D\)	\(\mathbb{Z}_N^D\)
Base dynamics	SDE, Brownian motion	CTMC, uniform rate
Additive noise	Gaussian: \(p_{1\|t}(y\mid x) = q_t(y-x)\)	Cyclic group: \(p_{1\|t}(y\mid x) = q_t(y - x)\)
Optimal control	\(u^\star = \sigma_t \nabla V\) (additive)	\(u^\star = r_t \cdot e^{V(y)-V(x)}\) (multiplicative)
Adjoint terminal	\(\nabla g(X_1)\) (gradient)	\(e^{g(X_1) - g(y)}\) (exponential difference)
Adjoint ODE	Stochastic (\(\nabla_x b\) depends on \(X_t\))	Deterministic (admits closed form)
Matching loss	\(L^2\) regression	Generalized KL divergence
Corrector	\(\nabla \log \widehat{\varphi}_1\) (score)	\(\widehat{\varphi}_1(y)/\widehat{\varphi}_1(x)\) (ratio)

The discrete analogue of Nelson’s relation from the BMS notes is detailed balance: \(u^\star_t(y,x)/u^{\star,\text{rev}}_t(x,y) = p^\star_t(y)/p^\star_t(x)\).