Critique: adjoint_matching.qmd

Overall Assessment

The note covers the right ideas but reads like a competent summary rather than a re-derivation in the author’s distinctive voice. It describes results instead of computing them, lacks the Euler-discretization-and-Bayes’-rule reasoning that makes HJB.qmd and doob.qmd come alive, and misses several key insights from the paper. The tone is closer to a textbook overview than to the existing notes’ dense, conversational, compute-first style.

Major Issues

The note describes but never derives. The existing notes (doob.qmd, HJB.qmd, girsanov.qmd) build results from scratch using informal Euler discretization arguments. This note simply states results: “The Girsanov theorem gives…” (line 39), “The optimal control is $u^\star = ...$” (line 45), “The adjoint state satisfies…” (line 117). The reader is told what to believe rather than shown why. The value function bias, the memoryless condition, and the lean adjoint derivation should all be computed step by step, not announced.
The value function bias section (lines 48-67) is the central insight but is underdeveloped. The note writes down the joint optimal distribution ?@eq-joint-optimal but never explains where it comes from. In HJB.qmd, this is derived via the variational formulation and Girsanov. Here it appears from nowhere. The note should derive the optimal path distribution (or at minimum cross-reference the exact result in HJB.qmd) and then show concretely why marginalization fails. The sentence “This is the value function bias” (line 66) names the problem but the reader hasn’t felt it.
The memoryless condition derivation is entirely missing. Lines 82-86 state the formula $\sigma(t) = \sqrt{2\eta_t}$ with a heuristic <details> block, but the actual derivation is absent. The paper’s key insight is that memorylessness = independence of $X_0$ and $X_1$, which pins down $\sigma(t)$. This should be derived (at least heuristically) using the conditional Gaussian structure of $\bar{X}_t | \bar{X}_1$. The <details> block on lines 88-93 is hand-wavy and doesn’t actually show why $\sigma(t) = \sqrt{2\eta_t}$ works.
The lean adjoint derivation skips the key computational step. The passage from ?@eq-adjoint-ode to ?@eq-lean-adjoint is the core algorithmic contribution, but the note treats it as “remove the zero-expectation terms.” The reader needs to see: (a) the full adjoint ODE written out, (b) the identification of which terms cancel at optimality, (c) the lean adjoint as a consequence. The <details> block on lines 145-154 is a sketch, not a derivation.
?@eq-joint-optimal is wrong or at least non-standard. The formula $p^\star(X_0, X_1) = p^{\text{base}}(X_0, X_1) \exp(r(X_1) + V(X_0, 0))$ needs to be a density ratio, not an equality. This should be $p^\star(X_0, X_1) \propto p^{\text{base}}(X_0, X_1) \exp(r(X_1) + V(X_0, 0))$, or more precisely, it should be the unnormalized form with a clear explanation of the normalizing constant. The paper writes it carefully with proper conditioning. As written, the equation is dimensionally/normalization-wise suspicious.
No portrait image. Every existing note has a historical figure portrait at the top. The TODO comment on lines 13-17 suggests Carla Domingo-Enrich but no image is included.

Style Issues

Line 21: “A pre-trained generative model (flow matching or diffusion) produces samples $X_1$ by simulating an SDE on $[0,1]$:” – This is expository preamble. Compare doob.qmd which opens immediately: “Consider a continuous time Markov process $X_t$…” The note should start with the SDE and the problem, not with a description of what we’re about to do.
Line 27: “The drift $b$ encodes the pre-trained model (score function or velocity field), and the resulting terminal distribution is $p^{\text{base}}(X_1)$.” – Describing rather than computing. The existing notes would write this as an inline statement alongside the SDE definition.
Line 33: “As described in the stochastic optimal control notes, the natural framework for this is a KL-regularized control problem.” – “The natural framework for this” is vague connective tissue. In the existing notes, the connection to KL would be shown via Girsanov, not declared “natural.”
Line 50: “There is a subtlety that makes this problem different from the SOC setting for language models, and it comes from the fact that the generative process is dynamical: the initial noise $X_0$ and the generated sample $X_1$ can be correlated.” – This reads like an AI-written topic sentence. The existing notes would jump straight to the computation: write down $p^\star(X_0, X_1)$, try to marginalize, and show it doesn’t factor.
Line 66: “For deterministic processes ($\sigma(t) = 0$, as in standard flow matching), $X_0$ fully determines $X_1$, and the bias is maximal: the control literally cannot change anything because there is no noise to steer.” – “the control literally cannot change anything because there is no noise to steer” is good intuition but it’s buried in a paragraph of description. This should be a standalone one-liner after a computation.
Line 99: “An important practical point: the memoryless schedule is only needed during fine-tuning.” – “An important practical point” is filler. Just state it.
Line 106: “With the memoryless schedule in hand, we need an algorithm to solve the SOC problem ?@eq-soc. Three approaches exist, each with trade-offs.” – Classic AI writing pattern. “Three approaches exist, each with trade-offs” is the kind of signposting the existing notes never do. They just present the approaches.
Lines 108-123: The three-approach comparison reads like a literature review, not a mathematical note. The existing notes would skip the adjoint method and SOCM summaries (since the adjoint method is already in adjoint.qmd) and go straight to the Adjoint Matching idea. The reader doesn’t need a survey paragraph about SOCM.
Lines 173-186: “The complete fine-tuning recipe” numbered list is a style mismatch. The existing notes avoid algorithmic-recipe-style numbered lists. This would be better as a short paragraph or omitted entirely.
Line 189: “The connection between Adjoint Matching and the adjoint method is worth spelling out.” – “is worth spelling out” is hedging/announcing. Just spell it out.

Missing Content

The paper’s Theorem on necessity of the memoryless schedule. The note mentions sufficiency in the text and hints at necessity on line 101, but never states the result clearly. The paper proves that within the family of generative SDEs sharing time marginals, memoryless is the only schedule guaranteeing convergence to $p^\star$ for arbitrary rewards. This is a stronger and more interesting result than mere sufficiency.
Connection to DDPM. The paper’s striking observation that the memoryless schedule for DDIM recovers DDPM is mentioned on line 97 but not developed. This deserves more attention: it gives retrospective justification for DDPM’s noise schedule, which was previously considered just one heuristic choice among many.
The fixed-point interpretation of Adjoint Matching. The paper emphasizes that $u^\star$ is the unique fixed point of $u(x,t) = -\sigma(t)^\top \E[a(t; X, u) | X_t = x]$. This is mentioned briefly in the <details> block (lines 134-137) but should be a central conceptual point. It’s what makes Adjoint Matching conceptually different from standard adjoint optimization.
Variance reduction from the lean adjoint. The paper discusses extensively that removing zero-expectation terms reduces variance of the gradient estimator. The note mentions this (line 162) but doesn’t explain why the lean adjoint has lower variance. An informal argument (the full adjoint contains large-magnitude terms that cancel in expectation but add noise) would be valuable.
The relationship between the score and velocity after fine-tuning. Lines 99 and 186 claim the memoryless schedule “preserves the relationship between velocity and score.” This is a non-trivial claim from the paper (Theorem 2) and deserves at least a sentence of explanation.
Concrete example with flow matching schedule. The note gives the formula $\sigma(t) = \sqrt{2(1-t)/t}$ for the standard flow matching schedule (line 95), but doesn’t show the computation. A two-line derivation from $\alpha_t = t$, $\beta_t = 1-t$ would make the formula concrete and verifiable.

Unnecessary Content

Lines 108-123: The overview of adjoint method and SOCM can be heavily compressed. The adjoint method is already covered in adjoint.qmd. A single sentence cross-referencing it, plus one sentence on why SOCM fails (exponential variance), suffices. The current text reads like a related-work section.
Lines 189-197: “Connection to the adjoint method” section. This section restates what was already said (lean adjoint = adjoint ODE with $f=0$). The observation is already made on line 162 (“the lean adjoint depends only on the base drift”). This section adds no new content and can be folded into a sentence near ?@eq-lean-adjoint.
Lines 173-186: Algorithm recipe. The numbered list is both unnecessary and stylistically out of place. The note already defines all the pieces. A one-paragraph summary would suffice.

Line-by-Line Notes

Line 27: p^{\text{base}}(X_1) notation. The existing notes use \bbP for path measures and lowercase $p$ for densities. Be more precise about whether this is a density or a distribution.
Line 39: The KL formula $\kl(\bbP^u \| \bbP) = \frac{1}{2} \E \int_0^1 \|u\|^2 dt$ uses \kl macro correctly but the expectation inside uses \E without specifying under which measure. Should be $\E_{\bbP^u}$ or $\E^u$ to be precise (girsanov.qmd is explicit about this).
Line 45: $u^\star(t,x) = \sigma(t)^\top \nabla V(t,x)$ has a sign issue relative to the paper. The paper defines the value function with a minimization convention, so $u^\star = -\sigma^\top \nabla V$. The note seems to be using a maximization convention (since the SOC objective ?@eq-soc is a minimization of costs = maximization of reward). Check that the sign is consistent with HJB.qmd, which uses $u^\star = \sigma^\top \nabla V$ (where $V = \log h$ and $h$ is the exponential conditional expectation). This needs to be verified carefully; currently the sign conventions between ?@eq-soc and the formula for $u^\star$ are potentially inconsistent.
Line 55: $V(X_0, 0)$ should be $V(0, X_0)$ if using the convention $V(t,x)$ from line 45. The argument order is swapped.
Line 82: The expression for the drift $b(x,t) = \kappa_t x + (\sigma(t)^2/2 + \eta_t) \mathfrak{s}(x,t)$ is taken directly from the paper but uses the notation $\mathfrak{s}$ (Fraktur s) for the score, which is not used anywhere in the existing notes. The existing notes use $\nabla \log p_t(x)$ or just “the score.” Use consistent notation.
Line 95: The specific schedule $\sigma(t) = \sqrt{2(1-t)/t}$ should be derived, not stated. Show: $\eta_t = \beta_t(\dot{\alpha}_t \beta_t / \alpha_t - \dot{\beta}_t) = (1-t)(1 \cdot (1-t)/t - (-1)) = (1-t)/t$.
Line 117: ?@eq-adjoint-ode uses compact notation $\nabla_x (b + \sigma u)^\top a(t)$ that obscures what’s happening. Write it out more explicitly so the reader can see which terms depend on $u$ and which don’t.
Line 127: ?@eq-basic-am uses $\texttt{sg}$ for stop-gradient. This is fine for ML audiences but should get a brief note about what it means mathematically (freeze parameters when computing the trajectory and adjoint, differentiate only through the regression term).
Line 142: ?@eq-zero-terms writes $u^\star(x,t)^\top \nabla_x u^\star(x,t)$ but this is a row-vector times a Jacobian matrix. Clarify the dimensions or use trace notation.
Line 159: The lean adjoint ?@eq-lean-adjoint terminal condition has $\tilde{a}(1) = -\nabla r(X_1)$, but the full adjoint ?@eq-adjoint-ode has $a(1) = -\nabla r(X_1)$. This is correct (same terminal condition), but the sign should be verified against the convention in ?@eq-soc where $g = -r$.

Suggested Structure for v2

Setup (short). SDE, controlled SDE, SOC objective. Cross-reference HJB.qmd and Girsanov.qmd for the KL interpretation. One paragraph, three equations. No “the natural framework” type connectors.
Value function bias (the core problem). Start from the optimal path distribution (derive it using the variational formulation from HJB.qmd, or at minimum state it with a precise cross-reference and equation number). Write down the joint $p^\star(X_0, X_1)$. Try to marginalize. Show the $\exp(V(X_0,0))$ factor prevents factoring. Compute the deterministic case $\sigma = 0$ as a concrete example where the bias is maximal.
Memoryless schedule (the fix). State the independence condition $X_0 \perp X_1$. Show that under independence, the marginal over $X_0$ becomes a constant. Then derive the noise schedule: use the conditional Gaussian structure $\bar{X}_t | \bar{X}_1 \sim \normal(\alpha_t \bar{X}_1, \beta_t^2 I)$ to show $\sigma(t) = \sqrt{2\eta_t}$ is the unique schedule matching this conditional structure. Work out the flow matching example ($\alpha_t = t$, $\beta_t = 1-t$) explicitly. State the necessity result. Note the DDPM recovery.
Adjoint Matching (the algorithm). Start from the fixed-point characterization: the optimal control satisfies $u^\star(t,x) = -\sigma(t)^\top \E[a(t) | X_t = x]$. This is the key observation. Write the basic AM loss as a regression to enforce this fixed point. Show (compute!) that expanding the square and taking the gradient recovers the continuous adjoint gradient. Then derive the lean adjoint: write the full adjoint ODE, identify the $u$-dependent terms, show they have zero conditional expectation at optimality (using the conditional-expectation characterization), remove them. State the final AM loss. Explain the variance reduction intuitively.
After fine-tuning (short). One paragraph on why the fine-tuned model can switch back to any noise schedule.

This restructuring prioritizes computation over description, puts the “aha moments” front and center, and matches the style of the existing notes.