Second-Pass Critique: adjoint_matching_v2.qmd

Verdict

Needs minor-to-moderate fixes: one genuine sign error in the Radon-Nikodym formula, a macro collision with existing notes, and a handful of style/consistency issues.

Sign Convention Issues

1. Wrong sign in ?@eq-radon-nikodym (line 59)

The note defines \(V = \log h\) with \(h(t,x) = \E[\exp(r(X_1)) | X_t = x]\) (lines 50–53), matching the HJB.qmd maximization convention where \(V > 0\) for positive rewards. But at line 59 it writes:

\[\frac{d\bbP^\star}{d\bbP}(X_{[0,1]}) \propto \exp(r(X_1) + V(0, X_0))\]

This formula comes from the original paper (Eq. 18), where \(V^{\text{paper}}(x,t) = -\log \E[\exp(r(X_1))|X_t=x]\) is the minimization value function (negative of the note’s \(V\)). In the paper’s convention, \(r(X_1) + V^{\text{paper}} = r(X_1) - \log \E[\exp(r)|X_0]\), which is correct. But in the note’s convention (\(V = \log h\)), the formula should be:

\[\frac{d\bbP^\star}{d\bbP}(X_{[0,1]}) \propto \exp\big(r(X_1) - V(0, X_0)\big)\]

since the conditional RN derivative is \(d\bbP^{u^\star}(\cdot|X_0) / d\bbP(\cdot|X_0) = \exp(r(X_1)) / h(0,X_0) = \exp(r(X_1) - V(0,X_0))\).

The qualitative argument is unaffected (the bias discussion and memoryless fix work identically with either sign, because the factor involving \(V(0,X_0)\) still couples \(X_0\) and \(X_1\)). But the formula is internally inconsistent with the note’s own conventions.

Fix: change line 59 to \(\propto \exp(r(X_1) - V(0, X_0))\).

2. Cascading error in ?@eq-marginal-bad (lines 68–70)

Same sign error propagates. The joint should read:

\[p^\star(X_0, X_1) \propto p^{\text{base}}(X_0, X_1) \exp(r(X_1) - V(0, X_0))\]

In ?@eq-marginal-good (lines 83–86), the conclusion remains correct either way: \(p^\star(X_1) \propto p_1(X_1) \exp(r(X_1))\) under independence. But fix the intermediate formula.

3. Line 62: initial distribution description is misleading

“The \(V(0, X_0)\) term comes from the initial distribution shift: the optimally controlled process starts from \(q_0(x_0) \propto p_0(x_0) h(0, x_0) = p_0(x_0) \exp(V(0, x_0))\), not from \(p_0\).”

This describes the initial distribution of the Doob-tilted process \(\bbQ\) from HJB.qmd (which jointly optimizes \(q\) and \(u\)). But the adjoint matching setup at line 24 fixes \(X_0 \sim \normal(0,I)\) and only optimizes \(u\). The controlled process \(\bbP^{u^\star}\) still starts from \(p_0\), not from \(q_0\). The \(V(0,X_0)\) term in the RN derivative is the conditional normalizing constant \(h(0,X_0)\), not an initial distribution shift.

Fix: Rewrite line 62 to say something like: “The factor \(\exp(-V(0, X_0)) = 1/h(0, X_0)\) is the conditional normalizing constant: given \(X_0\), the conditional path measure \(\bbP^{u^\star}(\cdot | X_0) / \bbP(\cdot|X_0) = \exp(r(X_1)) / h(0, X_0)\).”

4. SOC problem matches HJB.qmd (correct)

Line 42: \(\min_u \cJ(u) = \E[\int \frac{1}{2}\|u\|^2 dt - r(X_1)]\), line 45 maps to HJB with \(f=0\), \(g=r\). Line 50: \(u^\star = \sigma^\top \nabla V\) with \(V = \log h\). Line 53: \(h = \E[\exp(r(X_1))|X_t=x]\). All consistent with HJB.qmd.

5. Adjoint terminal condition (correct)

Line 136: \(a(1) = -\nabla r(X_1)\). Since the cost is \(-r(X_1)\), we have \(\nabla_{X_1}(-r) = -\nabla r\). Correct.

6. Fixed-point condition sign (correct)

Line 142: \(u^\star = -\sigma^\top \E[a | X_t]\). Since \(\E[a|X_t] = \nabla_x \cJ\) and \(\nabla V = -\nabla \cJ\), this gives \(u^\star = \sigma^\top \nabla V\). Correct.

Notation Inconsistencies (cross-note)

1. \cL used for loss (all 4 v2 notes) vs generator (existing notes)

In HJB.qmd, doob.qmd, and girsanov.qmd, \(\cL\) is the infinitesimal generator of the diffusion. In all four v2 notes, \(\cL\) (or \(\cL_\text{AM}\), \(\cL_\text{RAM}\), etc.) denotes a loss/objective function. A reader who has just finished HJB.qmd will see \(\cL_\text{AM}\) and parse it as “generator subscript AM.”

Fix across all 4 v2 notes: use a different symbol for the loss, e.g., \(L_\text{AM}\) or \(\ell_\text{AM}\), reserving \(\cL\) for the generator. (Or use \(\cJ\) for the cost, which is already used in adjoint_matching_v2 for the SOC cost but not for the regression loss.)

2. \sigma(t) in adjoint_matching vs \sigma_t in other 3 notes

adjoint_matching_v2 uses functional notation \(\sigma(t)\) (lines 24, 36, 96, etc.). The other three v2 notes (adjoint_sampling, ASBS, BMS) use subscript notation \(\sigma_t\). Not a mathematical error, but a cosmetic inconsistency.

Fix: pick one convention for all 4 notes. \(\sigma_t\) is shorter and matches the \(\kappa_t\), \(\eta_t\) subscript style already used at line 99.

3. Trajectory notation

adjoint_matching_v2 uses \(X^u_t\) (superscript \(u\), subscript \(t\)) at line 36 and then drops the \(u\) superscript in the adjoint section. adjoint_sampling_v2 uses \(\boldsymbol{X}\) for full trajectories. BMS_v2 uses \((X_0, X_t, X_T)\) tuple notation. These are all reasonable per-note choices and not confusing.

4. Time horizon

adjoint_matching_v2 uses \([0,1]\). BMS_v2 uses \([0,T]\). The others use \([0,1]\). Minor cosmetic issue (BMS is the outlier). No fix needed in adjoint_matching.

Pedagogy Gaps

1. Lines 45–46: the HJB mapping

“Equivalently, we maximize \(\E[-\frac{1}{2}\|u\|^2 + r(X_1)]\), which is the SOC problem from the HJB notes with \(f=0\) and \(g=r\).”

A grad student will stumble. The HJB notes maximize \(\E[\int (f - \frac{1}{2}\|u\|^2) ds + g(X_T)]\). With \(f=0\), \(g=r\), this is \(\E[-\frac{1}{2}\|u\|^2 + r(X_1)]\). Fine, but the HJB notes also jointly optimize over the initial distribution \(q\), which the adjoint matching note does not. A one-sentence clarification would help: “Here we fix \(X_0 \sim \normal(0,I)\) and optimize only over \(u\), unlike the joint \((q,u)\) optimization in the HJB notes.”

2. Lines 56–62: Radon-Nikodym formula appears from nowhere

The note claims “The HJB notes derive the Radon-Nikodym derivative… the optimal path distribution satisfies ?@eq-radon-nikodym.” But HJB.qmd derives \(d\bbQ/d\bbP \propto \exp(\int f ds + g(X_T))\) for the jointly optimized measure. The specific formula with \(V(0,X_0)\) for the fixed-initial-distribution setting is not explicitly in HJB.qmd. The reader is told to look at HJB.qmd but won’t find this specific formula there.

Fix: Either derive the formula from scratch (3 lines using conditional Girsanov) or point to the precise equation in HJB.qmd and explain the modification for fixed \(p_0\).

3. Lines 99–103: memoryless schedule derivation is a hand-wave

The key claim “Matching these exactly… requires \(\sigma(t) = \sqrt{2\eta_t}\)” (line 102) is stated without proof. The derivation says: the conditional variance \(\beta_t^2\) decays at rate \(\eta_t\), the diffusion adds noise at rate \(\sigma^2/2\), “matching these exactly” gives \(\sigma^2/2 = \eta_t\).

A grad student will ask: matching WHAT exactly? The note never writes the equation that is being matched. The paper proves this via Fokker-Planck analysis of the conditional distribution \(X_t | X_1\). The note should either reproduce the key step (show that \(X_t | X_1 \sim \normal(\alpha_t X_1, \beta_t^2 I)\) requires the SDE’s conditional variance to reproduce \(\beta_t^2\), giving \(\sigma^2 = 2\eta_t\)) or admit it’s skipping the proof and cite the paper.

4. Lines 157–173: zero-expectation argument

The derivation is correct but dense. The jump from “right-multiply both sides by \(\nabla_x u^\star\)” to the conclusion at ?@eq-zero-terms requires the reader to track row-vector vs column-vector transposes. Adding one sentence (“Both sides are row vectors in \(\bbR^{1 \times D}\)”) would help.

5. Lines 127–133: adjoint state definition

The adjoint state \(a(t)\) is defined as the gradient of the “future cost” (line 130), and then the adjoint ODE is stated (line 136) with the claim “Applying the adjoint ODE from the adjoint notes.” The adjoint notes derive \(\dot{\lambda} = -\nabla_x f - b_x^\top \lambda\), \(\lambda(T) = \nabla_x g\) for \(\cL = \int f dt + g\). Mapping: the “drift” for the controlled system is \(b + \sigma u\); the “running cost” is \(\frac{1}{2}\|u\|^2\) (no \(f\) in the running state cost); the “terminal cost” is \(-r\). The resulting adjoint should be \(\dot{a} = -\nabla_x(\frac{1}{2}\|u\|^2) - (\nabla_x(b+\sigma u))^\top a\), \(a(1) = -\nabla r\). Line 136 is correct, but a grad student needs the explicit mapping to verify it.

Remaining Style Issues

1. “Note the connection to the adjoint notes” (line 183)

“Note the connection” is a mild instance of the forbidden pattern. Rewrite: “The lean adjoint is precisely the adjoint ODE from the adjoint notes applied to the base dynamics \(\dot{x} = b(t,x)\) with terminal cost \(-r\).”

2. “This has two practical effects” (line 181)

The “(i)… and (ii)…” enumeration reads like a survey. Rewrite as prose: “No need to compute \(\nabla_x u\), which is expensive for neural networks. The lean adjoint also has smaller magnitude since the removed terms are large but cancel in expectation, reducing gradient variance.”

3. “The Adjoint Matching loss replaces…” (line 185)

Bold-naming a result by announcement. The existing notes would just write the formula and let the reader recognize it.

4. Missing \blue{} and \red{} annotations

The existing style uses \(\blue{\ldots}\) to highlight control terms and \(\green{\ldots}\) for generalizations. In adjoint_matching_v2:

  • Line 36: \(\blue{u(t, X^u_t)}\) is correctly blue. Good.
  • Line 178: the lean adjoint is the key result but has no color highlighting. Consider making \(\dot{\tilde{a}} = -\nabla_x b^\top \tilde{a}\) blue (or the removed terms red with strikethrough).
  • Line 102: \(\sigma(t) = \sqrt{2\eta_t}\) is the main formula of the section and has no color.

5. Missing <details> blocks

The zero-expectation argument (lines 157–173) is a side computation that interrupts the main flow. In the style of the existing notes, it should be in a <details><summary> block.

6. Line 196: survey-like closing paragraph

“This works because the memoryless schedule is the unique one preserving the velocity-score relationship…” reads like a paper abstract. The existing notes would either derive this claim or omit it.

7. TODO: missing historical figure portrait (lines 13–16)

All existing notes have a historical portrait. This one has a placeholder comment. Suggested: Pontryagin (used in adjoint.qmd) or Bellman (used in HJB.qmd). Since both are taken, perhaps Rudolf Kalman (1930–2016) for his work on optimal control, or keep blank and add later.

Mathematical Errors or Concerns

1. Flow matching schedule check (lines 110–114)

\(\alpha_t = t\), \(\beta_t = 1-t\). Computed: \(\eta_t = (1-t)[(1-t)/t + 1] = (1-t)/t\). So \(\sigma(t) = \sqrt{2(1-t)/t}\).

Check: \(\kappa_t = \dot\alpha_t/\alpha_t = 1/t\), \(\eta_t = \beta_t(\kappa_t \beta_t - \dot\beta_t) = (1-t)((1-t)/t - (-1)) = (1-t)((1-t)/t + 1) = (1-t) \cdot (1/t) = (1-t)/t\). Correct.

2. Lean adjoint: are the zero-expectation terms correctly identified? (lines 157–175)

Yes. The terms removed from the full adjoint are \(\sigma(\nabla_x u)^\top a + (\nabla_x u)^\top u = (\nabla_x u)^\top(\sigma a + u)\). At optimality, \(u^\star = -\sigma \E[a|X_t]\), so \(\sigma a + u^\star = \sigma(a - \E[a|X_t])\), which has conditional expectation zero. Hence \(\E[(\nabla_x u^\star)^\top \sigma(a - \E[a|X_t]) | X_t] = 0\) because \(\nabla_x u^\star\) is a function of \(X_t\). Correct.

3. Value function bias: does the marginalization work? (lines 80–86)

Under independence, \(p^{base}(X_0, X_1) = p_0(X_0) p_1(X_1)\). With the corrected sign: \(p^\star(X_1) \propto p_1(X_1) \exp(r(X_1)) \int p_0(X_0) / h(0,X_0) \, dX_0 \propto p_1(X_1) \exp(r(X_1))\). The integral over \(X_0\) is a constant. Correct.

4. DDIM schedule claim (line 118)

“For DDIM with \(\alpha_t = \sqrt{\bar\alpha_t}\), \(\beta_t = \sqrt{1-\bar\alpha_t}\), the memoryless schedule gives \(\sigma(t) = \sqrt{\dot{\bar\alpha}_t / \bar\alpha_t}\), which is exactly the DDPM noise schedule.”

Check: \(\kappa_t = \dot\alpha/\alpha = \dot{\bar\alpha}/(2\bar\alpha)\). \(\eta_t = \beta(\kappa \beta - \dot\beta) = \sqrt{1-\bar\alpha}[\frac{\dot{\bar\alpha}}{2\bar\alpha}\sqrt{1-\bar\alpha} + \frac{\dot{\bar\alpha}}{2\sqrt{1-\bar\alpha}}}] = \frac{\dot{\bar\alpha}}{2\bar\alpha}(1-\bar\alpha) + \frac{\dot{\bar\alpha}}{2} = \frac{\dot{\bar\alpha}}{2\bar\alpha}\). So \(2\eta_t = \dot{\bar\alpha}/\bar\alpha\) and \(\sigma = \sqrt{\dot{\bar\alpha}/\bar\alpha}\). This matches the paper’s Table 1. Correct.

5. ?@eq-adjoint-ode running cost term (line 136)

The note writes \(\underbrace{u^\top \nabla_x u}_{\nabla_x(\frac{1}{2}\|u\|^2)}\). Check: \(\nabla_x(\frac{1}{2}\|u(t,x)\|^2) = (\nabla_x u)^\top u\), not \(u^\top \nabla_x u\). As row vectors: \(u^\top \nabla_x u \in \bbR^{1 \times D}\) while \((\nabla_x u)^\top u \in \bbR^{D \times 1}\). The adjoint ODE acts on column vectors (\(a \in \bbR^D\)), so the correct term is \((\nabla_x u)^\top u\). Displaying it as \(u^\top \nabla_x u\) conflates a row vector with a column vector.

In the context of the ODE \(\dot{a} = -[\ldots]^\top a - \nabla_x(\frac{1}{2}\|u\|^2)\), the last term should be \((\nabla_x u)^\top u\) (column vector). The note writes \(u^\top \nabla_x u\), which as written is \(\bbR^{1 \times D}\) (a row vector). The brace annotation says \(= \nabla_x(\frac{1}{2}\|u\|^2)\). By convention, \(\nabla_x\) produces a column vector, so the annotation is correct but the displayed expression \(u^\top \nabla_x u\) has the wrong shape. This is a minor point since the note treats all vectors implicitly as columns, but it could confuse a careful reader.

Minor Issues

  1. Line 93: sentence starting “The generative SDE ?@eq-base-sde is built…” This sentence is 3 lines long and quite dense. Consider splitting after “The unified drift from [@domingoenrich2024adjoint] is” to improve readability.

  2. Line 120: “The result of [@domingoenrich2024adjoint] is stronger than sufficiency” The claim that memoryless is the ONLY schedule guaranteeing convergence to ?@eq-tilted “for arbitrary reward functions” is stated but not proven or even sketched. Either add a one-line argument or flag it as a theorem from the paper.

  3. Line 122: orphaned short paragraph. “The memoryless schedule is only needed during fine-tuning…” is a 2-sentence paragraph that could be folded into the “After fine-tuning” section.

  4. Missing bibliography. The BibTeX entry for [@domingoenrich2024adjoint] is in an HTML comment at the bottom (lines 199–207) but likely not added to ref.bib. The citation will render as “?” unless the bib entry exists.

  5. No ## References section. The existing notes (e.g., girsanov.qmd) don’t have one either, so this may be fine if Quarto auto-generates it from the bibliography.

  6. Line 153: “Expanding the square and differentiating with respect to \(\theta\) recovers exactly the continuous adjoint gradient from ?@eq-adjoint-ode.” This is stated but not shown. A <details> block with the 3-line computation would strengthen the claim.

  7. Equation label #eq-adjoint-state missing. Line 131 defines \(a(t)\) with label #eq-adjoint-state but this label is never referenced in the text. Either reference it or drop the label.

  8. Line 118: “This gives retrospective justification for DDPM’s noise choice, which was previously just one heuristic among others.” This is an interpretation/opinion that reads slightly editorial. Consider: “The DDPM noise schedule is the memoryless schedule for DDIM.”

  9. Missing cross-reference to Schrodinger bridges. The memoryless condition \(X_0 \perp X_1\) is closely related to the Schrodinger bridge discussion (the independent coupling from BMS_v2). No cross-reference is made. Consider adding one at line 91.