Tightness Audit: adjoint_matching.qmd

Verdict

Mostly tight with 7 substantive issues: 2 sign convention inconsistencies with the cross-referenced notes, 1 algebraically incomplete derivation, 2 claims that are stated more strongly than what is shown, and 2 hidden assumptions that should be flagged.

Derivation Gaps

Line 50–53: Sign mismatch between the note’s value function and the HJB notes. The note says \(u^\star(t,x) = \sigma_t^\top \nabla V(t,x)\) with \(V = \log h\) and \(h(t,x) = E[\exp(r(X_1)) \mid X_t = x]\). This is consistent with the HJB notes (eq-u-star-V), where the SOC is a maximization and \(V = \log h\). But the paper defines \(V(x,t) = \min_u J(u;x,t)\) (a cost to minimize, with \(g = -r\)), giving \(V_{\text{paper}} = -\log h\) and \(u^\star = -\sigma^\top \nabla V_{\text{paper}}\). The note then on line 141 writes \(V(t,x) = -J(u^\star; x, t)\), mixing the paper’s cost-functional \(J\) with the HJB notes’ reward-based \(V\). This works out (\(-J = \log h\) when \(f=0, g=-r\) in the paper’s convention), but the reader sees “\(V(t,x) = -J(u^\star; x, t)\)” and needs to mentally reconcile the paper’s \(J\) (which includes \(+g = -r\)) with the note’s \(V = \log h\). Fix: Add one sentence: “The paper defines \(V\) as cost-to-go (minimization); we use the HJB notes’ convention \(V = \log h\) (reward-to-go, maximization). The two are negatives of each other: \(V_{\text{here}} = -V_{\text{paper}}\).”
Line 56–62: The bivariate marginal formula ?@eq-radon-nikodym is asserted, not derived. The note says “Combining Ito’s formula on \(V(t,X_t)\) with the HJB equation and Girsanov” gives \(p^\star(X_0, X_1) \propto p^{\text{base}}(X_0, X_1) \exp(r(X_1) - V(0,X_0))\), and then in the following paragraph says “the path-level Radon-Nikodym derivative from Girsanov involves additional path-dependent terms (\(\int \|u^\star\|^2 ds\)), but after marginalizing over intermediate times, these reduce to a function of the endpoints only, via Ito’s formula and the HJB equation.” This is the right idea, but the actual calculation is not shown. A grad student would want to see: (a) write the Girsanov RN derivative \(d\bbP^{u^\star}/d\bbP\) along the full path, (b) apply Ito to \(V(t,X_t^{u^\star})\) using the HJB equation to simplify \(\int_0^1 \frac{1}{2}\|u^\star\|^2 ds\) into endpoint terms, (c) marginalize. The HJB notes do carry out a closely related calculation (the Ito + HJB step leading to eq-logZ-pathwise), so the cross-reference is legitimate. But the note should either show this 3-line calculation or say explicitly: “The HJB notes derive \(V(t_2, X_{t_2}) - V(t_1, X_{t_1}) = \int \frac{1}{2}\|u^\star\|^2 ds + \text{martingale}\) (final displayed equation). Plugging this into the Girsanov RN derivative and taking expectations kills the martingale, yielding ?@eq-radon-nikodym.” Fix: Replace the hand-wave paragraph (lines 62) with the explicit 3-step argument referencing the specific equation in HJB.qmd.
Line 77: The example \(\sigma_t = 0\) is slightly misleading. The note says “Adding a control \(u\) in ?@eq-controlled-sde has zero effect since \(\sigma = 0\) kills both the noise and the control term (the control enters as \(\sigma_t u\), so it vanishes with \(\sigma_t\)).” Careful: the controlled drift is \(b + \sigma_t u\), so the control term is \(\sigma_t u \, dt\). When \(\sigma_t = 0\), this is zero. But the cost functional \(J(u) = E[\frac{1}{2}\|u\|^2 - r(X_1)]\) is still well-defined; any nonzero \(u\) increases cost without affecting the trajectory. So the minimizer is \(u = 0\) and nothing changes. The logic is correct, but “there is nothing to optimize” is misleading: there is an optimization problem, it just has a trivial solution. Fix: Minor wording; say “the optimal control is trivially \(u=0\) since any nonzero \(u\) adds cost without affecting the trajectory.”
Line 101: The memoryless schedule derivation defers too much. The note says the paper “proves this by analyzing the time-reversed SDE and showing that the coefficient of \(X_0\) in the explicit solution vanishes if and only if \(\sigma_t^2 \geq 2\eta_t\).” This is accurate but the reader has no way to verify the claim \(\sigma_t = \sqrt{2\eta_t}\) without reading 150 lines of the paper’s Appendix 12.1. Since the note derives everything else from scratch, this is a conspicuous gap. The note could at least sketch the idea: the time-reversed SDE for \(\vec{X}_t := X_{1-t}\) has an explicit solution (linear SDE in disguise). The coefficient of \(\vec{X}_0 = X_1\) in \(\vec{X}_1 = X_0\) involves \(\exp(-\int \sigma^2/(2\beta^2) ds)\). For this to vanish (making \(X_0\) independent of \(X_1\)), one needs \(\int \sigma^2/(2\beta^2) ds = \infty\) near \(t=0\), which is satisfied iff \(\sigma^2 \geq 2\eta_t\). Fix: Add a 3–4 line sketch of the mechanism inside a <details> block, similar to the flow matching check already present.
Line 141: “\(\E[a(t) \mid X_t = x] = \nabla_x J(u; x, t)\)” needs more care. The note says this interchange of gradient and conditional expectation “holds under standard regularity conditions on \(r\).” This is an important step and the conditions are not standard; they require: (i) the adjoint ODE has a unique solution along the SDE trajectory (pathwise, with probability one), (ii) the gradient \(\nabla_{X_t}\) of the future cost exists and equals \(a(t)\) (this is the content of the adjoint method, which is derived for ODEs in adjoint.qmd but only heuristically extended to SDEs), and (iii) the conditional expectation of the pathwise gradient equals the gradient of the conditional expectation. The adjoint notes (line 200–210) explicitly say “informally, one can apply the same reasoning” for SDEs with additive noise. The note should acknowledge this is the heuristic SDE extension, not a rigorous result. Fix: Replace “which holds under standard regularity conditions on \(r\)” with “heuristically, following the SDE extension described in the adjoint notes.”
Lines 159–183: The “removing \(u\)-dependent terms” argument has a gap in the logic. The argument goes: at optimality, \(u^\star = -\sigma_t^\top E[a(t) \mid X_t]\). Right-multiply both sides by \(\nabla_x u^\star\). Rearrange to get ?@eq-zero-terms. The issue: the right-multiplication step (line 170–173) is done before taking conditional expectations. Both sides are evaluated at a fixed \(x\), so \(u^\star(t,x)\) and \(\nabla_x u^\star(t,x)\) are deterministic given \(X_t = x\). The step from line 173 to line 179 wraps everything in \(E[\cdot \mid X_t = x]\). Since \(u^\star(t,x)\) and \(\nabla_x u^\star(t,x)\) are deterministic functions of \(x\), they can be pulled out of the conditional expectation, and the conclusion follows. This is correct but the presentation makes it look like you are taking a conditional expectation of something that was derived without conditioning. A grad student will pause here. Fix: Add one clarifying sentence: “Since \(u^\star(t,x)\) and \(\nabla_x u^\star(t,x)\) are deterministic functions of \((t,x)\), they can be pulled inside or outside \(E[\cdot \mid X_t = x]\) freely.”
Line 201: The uniqueness argument for the critical point of \(E[L_{\text{AM}}]\) is compressed to the point of being hard to verify. The note claims: (a) the gradient of \(L_{\text{AM}}\) at \(u = u^\star\) equals zero because the removed terms have zero conditional expectation, and (b) uniqueness of the critical point follows from the uniqueness of the critical point of \(L_{\text{basic}}\), proven in Appendix 13.3 of the paper. Step (a) is shown. Step (b) is the non-trivial part. The actual argument (from the paper’s Appendix 13.3) is: if \(\hat{u}\) is a critical point of \(E[L_{\text{AM}}]\), then \(\hat{u} = -\sigma^\top E[\tilde{a} \mid X_t]\). At a critical point, the canceled terms (?@eq-zero-terms applied to \(\hat{u}\)) are zero, which means \(E[\tilde{a} \mid X_t] = E[a(\cdot, \hat{u}) \mid X_t]\) (they satisfy the same integral equation, which has a unique solution). So \(\hat{u}\) is also a critical point of \(E[L_{\text{basic}}]\), which has a unique critical point \(u^\star\). The note says “by the uniqueness of the critical point of \(E[L_{\text{basic}}]\) (proven in the adjoint notes and Appendix 13.3).” But the adjoint notes do NOT prove uniqueness of the critical point of \(L_{\text{basic}}\); they derive the adjoint ODE. Uniqueness of the critical point requires showing that \(u = -\sigma^\top E[a(\cdot, u) \mid X_t]\) has a unique fixed point, which is proven in Appendix 13.3 of the paper, not in the adjoint notes. Fix: Remove the reference to “the adjoint notes” for the uniqueness claim. Keep only the paper reference.

Hidden Assumptions

Scalar \(\sigma_t\): The note writes \(\sigma_t\) as a scalar throughout (e.g., \(\sigma_t u\) instead of \(\sigma(t) u\)). The Girsanov notes and HJB notes use matrix-valued \(\sigma(x)\). The restriction to scalar, state-independent diffusion is never stated. Add a remark: “We restrict to scalar, state-independent \(\sigma_t > 0\) throughout, matching the generative model setting.”
Regularity of \(r\): The terminal reward \(r\) needs to be at least \(C^1\) for \(\nabla r\) to exist (used in ?@eq-adjoint-ode). For the value function argument, \(\exp(r)\) needs to be integrable. For the adjoint ODE to have a unique solution, \(\nabla_x b\) needs to be continuous. None of these are stated. One sentence would suffice: “We assume \(r\) is smooth and that all expectations below are well-defined.”
The flow matching schedule check (lines 112–117): The computation gives \(\eta_t = (1-t)/t\), so \(\sigma_t = \sqrt{2(1-t)/t}\). This blows up as \(t \to 0\). In practice, one truncates the time interval to \([\epsilon, 1]\). This practical issue is not mentioned and could confuse a reader who tries to implement this.

Carpet Sweeping

Lines 95–101: The unified drift formula ?@eq-unified-drift appears from nowhere. The note says “The unified drift from [paper]” and writes down \(b(x,t) = \kappa_t x + (\sigma_t^2/2 + \eta_t) \nabla \log p_t(x)\) without any derivation or intuition. The existing notes (HJB, Girsanov) work with generic drift \(b(x)\), not this specific form. A reader encountering this for the first time has no basis to trust it. Consider adding one sentence of context: “This drift is constructed so that the SDE has the same time marginals as the reference interpolation; the \(\kappa_t x\) term handles the deterministic scaling and the score term corrects for the noise.”
Line 122: “The memoryless schedule is the unique choice that preserves the velocity-score relationship.” This is a claim from Theorem 2 of the paper (the “fine-tuning recipe” theorem). The note states it as fact without indicating it requires a separate proof. This is carpet-sweeping: the necessity direction of the memoryless condition is harder than the sufficiency direction and involves HJB-based arguments about the score-velocity conversion (Appendix 12.2 of the paper). Fix: Flag that this is a deeper result: “The paper proves (Theorem 2) that within the family of SDEs sharing marginals with the reference flow, the memoryless schedule is not just sufficient but necessary for the fine-tuned velocity to be convertible to arbitrary noise schedules.”

Cross-Reference Verification

Ref in note	Target	Verdict
Girsanov theorem for \(\kl(\bbP^u \\| \bbP)\) formula (line 39)	girsanov.qmd derives \(\kl(\bbP, \bbP^u) = \frac{1}{2} E[\int \\|u\\|^2 dt]\)	Direction mismatch: the note writes \(\kl(\bbP^u \\| \bbP)\) but the Girsanov notes derive \(\kl(\bbP, \bbP^u)\). The note acknowledges this parenthetically (“the Girsanov notes derive the analogous formula in the other direction”). For the control-affine case with same initial condition, the two are actually equal by a symmetry argument (the cross-term \(\int u^\top dW\) has zero expectation under either measure). But this equality is non-obvious and should be stated or the note should just cite the direction that matches.
HJB notes for \(u^\star = \sigma_t^\top \nabla V\) (line 50)	HJB.qmd eq-u-star-V: \(u^\star = \sigma^\top \nabla V\)	Correct
HJB notes for SOC with \(f=0\), \(g=r\) (line 45)	HJB.qmd variational formulation section	Correct, though HJB notes use general \(f,g\); specializing to \(f=0\), \(g=r\) is immediate
Adjoint notes for adjoint ODE (lines 129, 135)	adjoint.qmd eq-adjoint-ode: \(\dot{\lambda} = -\nabla_x f - b_x^\top \lambda\), \(\lambda(T) = \nabla_x g\)	Consistent but requires translation: adjoint.qmd uses \(\lambda\) where the note uses \(a\), and adjoint.qmd’s ODE is for the base drift \(b(t,\theta,x)\) only. The note’s full adjoint ?@eq-adjoint-ode adds terms for the controlled drift \(b + \sigma_t u\) and controlled running cost \(\frac{1}{2}\\|u\\|^2\). This is a standard extension but it is not derived in adjoint.qmd. The note should say “applying the adjoint ODE from the adjoint notes to the controlled system.”
Adjoint notes for lean adjoint (line 193)	adjoint.qmd eq-adjoint-ode with \(f = 0\) and only base drift	Correct: setting \(\nabla_x f = 0\) and using base drift \(b\) in adjoint.qmd’s formula gives \(\dot{\lambda} = -b_x^\top \lambda\), \(\lambda(T) = \nabla_x g\), which matches ?@eq-lean-adjoint with \(g = -r\).
Adjoint notes for uniqueness of critical point (line 201)	adjoint.qmd does not prove any uniqueness result	Broken reference: See Derivation Gap #7.

Grad Student Sticking Points

Line 39: KL direction. The note writes \(\kl(\bbP^u \| \bbP)\) but the Girsanov notes derive \(\kl(\bbP, \bbP^u)\). A student would spend 20 minutes figuring out if these are the same (they are, for this specific form, because \(\int u^\top dW\) has zero expectation under either measure by the martingale property).
Lines 56–62: The bivariate marginal. The key equation of the “value function bias” section appears without derivation. A student who wants to verify it needs to combine Girsanov’s formula with Ito applied to \(V(t, X_t)\) along the optimal path, use the HJB equation to simplify, then marginalize. The note says to combine three things but does not show the combination.
Lines 95–101: The unified drift and the memoryless condition. A student encountering \(\kappa_t\), \(\eta_t\), and the unified drift for the first time has no intuition for why these quantities arise or what they mean geometrically. The note provides the formula but not the “why.”
Line 141: \(E[a(t) \mid X_t] = \nabla_x J\). This is the central identity connecting the pathwise adjoint to the value function gradient. A student would want to see: (a) \(a(t)\) is the pathwise gradient of future cost w.r.t. \(X_t\), (b) taking conditional expectation averages over future randomness, giving the gradient of the expected future cost, which is \(\nabla_x J\). The note states this but the justification (“assuming the interchange is valid”) is thin.
Line 201: The uniqueness argument. The compressed paragraph starting “To see why:” is dense. A student would need to unpack: (a) the gradient of the lean loss at \(u^\star\), (b) why adding zero-expectation terms does not change the gradient, (c) why the critical point is unique. Three non-trivial facts in one paragraph.

What the Note Gets Right

The value function bias section (lines 48–77) is pedagogically excellent. The core insight – that the SOC solution tilts the joint \((X_0, X_1)\) distribution, not just the marginal – is explained clearly. The deterministic flow example (\(\sigma = 0\)) is a perfect illustration.
The memoryless schedule section (lines 80–124) is well-structured. The logic “if \(X_0 \perp X_1\) then the bias disappears” into “what makes them independent?” into “the memoryless schedule” flows naturally. The explicit flow matching check in the <details> block is a nice touch.
The lean adjoint motivation (lines 157–203) correctly captures the key idea. The argument “the \(u\)-dependent terms cancel in conditional expectation at optimality, so remove them” is the core insight of the paper, and the note presents it cleanly.
Notation is unified and consistent with the existing notes. The note uses \(\bbP^u\), \(\kl\), \(\sigma_t\), colored annotations, and equation labeling consistently.
The note is honest about what it does not prove. It defers the memoryless proof to the paper, flags the SDE extension as heuristic (partially), and correctly identifies the lean adjoint as producing different gradients away from optimality. This is better than many expositions that would just say “it can be shown.”
The distinction between basic and lean Adjoint Matching is clear. The note correctly explains that the basic version gives the same gradients as the continuous adjoint method, while the lean version is genuinely different (and better in practice).