Tightness Audit: adjoint_sampling.qmd
Verdict
Strong pedagogical note with correct high-level structure, but several derivation gaps that range from unstated assumptions to an algebraic step that needs more justification. Two broken cross-reference links. The corrector matching section is the weakest: it compresses multiple non-trivial steps into single sentences.
Derivation Gaps
Line 27: Girsanov KL direction. The note claims \(\kl(\bbP^u \| \bbP) = \frac{1}{2} \E \int_0^1 \|u\|^2 \, dt\) and cites the Girsanov notes. The Girsanov notes (girsanov.qmd, around line 109) derive \(\kl(\bbP, \bbP^u) = \frac{1}{2} \E[\int_0^T \|u\|^2 \, dt]\) with expectation under \(\bbP\), not \(\bbP^u\). Both equalities are true (the KL in both directions equals \(\frac{1}{2}\E_{\text{respective measure}}[\int \|u\|^2]\)) but the one with \(\E_{\bbP^u}\) is slightly less immediate from Girsanov alone; it requires substituting the Girsanov weight and simplifying. The note should either (a) note that the expectation is under \(\bbP^u\) and state why, or (b) write it under \(\bbP\) to match the cited source. As written, a careful reader following the citation will see a different formula.
Line 31: “At optimality the controlled path measure equals the Schrodinger bridge: \(\bbP^{u^\star}(\boldsymbol{X}) = \bbP(\boldsymbol{X} \mid X_1) \, \pi(X_1)\).” This is stated as a consequence of the KL formulation but never derived. The claim is that \(\kl(\bbP^u \| \bbP^\star) = 0\) implies this factorization. It requires showing that \(\bbP^\star(\boldsymbol{X}) = \bbP(\boldsymbol{X} \mid X_1) \pi(X_1)\), which itself requires the \(X_0 = 0\) (Dirac) assumption to get the SB with these marginals. The parenthetical “(this follows from the KL formulation… see the HJB notes)” is insufficient; the HJB notes do not contain this specific SB factorization. The SB notes (shrodinger.qmd) contain the general factorization but not specialized to this case. Suggested fix: Add a 2-3 line argument: the SB minimizing \(\kl(\bbQ \| \bbP)\) with \(\bbQ_0 = \delta_0\) and \(\bbQ_1 = \pi\) has \(d\bbQ/d\bbP \propto \varphi_1(X_1)\) (no initial tilt since \(\delta_0\) is already the base initial), so \(\bbQ = \bbP(\cdot \mid X_1) \pi(X_1)\).
Line 58: Intuition for constant lean adjoint. The note says “Holding the future noise realization fixed and perturbing \(X_t\) by \(\delta X_t\), the terminal state shifts by the same amount: \(X_1 \mapsto X_1 + \delta X_t\).” This is correct for \(b = 0\) base dynamics, but the lean adjoint is evaluated along controlled trajectories (where \(b + \sigma_t u \neq 0\)). The note does acknowledge this (“the lean adjoint is defined using the base dynamics, not the controlled dynamics”) but the physical intuition paragraph conflates the two. For the controlled process, \(X_1\) does depend nonlinearly on \(X_t\), and it is the lean adjoint that is constant, not the full adjoint. The paragraph is pedagogically fine but could mislead a reader into thinking \(\nabla_{X_t} g(X_1) = \nabla g(X_1)\) holds generally along controlled trajectories. It does not; it holds along base trajectories.
Lines 107–131: The projection never increases the SOC cost (details block). The decomposition \(J(u) = \kl(\bbP^u \| \text{SB}_{p^u_1}) + \kl(p^u_1 \| \pi)\) is correct and clean. However, the claim “driving the path-space KL to zero” assumes the SB with terminal marginal \(p^u_1\) exists and is achievable. This requires \(p^u_1\) to be absolutely continuous with respect to \(p_1^\text{base}\), which is silently assumed. For typical neural network controls this is fine, but it is an assumption.
Lines 159–161: SB Radon-Nikodym derivative. The note writes \(d\bbQ^\star / d\bbP \propto \widehat{\varphi}_0(X_0) \varphi_1(X_1)\) and defines \(\varphi_t(x) = \E_\bbP[\varphi_1(X_1) \mid X_t = x]\) and \(\widehat{\varphi}_t(x) = \E_\bbP[\widehat{\varphi}_0(X_0) \mid X_t = x]\). This is the standard Schrodinger bridge factorization, but the note does not verify it against the SB notes. The SB notes (shrodinger.qmd, lines 112–133) define the potentials differently: \(\varphi_t(x) = \E[g(X_1) \mid X_t = x]\) (harmonic extension of terminal potential \(g\)). The two definitions are consistent (\(\varphi_1\) in the SB notes is \(g\), the terminal potential), but the notation clash between “\(g\)” in the SB notes (endpoint potential) and “\(g\)” in this note (terminal cost) could confuse a reader who checks the cross-reference.
Line 163: Matching the SOC joint to the SB joint. The note states: “Matching the terminal tilt requires \(e^{-g(X_1)} \propto \varphi_1(X_1)\).” This identification is done by comparing \(\bbP(X_0, X_1) e^{-g(X_1) + V_0(X_0)}\) with \(\bbP(X_0, X_1) \widehat{\varphi}_0(X_0) \varphi_1(X_1)\). For the terminal parts to match, we need \(e^{-g} \propto \varphi_1\). For the initial parts, we need \(e^{V_0} \propto \widehat{\varphi}_0\). The note does not state the second requirement explicitly or verify it. The identification \(e^{V_0(X_0)} = c \cdot \widehat{\varphi}_0(X_0)\) is actually non-trivial: it requires that the value function of the SOC problem (with the modified terminal cost) equals \(\log \widehat{\varphi}_0\) up to a constant. This is true by the structure of SB potentials (since \(\varphi_0(x) = \E[\varphi_1(X_1) | X_0 = x]\), we get \(V_0 = \log \varphi_0 = \log h\), and separately \(\widehat{\varphi}_0\) plays the role of \(1/\varphi_0\)… wait, that is not right). Actually, the matching requires \(e^{V_0} \propto \widehat{\varphi}_0\), but \(V_0 = \log \E[\exp(-g(X_1)) | X_0]\), and with \(g = -\log \varphi_1 + C\), this gives \(V_0 = \log \E[\varphi_1(X_1) | X_0] + C' = \log \varphi_0 + C'\). So \(e^{V_0} \propto \varphi_0\), not \(\widehat{\varphi}_0\). This means the initial tilt is \(\varphi_0\), matching the SOC structure \(e^{V_0} = \varphi_0\) (up to constants). The SB structure has initial tilt \(\widehat{\varphi}_0\). So we need \(\varphi_0 \propto \widehat{\varphi}_0\), which is NOT generally true. The resolution is that the SB and SOC problems are NOT being “matched” in the sense of having identical joint distributions; rather, the SOC with the modified terminal cost produces the SB forward drift \(\sigma_t^2 \nabla \log \varphi_t\), and the initial distribution is separately handled by the SB constraint \(\bbQ_0 = \mu\), which modifies \(q_0\) in the SOC. The note’s argument at line 163 is somewhat hand-wavy on this point. Suggested fix: Clarify that the SOC problem here optimizes both \(u\) and \(q_0\), not just \(u\), and that \(\widehat{\varphi}_0\) enters through the initial distribution constraint.
Lines 169–177: Debiasing verification (?@eq-debias). The second step claims \(\int \bbP(X_1 \mid X_0) \widehat{\varphi}_0(X_0) \mu(X_0) dX_0 = \widehat{\varphi}_1(X_1) p_1^\bbP(X_1)\). The parenthetical says this uses “the definition of \(\widehat{\varphi}_1\) combined with Bayes’ rule.” This is actually just the definition: \(\widehat{\varphi}_1(x) = \E_\bbP[\widehat{\varphi}_0(X_0) | X_1 = x]\), so \(\widehat{\varphi}_1(x) p_1^\bbP(x) = \int \bbP_{1|0}(x|y) \widehat{\varphi}_0(y) \mu(y) dy\) when \(X_0 \sim \mu\). But wait: \(\widehat{\varphi}_t(x) = \E_\bbP[\widehat{\varphi}_0(X_0) | X_t = x]\) where \(X_0\) is drawn from the base initial distribution, not from \(\mu\). If the base process starts from \(\mu\) (i.e., \(\bbP\) already has \(X_0 \sim \mu\)), then this is fine. But if the base initial is different from \(\mu\), the integral \(\int \bbP_{1|0}(x|y) \widehat{\varphi}_0(y) \mu(y) dy\) is NOT \(\widehat{\varphi}_1(x) p_1^\bbP(x)\). The note appears to silently assume that the base process starts from \(\mu\), which is the SB setup but was not stated. For the Dirac case (\(\mu = \delta_0\), \(X_0 = 0\)), the base process trivially starts from \(\mu\). For the general prior case, this needs to be stated: the base \(\bbP\) must have \(X_0 \sim \mu\).
Lines 188–204: Corrector matching derivation. The Bayes’ rule step \(\bbP_{0|t}(y|x) = \bbP_{t|0}(x|y) \mu(y) / p_t^\bbP(x)\) at line 188 uses the base process with \(X_0 \sim \mu\). This is consistent with the assumption above but makes it even more important to state it. The derivation from ?@eq-phi-integral to ?@eq-log-grad-product is correct but compressed. Taking \(\nabla_x \log\) of both sides of ?@eq-phi-integral: the left side is \(\nabla_x \log[\widehat{\varphi}_t \cdot p_t^\bbP]\) and the right side is \(\nabla_x \log[\int \bbP_{t|0}(x|y) \tilde{\varphi}_0(y) dy]\). The chain rule gives \(\frac{\int \nabla_x \bbP_{t|0} \tilde{\varphi}_0 dy}{\int \bbP_{t|0} \tilde{\varphi}_0 dy}\). The denominator equals \(\widehat{\varphi}_t p_t^\bbP\) by ?@eq-phi-integral. This checks out.
Lines 224–228: Corrector matching loss ?@eq-cm. The CM loss is stated as minimizing over pairs \((X_0, X_1) \sim p^\star_{0,1}\), but then the text says “In practice, the expectation is over the current model’s endpoint pairs \((X_0, X_1) \sim p^{u^{(k)}}_{0,1}\), not the unknown SB joint \(p^\star_{0,1}\).” This is a significant gap. The loss ?@eq-cm with \(p^\star_{0,1}\) has \(\nabla \log[\widehat{\varphi}_1 \cdot p_1^\bbP]\) as its minimizer. Replacing \(p^\star_{0,1}\) with \(p^{u^{(k)}}_{0,1}\) changes the minimizer, and the note does not justify why the alternation still converges. The IPFP argument (lines 244–246) is invoked but not worked through. Suggested fix: State that each CM step finds the minimizer under the current model’s samples, which gives the best approximation to \(\nabla \log \widehat{\varphi}_1\) at the current iterate, and that alternation converges by the IPFP contraction argument from the cited references.
Line 182: “Since \(b = 0\), the lean adjoint is still constant.” This is correct for the base dynamics, but the modified terminal cost ?@eq-modified-terminal changes \(g\), not \(b\). The lean adjoint terminal condition is \(\tilde{a}(1) = \nabla g(X_1) = \nabla E(X_1) + \nabla \log \widehat{\varphi}_1(X_1)\). Since \(\dot{\tilde{a}} = 0\) (because \(b = 0\)), we get \(\tilde{a}(t) = \nabla g(X_1)\) for all \(t\), which includes both terms. This is stated correctly in ?@eq-am-corrector but could be made more explicit: the lean adjoint is constant and equals the full modified terminal gradient.
Carpet Sweeping
“This follows from the KL formulation” (line 31). The claim that \(\bbP^{u^\star} = \bbP(\cdot \mid X_1) \pi(X_1)\) is attributed to the HJB notes via a parenthetical. The HJB notes do not contain this SB factorization; they derive the optimal control and the value function but not the path-measure factorization into bridge times marginal.
“the projection never increases the SOC cost” (details block, lines 109–133). The proof is complete and correct, but the final step (“driving the path-space KL to zero… which has drift \(\sigma_t^2 \nabla \log \varphi_t\) for the appropriate Doob h-transform”) invokes the Doob h-transform without derivation. Fair enough in a details block, but a reader might want to see why the SB with a given terminal marginal has zero path-space KL.
“Alternating between forward and backward half-bridge projections is exactly IPFP/Sinkhorn on path-space” (line 244). This is the central convergence argument but is stated as a fact with a citation. The connection between the AM/CM alternation and the half-bridge projections is non-trivial and occupies a full theorem in the ASBS paper. The note compresses it to a single sentence.
Line 64: fixed-point characterization of \(u^\star\). The claim that “the unique fixed point is \(u^\star(t,x) = -\sigma_t \E_{\bbP^{u^\star}}[\nabla g(X_1) | X_t = x]\)” is stated without derivation. This follows from the lean adjoint argument in the adjoint matching note, but the note should say more than just “exchanging \(\nabla\) and \(\E\).”
Cross-Reference Verification
HJB notes: Path resolves to
notes/HJB/HJB.qmd. EXISTS. Content matches: contains the variational formulation, value function, optimal control \(u^\star = \sigma^\top \nabla V\), and the HJB equation. The sign convention (maximization) is correctly noted. However, the HJB notes do NOT contain the SB factorization \(\bbP^{u^\star} = \bbP(\cdot|X_1)\pi(X_1)\) claimed at line 31.Girsanov notes: Path resolves to
notes/girsanov/girsanov.qmd. EXISTS. Contains the KL formula \(\kl(\bbP, \bbP^u) = \frac{1}{2}\E[\int \|u\|^2 dt]\) but with expectation under \(\bbP\) (the other direction). See Derivation Gap #1.Doob h-transform: Path resolves to
notes/doob_transforms/doob.qmd. EXISTS. Contains the h-function, conditioned generator, optimal control formula, and Brownian bridge example. Correctly cited.adjoint notes: Path resolves to
notes/adjoint/adjoint.qmd. DOES NOT EXIST. The adjoint notes live atnotes/adjoint_method/adjoint.qmd. Broken link.Schrodinger bridge: Path resolves to
notes/shrodinger/shrodinger.qmd. DOES NOT EXIST. The SB notes live atnotes/shrodinger_bridge/shrodinger.qmd. Broken link.Tweedie-type formula: Path resolves to
notes/reverse_and_tweedie/reverse_and_tweedie.qmd. EXISTS. The Tweedie formula and denoising score matching identity are correctly cited. The claim that ?@eq-cm-tweedie has “identical structure” is accurate.adjoint matching: Local file in same directory. EXISTS. The lean adjoint derivation, zero conditional expectation argument, and loss function are all correctly cited. This cross-reference is accurate.
SB notes (line 177, 244, 246): Same broken link as #5 above.
Grad Student Sticking Points
Line 23: Sign convention parenthetical. A student comparing this note to the HJB note will get confused by \(g = -g_\text{HJB}\), \(V = -J\), and the claim that “the optimal control takes the same form in both conventions.” Working this out requires: \(u^\star_\text{HJB} = \sigma \nabla V_\text{HJB} = \sigma \nabla \log h\), and here \(u^\star = \sigma \nabla V = \sigma \nabla \log h\) with the same \(h\). But the note uses \(V = \log h\) while the HJB notes use \(V = \log h\) with a different sign on \(g\) inside \(h\). A student needs to check that \(h_\text{here} = h_\text{HJB}\), which requires \(\exp(-g) = \exp(g_\text{HJB})\), i.e., \(g = -g_\text{HJB}\). The check works but takes 5 minutes of confusion.
Line 64: “The conditional expectation, under the optimally controlled process, of the terminal gradient.” This is the fixed-point condition. A student will ask: why is \(u^\star(t,x) = -\sigma_t \E_{\bbP^{u^\star}}[\nabla g(X_1) | X_t = x]\) and not \(-\sigma_t \E_\bbP[\nabla g(X_1) | X_t = x]\) (under the base process)? The answer is that the regression ?@eq-am-loss regresses against samples from \(\bbP^{\bar{u}}\), so the conditional expectation is under \(\bbP^{u^\star}\) at the fixed point. But this is also \(-\sigma_t \nabla V = -\sigma_t \nabla \log h\), and \(h\) is defined under the base process. The reconciliation is non-obvious: \(\nabla \log h(t,x) = \E_\bbP[\nabla g(X_1) | X_t = x] / h(t,x) \cdot h(t,x)\)… no, that is not right either. Actually, \(\nabla V(t,x) = \nabla \log \E_\bbP[\exp(-g(X_1)) | X_t = x]\), which is NOT \(\E_{\bbP^{u^\star}}[\nabla g(X_1) | X_t = x]\) in general. It equals \(\E_\bbP[\exp(-g(X_1)) \nabla(-g)(X_1) | X_t = x] / \E_\bbP[\exp(-g(X_1)) | X_t = x]\), which is \(\E_{\bbQ}[-\nabla g(X_1) | X_t = x]\) where \(\bbQ\) is the tilted measure. So the note’s claim requires that \(\bbP^{u^\star}\) produces the same conditional expectations as \(\bbQ\). This is true because \(\bbP^{u^\star} = \bbQ\), but a student will struggle with this.
Lines 159–167: Matching SOC to SB. The derivation of ?@eq-modified-terminal requires simultaneously: (a) the SB factorization, (b) the SOC optimal joint formula from the HJB notes, (c) the SB marginal constraint, and (d) identifying which piece maps to which. A student following line by line will find the argument dense. The sentence “Matching the terminal tilt requires \(e^{-g} \propto \varphi_1\)” is clear, but the next step (using the marginal constraint to get \(\varphi_1 = \pi/(p_1 \widehat{\varphi}_1)\)) requires knowing the Sinkhorn structure from the SB notes.
Lines 200–228: From ?@eq-log-grad-product to ?@eq-cm. The identification of the ratio as the SB posterior, the substitution, and the conversion to a least-squares loss happens in about 25 lines. Each step is individually correct but the chain is long. A student who loses track of which measure the expectation is over (base? SB? current model?) will get lost.
Papers’ Sins Reproduced
The Adjoint Sampling paper (Havens et al.) has a clean structure but does not adequately justify the replacement of \(p^u_{t,1}\) with \(p^\text{base}_{t|1} \cdot p^u_1\) away from optimality. The note reproduces this by saying “At the fixed point, the replacement is exact” (line 107) and deferring the off-fixed-point analysis to a details block. The details block gives the projection argument, which is better than the paper, but the core concern remains: why does a loss with the wrong joint (away from the fixed point) converge to the right fixed point? The note’s answer (“After projection, RAM and AM coincide”, line 135) is correct but could be more explicit about the fact that convergence requires the outer loop to eventually produce correct terminal samples.
The ASBS paper (Liu et al.) introduces the corrector matching objective but does not make clear that it is a self-consistent equation, not a standard regression. The note partially reproduces this: ?@eq-cm is stated as a regression loss, but the expectation is over \(p^\star_{0,1}\), which depends on the unknown optimal control. The note does flag this (“In practice, the expectation is over the current model’s endpoint pairs”), but the transition from the exact loss to the practical loss is abrupt.
Both papers claim IPFP convergence without discussing the effect of finite inner optimization steps. The note reproduces this at line 246: it mentions the caveat but does not address it substantively.
The ASBS paper does not clearly state that the base process must start from \(\mu\) (not from some other distribution). The note inherits this ambiguity; the setup starts with \(X_0 = 0\) and only later discusses non-Dirac priors, but never explicitly redefines the base process to start from \(\mu\).
What the Note Gets Right
Sign convention handling. The explicit parenthetical at line 23 comparing minimization (this note) to maximization (HJB notes) prevents a common source of confusion. This is better than either paper, which silently adopt their own conventions.
The constant lean adjoint derivation (lines 52–58). Setting \(b = 0\) and observing that \(\dot{\tilde{a}} = 0\) is clean, correct, and immediate. The physical intuition (Brownian increments are independent of the past) is well-stated. This is the key simplification and the note nails it.
The bridge formula and Gaussian conditioning (lines 81–99). The derivation of ?@eq-bridge is complete, self-contained, and includes the details block. The connection to the classical Brownian bridge is made explicit. No hidden steps.
The SOC cost decomposition (details block, lines 109–133). Splitting \(J(u)\) into path-space KL plus marginal KL is a clean argument not present (this explicitly) in the original paper.
The corrector matching derivation from Bayes’ rule (lines 188–228). Despite being compressed, the chain of equalities is correct. The identification of the ratio as the SB posterior is a nice touch that connects back to the SB notes.
The summary table (line 248–252). Concisely shows how ASBS generalizes Adjoint Sampling. Accurate.
The comparison with PDDS/TSM (lines 261–264). Correctly identifies that the difference is in the sampling distribution (\(p^{\bar{u}}_1\) vs \(\pi\)), which is the key practical distinction. This comparison is more clearly stated here than in the original paper.
Pedagogical arc. The three-question structure (“Can I learn \(u^\star\) without computing \(h\)?”, “Can I avoid simulating the SDE?”, “What if the prior is not a Dirac?”) organizes the material from simple to complex. Each section builds on the previous one. A reader who understands the first section can skip the third without losing the thread.