Tightness Audit: BMS.qmd
Verdict
Solid expository note with correct high-level structure; the main derivations are right at the heuristic level adopted. There are three genuine gaps (backward drift description, the TSI integration-by-parts sign chain, and a missing sign in the general regression target), several hidden assumptions, and one place where a fixed-point claim is stated without justification. No fatal errors, but a careful reader will get stuck at 4–5 points.
Derivation Gaps
1. Nelson’s relation: linearization of \(p_t(x)\) around \(y\) (lines 42–45)
Claim: Expanding \(p_t(x) \approx p_t(y) \exp(\langle \nabla \log p_t(y), x - y \rangle)\) and completing the square yields the backward conditional mean \(\E[X_t \mid X_{t+\delta} = y] = y - \sigma_t u \delta + \sigma_t^2 \nabla \log p_t(y) \delta\).
Issue: The expansion is first-order in \((x-y)\), but \(x - y = O(\sqrt{\delta})\) (Brownian scaling), so \((x-y)^2 = O(\delta)\). The second-order term of \(\log p_t(x)\) around \(y\) contributes at the same \(O(\delta)\) order as the linear term squared in the Gaussian exponent. The derivation implicitly assumes the Hessian \(\nabla^2 \log p_t\) contribution is absorbed into the \(O(\delta^2)\) remainder, which is correct only after averaging (the Hessian term shifts the variance, not the mean, to leading order). This is standard Euler-level heuristics, but the note does not flag that the first-order expansion of \(\log p_t\) is actually sufficient for the mean but not for higher moments. A one-line parenthetical would help.
Fix: Add a remark that the second-order correction to \(\log p_t\) affects the conditional variance at \(O(\delta)\) but not the conditional mean.
2. Backward drift description for a reciprocal process (line 176–182)
Claim (line 176): “the bridge from \(X_0\) to \(X_T\) has backward drift \(\sigma_t \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\) (the score of the forward transition kernel, pointing back toward \(X_0\)).”
Issue: This statement is not quite right. The drift of the bridge \(\mathbb{P}_{|0,T}\) is \(\sigma_t \nabla_{X_t} \log \mathbb{P}_{t|0,T}(X_t | X_0, X_T)\), not \(\sigma_t \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\). The bridge drift involves conditioning on both endpoints. What the note means is the Markovian backward drift \(v^\star\) for the process \(\Pi^\star\), obtained by averaging the bridge’s backward score \(\nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\) over \(X_0 \mid X_t\). But the bridge kernel \(\mathbb{P}_{t|0,T}\) decomposes as \(\mathbb{P}_{t|0} \cdot \mathbb{P}_{T|t} / \mathbb{P}_{T|0}\), and the forward score \(\nabla_{X_t} \log \mathbb{P}_{t|0}\) is only one piece of the bridge drift (the other piece is \(\nabla_{X_t} \log \mathbb{P}_{T|t}\), which points toward \(X_T\)). The note then correctly uses only the \(\nabla_{X_t} \log \mathbb{P}_{t|0}\) piece in the formula for \(v^\star\), which is right (this is Proposition “uv general coupling” in the paper). But the English description on line 176 conflates “bridge drift” with “backward Markovian drift piece”.
Fix: Replace “the bridge from \(X_0\) to \(X_T\) has backward drift \(\sigma_t \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\)” with something like: “From the decomposition of the bridge score into forward and backward pieces (the bridge drift at \(X_t\) is \(\nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0) + \nabla_{X_t} \log \mathbb{P}_{T|t}(X_T | X_t)\)), the backward Markovian drift \(v^\star\) extracts the forward-transition piece \(\sigma_t \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\) and averages it over \(X_0 \mid X_t\).”
3. TSI integration by parts: sign and direction (lines 140–165)
Claim (line 143): \(\nabla_x \mathbb{P}_{t|0,T} = -\frac{1}{1-\gamma_t} \nabla_{x_0} \mathbb{P}_{t|0,T}\).
Verification: The bridge mean is \((1-\gamma_t)x_0 + \gamma_t x_T\). The score w.r.t. \(x\) is \(-(x - (1-\gamma_t)x_0 - \gamma_t x_T) / [\nu_T \gamma_t(1-\gamma_t)]\). The score w.r.t. \(x_0\) is \(+(1-\gamma_t)(x - (1-\gamma_t)x_0 - \gamma_t x_T) / [\nu_T \gamma_t(1-\gamma_t)]\). So indeed \(\nabla_x \log \mathbb{P}_{t|0,T} = -\frac{1}{1-\gamma_t} \nabla_{x_0} \log \mathbb{P}_{t|0,T}\), and at the density level \(\nabla_x \mathbb{P}_{t|0,T} = -\frac{1}{1-\gamma_t} \nabla_{x_0} \mathbb{P}_{t|0,T}\). Correct.
After integration by parts (moving \(\nabla_{x_0}\) from \(\mathbb{P}_{t|0,T}\) onto \(\Pi^\star_{0,T}\)), the result on line 148 picks up a factor \(+1/(1-\gamma_t)\) in front of \(\nabla_{x_0} \log \Pi^\star_{0,T}\). Let me verify. Starting from:
\[\nabla_x \Pi^\star_t(x) = \int \nabla_x \mathbb{P}_{t|0,T}(x|x_0,x_T) \Pi^\star_{0,T}(x_0,x_T) dx_0 dx_T\]
Substituting \(\nabla_x \mathbb{P} = -\frac{1}{1-\gamma_t} \nabla_{x_0} \mathbb{P}\):
\[= -\frac{1}{1-\gamma_t} \int \nabla_{x_0} \mathbb{P}_{t|0,T} \cdot \Pi^\star_{0,T} \, dx_0 dx_T\]
Integration by parts in \(x_0\) (boundary terms vanish):
\[= +\frac{1}{1-\gamma_t} \int \mathbb{P}_{t|0,T} \cdot \nabla_{x_0} \Pi^\star_{0,T} \, dx_0 dx_T\]
Dividing by \(\Pi^\star_t(x)\) and recognizing the conditional expectation:
\[\nabla_x \log \Pi^\star_t(x) = \frac{1}{1-\gamma_t} \E_{\Pi^\star_{0,T|t}}[\nabla_{X_0} \log \Pi^\star_{0,T}(X_0,X_T) \mid X_t = x]\]
This matches line 154. The derivation is correct. However, the details block could be clearer about the sign flip from integration by parts. The note says “we moved \(\nabla_{x_0}\) from \(\mathbb{P}_{t|0,T}\) onto \(\Pi^\star_{0,T}\)” without explicitly showing the sign change from integration by parts (minus times minus = plus). A reader following with pen and paper might stumble.
Fix: Minor. Add explicit mention of the sign change from integration by parts.
4. General regression target: missing negative sign verification (lines 184–192)
Claim (line 187): The regression target is
\[\sigma_t^{-1} \xi = \frac{1-c(t)}{1-\gamma_t} \nabla_{X_0} \log \Pi^\star_{0,T} + \frac{c(t)}{\gamma_t} \nabla_{X_T} \log \Pi^\star_{0,T} - \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0).\]
Verification: Nelson gives \(u^\star = \sigma_t \nabla \log \Pi^\star_t - v^\star\). The TSI gives \(\sigma_t \nabla \log \Pi^\star_t = \sigma_t \E[\text{TSI integrand} | X_t]\). Proposition “uv general coupling” gives \(v^\star = \sigma_t \E[\nabla_{X_t} \log \mathbb{P}_{t|0}(X_t|X_0) | X_t]\). Combining: \(u^\star = \sigma_t \E[\text{TSI integrand} - \nabla_{X_t} \log \mathbb{P}_{t|0} | X_t]\). The \(\xi\) integrand is the quantity inside the expectation, rescaled by \(\sigma_t\).
The note writes \(\sigma_t^{-1} \xi = \text{TSI integrand} - \nabla_{X_t} \log \mathbb{P}_{t|0}\). This is \(u^\star / \sigma_t = \nabla \log \Pi^\star_t - v^\star / \sigma_t\), with \(v^\star / \sigma_t = \E[\nabla_{X_t} \log \mathbb{P}_{t|0} | X_t]\). Before conditioning, the integrand for \(v^\star/\sigma_t\) is \(\nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\). So \(\xi / \sigma_t = \text{TSI integrand} - \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\), and \(\E[\xi | X_t] = u^\star\). Correct.
But: the note does not verify that the conditional expectation of this \(\xi\) actually equals \(u^\star\) (it just says “their combination, before conditioning, is a valid \(\xi\)”). The step of combining two separate conditional expectations into one by taking integrands before conditioning requires that both expectations are taken over the same conditional distribution \(\Pi^\star_{0,T|t}\). The TSI integrand is conditioned on \((X_0, X_T) | X_t\) while the backward drift integrand is conditioned on \(X_0 | X_t\). These are compatible: both are expectations under \(\Pi^\star\) conditioned on \(X_t\), and the first depends on \((X_0, X_T)\) while the second depends on \(X_0\) alone. Since \((X_0, X_T) | X_t\) marginalizes to \(X_0 | X_t\) by dropping \(X_T\), the subtraction is valid. The note should mention this compatibility.
Fix: Add one sentence: “Both conditional expectations are with respect to \(\Pi^\star\) conditioned on \(X_t = x\); the TSI integrand depends on \((X_0, X_T)\) and the backward drift integrand on \(X_0\) alone, but both live in the same conditional probability space, so the subtraction of integrands before conditioning is valid.”
5. AS reduction: the \(\nabla_{X_0}\) term for a Dirac coupling (lines 199–226)
Claim (line 204): “The \(\nabla_{X_0}\) term drops (Dirac at \(0\)).”
Issue: This is a distributional statement. For \(\Pi^\star_{0,T} = \delta_0 \otimes \pi\), we have \(\nabla_{X_0} \log \Pi^\star_{0,T} = \nabla_{X_0} \log \delta_0(X_0) + \nabla_{X_0} \log \pi(X_T)\). The second term is zero (no \(X_0\) dependence). The first term is \(\nabla \log \delta_0\), which is not well-defined in the classical sense. The note handwaves this as “Dirac has no gradient to speak of.” In reality, what happens is that \(X_0 = 0\) a.s. under the coupling, so the term involving \(\nabla_{X_0}\) has no variance; it is a constant integrated against a Dirac, and the whole term vanishes after expectation. The note’s explanation is correct in spirit but sloppy in detail.
More precisely: with \(c(t) = \gamma_t\), the coefficient \((1-c(t))/(1-\gamma_t) = (1-\gamma_t)/(1-\gamma_t) = 1\). The term is \(\nabla_{X_0} \log \delta_0(X_0)\), which is ill-defined. The correct argument (as in the paper’s Appendix C.2) proceeds differently: one does not use the general formula with \(\nabla_{X_0} \log \Pi^\star_{0,T}\) at all, but instead directly computes the score \(\nabla \log \Pi^\star_t\) for the half-bridge coupling using the TSI with \(c(t) = \gamma_t\), which kills the \(\nabla_{X_0}\) integration-by-parts term entirely (not just the integrand). The note’s details block (lines 204–225) actually does the correct computation: it goes back to the concrete expressions and shows the cancellation. The surrounding text (line 204) is just misleading.
Fix: Replace “The \(\nabla_{X_0}\) term drops (Dirac at \(0\))” with “The TSI is used with \(c(t) = \gamma_t\), and since \(\Pi^\star_0 = \delta_0\), the integration-by-parts in \(X_0\) that generates the \(\nabla_{X_0} \log \Pi^\star_{0,T}\) term is not needed; we use only the \(X_T\) branch of the TSI.”
6. Damped iteration: the Pythagorean identity (lines 288–305)
Claim (line 291): “Apply the bias-variance decomposition (Pythagorean identity from the Markovian projection) to the first term.”
Verification: This is the standard \(L^2\) decomposition: for \(\Phi(u_i) = \E_{\Pi^i}[\xi | X_t]\), we have \(\E[\|\xi - u\|^2] = \E[\|\xi - \Phi\|^2] + \E[\|\Phi - u\|^2]\). The cross-term vanishes by the tower property. Correct.
The rest of the derivation (pointwise FOC, substitution of \(\eta = (1-\alpha)/\alpha\)) is algebra that checks out. No issues.
Carpet Sweeping
C1. “The matching loss is a forward KL objective” (line 318)
The note claims: “The matching loss ?@eq-matching-loss is a forward KL objective: \(u^\star = \argmin_u \text{KL}(\Pi^\star | \mathbb{P}^u)\). This follows because the Girsanov KL between \(\Pi^\star\) and \(\mathbb{P}^u\) decomposes as an irreducible term plus the matching loss (see Lemma 1 of Brunick 2013).”
This is the right idea but the attribution is wrong. Brunick (2013) proves the Markovian projection preserves marginals; the KL decomposition is a separate (though standard) result. More importantly, the decomposition requires \(\Pi^\star \ll \mathbb{P}^u\) for the KL to be finite, which holds when \(\Pi^\star\) is in the reciprocal class of \(\mathbb{P}\) and \(u\) is admissible, but the note does not verify this.
C2. “Forward KL is mode-covering” (line 318)
The claim that forward KL \(\text{KL}(\Pi^\star | \mathbb{P}^u)\) is mode-covering is stated without qualification. This is the standard intuition (minimizing forward KL penalizes placing zero mass where \(\Pi^\star\) has mass), but it applies to the path-space KL, not directly to the terminal marginal. The mode-covering property for the terminal distribution requires additional argument (that the Markovian projection preserves marginals, hence mode coverage at the path level implies mode coverage at the marginal level). The note waves at this but does not spell it out.
C3. “The Schrodinger bridge coupling minimizes path-space KL” (line 267)
Stated as a fact without derivation or citation. The note should cross-reference the SB note or at minimum say “as shown in the SB note.”
Cross-Reference Verification
XR1. Reverse diffusions note (lines 36, 61)
Claim: The backward conditional mean computation “exactly as in the reverse diffusions note.”
Verification: The reverse_and_tweedie.qmd note (lines 39–58) performs the same Bayes’ rule + Gaussian completion-of-the-square argument for a general drift \(\mu(X)\) and derives the time-reversal formula. The BMS note adapts this to the \(\sigma_t u\) drift form. Valid cross-reference.
XR2. Doob h-transforms note for Brownian bridge (line 82)
Claim: “see the Schrodinger bridges note” for the reciprocal class definition.
Verification: The shrodinger.qmd note (lines 86–99) defines the dynamic SB problem as finding \(\mathbb{Q}\) with given marginals minimizing KL to the reference, and describes the reference bridge process. The reciprocal class concept is not explicitly defined by name in that note, but the construction \(\Pi = \Pi_{0,T} \mathbb{P}_{|0,T}\) is implicit. Adequate but imprecise cross-reference.
XR3. Girsanov note (line 318)
Claim: “the Girsanov KL between \(\Pi^\star\) and \(\mathbb{P}^u\) decomposes…”
Verification: The girsanov.qmd note (lines 106–110) derives \(\text{KL}(\mathbb{P}, \mathbb{P}^u) = \frac{1}{2} \E \int \|u\|^2 dt\), which is the special case of KL between the reference and a controlled process. The general KL between two path measures with different drifts is also given (lines 98–104). The decomposition claimed in BMS.qmd (matching loss = KL minus irreducible noise) is not explicitly stated in the Girsanov note. Overclaimed cross-reference.
XR4. Adjoint sampling note (lines 25, 199, 234)
Claim: Links to adjoint_sampling.qmd for the AS algorithm.
Verification: The adjoint_sampling.qmd note contains the AS algorithm, the corrector matching loss, and the IPFP alternation. The BMS note’s description of AS as “Dirac prior, memoryless condition” matches. The corrector/ASBS description on line 234 matches the adjoint_sampling.qmd content. Valid cross-references.
XR5. Schrodinger bridges note (line 82, 234, 267)
Claim: References to the SB coupling \(\hat\varphi_0 \cdot \mathbb{P}_{T|0} \cdot \varphi_T\).
Verification: The shrodinger.qmd note (lines 109–116) describes the SB solution as \(d\mathbb{Q}^\star / d\mathbb{P}^{\text{ref}} \propto f(X_0) g(X_1)\) with potentials \(\varphi_t\) and \(\hat\varphi_t\) satisfying the Sinkhorn equations. The BMS note’s formulation on line 234 uses \(\hat\varphi_0, \varphi_T\), which matches (with the understanding that \(\hat\varphi\) is the backward potential). Valid cross-reference.
Grad Student Sticking Points
G1. Why Nelson’s relation involves \(p_t\) and not \(\Pi^\star_t\)
The note derives Nelson’s relation for \(\mathbb{P}^u\) with marginals \(p_t\) (line 72), then later applies it with \(\Pi^\star_t\) (line 184). The connection is: \(u^\star\) is the Markovian projection of \(\Pi^\star\), so \(\mathbb{P}^{u^\star}\) has the same marginals as \(\Pi^\star\), hence Nelson applies with \(p_t = \Pi^\star_t\). This is not stated.
G2. The backward drift \(v^\star\) computation
Line 179 states a formula for \(v^\star\) as a conditional expectation of \(\sigma_t \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t | X_0)\). A grad student would wonder: where does this come from? The note says “the bridge from \(X_0\) to \(X_T\) has backward drift…” (line 176), but this requires decomposing the bridge score \(\nabla \log \mathbb{P}_{t|0,T}\) into \(\nabla \log \mathbb{P}_{t|0} + \nabla \log \mathbb{P}_{T|t}\) and then recognizing that the \(\nabla \log \mathbb{P}_{T|t}\) piece, when averaged over \(X_T | X_t\), gives \(u^\star\), so by Nelson, \(v^\star\) is the remaining piece averaged over \(X_0 | X_t\). This chain of reasoning is not shown.
G3. Why does independent coupling give \(\mathbb{P}^{u^\star}_T = \pi\)?
Lines 257–265 argue that the independent coupling has terminal marginal \(\pi\) (by direct computation, which is clear), and then say “the Markovian projection preserves time marginals, so \(\mathbb{P}^{u^\star}_T = \pi\).” This is the key property of Markovian projections (Brunick 2013), but a student unfamiliar with this result would want to know why. A pointer to the relevant theorem would help.
G4. Evaluating ?@eq-xi-bms in practice
Line 252 gives the BMS regression target with three terms: \(\nabla \log p_0(X_0)\), \(\nabla \log \pi(X_T)\), and \((X_t - X_0)/\nu_t\). A student implementing this would ask: where do \((X_0, X_t, X_T)\) come from? The answer (line 269–271) is: simulate the SDE to get \(X_0, X_T\) pairs, independently resample them, then draw \(X_t\) from the bridge ?@eq-bridge-marginal. This workflow is described but could be made more concrete (e.g., pseudocode or a numbered algorithm).
Papers’ Sins Reproduced
P1. Nelson’s relation derivation is self-consistent: YES
The note derives Nelson’s relation from Euler discretization + Bayes’ rule (lines 30–73), then verifies it against the time-reversal formula (lines 61–77). The two routes agree. No sin here.
P2. The TSI integration by parts: is it correct? YES, but terse
The integration by parts in the details block (lines 140–165) is correct. The paper’s proof (Appendix B, proof of TSI general coupling) goes through the same steps. The note’s treatment is more compressed but does not introduce errors. The sign is right: the gradient swap \(\nabla_x \to -\frac{1}{1-\gamma_t}\nabla_{x_0}\) combined with IBP sign flip gives \(+\frac{1}{1-\gamma_t}\).
P3. The general regression target: are the signs right? YES
Verified above (Derivation Gap 4). The signs in ?@eq-xi-general are correct.
P4. The AS specialization: is the algebra shown? PARTIALLY
The details block (lines 202–225) shows the reduction, but the argument that \(X_t/\nu_t\) cancels to give \(X_T/\nu_T\) after Markovianization involves computing \(\E[X_t | X_T]\) under the half-bridge, which is shown. The final step “\(\nabla \log \mathbb{P}_T(X_T) = -X_T/\nu_T\)” is stated without derivation (it follows from \(\mathbb{P}_T = \mathcal{N}(0, \nu_T I)\), which is a one-liner, but should be stated explicitly).
P5. The independent coupling fixed point: is it justified? PARTIALLY
Lines 257–265 verify that the terminal marginal is \(\pi\) and invoke marginal preservation. The note does not verify that \(u^\star\) is the unique fixed point, nor that the iteration converges. The BMS paper itself notes convergence as an open problem. The note is honest about this by omission rather than by explicit statement.
P6. The damped iteration: is it verified? YES
Lines 288–305 derive ?@eq-damped from ?@eq-damped-var via the Pythagorean identity and pointwise FOC. The algebra is correct and complete.
What the Note Gets Right
Nelson’s relation derivation. Clean, self-contained, with a good sanity check against the time-reversal formula. The Euler discretization approach matches the style of the existing notes.
Unified perspective across three algorithms. The table on line 312 and the systematic progression from AS to AS+corrector to BMS, all through the lens of different couplings, is pedagogically effective.
The completing-the-square details block (lines 48–59) is well done and gives the reader a concrete computation to follow.
The damped iteration derivation (lines 288–305) is the cleanest part of the note: every step follows, the Pythagorean identity is correctly invoked, and the substitution \(\eta = (1-\alpha)/\alpha\) is verified.
The “why it works” section (lines 257–271) for the independent coupling gives an honest, direct argument for why \(\Pi^\star_T = \pi\) without hiding anything.
The general \(\xi\) formula (line 187) correctly combines TSI and backward drift integrands. The note is explicit that this is “before conditioning” and that Markovianization recovers \(u^\star\) via regression.
Cross-references to the series are well-placed and mostly accurate, giving the reader a clear map of which results come from which note.
The note is honest about what the independent coupling sacrifices. Line 267: “The Schrodinger bridge coupling minimizes path-space KL. The independent coupling sacrifices this optimality for a fully tractable regression target.” This is a fair summary.