Critique: BMS.qmd

Overall Assessment

The note covers the right material and the table at the end is a useful summary. However, it reads more like a compressed survey than a worked-through derivation in the style of the existing notes (HJB.qmd, doob.qmd). The key problem: Nelson’s relation, the central idea of the paper, is stated but never derived or made intuitive. The note tells the reader what things are rather than computing them from scratch.

Major Issues

  1. Nelson’s relation is stated, not derived. This is the single most important equation in the note. In doob.qmd and HJB.qmd, every key formula is derived from Euler discretization or Bayes’ rule arguments. Nelson’s relation should get the same treatment: start from the Fokker-Planck equation for \(\Pi^*_t\), write the forward and backward drifts, sum them, and show that the score pops out. The current note just says “Nelson’s relation states that” (line 76) and moves on. This is the pedagogical core of the paper and it is missing entirely.

  2. The TSI (target score identity) derivation is too compressed. The <details> block (lines 104-113) sketches the derivation in 5 lines. The key step, integration by parts to swap the gradient from \(\bbP_{t|0,T}\) onto \(\Pi^*_{0,T}\), deserves a full worked-out computation. This is where the independent coupling becomes tractable, and the reader needs to see why. Currently the sketch says “one can differentiate under the integral and integrate by parts” without showing it.

  3. The backward drift formula (?@eq-backward-drift) appears from nowhere. Line 83 says “By applying a Doob h-transform, the backward drift has the form…” but never shows the computation. In the existing notes, Doob h-transforms are always derived by computing the conditional expectation explicitly. At minimum, cross-reference the specific result from doob.qmd that gives this formula; ideally, show the one-line computation.

  4. No Euler discretization arguments anywhere. The signature move of these notes is to derive continuous-time results by first writing the Euler discretization, applying Bayes’ rule, and taking the limit. The reverse_and_tweedie.qmd note does exactly this for time reversal. BMS.qmd uses none of this machinery. Nelson’s relation, the bridge marginal formula, and the TSI should all be motivated or at least sanity-checked via discrete-time arguments.

  5. The connection between ?@eq-xi-general and the individual algorithm drifts is hand-waved. Lines 128-163 specialize the general drift to three couplings, but the algebra is skipped. For the independent coupling especially, show the substitution step by step: plug \(\Pi^*_{0,T} = p_0 \otimes \pi\) into ?@eq-xi-general, observe that \(\nabla_{X_0} \log(p_0 \otimes \pi) = \nabla \log p_0(X_0)\), etc.

  6. Missing: why the independent coupling is a valid fixed point. The note says “by construction, the target control \(u^*\) is a fixed point” (line 67) but never explains why the independent coupling \(p_0 \otimes \pi\) actually yields marginals matching the converged process. This is non-obvious; the Schrodinger bridge coupling is the KL-optimal one, so why does the independent coupling also work? The paper discusses this (the boundary constraints are satisfied by construction), and this needs explanation.

Style Issues

  1. Opening paragraph (lines 19-27) is a summary, not a computation. Compare with HJB.qmd which opens with “Consider a diffusion in \(\bbR^D\)…” and immediately starts computing. The BMS note opens with two paragraphs describing what the paper does. Rewrite to start with the mathematical setup.

  2. “The key mathematical ingredient is Nelson’s relation, which connects…” (line 27). This is “In this note, we will…” phrasing that the style guide explicitly forbids. Don’t announce what you’ll do; do it.

  3. “Recall the setup from the Schrodinger bridges note.” (line 32). This is fine for a brief callback but should not be the opening move of the first real section. The existing notes set up their own notation and derive what they need, cross-referencing for proofs but not for definitions.

  4. “By a classical result (Gyongy, 1986)” (line 44). The existing notes don’t appeal to authority this way. Either derive the Markovian projection result (it is a standard \(L^2\) projection argument, 3-4 lines) or just state it as obvious from the tower property, which is what the note already half-does on line 56.

  5. “The connection to Girsanov is also natural:” (line 204). The word “natural” is filler. Either make the connection explicit (write the formula) or cut it.

  6. “This has a clean variational interpretation.” (line 194). “Clean” is an empty adjective. Show the interpretation; don’t describe it.

  7. Lines 220-228 (“Connection to score matching and IPFP”). This entire section reads like a literature review paragraph, not a computation. It states connections without showing them. Either derive the IPFP connection explicitly (showing the two half-steps) or cut this section and put a one-line cross-reference to shrodinger.qmd.

  8. “The paper demonstrates sampling from multimodal densities in \(d=2500\) dimensions.” (line 217). The existing notes never summarize experimental results from papers. This line should be cut.

  9. Passive constructions throughout. “It is simply the score…” (line 88), “This is what makes the reciprocal projection computationally cheap” (line 175). The existing notes prefer active voice: “The score of the forward transition gives…” or “Sampling from the bridge is cheap because…”

  10. “The practical algorithm stores endpoint pairs…” (line 217). This reads like a methods section of a paper, not the mathematical blog note style. The existing notes describe algorithms through their mathematical content, not their implementation.

Missing Content

  1. Derivation of Nelson’s relation from Fokker-Planck. This is the heart of the paper. Write the Fokker-Planck equation for the forward process, the backward Fokker-Planck, equate the time derivatives, and extract \(u^* + v^* = \sigma \nabla \log p_t\). This can be done in 5-6 lines of computation. Alternatively, derive it from the discrete-time identity in reverse_and_tweedie.qmd.

  2. Why the Brownian bridge has the stated marginals. The formula ?@eq-bridge-marginal is stated without derivation. Since the existing notes derive the Brownian bridge drift from first principles (doob.qmd), the marginal formula should at least be shown to follow from the bridge SDE or from Gaussian conditioning.

  3. The “why” of the independent coupling. The paper’s key insight is that independent coupling works despite being suboptimal. The note should explain: the Schrodinger bridge coupling minimizes KL, but any coupling satisfying the boundary constraints yields a valid fixed point. The independent coupling sacrifices optimality of the coupling for tractability of the regression target. Currently this trade-off is mentioned in one sentence (line 215) but not explained.

  4. Mode-covering property of forward KL. Line 228 mentions “mode-covering” in passing. This is actually an important practical consequence. Explain in 2-3 sentences why \(\min_u \kl(\Pi^* | \bbP^u)\) (forward KL) is mode-covering, connecting to the standard forward-vs-reverse KL discussion.

  5. The free parameter \(c(t)\) in the TSI. The note mentions \(c(t) \in (0,1]\) but never discusses what choices are useful or why the freedom exists. The paper discusses choosing \(c(t) = \gamma(t)\) to simplify terms. This deserves a sentence or two.

Unnecessary Content

  1. The summary table (lines 207-215) is useful but could be more compact. The “Requires” column is vague. Replace with the actual regression target formula for each method.

  2. Lines 220-228 (Connection to score matching and IPFP). This section adds no computation and restates things already said. Cut it or replace it with one cross-reference sentence.

  3. Lines 231-261 (BibTeX comment block). Not a content issue per se, but the BibTeX entries should be moved to ref.bib. The comment says “to add to ref.bib” but this should be done, not left as a TODO.

  4. “This is a proximal point method: each step takes a conservative update…” (line 200). Naming the method adds nothing. The variational formula ?@eq-damped-var already shows what is happening.

Line-by-Line Notes

  • Line 22: The SDE has \(\sigma_t u(X_t, t) dt + \sigma_t dW_t\). In the existing notes (HJB.qmd line 81), the convention is \(b(X^u)dt + \sigma(X^u)(dW_t + u dt)\). Decide which convention to use and be consistent across all 4 new notes.
  • Line 32: Cross-reference path ../shrodinger/shrodinger.qmd but the file is in the same directory. Should be shrodinger.qmd or ../shrodinger/shrodinger.qmd depending on the actual file location. Check all cross-reference paths.
  • Line 44: \(\xi(X,t)\) is introduced for the path-dependent drift, but line 47 writes \(\xi(X,t)\) as the bridge drift of \(\Pi^*\). Clarify: \(\xi\) is the non-Markovian drift of the process whose law is \(\Pi^*\), not the “bridge drift” generically.
  • Line 77: \(\sigma_t \nabla \log \Pi^*_t(x)\) on the RHS. The notation mixes \(\sigma_t\) (scalar schedule) with \(\sigma(t)\) used elsewhere. Pick one.
  • Line 85: \(\Pi^*_{0|t}\) notation is used without definition. This is the conditional law of \(X_0\) given \(X_t\) under \(\Pi^*\). Define it.
  • Line 96: \(\gamma(t) = \kappa(t)/\kappa(T)\) is defined but \(\kappa(T)\) depends on \(T\) which is the terminal time. Check consistency: is \(T\) always fixed?
  • Line 118: ?@eq-xi-general has \(\sigma_t^{-1} \xi(X,t)\) on the LHS but the three terms on the RHS involve gradients w.r.t. different variables (\(X_0\), \(X_T\), \(X_t\)). This is correct but confusing without explicit statement that \(X_0, X_t, X_T\) are all determined by the path \(X\).
  • Lines 128-136: The adjoint sampling coupling is written as \(p_0 \otimes \Pi^*_T\) where \(\Pi^*_T = \pi\), then the drift is \(\nabla_{X_T} \log[\pi(X_T)/\bbP_T(X_T)]\). This \(\bbP_T\) comes from the forward transition marginal under the reference, but it is not defined in the BMS note. Define it or cross-reference.
  • Line 141: \(\hat\varphi_T(X_T)\) appears for the ASBS drift. The notation \(\hat\varphi\) is used in shrodinger.qmd but with different subscript conventions. Unify.
  • Line 177: The formula for \(\nabla_{X_t} \log \bbP_{t|0}\) uses \(\kappa(t)\), but \(\bbP_{t|0}\) was not explicitly defined as a Gaussian with variance \(\kappa(t)\). State this.
  • Line 192: \(\alpha \in (0,1]\) is introduced for damping. The macro file doesn’t have \(\alpha\) as a special symbol, which is fine, but make sure it doesn’t clash with any other use in the note.

Suggested Structure for v2

  1. Opening: Start with the mathematical setup directly. “Consider a controlled SDE \(dX_t = \sigma_t u(X_t,t) dt + \sigma_t dW_t\) with \(X_0 \sim p_0\). The goal is to find \(u\) such that \(X_T \sim \pi\).” One sentence mentioning that adjoint sampling and ASBS address this with different couplings, and BMS unifies them.

  2. Nelson’s relation (the main idea): Derive Nelson’s relation from scratch using Fokker-Planck or Euler discretization. Show \(u^* + v^* = \sigma_t \nabla \log p_t\). Give the one-line intuition: forward drift pushes toward target, backward drift pushes toward prior, their sum is the score.

  3. Reciprocal class and Markovian projection: Brief setup (2-3 equations), cross-referencing shrodinger.qmd for the reciprocal class. Derive the Markovian projection as \(L^2\) conditional expectation (show the tower property argument in 3 lines). State the matching loss.

  4. The fixed-point iteration: State the three steps (simulate, reciprocal project, regress). Explain why \(u^*\) is a fixed point.

  5. Target score identity: Derive the TSI properly. Start from \(\Pi^*_t(x) = \int \bbP_{t|0,T}(x|x_0,x_T) \Pi^*_{0,T}(x_0,x_T) dx_0 dx_T\). Differentiate, integrate by parts, show both forms. Then combine with Nelson’s relation and the backward drift to get the general \(\xi\) formula.

  6. Three couplings: For each coupling (AS, ASBS, BMS), show the substitution into the general formula explicitly. Keep it compact (3-5 lines each). Highlight what is tractable and what is not.

  7. The independent coupling: This is the punchline. Emphasize that everything in the regression target is known. Show the bridge sampling formula. Explain the trade-off: KL-suboptimal coupling but single objective, no alternation.

  8. Damped iteration: State the formula. Show the variational characterization in 2 lines. Give the proximal-point intuition in one sentence. Connect to mode preservation.

  9. Summary table: Keep it, but make it more informative.

Cut: the “Connection to score matching and IPFP” section, the experimental results mention, and the implementation details paragraph.