Critique: adjoint_sampling.qmd

Overall Assessment

The note covers the right ideas in the right order, but it reads like a well-organized summary of the paper rather than a note in the author’s voice. The existing notes (HJB.qmd, doob.qmd, girsanov.qmd) derive results from scratch using Euler discretization and Bayes’ rule arguments; this note instead announces results and explains them verbally. Several key ideas from the paper are missing or underexplained, and the theoretical justification section is too vague to be useful.

Major Issues

  1. No derivation from scratch. The hallmark of the existing notes is that every result is computed, not stated. HJB.qmd derives the optimal control by doing a one-step Bellman argument. doob.qmd derives the conditioned generator from the definition. girsanov.qmd derives Girsanov from discrete path probabilities. This note states ?@eq-soc-obj without showing where it comes from (just says “as shown in the Girsanov and SOC notes”). The Brownian bridge distribution ?@eq-bridge-gaussian is derived in a details block, which is good, but the main derivations (the SOC objective, the adjoint simplification, the reciprocal projection) are asserted, not computed.

  2. The reciprocal projection is hand-waved. Section “Reciprocal Adjoint Matching” is the core contribution of the paper, but the note says “we can project the path measure onto the Schrodinger bridge” without explaining what this means computationally. The key insight is simple: at optimality the joint \((X_t, X_1)\) factorizes as bridge times marginal, so even away from optimality we can sample \((X_t, X_1)\) by first sampling \(X_1\) from the current model and then drawing \(X_t\) from the Brownian bridge. The note says this but buries it. A clean one-paragraph derivation showing why this replacement is valid (and why it is an improvement) is missing.

  3. The “theoretical justification” section (lines 177–192) is too abstract. It defines a projection operator \(\bbP(u)\) and states two properties without proof or intuition. The reader cannot verify these claims or understand why they hold. The proof in the paper (Appendix C) is actually short and instructive: it uses the factorization \(g(X_1) = \log(p^{\text{base}}_1/p^u_1)(X_1) + \log(p^u_1/\mu)(X_1)\) and the definition of the projection as a Schrodinger bridge with terminal marginal \(p^u_1\). This should be shown, at least in a details block.

  4. Missing: why the alternating scheme converges. The paper’s main theoretical result (Theorem 1 / Theorem in Appendix C.2) shows that each step of Adjoint Sampling is equivalent to a projection step plus an AM gradient step. This is the key structural insight. The note mentions it vaguely (“implicitly performs this projection at every outer step”) but does not state the result precisely or give any intuition for why it holds. At minimum: state that minimizing RAM with fixed \(X_1\) samples is equivalent to minimizing AM on the projected control, and that the fixed point is \(u^\star\).

  5. No portrait image. Every existing note opens with a historical figure. The TODO comment is still there (line 13). Pick someone (Kolmogorov, or more relevantly, one of the authors, or perhaps Rudolf Clausius for the connection to free energy / Jarzynski).

Style Issues

  1. Lines 19–20 (opening paragraph). “This note covers Adjoint Sampling [@havens2025adjoint], which adapts the adjoint matching framework to the problem of sampling from Boltzmann distributions.” This is a “In this note, we will…” opening. The existing notes never do this. doob.qmd opens with “Consider a continuous time Markov process…”. HJB.qmd opens with “Consider a diffusion in \(\bbR^D\)…”. girsanov.qmd opens with a concrete Gaussian calculation. The note should dive straight into the math. Suggested rewrite: cut the entire first paragraph. Start at “The sampling problem as stochastic optimal control” and let the reader discover what the note is about from the content.

  2. Line 54. “We already know from the SOC note that \(u^\star(t,x) = \sigma_t \nabla V(t,x)\)…” This is fine as a cross-reference, but the phrasing “We already know” is slightly textbook-ish. The existing notes use phrasing like “As described in these notes on [Doob h-transforms]…” or “The [Girsanov Theorem] gives that…”.

  3. Line 79. “This is a significant simplification.” This is commentary. The existing notes would just show the simplification and let it speak for itself, perhaps adding an intuitive gloss. Suggested: delete the sentence and move the explanation inline.

  4. Line 85. “The interpretation: for each intermediate state \(X_t\) along a trajectory, regress the control \(u(t, X_t)\) onto \(-\sigma_t \nabla g(X_1)\).” This is good but should come before the formula, not after. Build intuition first.

  5. Lines 88–96 (section “Why is the adjoint constant?”). “It is worth pausing to understand…” is a verbal filler. The section itself is well done (especially the details block), but the opening sentence should be cut. Just start with “Recall from the adjoint method note that…”.

  6. Line 127. “This is the reciprocal projection, following the terminology of Schrodinger bridges.” This is describing, not computing. The existing notes would show what the projection does via a short calculation.

  7. Lines 177–178. “Why does the reciprocal projection not hurt?” This rhetorical question is fine, but the answer that follows is too abstract. The existing notes answer such questions by computing, not by listing properties.

  8. Lines 200–209 (Summary section). Bulleted summary lists are explicitly called out in the style guide as something to avoid unless summarizing key takeaways. This summary repeats what was already said. Cut it entirely; the note is short enough that it does not need a recap.

Missing Content

  1. The connection between RAM and score matching / bridge matching. The paper (Section 3 and Appendix D) shows that the RAM objective is related to Bridge Matching: if one had samples from the target, RAM with those samples would recover Bridge Matching. This connection to the Schrodinger bridge literature (already covered in shrodinger.qmd) would strengthen the note and provide a cross-reference.

  2. The explicit form of the optimal control. At optimality, \(u^\star(t,x) = \sigma_t \nabla \log h(t,x)\) where \(h(t,x) = \E[\exp(-g(X_1)) \mid X_t = x]\) under the base process. This is derived in HJB.qmd. The note should state it explicitly and connect: the value function \(V = \log h\) satisfies the HJB equation with \(f = 0\), and the adjoint \(\tilde{a}(t) = \nabla g(X_1)\) is a stochastic estimate of \(\nabla_x V\) (this is the fixed-point condition). This would close the loop between the SOC theory and the regression formulation.

  3. The replay buffer as approximate expectation. The paper discusses (Section 3.1 and after Theorem 1) that in practice the buffer stores samples from multiple prior iterations and the RAM step is not run to convergence. This is important practically and has a smoothing interpretation. Worth a sentence or two.

  4. The time weighting \(\lambda(t) = 1/\sigma_t^2\). Line 135 mentions it but does not explain why it helps. The paper notes it de-correlates samples across time and stabilizes the loss when \(\sigma_t\) is small. Worth a brief remark.

  5. Geometric extensions. The paper’s molecular conformer experiments rely on SE(3) equivariance and the center-of-mass projection, plus torsional angle sampling on the flat torus. These are not deep theory, but the CoM projection modifying \(p^{\text{base}}\) to a singular Gaussian is a clean one-paragraph addition. The torus extension (wrapped Gaussian, sampling via \(k\)-augmentation) is also neat. At minimum, mention these exist and point to the paper.

  6. Comparison with off-policy methods. The paper makes a strong point that Adjoint Sampling is on-policy but highly efficient because it reuses samples through the reciprocal projection. This distinction (on-policy with replay buffer vs. off-policy) is conceptually important and missing from the note.

Unnecessary Content

  1. The “Why is the adjoint constant?” section (lines 88–105). The main text (lines 88–96) explains this three different ways: (i) the ODE argument, (ii) the physical interpretation, (iii) the direct argument in the details block. Two of these are redundant. Keep the ODE argument (one line: \(b=0\) so \(\dot{\tilde{a}} = 0\)) and the details block. Cut the physical interpretation (“Each infinitesimal perturbation at time \(t\) propagates straight to \(X_1\)…”), which adds words but not understanding beyond what the ODE already says.

  2. Lines 195–198 (Relation to existing methods). This section is thin: it says PDDS and TSM use the same formula but with different expectations, then says Adjoint Sampling is on-policy. This could be folded into a single sentence earlier in the note, perhaps when introducing the RAM loss. As a standalone section, it does not carry enough weight.

  3. The summary section (lines 200–209). As noted above, cut or drastically compress.

Line-by-Line Notes

  • Line 30, ?@eq-controlled-sde. The drift is written as \(\sigma_t u(t, X_t)\) (control multiplied by \(\sigma_t\)). This matches the paper but differs from HJB.qmd which writes \(\sigma(X) u(t,X)\) with the control inside the diffusion coefficient. Be explicit about this convention choice. Also: the existing notes use \(b(X_t) dt + \sigma(X_t) (dW_t + u \, dt)\), which puts the control as a drift perturbation of the Brownian motion. Using \(\sigma_t u \, dt + \sigma_t dW_t\) is equivalent but looks different. Consider matching the HJB convention for consistency.

  • Line 33.\(p^\text{base}_t(x) = \normal(x; 0, \nu_t I)\) where \(\nu_t = \int_0^t \sigma_s^2 \, ds\).” This is correct but uses \(\normal\) as a density function (not a distribution). The macros define \(\normal = \mathcal{N}\), which is standard. Fine.

  • Line 38, ?@eq-sb-target.\(\bbP^\star(\boldsymbol{X}) = \bbP(\boldsymbol{X} \mid X_1) \pi(X_1)\).” The notation \(\bbP(\boldsymbol{X} \mid X_1)\) means the path measure of the base process conditioned on the terminal state. This is a Brownian bridge measure. Worth noting explicitly that this is a path measure, not a density, to avoid confusion.

  • Line 44, ?@eq-soc-obj. The sign convention: \(g(x) = \log p_1^{\text{base}}(x) + E(x)/\tau\). In HJB.qmd, the terminal cost \(g\) appears with a negative sign in the objective (minimize \(-g\)). Here the objective has \(+g\). Make sure the sign convention is consistent. Looking more carefully: in HJB.qmd, the SOC maximizes \(\E[\int f + g]\) while here we minimize \(\E[\int \frac{1}{2}\|u\|^2 + g]\). These are equivalent with \(g_{\text{here}} = -g_{\text{HJB}} = \log p_1^{\text{base}} + E/\tau\). This is consistent but potentially confusing. A one-line remark about the sign convention relative to HJB.qmd would help.

  • Line 51, ?@eq-terminal-cost. The temperature \(\tau\) divides \(E(x)\). The Boltzmann distribution is \(\pi \propto \exp(-E/\tau)\). Then \(g(x) = \log p_1^{\text{base}}(x) - \log \pi(x) - \log \cZ = \log p_1^{\text{base}}(x) + E(x)/\tau + \log \cZ\). But the note writes \(g(x) = \log p_1^{\text{base}}(x) + E(x)/\tau\) (dropping \(\log \cZ\)). This is correct since \(\log \cZ\) is a constant that does not affect the optimal control, but should be noted.

  • Line 62, ?@eq-lean-adjoint-general. Uses \(\nabla_x b(X_t, t)\) in the adjoint ODE. In the adjoint.qmd note, the adjoint ODE is \(\dot{\lambda} = -\nabla_x f - b_x^\top \lambda\). The term \(b_x^\top \lambda\) matches \(\tilde{a}^\top \nabla_x b\) only after transposing. The note writes \(\tilde{a}(t)^\top \nabla_x b\), which is the row-vector form. This is consistent with adjoint.qmd if \(\tilde{a}\) is a row vector. Check that the convention is stated clearly.

  • Line 68, ?@eq-am-loss. The loss has \(u(t, X_t) + \sigma_t \tilde{a}(t)\). At the optimum, \(u^\star = -\sigma_t \tilde{a}\) in expectation. The sign is correct. But note that adjoint_matching.qmd writes it as \(u + \sigma(t)^\top \tilde{a}\) with \(\sigma(t)^\top\) (transposed scalar, which is just \(\sigma(t)\)). Consistent.

  • Line 100, details block. “A perturbation \(\delta X_t\) at time \(t\) propagates to the terminal state as \(\delta X_1 = \delta X_t + O(\delta X_t)\)…” The \(O(\delta X_t)\) notation is slightly sloppy. It means the perturbation to \(X_1\) equals \(\delta X_t\) plus higher-order terms from the nonlinear feedback through \(u\). This is fine for an informal argument but could be cleaner: just say “along a fixed trajectory, \(X_1 = X_t + \int_t^1 \sigma_s dW_s + \int_t^1 \sigma_s u(s, X_s) ds\), and since the trajectory is fixed, \(\partial X_1 / \partial X_t = 1\).”

  • Line 124, ?@eq-sb-target usage. The joint factorization \(p^{u^\star}_{t,1}(X_t, X_1) = p^{\text{base}}_{t|1}(X_t \mid X_1) \pi(X_1)\) follows from ?@eq-sb-target but is stated without derivation. Worth a one-line justification: at optimality \(\bbP^{u^\star} = \bbP^\star\), and marginalizing \(\bbP^\star\) over times \(\neq t\) gives this factorization.

  • Line 135.\(\lambda(t) = 1/\sigma_t^2\)” appears out of nowhere. Is this a weighting? Why \(1/\sigma_t^2\)? The paper explains this normalizes the loss so that the regression target has roughly constant magnitude across time. State this.

  • Line 182, ?@eq-projection. The notation \(\bbP(u)\) for the projection operator clashes with \(\bbP\) for the base path measure. The paper uses \(\mathcal{P}\) or \(\mathbb{P}\) with different formatting. Consider using \(\Pi(u)\) or \(\text{Proj}(u)\) to avoid collision.

Suggested Structure for v2

  1. Opening (no preamble, straight to the problem). Start with the controlled SDE, \(b = 0\), terminal cost \(g\). Cross-reference HJB.qmd for the SOC formulation. One paragraph.

  2. The adjoint is constant. Derive \(\tilde{a}(t) = \nabla g(X_1)\) from the adjoint ODE. Cross-reference adjoint.qmd. State the AM loss. Two paragraphs + one details block.

  3. From trajectories to pairs. Show the loss only needs \((X_t, X_1)\). Short, one paragraph.

  4. Reciprocal projection: the key trick. This is the heart of the note. Start with the intuition: at optimality, the joint factorizes as bridge times target marginal. Even away from optimality, we can replace the joint with bridge times current marginal. Derive the Brownian bridge distribution (Gaussian conditioning, as in the current details block, but promote to main text). Write the RAM loss. Explain what this buys: decouple expensive (generate \(X_1\), evaluate \(\nabla g\)) from cheap (sample \(X_t\) from bridge). Three to four paragraphs + bridge derivation.

  5. Why the projection helps. Show (or sketch) that \(J(u) \geq J(\Pi(u))\) using the SOC factorization. Put the full proof in a details block. Two paragraphs.

  6. The algorithm and its fixed-point structure. State the alternating scheme. State the convergence result: each iteration is a projection plus an AM gradient step, fixed point is \(u^\star\). One to two paragraphs.

  7. Connection to PDDS/TSM. One paragraph, folded in naturally, comparing the on-policy vs. off-policy expectations.

  8. (Optional) Extensions. Brief mention of SE(3) equivariance and torus extensions. One to two paragraphs.

No summary section. The note should be dense enough that a summary is unnecessary.