Audit: discrete_adjoint.qmd
Critical Issues (must fix)
gKL argument order in the DAM loss may be wrong. Line 166 writes:
D_gKL(u_theta(., X_t, t), r_t(., X_t) . a_t(.; X_1))With the standard Bregman/gKL convention \(D_f(a, b) = a\log(a/b) - a + b\) (which the note uses at line 156), the minimizer of \(\mathbb{E}[D_f(c, W)]\) over the first argument \(c\) satisfies \(f'(c) = \mathbb{E}[f'(W)]\), which for \(f = a\log a - a\) gives \(c = \exp(\mathbb{E}[\log W])\) (the geometric mean), not \(c = \mathbb{E}[W]\) (the arithmetic mean). To recover \(u^\star = r \cdot \mathbb{E}[\tilde{a}]\), one needs to minimize over the SECOND argument: \(\min_c \mathbb{E}[D_f(W, c)]\), which gives \(c = \mathbb{E}[W]\). The DAM paper (eq. dam) places \(u_\theta\) in the first argument too, but their stated property “\(\mathbb{E}\xi = \arg\min_c \mathbb{E} D_f(\xi \| c)\)” is about optimizing the second argument.
Fix: Either swap the argument order to \(D_\text{gKL}(r_t \cdot \tilde{a}_t, u_\theta)\), or add a sentence explaining that the fixed-point argument does not rely on the Bregman mean property but rather on a direct verification that \(u^\star\) is a critical point (which it is, since \(\tilde{a}_t\) becomes deterministic when conditioned on \(X_t\) at optimality… but it doesn’t become deterministic, \(X_1\) is still random). This needs careful thought. The DASBS paper avoids this issue by stating the result as a variational characterization with the correct Bregman convention. Cross-check with DASBS eq. (varphi_star_tsm) which optimizes the second argument of \(D_f\).
Notation clash: \(\phi_t\) vs \(\varphi_t\). The note uses \(\phi_t\) for the SB potential (lines 35, 38, 136, 192), while the continuous notes (
adjoint_sampling.qmd,BMS.qmd) consistently use \(\varphi_t\) and \(\widehat{\varphi}_t\). Both source papers (DASBS and DAM) also use \(\varphi_t\). Using \(\phi_t\) in the discrete note breaks the unified notation promise from the project CLAUDE.md. Worse, \(\phi\) already appears as the convex function generating the Bregman divergence (line 159: “the convex function \(\phi(a) = a \log a - a\)”), creating ambiguity.Fix: Replace all instances of \(\phi_t\) / \(\widehat{\phi}_t\) with \(\varphi_t\) / \(\widehat{\varphi}_t\) throughout the note.
Missing normalization in the Doob h-transform transition kernel (line 136). The note writes:
\(p^{u^\star}_{1|t}(z \mid x) = p^r_{1|t}(z \mid x) \cdot e^{-g(z)}/\phi_t(x)\)
This is correct IF \(\phi_t(x) = \sum_z p^r_{1|t}(z \mid x) e^{-g(z)}\), which the note defines at line 38. But the note defines \(\phi_t(x) = \mathbb{E}_\mathbb{P}[e^{-g(X_1)} \mid X_t = x]\), which is the same thing. So this is consistent, but the note should clarify that this is the \(\phi_t\) from line 38 to avoid confusion (currently the connection is implicit and the reader must verify it).
Sign Convention / Notation Issues
Sign of \(g\) is context-dependent and potentially confusing across notes. In
adjoint_sampling.qmd, \(g(x) = \log p_1^{\text{base}}(x) - \log \pi(x)\) (a cost, minimized). Inadjoint_matching.qmd, \(g\) plays the role of \(-r\) (negative reward). Indiscrete_adjoint.qmd, \(g\) is introduced generically as “terminal cost” (line 32) with no explicit formula relating it to \(\pi\). Later (line 192), the modified cost \(g(x) = \log(\widehat{\phi}_1(x)/\pi(x))\) appears, matching the ASBS formulation. But the relationship between \(g\) and the sampling target \(\pi\) is never stated for the pure SOC case (lines 28–44). A reader coming from the adjoint sampling note expects \(g = \log(p_1^{\text{base}}/\pi)\); a reader coming from the adjoint matching note expects \(g = -r\). The note should state which convention it uses.Fix: Add one sentence after line 32 linking \(g\) to the sampling problem, e.g., “For sampling from \(\pi\), the terminal cost is \(g(x) = -\log \pi(x) + \text{const}\) when the reference stationary distribution is uniform.”
\(V_t\) sign is internally consistent but opposite to the DAM paper. The note defines \(V_t = \log \phi_t\) (line 38), a “reward-to-go” convention matching the HJB notes. The DAM paper defines \(V_t(x) = -\log \sum_z p^{\text{base}}_{1|t}(z|x) e^{-g(z)}\), which is \(-\log \phi_t\), a “cost-to-go” convention. The resulting formulas are the same (\(u^\star = r \cdot e^{V(y)-V(x)}\) in the blog vs \(u^\star = r \cdot e^{-V(y)+V(x)}\) in DAM, both giving \(r \cdot \phi(y)/\phi(x)\)). This is fine, but a reader cross-checking the DAM paper will be confused by the sign flip. Add a one-line remark noting the sign difference, similar to the remark in
adjoint_matching.qmdline 148.Line 79: exponent direction in the discrete adjoint terminal condition. The note writes \(\tilde{a}_1(y; X_1) = e^{g(X_1) - g(y)}\). Cross-checking: the DAM paper (eq. lean-adj-dyn) gives terminal condition \(a_1(y) = e^{-g(y) + g(X_1)}\), which is the same. The DASBS paper gives \(\varphi_1(x_1^{d \gets \triangle})/\varphi_1(x_1)\), which at \(t = 1\) should equal \(e^{-g(\text{target}) + g(X_1)}\) via \(\varphi_1 \propto e^{-g}\). Consistent. No issue here.
Line 125: sign in the continuous identity. The note says the continuous analogue is “\(\mathbb{E}[\nabla g(X_1) \mid X_t = x] = -\nabla V(t,x)\)”. In
adjoint_matching.qmd, the lean adjoint terminal condition is \(\tilde{a}(1) = -\nabla r(X_1)\) with \(r\) being the reward, so \(\tilde{a}(1) = \nabla g(X_1)\) when \(g = -r\). Then \(\mathbb{E}[\tilde{a}(t) \mid X_t = x]\) at optimality equals \(-\nabla V\) (since \(u^\star = \sigma_t \nabla V\) and \(u^\star = -\sigma_t \mathbb{E}[\tilde{a} \mid X_t]\)). This is correct. But writing “\(-\nabla V\)” may confuse readers since the note uses the reward-to-go convention where \(V\) is positive; a reader might expect \(+\nabla V\). Adding the intermediate step \(u^\star = \sigma_t \nabla V = -\sigma_t \mathbb{E}[\tilde{a}]\) would clarify.
Redundancy
The Setup section (lines 16–44) recapitulates material from the HJB notes and Doob notes. The path-space KL formula, the SOC objective, the Doob h-transform, and the optimal rate \(u^\star = r \cdot \phi(y)/\phi(x)\) are all derived in those notes. The note adds “each summand is the generalized KL divergence” (line 24) and the remark about multiplicative vs additive structure (line 44), which are new. Consider trimming: state the path-space KL and optimal rate as results with cross-references, keep only the new observations.
The DASBS section (lines 182–204) partially repeats the value function bias discussion from
adjoint_matching.qmd(lines 49–77) and the SB decomposition fromadjoint_sampling.qmd(lines 151–177). The bias explanation (line 184), the SB path measure decomposition (line 189), and the boundary conditions (line 192) are all covered in the continuous notes. The new content is the alternating scheme specifics (lines 195–203) and the empirical finding about memoryless schedules (line 204). Consider tightening: one sentence citing the continuous notes for the bias/decomposition, then jump to the alternating scheme.Lines 70–75 restate the continuous adjoint from
adjoint_matching.qmdandadjoint_sampling.qmd. The terminal condition \(\tilde{a}(1) = \nabla g(X_1)\) and the zero-base-drift simplification are already derived inadjoint_sampling.qmd(lines 46–58). One cross-reference sentence suffices.
Overselling / Complexity
- Line 90: “This linearity is a structural bonus” overstates the advantage. The continuous lean adjoint ODE \(\dot{\tilde{a}} = -(\nabla_x b)^\top \tilde{a}\) is also linear in \(\tilde{a}\) (it is a linear ODE with time-dependent coefficients). The difference is that in the continuous case the coefficients depend on \(X_t\) (through \(\nabla_x b(X_t, t)\)), making it a stochastic linear ODE that must be solved along each trajectory. In the discrete case, \(r_t(z, y)\) does not depend on the current state \(X_t\), so the ODE is deterministic. The bonus is determinism, not linearity per se.
Fix: Replace “This linearity is a structural bonus: in the continuous case, the ODE is generally nonlinear through the drift Jacobian” with something like “Because the reference rates \(r_t(z,y)\) do not depend on the current state \(X_t\), this ODE is deterministic and can be solved offline. In the continuous case, \(\nabla_x b\) depends on \(X_t\), making the lean adjoint a stochastic ODE that must be integrated along each trajectory.”
The Dynkin’s formula details block (lines 128–146) is more complex than needed. The “more directly” derivation (lines 136–145) is clean and self-contained. The Dynkin route (lines 130–134) is vague (“produces telescoping cancellations”) and does not add clarity. Consider dropping the Dynkin paragraph and keeping only the direct computation.
Line 159: “the Bregman divergence property” is stated without proof or reference. The claim that \(\min_u \mathbb{E}[D_\text{gKL}(u, W)] = \mathbb{E}[W]\) is presented as if obvious. As noted in Critical Issue 1, this property holds when minimizing over the SECOND argument. The note optimizes the FIRST argument. This needs either a proof or a correction.
Missing Connections
No connection to masked diffusion models. The
DiscreteDiff.qmdnote develops CTMCs with a masking structure. The DAM paper’s practical algorithm (Section “Adaptation to masked diffusion models”) shows that the optimal rate inherits the masking structure from the base rate, reducing the modeling complexity from \(O(M^N)\) to \(O(MN)\). This is a major practical insight that the blog note omits entirely. Adding even a brief remark would help the reader see how the abstract theory connects to the masked LLM fine-tuning application.No mention of how sampling from \(p^r_{1|t}(\cdot \mid y)\) works in practice. The closed form (line 97) requires the reference transition kernel \(p^r_{1|t}(z \mid y)\). The DASBS paper (Proposition 9) gives the closed form for the uniform rate on \(\mathbb{Z}_N^D\): a product of independent categorical distributions, one per coordinate, with explicit formulas. The note mentions “efficient computation” (line 67) but never gives the explicit kernel or explains why it factorizes across coordinates.
No connection to Nelson’s relation in the discrete setting. The
BMS.qmdnote develops Nelson’s relation \(u + v = \sigma_t \nabla \log p_t\) for the continuous case. The discrete analogue, relating forward and backward rates through the marginal distribution, is: \(u^\star_t(y,x)/u^{\star,\text{rev}}_t(x,y) = p^\star_t(y)/p^\star_t(x)\). This is just detailed balance. Mentioning this connection would unify the discrete and continuous perspectives.The comparison table (lines 209–218) could include the “corrector target” row for the DASBS case. Currently the corrector row says “\(\widehat{\phi}_1(y)/\widehat{\phi}_1(x)\) (ratio)” but does not distinguish between the adjoint matching and denoising matching approaches for learning it. The DASBS paper emphasizes that the denoising matching approach works without the additive property.
Missing: why generalized KL is a Bregman divergence. Line 159 states this but does not show it. The connection is: \(D_\text{gKL}(u, w) = \phi(u) - \phi(w) - (u - w)\phi'(w)\) with \(\phi(a) = a\log a - a\), so \(\phi'(a) = \log a\). One line of algebra would make this concrete.
Structural Feedback
The note lacks a historical figure portrait at the top. All other notes in the series have one (or a TODO placeholder for one). Consider adding a portrait of e.g., Andrey Kolmogorov (for the backward equation used in the verification) or Joseph Doob (for the h-transform that drives the whole construction).
The “Additive noise on cyclic groups” section could be merged with or placed after the adjoint estimator section. Currently the flow is: Setup -> Additive noise -> Adjoint estimator -> Matching loss -> DASBS. The additive noise section (lines 47–68) is a technical ingredient used only in the matching loss section (lines 171–179). Moving it there would tighten the narrative: Setup -> Adjoint estimator -> Matching loss (which introduces the additive property when needed) -> DASBS.
The DASBS section (lines 182–204) is thin relative to the source paper. The alternating scheme gets 10 lines. The convergence guarantee (Theorem 4 in the DASBS paper, analogous to IPFP convergence) is mentioned as “convergence follows from the same contraction argument” without stating the result. Either flesh this out or explicitly flag it as “see the DASBS paper for the convergence proof.”
No bibliography entries. Lines 221–233 contain commented-out BibTeX with placeholder arXiv IDs (
arXiv:2506.xxxxx). These need to be filled in and added toref.bibbefore the note can be published.
Minor / Stylistic
Line 51: “each state \(x = (x^1, \ldots, x^D)\) with \(x^d \in \{0, 1, \ldots, N-1\}\)” is a sentence fragment. Should read “each state is \(x = (x^1, \ldots, x^D)\) with \(x^d \in \{0, 1, \ldots, N-1\}\)” or similar.
Line 56: Hamming distance restriction is introduced without motivation. Why restrict to \(d_H(y,x) = 1\)? This is standard in discrete diffusion (one-coordinate-at-a-time jumps), but a reader unfamiliar with the convention would benefit from a brief justification. Even “restricting to single-coordinate jumps keeps the rate matrix sparse and the computation per step \(O(DN)\)” would suffice.
Line 159: \(\phi\) is overloaded. The convex function generating the Bregman divergence is called \(\phi(a) = a \log a - a\), while the SB potential is also called \(\phi_t\). Rename one. Since the SB potential should be \(\varphi_t\) anyway (Issue 2), fixing Issue 2 resolves this.
Line 169: \(\bar{u} = \texttt{sg}(u_\theta)\). The notation \(\texttt{sg}\) appears without definition in this note (it is defined in
adjoint_matching.qmdas stop-gradient). Add “(stop-gradient)” on first use, as is done inadjoint_sampling.qmd.Line 87: the sum \(\sum_z r_t(z,y)\) should specify \(z \neq y\) or clarify that \(r_t\) includes the diagonal. The rate matrix convention matters: if \(r_t(z, y)\) for \(z \neq y\) is the off-diagonal rate, then the sum should be \(\sum_{z \neq y}\). If \(r_t\) is the full generator (with negative diagonal), then \(\sum_z\) is correct. The note uses off-diagonal rates at line 16 (“the infinitesimal probability per unit time of jumping from state \(x\) to state \(y \neq x\)”), so line 87 should sum over \(z \neq y\) for consistency, or clarify that the full rate matrix (including diagonal) is used here. The verification block (line 109) uses \(\sum_w r_t(w,y)\) with \(w\) ranging over all states (including \(w = y\)), relying on the full generator convention. This inconsistency in convention should be resolved.
Fix: Either (a) explicitly state at line 87 that \(r_t(z,y)\) here denotes the full generator (with \(r_t(y,y) = -\sum_{z \neq y} r_t(z,y)\)), or (b) write \(\sum_{z \neq y}\) and add the diagonal term explicitly.
The comparison table (line 214) writes “Cyclic group: \(p_{1|t}(y\mid x) = q_t(y - x \bmod N)\)” but the state space is \(\mathbb{Z}_N^D\), not \(\mathbb{Z}_N\). The modular arithmetic is component-wise. Minor but worth fixing: “\(q_t(y - x)\)” with the understanding that subtraction is in \(\mathbb{Z}_N^D\).
Missing cross-reference to
DiscreteDiff.qmdin the summary table. The note mentions the discrete diffusion notes at line 16 but the summary table (lines 207–218) does not link back to it.