Masked Discrete Diffusion

DDPM
score
Published

21 10 2025

Modified

21 10 2025

Masked Discrete Diffusion

We consider a finite state space \[ \mathcal{X} = \{M, 1,2, \ldots, V\}, \] where \(M\) is a special masked state and \(1, \ldots, V\) correspond to token values in a vocabulary of size \(V\). On the time interval \([0,T]\), we define a continuous-time Markov chain with initial distribution \(p_0\) and time-dependent infinitesimal rate matrix \(Q_t \in \mathbb{R}^{(V+1)\times(V+1)}\), such that \[ \mathbb{P}(X_{t+h} = j \mid X_t = i) = Q_t(i,j) \, h + o(h). \]

Bayes’ rule implies that the time-reversal of this Markov chain is itself Markov, with infinitesimal rate matrix \(Q_t^{\star}\) satisfying \[ Q_t^{\star}(i,j) = \frac{p_t(j)}{p_t(i)} \, Q_t(j,i), \] where \(p_t\) is the marginal distribution of \(X_t\). We have: \[ \mathbb{P}(X_{t-h} = j \mid X_t = i) = Q_t^{\star}(i,j) \, h + o(h). \]

We are interested in modeling a Markov chain that progressively masks the initial value into the masked state \(M\) as time \(t\) goes from \(0\) to \(T\). Transitions are only allowed from any token \(i \in \{1,\dots,V\}\) to the masked state \(M\), and once in \(M\) the process remains there. Thus, outside the diagonal, the only nonzero entries of \(Q_t\) are \(Q_t(i,M)\). As it will be useful later, we denote by \(\tau\) the jump time to \(M\) and we assume \(\tau < T\) almost surely, so that \(X_T = M\) with probability one.

Extension to Sequences

We are interested in modeling sequences of length \(L\) (or binary images, or genomic sequences, etc…), each coordinate taking value in \(\{1,2,\ldots,V\}\). We denote by \(\overline{p}_0\) the data distribution over such sequences. For this purpose, we consider \(L\) independent copies of the above Markov chain, one per coordinate: \[ \overline{X}_t = (X_t^1, \ldots, X_t^L), \] each with rate matrix \(Q_t\). At time \(T\), the process reaches the fully masked sequence \(\overline{X}_T = (M, \ldots, M)\) with probability one. Since jump times \(\tau_i\) of each coordinate are almost surely distinct, the infinitesimal rate matrix \(\overline{Q}_t\) of the joint process is nonzero only when a single coordinate changes. If \(x,\widehat{x} \in \mathcal{X}^L\) differ by a single coordinate \(i\), we have \[ \overline{Q}_t(x,\widehat{x}) = Q_t(x_i, \widehat{x}_i). \]

As before, the time-reversal has infinitesimal rate matrix \(\overline{Q}_t^{\star}\) satisfying \[ \overline{Q}_t^{\star}(x,\widehat{x}) = \frac{\overline{p}_t(\widehat{x})}{\overline{p}_t(x)} \, Q_t(\widehat{x}_i, x_i), \]

where \(\overline{p}_t\) is the marginal distribution of \(\overline{X}_t\). Since \(x\) and \(\widehat{x}\) differ at coordinate \(i\), for \(\overline{Q}_t^{\star}(x,\widehat{x})\) to be non-zero, necessarily \(x_i = M\) and \(\widehat{x}_i \in \{1,\ldots,V\}\). Let \(S = \{j : x_j \neq M\}\) be the set of unmasked coordinates in \(x\). To observe configuration \(x\) at time \(t\), the \((L - |S|)\) masked coordinates must have \(\tau < t\) and the \(|S|\) unmasked ones \(\tau \ge t\): \[ \overline{p}_t(x) = \mathbb{P}(\tau < t)^{L-|S|} \, \mathbb{P}(\tau \ge t)^{|S|} \, \overline{p}_0(x_S). \]

Similarly, \[ \overline{p}_t(\widehat{x}) = \mathbb{P}(\tau < t)^{L-|S|-1} \, \mathbb{P}(\tau \ge t)^{|S|+1} \, \overline{p}_0(x_S)\, \overline{p}_0(\widehat{x}_i \mid x_S). \]

This shows that the time-reversal rate matrix becomes \[ \overline{Q}_t^{\star}(x,\widehat{x}) = R(t)\, \overline{p}_0(\widehat{x}_i \mid x_S)\, Q_t(\widehat{x}_i, M), \]

with time dependent scalar \(R(t) = \frac{\mathbb{P}(\tau \ge t)}{\mathbb{P}(\tau < t)}\). To simulate the reverse process that progressively unmasks a fully masked sequence, one only needs to model the conditional distribution \(\overline{p}_0(\widehat{x}_i \mid x_S)\) of the data distribution \(\overline{p}_0\). This is precisely the prediction task of masked language models such as BERT, which estimate token probabilities conditioned on visible context. In conclusion, discrete diffusion models with one-way masking are mathematically almost identical to masked language models. Hence their similar behavior and performance on text generation tasks are not coincidental. This line of research was developed in a very interesting stream of papers, including (Ou et al. 2024), (Sahoo et al. 2024), (Shi et al. 2024).

References

Ou, Jingyang, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. 2024. “Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data.” arXiv Preprint arXiv:2406.03736.
Sahoo, Subham, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. 2024. “Simple and Effective Masked Diffusion Language Models.” Advances in Neural Information Processing Systems 37: 130136–84.
Shi, Jiaxin, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. 2024. “Simplified and Generalized Masked Diffusion for Discrete Data.” Advances in Neural Information Processing Systems 37: 103131–67.