Masked Discrete Diffusion
We consider a finite state space \[ \mathcal{X} = \{M, 1,2, \ldots, V\}, \] where \(M\) is a special masked state and \(1, \ldots, V\) correspond to token values in a vocabulary of size \(V\). On the time interval \([0,T]\), we define a continuous-time Markov chain with initial distribution \(p_0\) and time-dependent infinitesimal rate matrix \(Q_t \in \mathbb{R}^{(V+1)\times(V+1)}\), such that \[ \mathbb{P}(X_{t+h} = j \mid X_t = i) = Q_t(i,j) \, h + o(h). \]
Bayes’ rule implies that the time-reversal of this Markov chain is itself Markov, with infinitesimal rate matrix \(Q_t^{\star}\) satisfying \[ Q_t^{\star}(i,j) = \frac{p_t(j)}{p_t(i)} \, Q_t(j,i), \] where \(p_t\) is the marginal distribution of \(X_t\). We have: \[ \mathbb{P}(X_{t-h} = j \mid X_t = i) = Q_t^{\star}(i,j) \, h + o(h). \]
We are interested in modeling a Markov chain that progressively masks the initial value into the masked state \(M\) as time \(t\) goes from \(0\) to \(T\). Transitions are only allowed from any token \(i \in \{1,\dots,V\}\) to the masked state \(M\), and once in \(M\) the process remains there. Thus, outside the diagonal, the only nonzero entries of \(Q_t\) are \(Q_t(i,M)\). As it will be useful later, we denote by \(\tau\) the jump time to \(M\) and we assume \(\tau < T\) almost surely, so that \(X_T = M\) with probability one.
Extension to Sequences
We are interested in modeling sequences of length \(L\) (or binary images, or genomic sequences, etc…), each coordinate taking value in \(\{1,2,\ldots,V\}\). We denote by \(\overline{p}_0\) the data distribution over such sequences. For this purpose, we consider \(L\) independent copies of the above Markov chain, one per coordinate: \[ \overline{X}_t = (X_t^1, \ldots, X_t^L), \] each with rate matrix \(Q_t\). At time \(T\), the process reaches the fully masked sequence \(\overline{X}_T = (M, \ldots, M)\) with probability one. Since jump times \(\tau_i\) of each coordinate are almost surely distinct, the infinitesimal rate matrix \(\overline{Q}_t\) of the joint process is nonzero only when a single coordinate changes. If \(x,\widehat{x} \in \mathcal{X}^L\) differ by a single coordinate \(i\), we have \[ \overline{Q}_t(x,\widehat{x}) = Q_t(x_i, \widehat{x}_i). \]
As before, the time-reversal has infinitesimal rate matrix \(\overline{Q}_t^{\star}\) satisfying \[ \overline{Q}_t^{\star}(x,\widehat{x}) = \frac{\overline{p}_t(\widehat{x})}{\overline{p}_t(x)} \, Q_t(\widehat{x}_i, x_i), \]
where \(\overline{p}_t\) is the marginal distribution of \(\overline{X}_t\). Since \(x\) and \(\widehat{x}\) differ at coordinate \(i\), for \(\overline{Q}_t^{\star}(x,\widehat{x})\) to be non-zero, necessarily \(x_i = M\) and \(\widehat{x}_i \in \{1,\ldots,V\}\). Let \(S = \{j : x_j \neq M\}\) be the set of unmasked coordinates in \(x\). To observe configuration \(x\) at time \(t\), the \((L - |S|)\) masked coordinates must have \(\tau < t\) and the \(|S|\) unmasked ones \(\tau \ge t\): \[ \overline{p}_t(x) = \mathbb{P}(\tau < t)^{L-|S|} \, \mathbb{P}(\tau \ge t)^{|S|} \, \overline{p}_0(x_S). \]
Similarly, \[ \overline{p}_t(\widehat{x}) = \mathbb{P}(\tau < t)^{L-|S|-1} \, \mathbb{P}(\tau \ge t)^{|S|+1} \, \overline{p}_0(x_S)\, \overline{p}_0(\widehat{x}_i \mid x_S). \]
This shows that the time-reversal rate matrix becomes \[ \overline{Q}_t^{\star}(x,\widehat{x}) = R(t)\, \overline{p}_0(\widehat{x}_i \mid x_S)\, Q_t(\widehat{x}_i, M), \]
with time dependent scalar \(R(t) = \frac{\mathbb{P}(\tau \ge t)}{\mathbb{P}(\tau < t)}\). To simulate the reverse process that progressively unmasks a fully masked sequence, one only needs to model the conditional distribution \(\overline{p}_0(\widehat{x}_i \mid x_S)\) of the data distribution \(\overline{p}_0\). This is precisely the prediction task of masked language models such as BERT, which estimate token probabilities conditioned on visible context. In conclusion, discrete diffusion models with one-way masking are mathematically almost identical to masked language models. Hence their similar behavior and performance on text generation tasks are not coincidental. This line of research was developed in a very interesting stream of papers, including (Ou et al. 2024), (Sahoo et al. 2024), (Shi et al. 2024).