TODO TODO

diffusion
replica
generative
Published

05 12 2025

Modified

05 12 2025

Why Diffusion Models Generalize Before They Memorize

A Spectral Story via Random Features and Random Matrix Theory

Diffusion models trained by score matching behave in a way that is empirically very robust but theoretically nontrivial: they reach good generation early, and only much later start to memorize individual training samples. The goal of this note is to explain this phenomenon purely from training dynamics, within a random-features model where all relevant quantities can be analyzed analytically.

We will see that the key mechanism is spectral: the training operator has two distinct eigenvalue scales, corresponding to “population” and “sample-specific” directions. Gradient descent learns the population modes quickly, and the sample-specific modes only on a time scale that grows with the dataset size \(n\). Memorization is therefore dynamically delayed.


1. Score matching at fixed diffusion time

We begin by specifying the data and the score-matching objective.

1.1 Data and diffusion

Let \(x \in \mathbb{R}^d\) be a data vector drawn from a distribution \[ x \sim P_x \quad\text{with}\quad \mathbb{E}[x] = 0, \qquad \mathbb{E}[x x^\top] = \Sigma. \]

The forward diffusion process is defined by \[ x(t,\xi) = e^{-t} x + \sqrt{\Delta_t}\,\xi, \qquad \Delta_t = 1-e^{-2t}, \qquad \xi \sim \mathcal{N}(0,I_d). \tag{1}\]

The marginal distribution of \(x(t,\xi)\) is a smoothed version of \(P_x\); its density is denoted \(P_t\). The (Stein) score at time \(t\) is \[ s^\star_t(x) = \nabla_x \log P_t(x). \]

A parametric model \(s_\theta(x,t)\) is trained to approximate \(s^\star_t(x)\) via denoising score matching. At fixed \(t\), the population objective is \[ \mathcal{L}_t(\theta) = \frac{1}{d} \mathbb{E}_{x\sim P_x} \mathbb{E}_{\xi\sim\mathcal{N}(0,I_d)} \Big\| \sqrt{\Delta_t}\, s_\theta(x(t,\xi),t) + \xi \Big\|^2. \tag{2}\]

Given a training set \(\{x_\nu\}_{\nu=1}^n\), the empirical loss is \[ L_{\mathrm{train}}(\theta) = \frac{1}{dn} \sum_{\nu=1}^n \mathbb{E}_\xi \Big\| \sqrt{\Delta_t}\, s_\theta(x_\nu(t,\xi),t) + \xi \Big\|^2. \tag{3}\]

We are interested in the training dynamics of \(L_{\mathrm{train}}(\theta)\) for a particular model class.


2. Random-features score model

Directly analyzing a deep U-Net is out of reach. We therefore simplify the architecture while preserving three key ingredients:

  1. Score matching at a fixed diffusion time.
  2. Overparameterization.
  3. Nonlinearity in the data.

This leads to the random-features score model.

2.1 Pre-activations and features

Let \[ W \in \mathbb{R}^{p\times d} \] have i.i.d. entries \(W_{ij} \sim \mathcal{N}(0,1)\). For an input \(x\in\mathbb{R}^d\) we define:

  • The pre-activation vector \[ h(x) = \frac{W x}{\sqrt{d}} \in \mathbb{R}^p, \tag{4}\]
  • The feature vector \[ \varphi(x) = \sigma\big(h(x)\big) \in \mathbb{R}^p, \tag{5}\] where \(\sigma:\mathbb{R}\to\mathbb{R}\) is a fixed nonlinearity applied coordinate-wise.

For a noisy training sample \(x_\nu(t,\xi)\), we will write \[ \varphi_\nu(\xi) = \varphi\big(x_\nu(t,\xi)\big) = \sigma\!\left(\frac{W x_\nu(t,\xi)}{\sqrt{d}}\right). \tag{6}\]

2.2 Random-features score network

The score model is linear in the feature space: \[ s_A(x) = \frac{A}{\sqrt{p}} \,\varphi(x), \qquad A \in \mathbb{R}^{d\times p}. \tag{7}\]

Here:

  • \(W\) is random and fixed (first layer),
  • \(A\) is trainable (second layer),
  • The overall model remains nonlinear in \(x\) through \(\sigma\).

Plugging Equation 7 into the empirical loss Equation 3 at fixed time \(t\) yields \[ L_{\mathrm{train}}(A) = \frac{1}{dn} \sum_{\nu=1}^n \mathbb{E}_\xi \Big\| \sqrt{\Delta_t}\, \frac{A}{\sqrt{p}} \,\varphi_\nu(\xi) + \xi \Big\|^2. \tag{8}\]

At this point, the loss is quadratic in \(A\), which is the main structural simplification: gradient descent dynamics will be linear in \(A\).


3. From the loss to a linear ODE in \(A\)

We briefly compute the gradient of Equation 8 with respect to \(A\).

Write, for each \((\nu,\xi)\), \[ \ell_{\nu,\xi}(A) = \frac{1}{d} \Big\| \sqrt{\Delta_t}\, \frac{A}{\sqrt{p}} \,\varphi_\nu(\xi) + \xi \Big\|^2. \]

Expanding the square, \[ \ell_{\nu,\xi}(A) = \frac{\Delta_t}{dp}\, \varphi_\nu(\xi)^\top A^\top A \,\varphi_\nu(\xi) + \frac{2\sqrt{\Delta_t}}{d\sqrt{p}}\, \xi^\top A \,\varphi_\nu(\xi) + \frac{1}{d}\|\xi\|^2. \]

The gradient with respect to \(A\) is \[ \nabla_A \ell_{\nu,\xi}(A) = \frac{2\Delta_t}{dp}\, A \,\varphi_\nu(\xi)\varphi_\nu(\xi)^\top + \frac{2\sqrt{\Delta_t}}{d\sqrt{p}}\, \xi \,\varphi_\nu(\xi)^\top. \]

Averaging over \(\nu\) and \(\xi\) yields \[ \nabla_A L_{\mathrm{train}}(A) = \frac{2\Delta_t}{d}\, A U + \frac{2}{\sqrt{\Delta_t p}}\, V^\top, \tag{9}\] where we have defined

  • the Gram matrix \[ U = \frac{1}{n} \sum_{\nu=1}^n \mathbb{E}_\xi \big[ \varphi_\nu(\xi)\varphi_\nu(\xi)^\top \big] \in \mathbb{R}^{p\times p}, \tag{10}\]
  • and the “noise-feature” coupling \[ V = \frac{1}{n} \sum_{\nu=1}^n \mathbb{E}_\xi \big[ \varphi_\nu(\xi)\xi^\top \big] \in \mathbb{R}^{p\times d}. \tag{11}\]

The stochastic gradient descent update is then \[ A^{(k+1)} = A^{(k)} - \eta \,\nabla_A L_{\mathrm{train}}(A^{(k)}). \]

In the small learning-rate regime, taking a continuum limit with rescaled time \[ \tau = \frac{k\eta}{d^2}, \] gives the ODE \[ \frac{d}{d\tau} A(\tau) = - \frac{2\Delta_t}{d}\,A(\tau)\,U - \frac{2}{\sqrt{\Delta_t p}}\,V^\top. \tag{12}\]

This is a linear ODE in \(A\), driven by the random matrix \(U\).


4. Training dynamics in the eigenbasis of \(U\)

To understand the time scales of learning, we diagonalize the Gram matrix: \[ U = Q \Lambda Q^\top, \] where \[ \Lambda = \operatorname{diag}(\lambda_1,\dots,\lambda_p). \]

Let \[ B(\tau) = A(\tau) Q \in \mathbb{R}^{d\times p}, \qquad \tilde V^\top = V^\top Q. \]

Then Equation 12 becomes \[ \frac{d}{d\tau} B(\tau) = - \frac{2\Delta_t}{d}\,B(\tau)\,\Lambda - \frac{2}{\sqrt{\Delta_t p}}\, \tilde V^\top. \tag{13}\]

For each column \(b_i(\tau) \in \mathbb{R}^d\) of \(B(\tau)\), we have a scalar-matrix ODE: \[ \frac{d}{d\tau} b_i(\tau) = - \frac{2\Delta_t}{d}\,\lambda_i\, b_i(\tau) - \frac{2}{\sqrt{\Delta_t p}}\, \tilde v_i, \] where \(\tilde v_i\) is the \(i\)-th column of \(\tilde V^\top\). This is solved explicitly as \[ b_i(\tau) = e^{-(2\Delta_t\lambda_i/d)\,\tau}\, b_i(0) - \frac{2}{\sqrt{\Delta_t p}} \int_0^\tau e^{-(2\Delta_t\lambda_i/d)\,(\tau-s)} \tilde v_i\, ds. \]

Ignoring the forcing term for the moment (or thinking about relaxation to equilibrium), the homogeneous solution decays as \[ \|b_i(\tau)\| \approx \|b_i(0)\| \exp\!\Big(-\frac{2\Delta_t\lambda_i}{d}\,\tau\Big). \]

This suggests defining the relaxation time associated with eigenvalue \(\lambda\): \[ \tau(\lambda) = \frac{d}{2\Delta_t\lambda} = \frac{\psi_p}{\Delta_t \lambda}. \tag{14}\]

Modes with larger \(\lambda\) are learned faster; modes with small \(\lambda\) take longer to relax. Therefore:

The time scale at which different directions in parameter space (and thus in function space) are learned is determined by the eigenvalues of the Gram matrix \(U\).

Our task reduces to understanding the spectrum of \(U\) in the high-dimensional limit.


5. The spectral problem: what do we actually need?

Let \[ \lambda_1,\dots,\lambda_p \] be the eigenvalues of \(U\). The empirical spectral density is \[ \rho_U^{(p)}(\lambda) = \frac{1}{p} \sum_{i=1}^p \delta(\lambda - \lambda_i). \]

We are interested in its limit \(\rho_U(\lambda)\) as \(d,p,n\to\infty\) with \(\psi_p,\psi_n\) fixed.

More specifically, we need:

  1. The location and scale of the bulk (or bulks) of \(\rho_U(\lambda)\).
  2. The behavior of the smallest nonzero eigenvalues \(\lambda_{\min}\), since Equation 14 shows that \[ \tau_{\mathrm{slow}} \sim \frac{\psi_p}{\Delta_t \lambda_{\min}}. \]

The central question thus becomes:

What is the limiting spectral distribution of \(U\), and how do its bulk edges scale with \(\psi_p = p/d\) and \(\psi_n = n/d\)?


6. The resolvent (Stieltjes transform) of \(U\)

Random matrix theory typically approaches such questions via the resolvent or Stieltjes transform: \[ q(z) = \frac{1}{p}\operatorname{Tr}(U - zI)^{-1} = \frac{1}{p}\sum_{i=1}^p \frac{1}{\lambda_i - z}, \qquad z \in \mathbb{C}\setminus \mathbb{R}. \tag{15}\]

The spectral density is recovered from \(q(z)\) via \[ \rho_U(\lambda) = \frac{1}{\pi} \lim_{\varepsilon\to 0^+} \operatorname{Im}\, q(\lambda + i\varepsilon). \tag{16}\]

In classical settings, for instance when \[ U = \frac{1}{n} X X^\top \] with i.i.d. entries, the resolvent \(q(z)\) satisfies an explicit deterministic equation (the Marchenko–Pastur equation), from which one obtains the limiting spectral density. More generally, with a population covariance \(\Sigma\), one obtains the Silverstein–Pastur equations.

In our case, however, \(U\) is not a linear sample covariance: its entries involve the nonlinear map \(\sigma\), the random weights \(W\), and the diffusion noise. No classical closed-form resolvent equation is available.

We will now outline how to obtain such an equation using two key tools:

  1. Gaussian equivalence for the pre-activations.
  2. Replica methods applied to a Gaussian integral representation of the resolvent.

7. Gaussian equivalence and Hermite expansion

We first simplify the distribution of the pre-activations \(h(x) = W x / \sqrt{d}\).

7.1 Gaussian pre-activations

For fixed \(x\), each coordinate \(h_i(x) = W_i x / \sqrt{d}\) is Gaussian with \[ \mathbb{E}[h_i(x)] = 0, \qquad \mathbb{E}[h_i(x)^2] = \frac{1}{d}\|x\|^2. \]

For two inputs \(x,x'\), the pair \((h_i(x), h_i(x'))\) is Gaussian with covariance \[ \mathbb{E}[h_i(x) h_i(x')] = \frac{1}{d} x^\top x'. \]

For our noisy data \(x_\nu(t,\xi)\), this gives a Gaussian field \(H\) of pre-activations: \[ H_{\nu i}(\xi) = h_i(x_\nu(t,\xi)) \] with a covariance that can be expressed in terms of \(\Sigma\) and \(t\).

Gaussian equivalence (in this context) means that for many observables of interest (including spectral densities and generalization errors), we may replace the exact distribution of the feature matrix by a Gaussian process with the same covariance structure. Intuitively, the pre-activations are sums of many independent contributions, and high-dimensional central limit/type universality effects apply.

7.2 Hermite expansion of the nonlinearity

Once the inputs to \(\sigma\) are Gaussian, the nonlinearity can be analyzed via its Hermite expansion: \[ \sigma(z) = \sum_{k=0}^\infty \frac{\alpha_k}{k!} \mathrm{He}_k(z), \tag{17}\] where \(\{\mathrm{He}_k\}\) are the (probabilists’) Hermite polynomials orthogonal with respect to \(\mathcal{N}(0,1)\).

For a pair \((Z_1,Z_2)\) of jointly Gaussian variables with correlation \(\rho\), \[ \mathbb{E}[\sigma(Z_1)\sigma(Z_2)] = \sum_{k=0}^\infty \alpha_k^2 \rho^k. \tag{18}\]

In our setting, \(\rho\) will be a function of inner products like \(x_\nu^\top x_{\nu'}/d\) and of the diffusion time \(t\).

The key point is that the entries of \(U\) in Equation 10 can be expressed in terms of a finite number of Gaussian expectations such as Equation 18. These are summarized in four scalar parameters \[ a_t,\; b_t,\; v_t,\; s_t, \] which are explicit integrals involving \(\sigma\), the noise level \(t\), and the data covariance \(\Sigma\).

As a result, the Gram matrix can be decomposed schematically as \[ U \approx s_t^2 I_p + \underbrace{\frac{v_t^2}{n}\sum_{\nu=1}^n g_\nu g_\nu^\top}_{\text{sample-specific part}} + \underbrace{\tilde U}_{\text{population part}}, \tag{19}\] where:

  • \(g_\nu \in \mathbb{R}^p\) are Gaussian vectors (the high-dimensional limit of the features),
  • \(\tilde U\) is a deterministic “population covariance” term depending on \(\Sigma\) and \((a_t,b_t,v_t,s_t)\).

Equation Equation 19 is schematic but captures the structure: \(U\) is a sum of a diagonal term, a sample-specific random covariance, and a population-level covariance. All randomness is now Gaussian, and the nonlinearity is encoded in a small number of scalars.


8. Why we need a log-determinant representation

We want to compute the quenched resolvent \[ \bar q(z) = \mathbb{E}_{W,X} q(z) = \frac{1}{p}\,\mathbb{E}_{W,X} \operatorname{Tr}(U - zI)^{-1}. \]

Directly averaging \((U-zI)^{-1}\) over the randomness of \(U\) is intractable:

  • The inverse is a highly nonlinear function of all matrix entries.
  • \(U\) has complicated correlations due to its nonlinear construction.
  • Classical results (e.g. for \(XX^\top\)-type matrices) do not apply.

A standard tool in random matrix theory and statistical physics is to transform the inverse problem into a log-determinant problem, which can then be represented as a Gaussian integral.

8.1 From the resolvent to \(\log\det\)

For any invertible matrix \(M\), we have \[ \frac{\partial}{\partial z} \log \det(M - zI) = - \operatorname{Tr}(M - zI)^{-1}. \]

Applying this with \(M = U\), we obtain \[ q(z) = \frac{1}{p} \operatorname{Tr}(U - zI)^{-1} = - \frac{1}{p} \frac{\partial}{\partial z} \log \det(U - zI). \tag{20}\]

Therefore, understanding \(\bar q(z)\) is equivalent to understanding the average of \(\log\det(U - zI)\).

8.2 From \(\log\det\) to a Gaussian partition function

A second standard identity is the Gaussian integral \[ \int_{\mathbb{R}^p} \exp\!\left( -\tfrac12 x^\top M x \right) dx \propto (\det M)^{-1/2}, \quad M \succ 0. \]

Define the partition function \[ Z(z;U) = \int_{\mathbb{R}^p} \exp\!\left( - \tfrac12 \varphi^\top (U - zI)\varphi \right) d\varphi. \tag{21}\]

Then \[ \log Z(z;U) = -\tfrac12 \log \det(U - zI) + \text{const}, \] and using Equation 20: \[ q(z) = \frac{1}{p} \frac{\partial}{\partial z} \log Z(z;U). \tag{22}\]

We have therefore transformed the resolvent into the derivative of the log of a Gaussian integral.


9. Quenched vs annealed averages and the replica idea

We are interested in the typical behavior of the spectrum, corresponding to the quenched average of \(\log Z\): \[ \mathbb{E}\log Z(z;U). \]

This differs from the annealed quantity \(\log \mathbb{E}Z(z;U)\). In large systems, \(\log Z\) behaves like a sum of many weakly dependent contributions and concentrates around its mean (self-averaging). Therefore, the typical spectral density corresponds to \[ \frac{1}{p} \mathbb{E}\log Z(z;U), \] not to \(-\frac{1}{p}\log \mathbb{E}Z\).

The replica trick provides a formal way to compute \(\mathbb{E}\log Z\): \[ \mathbb{E}\log Z = \lim_{s\to0} \frac{1}{s} \log \mathbb{E}Z^s. \tag{23}\]

For integer \(s\), \[ Z^s = \int \prod_{a=1}^s d\varphi^a \exp\!\left( - \frac12 \sum_{a=1}^s \varphi^{a\top}(U - zI)\varphi^a \right). \tag{24}\]

The integrand now depends on \(U\) only through quadratic forms in the replicated fields \(\varphi^a\).


10. Disorder average and overlap order parameters

We now average \(Z^s\) over the randomness of \(U\), which itself comes from the randomness of \(W\) and the data \(\{x_\nu\}\). Using the Gaussian structure Equation 19 and the fact that everything is quadratic in \(\varphi^a\), the disorder average can be performed explicitly.

A standard step is to introduce overlap matrices: \[ Q_{ab} = \frac{1}{p} \varphi^{a\top}\varphi^b, \qquad a,b=1,\dots,s, \] and related order parameters coupling to the population covariance \(\Sigma\). After integrating out the Gaussian fields and applying Gaussian equivalence, one arrives at an effective representation \[ \mathbb{E}Z^s = \int dQ \, dR \, dS \; \exp\big( p\,\mathcal{S}(Q,R,S; z) \big), \tag{25}\] where:

  • \(Q,R,S\) are finite-dimensional order parameters summarizing the overlaps,
  • \(\mathcal{S}\) is an explicit “effective action” depending on \(\psi_p,\psi_n,\rho_\Sigma,a_t,b_t,v_t,s_t\).

In the limit \(p\to\infty\), the integral Equation 25 is dominated by saddle points of \(\mathcal{S}\), that is by solutions of \[ \frac{\partial \mathcal{S}}{\partial Q} = 0, \quad \frac{\partial \mathcal{S}}{\partial R} = 0, \quad \frac{\partial \mathcal{S}}{\partial S} = 0. \]

Assuming replica symmetry (no reason for symmetry breaking in this setting), the saddle point can be parametrized by three scalar functions of \(z\):

  • the resolvent \(q(z)\),
  • a projected resolvent \(r(z)\) aligned with \(\Sigma\),
  • an auxiliary scalar \(s(z)\).

These satisfy a closed system of self-consistent equations of the form \[ q(z) = F_q(q,r,s;z), \quad r(z) = F_r(q,r,s;z), \quad s(z) = F_s(q,r,s;z). \tag{26}\]

One can think of Equation 26 as a generalized Silverstein–Pastur equation for this nonlinear random-feature ensemble.

The paper’s main theorem can be read as giving an explicit expression for \(F_q,F_r,F_s\) in terms of \(\rho_\Sigma\) and the scalars \(a_t,b_t,v_t,s_t\). The resolvent \(q(z)\) is then defined implicitly as the solution of Equation 26.


11. Asymptotic spectrum and separation of scales

The self-consistent equations Equation 26 are complicated in general but become tractable in the regime \[ \psi_p = \frac{p}{d} \gg 1, \qquad \psi_n = \frac{n}{d} \gg 1. \]

By expanding the equations in this limit, one finds:

  1. A population bulk \(\rho_2(\lambda)\) at scale \[ \lambda = O(\psi_p), \] essentially matching the spectrum of the population covariance \(\tilde U = \mathbb{E}[U]\).

  2. A sample-specific bulk \(\rho_1(\lambda)\) at the smaller scale \[ \lambda = O(\psi_p/\psi_n). \]

  3. A residual delta mass at \(\lambda = s_t^2\), which does not affect training/test losses.

More concretely, the support of \(\rho_1(\lambda)\) behaves (schematically) like \[ \text{supp}(\rho_1) \approx \Big[ s_t^2 + c_1 v_t^2 (1 - \sqrt{\psi_p/\psi_n})^2,\; s_t^2 + c_1 v_t^2 (1 + \sqrt{\psi_p/\psi_n})^2 \Big], \] for some constant \(c_1\) depending on the details of the model. The important point is the scaling: \[ \lambda_{\min}^{(1)} \sim O\!\left(\frac{\psi_p}{\psi_n}\right), \qquad \lambda_{\max}^{(2)} \sim O(\psi_p). \]

Thus, in the large-(_p,_n$ limit, the spectrum splits into two well-separated bulks.


12. From spectral edges to training time scales

Recall from Equation 14 that the relaxation time associated with an eigenvalue \(\lambda\) is \[ \tau(\lambda) = \frac{\psi_p}{\Delta_t \lambda}. \]

Using the scaling of the two bulks:

  • For population modes ((_2$, \(\lambda \sim O(\psi_p)\)): \[ \tau_{\mathrm{gen}} \sim \frac{\psi_p}{\Delta_t\,O(\psi_p)} = O(1). \]

  • For sample-specific modes ((_1$, \(\lambda_{\min} \sim O(\psi_p/\psi_n)\)): \[ \tau_{\mathrm{mem}} \sim \frac{\psi_p}{\Delta_t \lambda_{\min}} \sim \frac{\psi_p}{\Delta_t (\psi_p/\psi_n)} = \frac{\psi_n}{\Delta_t} \propto n. \]

We obtain two distinct time scales:

  • A fast, dataset-size–independent time \(\tau_{\mathrm{gen}}\) at which population-level structure is learned.
  • A slow, dataset-size–proportional time \(\tau_{\mathrm{mem}}\) at which sample-specific details (and thus memorization) are learned.

This matches the empirical observation: diffusion models generalize early and memorize only later, with the onset of memorization pushed out as \(n\) increases.


13. Conceptual summary

Starting from diffusion score matching and a random-features score model, we:

  1. Expressed the training dynamics as a linear ODE in the readout weights \(A\).
  2. Identified the Gram matrix \(U\) as the operator controlling time scales.
  3. Used Gaussian equivalence and Hermite expansions to reduce \(U\) to a Gaussian random matrix parameterized by a few scalars.
  4. Introduced the resolvent and expressed it via a log-determinant and a Gaussian partition function.
  5. Used replica and saddle-point techniques to derive self-consistent equations for the resolvent.
  6. Analyzed these equations in the overparameterized regime and found a two-bulk spectrum, with eigenvalues at scales \(O(\psi_p)\) and \(O(\psi_p/\psi_n)\).
  7. Translated the spectral edges into relaxation times, obtaining: \[ \tau_{\mathrm{gen}} = O(1), \qquad \tau_{\mathrm{mem}} = O(n). \]

The key insight is that dynamical regularization arises from a spectral separation between population modes and sample-specific modes:

  • Large eigenvalues correspond to directions that capture the underlying data distribution; they are learned quickly.
  • Small eigenvalues correspond to directions that encode high-frequency, dataset-specific variations; they are learned slowly, on a time scale that grows with \(n\).

Even when the model is expressive enough to memorize the data, and even in the absence of explicit regularization, the spectral geometry of the training operator creates a large window of training time where the model has generalized but not yet memorized.

This is the spectral mechanism behind “generalization before memorization” in (this random-feature approximation of) diffusion models.