Why Diffusion Models Generalize Before They Memorize

A Spectral Story via Random Features and Random Matrix Theory

Diffusion models trained by score matching behave in a way that is empirically very robust but theoretically nontrivial: they reach good generation early, and only much later start to memorize individual training samples. The goal of this note is to explain this phenomenon purely from training dynamics, within a random-features model where all relevant quantities can be analyzed analytically.

We will see that the key mechanism is spectral: the training operator has two distinct eigenvalue scales, corresponding to “population” and “sample-specific” directions. Gradient descent learns the population modes quickly, and the sample-specific modes only on a time scale that grows with the dataset size $n$. Memorization is therefore dynamically delayed.

1. Score matching at fixed diffusion time

We begin by specifying the data and the score-matching objective.

1.1 Data and diffusion

Let $x \in \mathbb{R}^d$ be a data vector drawn from a distribution \[ x \sim P_x \quad\text{with}\quad \mathbb{E}[x] = 0, \qquad \mathbb{E}[x x^\top] = \Sigma. \]

The forward diffusion process is defined by \[ x(t,\xi) = e^{-t} x + \sqrt{\Delta_t}\,\xi, \qquad \Delta_t = 1-e^{-2t}, \qquad \xi \sim \mathcal{N}(0,I_d). \tag{1}\]

The marginal distribution of $x(t,\xi)$ is a smoothed version of $P_x$; its density is denoted $P_t$. The (Stein) score at time $t$ is \[ s^\star_t(x) = \nabla_x \log P_t(x). \]

A parametric model $s_\theta(x,t)$ is trained to approximate $s^\star_t(x)$ via denoising score matching. At fixed $t$, the population objective is \[ \mathcal{L}_t(\theta) = \frac{1}{d} \mathbb{E}_{x\sim P_x} \mathbb{E}_{\xi\sim\mathcal{N}(0,I_d)} \Big\| \sqrt{\Delta_t}\, s_\theta(x(t,\xi),t) + \xi \Big\|^2. \tag{2}\]

Given a training set $\{x_\nu\}_{\nu=1}^n$, the empirical loss is \[ L_{\mathrm{train}}(\theta) = \frac{1}{dn} \sum_{\nu=1}^n \mathbb{E}_\xi \Big\| \sqrt{\Delta_t}\, s_\theta(x_\nu(t,\xi),t) + \xi \Big\|^2. \tag{3}\]

We are interested in the training dynamics of $L_{\mathrm{train}}(\theta)$ for a particular model class.

2. Random-features score model

Directly analyzing a deep U-Net is out of reach. We therefore simplify the architecture while preserving three key ingredients:

Score matching at a fixed diffusion time.
Overparameterization.
Nonlinearity in the data.

This leads to the random-features score model.

2.1 Pre-activations and features

Let \[ W \in \mathbb{R}^{p\times d} \] have i.i.d. entries $W_{ij} \sim \mathcal{N}(0,1)$. For an input $x\in\mathbb{R}^d$ we define:

The pre-activation vector \[ h(x) = \frac{W x}{\sqrt{d}} \in \mathbb{R}^p, \tag{4}\]
The feature vector \[ \varphi(x) = \sigma\big(h(x)\big) \in \mathbb{R}^p, \tag{5}\] where $\sigma:\mathbb{R}\to\mathbb{R}$ is a fixed nonlinearity applied coordinate-wise.

For a noisy training sample $x_\nu(t,\xi)$, we will write \[ \varphi_\nu(\xi) = \varphi\big(x_\nu(t,\xi)\big) = \sigma\!\left(\frac{W x_\nu(t,\xi)}{\sqrt{d}}\right). \tag{6}\]

2.2 Random-features score network

The score model is linear in the feature space: \[ s_A(x) = \frac{A}{\sqrt{p}} \,\varphi(x), \qquad A \in \mathbb{R}^{d\times p}. \tag{7}\]

Here:

$W$ is random and fixed (first layer),
$A$ is trainable (second layer),
The overall model remains nonlinear in $x$ through $\sigma$.

Plugging Equation 7 into the empirical loss Equation 3 at fixed time $t$ yields \[ L_{\mathrm{train}}(A) = \frac{1}{dn} \sum_{\nu=1}^n \mathbb{E}_\xi \Big\| \sqrt{\Delta_t}\, \frac{A}{\sqrt{p}} \,\varphi_\nu(\xi) + \xi \Big\|^2. \tag{8}\]

At this point, the loss is quadratic in $A$, which is the main structural simplification: gradient descent dynamics will be linear in $A$.

3. From the loss to a linear ODE in $A$

We briefly compute the gradient of Equation 8 with respect to $A$.

Write, for each $(\nu,\xi)$, \[ \ell_{\nu,\xi}(A) = \frac{1}{d} \Big\| \sqrt{\Delta_t}\, \frac{A}{\sqrt{p}} \,\varphi_\nu(\xi) + \xi \Big\|^2. \]

Expanding the square, \[ \ell_{\nu,\xi}(A) = \frac{\Delta_t}{dp}\, \varphi_\nu(\xi)^\top A^\top A \,\varphi_\nu(\xi) + \frac{2\sqrt{\Delta_t}}{d\sqrt{p}}\, \xi^\top A \,\varphi_\nu(\xi) + \frac{1}{d}\|\xi\|^2. \]

The gradient with respect to $A$ is \[ \nabla_A \ell_{\nu,\xi}(A) = \frac{2\Delta_t}{dp}\, A \,\varphi_\nu(\xi)\varphi_\nu(\xi)^\top + \frac{2\sqrt{\Delta_t}}{d\sqrt{p}}\, \xi \,\varphi_\nu(\xi)^\top. \]

Averaging over $\nu$ and $\xi$ yields \[ \nabla_A L_{\mathrm{train}}(A) = \frac{2\Delta_t}{d}\, A U + \frac{2}{\sqrt{\Delta_t p}}\, V^\top, \tag{9}\] where we have defined

the Gram matrix \[ U = \frac{1}{n} \sum_{\nu=1}^n \mathbb{E}_\xi \big[ \varphi_\nu(\xi)\varphi_\nu(\xi)^\top \big] \in \mathbb{R}^{p\times p}, \tag{10}\]
and the “noise-feature” coupling \[ V = \frac{1}{n} \sum_{\nu=1}^n \mathbb{E}_\xi \big[ \varphi_\nu(\xi)\xi^\top \big] \in \mathbb{R}^{p\times d}. \tag{11}\]

The stochastic gradient descent update is then \[ A^{(k+1)} = A^{(k)} - \eta \,\nabla_A L_{\mathrm{train}}(A^{(k)}). \]

In the small learning-rate regime, taking a continuum limit with rescaled time \[ \tau = \frac{k\eta}{d^2}, \] gives the ODE \[ \frac{d}{d\tau} A(\tau) = - \frac{2\Delta_t}{d}\,A(\tau)\,U - \frac{2}{\sqrt{\Delta_t p}}\,V^\top. \tag{12}\]

This is a linear ODE in $A$, driven by the random matrix $U$.

4. Training dynamics in the eigenbasis of $U$

To understand the time scales of learning, we diagonalize the Gram matrix: \[ U = Q \Lambda Q^\top, \] where \[ \Lambda = \operatorname{diag}(\lambda_1,\dots,\lambda_p). \]

Let \[ B(\tau) = A(\tau) Q \in \mathbb{R}^{d\times p}, \qquad \tilde V^\top = V^\top Q. \]

Then Equation 12 becomes \[ \frac{d}{d\tau} B(\tau) = - \frac{2\Delta_t}{d}\,B(\tau)\,\Lambda - \frac{2}{\sqrt{\Delta_t p}}\, \tilde V^\top. \tag{13}\]

For each column $b_i(\tau) \in \mathbb{R}^d$ of $B(\tau)$, we have a scalar-matrix ODE: \[ \frac{d}{d\tau} b_i(\tau) = - \frac{2\Delta_t}{d}\,\lambda_i\, b_i(\tau) - \frac{2}{\sqrt{\Delta_t p}}\, \tilde v_i, \] where $\tilde v_i$ is the $i$-th column of $\tilde V^\top$. This is solved explicitly as \[ b_i(\tau) = e^{-(2\Delta_t\lambda_i/d)\,\tau}\, b_i(0) - \frac{2}{\sqrt{\Delta_t p}} \int_0^\tau e^{-(2\Delta_t\lambda_i/d)\,(\tau-s)} \tilde v_i\, ds. \]

Ignoring the forcing term for the moment (or thinking about relaxation to equilibrium), the homogeneous solution decays as \[ \|b_i(\tau)\| \approx \|b_i(0)\| \exp\!\Big(-\frac{2\Delta_t\lambda_i}{d}\,\tau\Big). \]

This suggests defining the relaxation time associated with eigenvalue $\lambda$: \[ \tau(\lambda) = \frac{d}{2\Delta_t\lambda} = \frac{\psi_p}{\Delta_t \lambda}. \tag{14}\]

Modes with larger $\lambda$ are learned faster; modes with small $\lambda$ take longer to relax. Therefore:

The time scale at which different directions in parameter space (and thus in function space) are learned is determined by the eigenvalues of the Gram matrix $U$.

Our task reduces to understanding the spectrum of $U$ in the high-dimensional limit.

5. The spectral problem: what do we actually need?

Let \[ \lambda_1,\dots,\lambda_p \] be the eigenvalues of $U$. The empirical spectral density is \[ \rho_U^{(p)}(\lambda) = \frac{1}{p} \sum_{i=1}^p \delta(\lambda - \lambda_i). \]

We are interested in its limit $\rho_U(\lambda)$ as $d,p,n\to\infty$ with $\psi_p,\psi_n$ fixed.

More specifically, we need:

The location and scale of the bulk (or bulks) of $\rho_U(\lambda)$.
The behavior of the smallest nonzero eigenvalues $\lambda_{\min}$, since Equation 14 shows that \[ \tau_{\mathrm{slow}} \sim \frac{\psi_p}{\Delta_t \lambda_{\min}}. \]

The central question thus becomes:

What is the limiting spectral distribution of $U$, and how do its bulk edges scale with $\psi_p = p/d$ and $\psi_n = n/d$?

6. The resolvent (Stieltjes transform) of $U$

Random matrix theory typically approaches such questions via the resolvent or Stieltjes transform: \[ q(z) = \frac{1}{p}\operatorname{Tr}(U - zI)^{-1} = \frac{1}{p}\sum_{i=1}^p \frac{1}{\lambda_i - z}, \qquad z \in \mathbb{C}\setminus \mathbb{R}. \tag{15}\]

The spectral density is recovered from $q(z)$ via \[ \rho_U(\lambda) = \frac{1}{\pi} \lim_{\varepsilon\to 0^+} \operatorname{Im}\, q(\lambda + i\varepsilon). \tag{16}\]

In classical settings, for instance when \[ U = \frac{1}{n} X X^\top \] with i.i.d. entries, the resolvent $q(z)$ satisfies an explicit deterministic equation (the Marchenko–Pastur equation), from which one obtains the limiting spectral density. More generally, with a population covariance $\Sigma$, one obtains the Silverstein–Pastur equations.

In our case, however, $U$ is not a linear sample covariance: its entries involve the nonlinear map $\sigma$, the random weights $W$, and the diffusion noise. No classical closed-form resolvent equation is available.

We will now outline how to obtain such an equation using two key tools:

Gaussian equivalence for the pre-activations.
Replica methods applied to a Gaussian integral representation of the resolvent.

7. Gaussian equivalence and Hermite expansion

We first simplify the distribution of the pre-activations $h(x) = W x / \sqrt{d}$.

7.1 Gaussian pre-activations

For fixed $x$, each coordinate $h_i(x) = W_i x / \sqrt{d}$ is Gaussian with \[ \mathbb{E}[h_i(x)] = 0, \qquad \mathbb{E}[h_i(x)^2] = \frac{1}{d}\|x\|^2. \]

For two inputs $x,x'$, the pair $(h_i(x), h_i(x'))$ is Gaussian with covariance \[ \mathbb{E}[h_i(x) h_i(x')] = \frac{1}{d} x^\top x'. \]

For our noisy data $x_\nu(t,\xi)$, this gives a Gaussian field $H$ of pre-activations: \[ H_{\nu i}(\xi) = h_i(x_\nu(t,\xi)) \] with a covariance that can be expressed in terms of $\Sigma$ and $t$.

Gaussian equivalence (in this context) means that for many observables of interest (including spectral densities and generalization errors), we may replace the exact distribution of the feature matrix by a Gaussian process with the same covariance structure. Intuitively, the pre-activations are sums of many independent contributions, and high-dimensional central limit/type universality effects apply.

7.2 Hermite expansion of the nonlinearity

Once the inputs to $\sigma$ are Gaussian, the nonlinearity can be analyzed via its Hermite expansion: \[ \sigma(z) = \sum_{k=0}^\infty \frac{\alpha_k}{k!} \mathrm{He}_k(z), \tag{17}\] where $\{\mathrm{He}_k\}$ are the (probabilists’) Hermite polynomials orthogonal with respect to $\mathcal{N}(0,1)$.

For a pair $(Z_1,Z_2)$ of jointly Gaussian variables with correlation $\rho$, \[ \mathbb{E}[\sigma(Z_1)\sigma(Z_2)] = \sum_{k=0}^\infty \alpha_k^2 \rho^k. \tag{18}\]

In our setting, $\rho$ will be a function of inner products like $x_\nu^\top x_{\nu'}/d$ and of the diffusion time $t$.

The key point is that the entries of $U$ in Equation 10 can be expressed in terms of a finite number of Gaussian expectations such as Equation 18. These are summarized in four scalar parameters \[ a_t,\; b_t,\; v_t,\; s_t, \] which are explicit integrals involving $\sigma$, the noise level $t$, and the data covariance $\Sigma$.

As a result, the Gram matrix can be decomposed schematically as \[ U \approx s_t^2 I_p + \underbrace{\frac{v_t^2}{n}\sum_{\nu=1}^n g_\nu g_\nu^\top}_{\text{sample-specific part}} + \underbrace{\tilde U}_{\text{population part}}, \tag{19}\] where:

$g_\nu \in \mathbb{R}^p$ are Gaussian vectors (the high-dimensional limit of the features),
$\tilde U$ is a deterministic “population covariance” term depending on $\Sigma$ and $(a_t,b_t,v_t,s_t)$.

Equation Equation 19 is schematic but captures the structure: $U$ is a sum of a diagonal term, a sample-specific random covariance, and a population-level covariance. All randomness is now Gaussian, and the nonlinearity is encoded in a small number of scalars.

8. Why we need a log-determinant representation

We want to compute the quenched resolvent \[ \bar q(z) = \mathbb{E}_{W,X} q(z) = \frac{1}{p}\,\mathbb{E}_{W,X} \operatorname{Tr}(U - zI)^{-1}. \]

Directly averaging $(U-zI)^{-1}$ over the randomness of $U$ is intractable:

The inverse is a highly nonlinear function of all matrix entries.
$U$ has complicated correlations due to its nonlinear construction.
Classical results (e.g. for $XX^\top$-type matrices) do not apply.

A standard tool in random matrix theory and statistical physics is to transform the inverse problem into a log-determinant problem, which can then be represented as a Gaussian integral.

8.1 From the resolvent to $\log\det$

For any invertible matrix $M$, we have \[ \frac{\partial}{\partial z} \log \det(M - zI) = - \operatorname{Tr}(M - zI)^{-1}. \]

Applying this with $M = U$, we obtain \[ q(z) = \frac{1}{p} \operatorname{Tr}(U - zI)^{-1} = - \frac{1}{p} \frac{\partial}{\partial z} \log \det(U - zI). \tag{20}\]

Therefore, understanding $\bar q(z)$ is equivalent to understanding the average of $\log\det(U - zI)$.

8.2 From $\log\det$ to a Gaussian partition function

A second standard identity is the Gaussian integral \[ \int_{\mathbb{R}^p} \exp\!\left( -\tfrac12 x^\top M x \right) dx \propto (\det M)^{-1/2}, \quad M \succ 0. \]

Define the partition function \[ Z(z;U) = \int_{\mathbb{R}^p} \exp\!\left( - \tfrac12 \varphi^\top (U - zI)\varphi \right) d\varphi. \tag{21}\]

Then \[ \log Z(z;U) = -\tfrac12 \log \det(U - zI) + \text{const}, \] and using Equation 20: \[ q(z) = \frac{1}{p} \frac{\partial}{\partial z} \log Z(z;U). \tag{22}\]

We have therefore transformed the resolvent into the derivative of the log of a Gaussian integral.

9. Quenched vs annealed averages and the replica idea

We are interested in the typical behavior of the spectrum, corresponding to the quenched average of $\log Z$: \[ \mathbb{E}\log Z(z;U). \]

This differs from the annealed quantity $\log \mathbb{E}Z(z;U)$. In large systems, $\log Z$ behaves like a sum of many weakly dependent contributions and concentrates around its mean (self-averaging). Therefore, the typical spectral density corresponds to \[ \frac{1}{p} \mathbb{E}\log Z(z;U), \] not to $-\frac{1}{p}\log \mathbb{E}Z$.

The replica trick provides a formal way to compute $\mathbb{E}\log Z$: \[ \mathbb{E}\log Z = \lim_{s\to0} \frac{1}{s} \log \mathbb{E}Z^s. \tag{23}\]

For integer $s$, \[ Z^s = \int \prod_{a=1}^s d\varphi^a \exp\!\left( - \frac12 \sum_{a=1}^s \varphi^{a\top}(U - zI)\varphi^a \right). \tag{24}\]

The integrand now depends on $U$ only through quadratic forms in the replicated fields $\varphi^a$.

10. Disorder average and overlap order parameters

We now average $Z^s$ over the randomness of $U$, which itself comes from the randomness of $W$ and the data $\{x_\nu\}$. Using the Gaussian structure Equation 19 and the fact that everything is quadratic in $\varphi^a$, the disorder average can be performed explicitly.

A standard step is to introduce overlap matrices: \[ Q_{ab} = \frac{1}{p} \varphi^{a\top}\varphi^b, \qquad a,b=1,\dots,s, \] and related order parameters coupling to the population covariance $\Sigma$. After integrating out the Gaussian fields and applying Gaussian equivalence, one arrives at an effective representation \[ \mathbb{E}Z^s = \int dQ \, dR \, dS \; \exp\big( p\,\mathcal{S}(Q,R,S; z) \big), \tag{25}\] where:

$Q,R,S$ are finite-dimensional order parameters summarizing the overlaps,
$\mathcal{S}$ is an explicit “effective action” depending on $\psi_p,\psi_n,\rho_\Sigma,a_t,b_t,v_t,s_t$.

In the limit $p\to\infty$, the integral Equation 25 is dominated by saddle points of $\mathcal{S}$, that is by solutions of \[ \frac{\partial \mathcal{S}}{\partial Q} = 0, \quad \frac{\partial \mathcal{S}}{\partial R} = 0, \quad \frac{\partial \mathcal{S}}{\partial S} = 0. \]

Assuming replica symmetry (no reason for symmetry breaking in this setting), the saddle point can be parametrized by three scalar functions of $z$:

the resolvent $q(z)$,
a projected resolvent $r(z)$ aligned with $\Sigma$,
an auxiliary scalar $s(z)$.

These satisfy a closed system of self-consistent equations of the form \[ q(z) = F_q(q,r,s;z), \quad r(z) = F_r(q,r,s;z), \quad s(z) = F_s(q,r,s;z). \tag{26}\]

One can think of Equation 26 as a generalized Silverstein–Pastur equation for this nonlinear random-feature ensemble.

The paper’s main theorem can be read as giving an explicit expression for $F_q,F_r,F_s$ in terms of $\rho_\Sigma$ and the scalars $a_t,b_t,v_t,s_t$. The resolvent $q(z)$ is then defined implicitly as the solution of Equation 26.

11. Asymptotic spectrum and separation of scales

The self-consistent equations Equation 26 are complicated in general but become tractable in the regime \[ \psi_p = \frac{p}{d} \gg 1, \qquad \psi_n = \frac{n}{d} \gg 1. \]

By expanding the equations in this limit, one finds:

A population bulk $\rho_2(\lambda)$ at scale \[ \lambda = O(\psi_p), \] essentially matching the spectrum of the population covariance $\tilde U = \mathbb{E}[U]$.
A sample-specific bulk $\rho_1(\lambda)$ at the smaller scale \[ \lambda = O(\psi_p/\psi_n). \]
A residual delta mass at $\lambda = s_t^2$, which does not affect training/test losses.

More concretely, the support of $\rho_1(\lambda)$ behaves (schematically) like \[ \text{supp}(\rho_1) \approx \Big[ s_t^2 + c_1 v_t^2 (1 - \sqrt{\psi_p/\psi_n})^2,\; s_t^2 + c_1 v_t^2 (1 + \sqrt{\psi_p/\psi_n})^2 \Big], \] for some constant $c_1$ depending on the details of the model. The important point is the scaling: \[ \lambda_{\min}^{(1)} \sim O\!\left(\frac{\psi_p}{\psi_n}\right), \qquad \lambda_{\max}^{(2)} \sim O(\psi_p). \]

Thus, in the large-(_p,_n$ limit, the spectrum splits into two well-separated bulks.

12. From spectral edges to training time scales

Recall from Equation 14 that the relaxation time associated with an eigenvalue $\lambda$ is \[ \tau(\lambda) = \frac{\psi_p}{\Delta_t \lambda}. \]

Using the scaling of the two bulks:

For population modes ((_2$, $\lambda \sim O(\psi_p)$): \[ \tau_{\mathrm{gen}} \sim \frac{\psi_p}{\Delta_t\,O(\psi_p)} = O(1). \]
For sample-specific modes ((_1$, $\lambda_{\min} \sim O(\psi_p/\psi_n)$): \[ \tau_{\mathrm{mem}} \sim \frac{\psi_p}{\Delta_t \lambda_{\min}} \sim \frac{\psi_p}{\Delta_t (\psi_p/\psi_n)} = \frac{\psi_n}{\Delta_t} \propto n. \]

We obtain two distinct time scales:

A fast, dataset-size–independent time $\tau_{\mathrm{gen}}$ at which population-level structure is learned.
A slow, dataset-size–proportional time $\tau_{\mathrm{mem}}$ at which sample-specific details (and thus memorization) are learned.

This matches the empirical observation: diffusion models generalize early and memorize only later, with the onset of memorization pushed out as $n$ increases.

13. Conceptual summary

Starting from diffusion score matching and a random-features score model, we:

Expressed the training dynamics as a linear ODE in the readout weights $A$.
Identified the Gram matrix $U$ as the operator controlling time scales.
Used Gaussian equivalence and Hermite expansions to reduce $U$ to a Gaussian random matrix parameterized by a few scalars.
Introduced the resolvent and expressed it via a log-determinant and a Gaussian partition function.
Used replica and saddle-point techniques to derive self-consistent equations for the resolvent.
Analyzed these equations in the overparameterized regime and found a two-bulk spectrum, with eigenvalues at scales $O(\psi_p)$ and $O(\psi_p/\psi_n)$.
Translated the spectral edges into relaxation times, obtaining: \[ \tau_{\mathrm{gen}} = O(1), \qquad \tau_{\mathrm{mem}} = O(n). \]

The key insight is that dynamical regularization arises from a spectral separation between population modes and sample-specific modes:

Large eigenvalues correspond to directions that capture the underlying data distribution; they are learned quickly.
Small eigenvalues correspond to directions that encode high-frequency, dataset-specific variations; they are learned slowly, on a time scale that grows with $n$.

Even when the model is expressive enough to memorize the data, and even in the absence of explicit regularization, the spectral geometry of the training operator creates a large window of training time where the model has generalized but not yet memorized.

This is the spectral mechanism behind “generalization before memorization” in (this random-feature approximation of) diffusion models.

TODO TODO