Bridge Matching Sampler – Alexandre Thiéry

Consider a controlled SDE on \([0,T]\) with noise schedule \(\sigma_t > 0\),

\[ dX_t = \sigma_t \, u(X_t, t) \, dt + \sigma_t \, dW_t, \quad X_0 \sim p_0. \tag{1}\]

The goal: find \(u\) such that \(X_T \sim \pi\), where \(\pi(x) = \rho(x)/\mathcal{Z}\) is a target density known up to normalization. Adjoint sampling and ASBS solve this via different endpoint couplings, but both require either a memoryless prior or alternating optimization. The Bridge Matching Sampler (BMS) identifies a single coupling, the independent coupling, that makes the regression target fully tractable and removes the need for alternation.

The central tool is Nelson’s relation, which we derive next.

Nelson’s relation

Let \(\mathbb{P}^u\) denote the path measure of Equation 1 with time marginals \(p_t\). Write the Euler discretization with step \(\delta\):

\[ X_{t+\delta} = X_t + \sigma_t \, u(X_t,t) \, \delta + \sigma_t \sqrt{\delta} \, \mathbf{n}, \qquad \mathbf{n}\sim \mathcal{N}(0,I). \tag{2}\]

The forward conditional mean is \(\mathbb{E}[X_{t+\delta} \mid X_t = x] = x + \sigma_t \, u(x,t) \, \delta\). Now compute the backward conditional mean \(\mathbb{E}[X_t \mid X_{t+\delta} = y]\) using Bayes’ rule, exactly as in the reverse diffusions note. For \(\delta \ll 1\):

\[ \mathbb{P}(X_t \in dx \mid X_{t+\delta} = y) \;\propto\; p_t(x) \, \exp {\left\{ -\frac{\|y - x - \sigma_t \, u(x,t) \, \delta\|^2}{2 \sigma_t^2 \, \delta} \right\}} . \]

Expanding \(p_t(x) \approx p_t(y) \exp(\left< \nabla \log p_t(y), x - y \right>)\) and completing the square, the conditional mean is

\[ \mathbb{E}[X_t \mid X_{t+\delta} = y] = y - \sigma_t \, u(y,t) \, \delta + \sigma_t^2 \, \nabla \log p_t(y) \, \delta + O(\delta^2). \]

So if \(v(y,t)\) denotes the backward drift (the drift of the time-reversed process), we read off

\[ -\sigma_t \, v(y,t) = -\sigma_t \, u(y,t) + \sigma_t^2 \, \nabla \log p_t(y), \]

which gives Nelson’s relation:

\[ \textcolor{blue}{u(x,t) + v(x,t) = \sigma_t \, \nabla \log p_t(x).} \tag{3}\]

Forward drift pushes mass toward the target, backward drift pushes it toward the prior, and their sum is the diffusion coefficient times the score of the time marginal. This holds for any Markov diffusion of the form Equation 1 with marginals \(p_t\).

Reciprocal class and Markovian projection

Let \(\mathbb{P}\) denote the reference (uncontrolled) process \(dX_t = \sigma_t \, dW_t\). A path measure \(\Pi\) belongs to the reciprocal class \(\mathcal{R}(\mathbb{P})\) if it has the form \(\Pi = \Pi_{0,T} \, \mathbb{P}_{|0,T}\), where \(\Pi_{0,T}\) is an endpoint coupling and \(\mathbb{P}_{|0,T}\) is the reference bridge (Brownian bridge for Brownian \(\mathbb{P}\)); see the Schrodinger bridges note.

A reciprocal measure is generally non-Markovian: the bridge drift depends on \(X_T\). The Markovian projection finds a Markovian drift \(u^\star\) whose time marginals match those of \(\Pi^\star\). This is an \(L^2\) projection: if \(\xi(X,t)\) is the path-dependent drift of \(\Pi^\star\), then

\[ u^\star(x,t) = \mathbb{E}_{\Pi^\star} {\left[ \xi(X,t) \mid X_t = x \right]} . \tag{4}\]

Why? For any Markovian \(u\), expand \(\mathbb{E}_{\Pi^\star}[\|\xi - u(X_t,t)\|^2]\) and use the tower property:

\[ \mathbb{E}_{\Pi^\star} {\left[ \|\xi - u\|^2 \right]} = \mathbb{E}_{\Pi^\star} {\left[ \|\xi - u^\star\|^2 \right]} + \mathbb{E}_{\Pi^\star} {\left[ \|u^\star - u\|^2 \right]} . \]

The cross-term vanishes because \(\mathbb{E}_{\Pi^\star}[\xi - u^\star \mid X_t] = 0\) by definition of \(u^\star\). So \(u^\star\) minimizes the matching loss

\[ u^\star = \mathop{\mathrm{argmin}}_{u} \; \mathbb{E}_{\Pi^\star} {\left[ \int_0^T \frac{1}{2} \| \xi(X,t) - u(X_t,t) \|^2 \, dt \right]} . \tag{5}\]

Fixed-point iteration

All three methods (adjoint sampling, ASBS, BMS) follow the same template. Starting from some control \(u_0\):

Simulate the current SDE \(\mathbb{P}^{u_i}\) to generate endpoint pairs \((X_0, X_T)\).
Reciprocal projection: form a coupling \(\Pi^i_{0,T}\) from the endpoints, define \(\Pi^i = \Pi^i_{0,T} \, \mathbb{P}_{|0,T}\).
Markovianize: update \(u_{i+1}\) by regressing onto the bridge drift via Equation 5.

If \(u_i = u^\star\), then \(\Pi^i = \Pi^\star\) and \(u_{i+1} = u^\star\), so \(u^\star\) is a fixed point. What distinguishes the methods is the coupling in step 2, which determines the regression target \(\xi\).

Target score identity

To get a tractable \(\xi\), we need the score \(\nabla \log \Pi^\star_t(x)\). Define the cumulative variance \(\kappa(t) = \int_0^t \sigma_s^2 \, ds\) and \(\gamma(t) = \kappa(t)/\kappa(T)\). The bridge \(\mathbb{P}_{|0,T}\) is Gaussian: for \(t \in (0,T)\),

\[ X_t \mid (X_0, X_T) \;\sim\; \mathcal{N} {\left( (1-\gamma(t)) X_0 + \gamma(t) X_T, \;\; \kappa(T) \gamma(t)(1-\gamma(t)) \, I \right)} . \tag{6}\]

The marginal density of \(\Pi^\star\) at time \(t\) is

\[ \Pi^\star_t(x) = \int \mathbb{P}_{t|0,T}(x \mid x_0, x_T) \, \Pi^\star_{0,T}(x_0, x_T) \, dx_0 \, dx_T. \tag{7}\]

Differentiate \(\log \Pi^\star_t(x)\):

\[ \nabla_x \log \Pi^\star_t(x) = \frac{\int \nabla_x \mathbb{P}_{t|0,T}(x \mid x_0, x_T) \, \Pi^\star_{0,T}(x_0, x_T) \, dx_0 \, dx_T}{\Pi^\star_t(x)}. \]

Since \(\mathbb{P}_{t|0,T}\) is Gaussian with mean \((1-\gamma)x_0 + \gamma \, x_T\),

\[ \nabla_x \log \mathbb{P}_{t|0,T}(x \mid x_0, x_T) = -\frac{x - (1-\gamma)x_0 - \gamma \, x_T}{\kappa(T)\gamma(1-\gamma)}. \]

Integration by parts to swap gradients:

Integrate by parts in \(x_0\). Note that \(\nabla_x \mathbb{P}_{t|0,T} = -\frac{1}{1-\gamma} \nabla_{x_0} \mathbb{P}_{t|0,T}\) (both shift the Gaussian mean). So

\[ \nabla_x \log \Pi^\star_t(x) = \frac{1}{1-\gamma} \frac{\int \mathbb{P}_{t|0,T}(x \mid x_0, x_T) \, \nabla_{x_0} \Pi^\star_{0,T}(x_0, x_T) \, dx_0 \, dx_T}{\Pi^\star_t(x)}, \]

where we integrated by parts, moving \(\nabla_{x_0}\) from \(\mathbb{P}_{t|0,T}\) onto \(\Pi^\star_{0,T}\) (boundary terms vanish). Recognizing the conditional expectation:

\[ \nabla_x \log \Pi^\star_t(x) = \mathbb{E}_{\Pi^\star_{0,T|t}} {\left[ \frac{1}{1-\gamma(t)} \nabla_{X_0} \log \Pi^\star_{0,T}(X_0,X_T) \;\Big|\; X_t = x \right]} . \]

The same argument with integration by parts in \(x_T\) gives

\[ \nabla_x \log \Pi^\star_t(x) = \mathbb{E}_{\Pi^\star_{0,T|t}} {\left[ \frac{1}{\gamma(t)} \nabla_{X_T} \log \Pi^\star_{0,T}(X_0,X_T) \;\Big|\; X_t = x \right]} . \]

Since both expressions equal the same score, any convex combination \((1-c(t))\) times the first plus \(c(t)\) times the second is also valid, for any \(c(t) \in (0,1]\).

This gives the generalized target score identity: for any \(c(t) \in (0,1]\),

\[ \nabla \log \Pi^\star_t(x) = \mathbb{E}_{\Pi^\star_{0,T|t}} {\left[ \frac{1-c(t)}{1-\gamma(t)} \nabla_{X_0} \log \Pi^\star_{0,T} + \frac{c(t)}{\gamma(t)} \nabla_{X_T} \log \Pi^\star_{0,T} \;\Big|\; X_t = x \right]} . \tag{8}\]

General regression target

Now combine everything. The backward drift of a process in \(\mathcal{R}(\mathbb{P})\) has the form (from the Doob h-transform and the bridge structure)

\[ v^\star(x,t) = \mathbb{E}_{\Pi^\star_{0|t}} {\left[ \sigma_t \, \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t \mid X_0) \;\Big|\; X_t = x \right]} . \tag{9}\]

Since \(\mathbb{P}_{t|0}\) is Gaussian \(\mathcal{N}(X_0, \kappa(t) I)\), this is \(-\sigma_t(X_t - X_0)/\kappa(t)\) averaged over \(X_0 \mid X_t\). Using Nelson’s relation \(u^\star = \sigma_t \nabla \log \Pi^\star_t - v^\star\) from Equation 3, and substituting Equation 8 for the score and Equation 9 for \(v^\star\), the non-Markovian drift \(\xi\) that we need to regress onto satisfies

\[ \sigma_t^{-1} \, \xi(X,t) = \frac{1-c(t)}{1-\gamma(t)} \nabla_{X_0} \log \Pi^\star_{0,T}(X_0,X_T) + \frac{c(t)}{\gamma(t)} \nabla_{X_T} \log \Pi^\star_{0,T}(X_0,X_T) - \nabla_{X_t} \log \mathbb{P}_{t|0}(X_t \mid X_0). \tag{10}\]

Here \((X_0, X_t, X_T)\) are all determined by the bridge path: \(X_0, X_T\) from the coupling, \(X_t\) from Equation 6. The Markovianization step fits \(u(X_t,t)\) to \(\mathbb{E}[\xi(X,t) \mid X_t]\) via Equation 5.

The tractability of Equation 10 depends entirely on the coupling scores \(\nabla \log \Pi^\star_{0,T}\).

Three couplings, three algorithms

Half-bridge / adjoint sampling. Set \(\Pi^\star_{0,T} = \delta_{x_0} \otimes \pi\) (Dirac prior, memoryless condition). Then \(\nabla_{X_0} \log \Pi^\star_{0,T}\) is undefined (Dirac), and \(\nabla_{X_T} \log \Pi^\star_{0,T} = \nabla \log \pi(X_T)\). Taking \(c(t) = \gamma(t)\) and using \(x_0 = 0\), the general formula Equation 10 simplifies to

\[ \sigma_t^{-1} \, \xi(X,t) = \nabla_{X_T} \log \frac{\pi(X_T)}{\mathbb{P}_T(X_T)}, \tag{11}\]

where \(\mathbb{P}_T\) is the terminal marginal of the reference. Simple, but requires Dirac prior and large \(\sigma_t\) for exploration.

Full Schrodinger bridge / ASBS. Set \(\Pi^\star_{0,T} = \hat\varphi_0(x_0) \, \mathbb{P}_{T|0}(x_T \mid x_0) \, \varphi_T(x_T)\), the Schrodinger bridge coupling. The drift becomes

\[ \sigma_t^{-1} \, \xi(X,t) = \nabla_{X_T} \log \frac{\pi(X_T)}{\hat\varphi_T(X_T)}, \tag{12}\]

where \(\hat\varphi_T\) is the backward Schrodinger potential. This allows arbitrary priors, but \(\hat\varphi_T\) is unknown and must be learned alongside \(u\), requiring alternating IPF-style updates.

Independent coupling / BMS. Set

\[ \Pi^\star_{0,T} = p_0 \otimes \pi. \tag{13}\]

Plug into Equation 10. The coupling scores factor trivially: \(\nabla_{X_0} \log \Pi^\star_{0,T} = \nabla \log p_0(X_0)\) and \(\nabla_{X_T} \log \Pi^\star_{0,T} = \nabla \log \pi(X_T)\). The regression target becomes

\[ \sigma_t^{-1} \, \xi(X,t) = \frac{1-c(t)}{1-\gamma(t)} \nabla \log p_0(X_0) + \frac{c(t)}{\gamma(t)} \nabla \log \pi(X_T) - \frac{X_t - X_0}{\kappa(t)}. \tag{14}\]

Every term on the right is known: \(\nabla \log p_0\) is the prior score (Gaussian), \(\nabla \log \pi = \nabla \log \rho\) is the target score (computable), and \((X_t - X_0)/\kappa(t)\) is the Gaussian transition score. No unknown potentials, no alternation.

The independent coupling: why it works

The independent coupling \(p_0 \otimes \pi\) satisfies the boundary constraints by construction: marginalizing over \(X_T\) gives \(p_0\), marginalizing over \(X_0\) gives \(\pi\). Any coupling satisfying these boundary constraints yields a valid fixed point \(u^\star\) that transports \(p_0\) to \(\pi\). The Schrodinger bridge coupling minimizes the path-space KL to the reference, so it is optimal in that sense. The independent coupling sacrifices this optimality for a fully tractable regression target.

At each iteration, the coupling is \(\Pi^i_{0,T} = \mathbb{P}^{u_i}_0 \otimes \mathbb{P}^{u_i}_T\): independently resample \(X_0\) and \(X_T\) from their marginals under the current SDE. In practice, simulate trajectories, then randomly pair the initial and terminal samples.

Sampling the bridge is cheap: given \((x_0, x_T)\), draw \(X_t\) from Equation 6 and evaluate Equation 14. No full trajectory simulation needed during regression.

Damped iteration

The undamped iteration \(u_{i+1} = \Phi(u_i)\) can overshoot in high dimensions. The damped version uses step size \(\alpha \in (0,1]\):

\[ u_{i+1} = \alpha \, \Phi(u_i) + (1-\alpha) \, u_i. \tag{15}\]

Setting \(\eta = (1-\alpha)/\alpha\), this solves

\[ u_{i+1} = \mathop{\mathrm{argmin}}_u \; {\left\{ \mathbb{E}_{\Pi^i} {\left[ \int_0^T \frac{1}{2} \| \xi - u(X_t,t) \|^2 \, dt \right]} \;+\; \textcolor{blue}{\eta \, \mathbb{E}_{\Pi^i} {\left[ \int_0^T \frac{1}{2} \| u_i(X_t,t) - u(X_t,t) \|^2 \, dt \right]} } \right\}} . \tag{16}\]

The \( \textcolor{blue}{\text{second term}}\) penalizes deviation from the previous iterate. Each step balances fitting new bridge data against staying close to \(u_i\), preventing mode collapse from aggressive updates.

Summary

Method	Coupling \(\Pi^i_{0,T}\)	Regression target \(\sigma_t^{-1} \xi\)	Limitation
AS	\(\delta_{x_0} \otimes \mathbb{P}^{u_i}_T\)	\(\nabla \log [\pi/\mathbb{P}_T](X_T)\)	Dirac prior
ASBS	\(\mathbb{P}^{u_i}_{0,T}\)	\(\nabla \log [\pi/\hat\varphi_T](X_T)\)	Alternating opt.
BMS	\(\mathbb{P}^{u_i}_0 \otimes \mathbb{P}^{u_i}_T\)	Equation 14	None (single obj.)

All three converge to a fixed point \(u^\star\) transporting \(p_0\) to \(\pi\). The matching loss Equation 5 is a forward KL objective: \(u^\star = \mathop{\mathrm{argmin}}_u D_{\text{KL}}(\Pi^\star \mid \mathbb{P}^u)\). Forward KL is mode-covering (it penalizes placing zero mass where \(\Pi^\star\) has mass), which partly explains the good mode diversity in practice.