Schrödinger Bridges

SDE
markov
Published

19 10 2025

Modified

18 02 2026

Erwin Schrödinger (1887 – 1961)

Consider two discrete probability distributions \(\nu_0\) and \(\nu_1\) over \([[1,n]]\). We would like to find a coupling \(\gamma(x_0,x_1)\) between \(\nu_0\) and \(\nu_1\) such that, under this coupling \((X_0,X_1) \sim \gamma\), the distance \(d(X_0,X_1)\) between \(X_0\) and \(X_1\) is small. Naturally, this can be formulated as the following optimal transport problem:

\[ \mathop{\mathrm{argmin}}_{\gamma \in \Pi(\nu_0, \nu_1)} \; \sum_{x_0,x_1} d(x_0,x_1) \, \gamma(x_0,x_1), \]

where \(\Pi(\nu_0, \nu_1)\) is the set of couplings between \(\nu_0\) and \(\nu_1\). It is a linear program and can be solved efficiently when \(n\) is not too large. However, the optimal transport plan is often very sparse since it puts mass on at most \((2n-1)\) pairs \((x_0,x_1)\). This can be undesirable in some applications. Furthermore, small changes in the distributions \(\nu_0\) and \(\nu_1\) can lead to large changes in the optimal transport plan. This sensitivity can be problematic in practice, especially when the distributions are estimated from data.

Static Shrödinger Bridge Problem

A standard way to address these issues is to add an entropic regularization term to the objective function. The resulting problem is known as the Schrödinger bridge problem and can be formulated as follows. Consider a reference joint distribution \(\mu_{\mathrm{ref}}(x_0,x_1)\) over \([[1,n]] \times [[1,n]]\) and find the coupling \(\gamma(x_0,x_1)\) that minimizes the Kullback-Leibler divergence to \(\mu_{\mathrm{ref}}\) while matching the marginals \(\nu_0\) and \(\nu_1\):

\[ \mathop{\mathrm{argmin}}_{\gamma \in \Pi(\nu_0, \nu_1)} \; \sum_{x_0,x_1} \gamma(x_0,x_1) \log \frac{\gamma(x_0,x_1)}{\mu_{\mathrm{ref}}(x_0,x_1)}. \tag{1}\]

Remark (invariance under separable rescaling). Only the “interaction structure” of \(\mu_{\mathrm{ref}}\) matters: if one replaces \(\mu_{\mathrm{ref}}\) by \(\tilde\mu_{\mathrm{ref}}(x_0,x_1)\propto a(x_0)\,\mu_{\mathrm{ref}}(x_0,x_1)\,b(x_1)\) with positive \(a,b\), then the optimal coupling can still be written in the same form below, with the factors \(a,b\) absorbed into the potentials. Equivalently, the solution depends on \(\mu_{\mathrm{ref}}\) only through its kernel up to left/right diagonal scaling.

A common choice for the reference measure is a distribution of the form \(\mu_{\mathrm{ref}}(x_0,x_1) \propto \exp(-d(x_0,x_1)/\varepsilon)\) (optionally times separable factors). This choice encourages the coupling \(\gamma\) to put more mass on pairs \((x_0,x_1)\) that are close according to the distance \(d\), and the resulting optimization problem can be rewritten (up to an additive constant) as:

\[ \mathop{\mathrm{argmin}}_{\gamma \in \Pi(\nu_0, \nu_1)} \; \sum_{x_0,x_1} d(x_0,x_1) \, \gamma(x_0,x_1) \textcolor{blue}{\; - \; \varepsilon\, \mathrm{H}(\gamma)} \tag{2}\]

where \(\mathrm{H}(\gamma) = - \sum_{x_0,x_1} \gamma(x_0,x_1) \log \gamma(x_0,x_1)\) is the entropy of the coupling \(\gamma\). Note that since the marginals of \(\gamma\) are fixed, it is also equivalent (up to constants) to replacing the entropy term by the Kullback-Leibler divergence to the independent coupling \(\nu_0(x_0) \otimes \nu_1(x_1)\),

\[ \mathop{\mathrm{argmin}}_{\gamma \in \Pi(\nu_0, \nu_1)} \; \sum_{x_0,x_1} d(x_0,x_1) \, \gamma(x_0,x_1) \textcolor{blue}{\; + \; \varepsilon\, D_{\text{KL}}(\gamma \mid \nu_0 \otimes \nu_1)}. \]

Writing the Lagrange dual formulation of the problem Equation 2 provides fast algorithms to solve it such as the iterative proportional fitting procedure (IPFP), also known as the Sinkhorn-Knopp algorithm in the optimal transport literature (Cuturi 2013). Importantly, it also almost immediately shows that the optimal coupling has a very simple form,

\[ \gamma(x_0,x_1) = f(x_0) \, \mu_{\mathrm{ref}}(x_0,x_1) \, g(x_1). \tag{3}\]

These potentials must satisfy the marginal constraints, equivalently the Schrödinger/Sinkhorn scaling equations:

\[ \left\{ \begin{aligned} \nu_0(x_0) &= \sum_{x_1} f(x_0)\, \mu_{\mathrm{ref}}(x_0,x_1)\, g(x_1),\\ \nu_1(x_1) &= \sum_{x_0} f(x_0)\, \mu_{\mathrm{ref}}(x_0,x_1)\, g(x_1). \end{aligned} \right. \]

Some details:

The Lagrangian of the constrained convex optimization Equation 1 reads: \[ \begin{align*} \sum_{x_0,x_1} \gamma(x_0,x_1) \log \frac{\gamma(x_0,x_1)}{\mu_{\mathrm{ref}}(x_0,x_1)} &+ \sum_{x_0} \alpha(x_0) \left( \nu_0(x_0) - \sum_{x_1} \gamma(x_0,x_1) \right)\\ &+ \sum_{x_1} \beta(x_1) \left( \nu_1(x_1) - \sum_{x_0} \gamma(x_0,x_1) \right). \end{align*} \] There is no need to impose a constraint on the total mass of \(\gamma\) since it is automatically satisfied by the marginal constraints. For a fixed set of dual variables \(\alpha\) and \(\beta\), minimizing the Lagrangian with respect to \(\gamma\) leads to the optimum \(\gamma_{\alpha, \beta}(x_0,x_1) = \mu_{\mathrm{ref}}(x_0,x_1) \, e^{\alpha(x_0) + \beta(x_1) -1}\), hence proving Equation 3 with \(f=e^{\alpha-1}\) and \(g=e^\beta\) (up to scaling). Plugging this expression back into the Lagrangian gives: \[ \mathcal{L}(\alpha, \beta) = \sum_{x_0} \alpha(x_0) \, \nu_0(x_0) + \sum_{x_1} \beta(x_1) \, \nu_1(x_1) - \sum_{x_0,x_1} \gamma_{\alpha, \beta}(x_0,x_1). \] Alternating maximization in \((\alpha,\beta)\) corresponds to alternately enforcing the two marginal constraints, i.e. IPFP / Sinkhorn scaling. Directly maximizing the Lagrange dual \(\mathcal{L}(\alpha, \beta)\) using gradient ascent methods such as L-BFGS is also possible and can lead to faster convergence in some cases. Note that the dual problem only has \(2n\) variables, while the primal problem has \(n^2\) variables!

Dynamic Schrödinger Bridge Problem

Now, suppose that the reference distribution \(\mu_{\mathrm{ref}}(x_0,x_1)\) is given as the two-time marginal of a Markov process \((X_t)_{t \in [0,1]}\) with transition kernels \(p^{\text{ref}}_{s,t}(x_s,x_t)\) and path measure \(\mathbb{P}^{\text{ref}}\) on trajectories \(x_{[0,1]}\).

The dynamic Schrödinger bridge problem consists in finding a new path measure \(\mathbb{Q}\) on trajectories \(x_{[0,1]}\) such that the starting and ending marginals are \(\nu_0\) and \(\nu_1\), while minimizing the Kullback-Leibler divergence to the reference path distribution \(\mathbb{P}^{\text{ref}}\):

\[ \mathop{\mathrm{argmin}}_{\mathbb{Q}:\ \mathbb{Q}_0=\nu_0,\ \mathbb{Q}_1=\nu_1}\; \mathrm{KL}(\mathbb{Q}\mid \mathbb{P}^{\text{ref}}). \]

When the reference is Markov, the chain rule for KL (disintegration with respect to \((X_0,X_1)\)) shows this is equivalent to the static Schrödinger bridge problem for the two-time marginal of \(\mathbb{P}^{\text{ref}}\): one first solves for the optimal endpoint coupling \(\gamma^\star(x_0,x_1)\), and then fills in intermediate times using the reference bridges \(\mathbb{P}^{\text{ref}}(\cdot\mid X_0=x_0,X_1=x_1)\):

  1. Sample the endpoints \((X_0, X_1) \sim \gamma^\star(x_0,x_1)\).
  2. Sample the intermediate points according to the conditional law of the reference process given the endpoints, i.e. \((X_t)_{t \in (0,1)} \sim \mathbb{P}^{\text{ref}}(\cdot \mid X_0, X_1)\).

Continuous-Time Stochastic Processes:

A typical setting is when the reference process \((X_t)\) is given as the solution of a stochastic differential equation of the form

\[ dX_t = b_t(X_t) \, dt + \sigma \, dW_t, \]

started from some initial distribution (one may take it to be \(\nu_0\) without loss of generality, since any mismatch can be absorbed into the endpoint tilt below). The solution to the dynamic Schrödinger bridge problem is given by the twisted path distribution:

\[ \frac{d\mathbb{Q}^\star}{d\mathbb{P}^{\text{ref}}}(x_{[0,1]}) \propto f(X_0)\,g(X_1), \]

for endpoint potentials \(f\) and \(g\) satisfying the marginal constraints at \(t=0\) and \(t=1\).

The marginal density at intermediate time \(t\in[0,1]\) factorizes as:

\[ q_t(x) = p_t^{\text{ref}}(x)\,\widehat{\varphi}_t(x)\,\varphi_t(x), \]

where \(p_t^{\text{ref}}\) is the time-\(t\) marginal density of the reference process, and where the time-dependent Schrödinger potentials are defined by:

\[ \left\{ \begin{aligned} \varphi_t(x) &= \mathbb{E}_{\mathbb{P}^{\text{ref}}}[g(X_1)\mid X_t=x],\\ \widehat{\varphi}_t(x) &= \mathbb{E}_{\mathbb{P}^{\text{ref}}}[f(X_0)\mid X_t=x], \end{aligned} \right. \]

(with \(\varphi_1 \equiv g\) and \(\widehat{\varphi}_0 \equiv f\), up to normalization). Naturally, the dynamics of the Schrödinger bridge process can be described as a new stochastic differential equation obtained by applying a Doob h-transform to the reference SDE,

\[ dX_t = b_t(X_t)\,dt + \textcolor{blue}{\sigma \sigma^\top \nabla_x \log \varphi_t( X_t)\,dt} + \sigma\,dW_t, \]

where, as explained in these previous notes, the function \(\varphi_t(x)\) is the harmonic extension of the terminal potential \(g\) defined above.

Discrete-Time Markov Chains:

It is often useful to state the Schrödinger bridge dynamics in a discrete setting. Let the reference process be a Markov chain \(X_0, \ldots, X_T\) with one-step kernels \(M_k(x,dy)\), i.e. \[ \mathbb{P}^{\text{ref}}(X_{k+1}\in dy\mid X_k=x)=M_k(x,dy). \] Let \(g\) be the terminal potential at time \(T\) and define the backward (harmonic) potentials defined by \(\varphi_T \equiv g\) and recursively as: \[ \varphi_k(x) = \int M_k(x,dy)\,\varphi_{k+1}(y). \]

Then the Schrödinger bridge \(\mathbb{Q}^\star\) is Markov and its forward transition kernels are given by the discrete Doob \(h\)-transform: \[ \mathbb{Q}^\star(X_{k+1}\in dy\mid X_k=x) =\frac{M_k(x,dy)\,\varphi_{k+1}(y)}{\varphi_k(x)}. \] This is the discrete-time counterpart of the continuous-time drift correction \(b_t \mapsto b_t + \sigma\sigma^\top\nabla\log\varphi_t(\cdot)\).

In general, Schrödinger bridge problems are difficult problems since the endpoint potentials \(f\) and \(g\) need to be found such that the marginal constraints are satisfied. These recent years have seen the development of many numerical methods to solve this problem approximately, especially in the machine learning community, eg: (Shi et al. 2023); (De Bortoli et al. 2021); (Vargas et al. 2021); (Chen, Liu, and Theodorou 2021).

References

Chen, Tianrong, Guan-Horng Liu, and Evangelos A Theodorou. 2021. “Likelihood Training of Schrodinger Bridge Using Forward-Backward Sdes Theory.” arXiv Preprint arXiv:2110.11291.
Cuturi, Marco. 2013. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport.” Advances in Neural Information Processing Systems 26.
De Bortoli, Valentin, James Thornton, Jeremy Heng, and Arnaud Doucet. 2021. “Diffusion Schrodinger Bridge with Applications to Score-Based Generative Modeling.” Advances in Neural Information Processing Systems 34: 17695–709.
Shi, Yuyang, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. 2023. “Diffusion Schrodinger Bridge Matching.” Advances in Neural Information Processing Systems 36: 62183–223.
Vargas, Francisco, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. 2021. “Solving Schrodinger Bridges via Maximum Likelihood.” Entropy 23 (9). MDPI: 1134.