<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Alexandre Thiéry</title>
<link>https://alexxthiery.github.io/notes/index_notes_as_list.html</link>
<atom:link href="https://alexxthiery.github.io/notes/index_notes_as_list.xml" rel="self" type="application/rss+xml"/>
<description>Alex Thiery Notes</description>
<generator>quarto-1.7.32</generator>
<lastBuildDate>Mon, 23 Feb 2026 17:00:00 GMT</lastBuildDate>
<item>
  <title>The mean-field Potts model</title>
  <link>https://alexxthiery.github.io/notes/potts_transition/potts.html</link>
  <description><![CDATA[ 





<!-- ::: {style="text-align:center;"}
![[Renfrey Potts](https://en.wikipedia.org/wiki/Renfrey_Potts) (1925–2005)](./potts.jpg){fig-align="center" width=35%}
::: -->
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/potts_transition/potts_scatter.gif" class="img-fluid figure-img" style="width:85.0%"></p>
<figcaption>Mean Field Potts model, with <img src="https://latex.codecogs.com/png.latex?q=3"> displayed in barycentric coordinates</figcaption>
</figure>
</div>
</div>
<p>The mean-field <a href="https://en.wikipedia.org/wiki/Potts_model">Potts model</a> is a very simple model with a first order phase transition; for Monte-Carlo simulation, it is especially interesting since this first order <a href="https://en.wikipedia.org/wiki/Phase_transition">phase transition</a> implies that standard tempering strategies such as <a href="https://en.wikipedia.org/wiki/Parallel_tempering">parallel-tempering</a> or sequential-monte-carlo are inefficient. Increasing the number of intermediate temperature just does not help (much) producing samples at low temperatures <span class="citation" data-cites="woodard2009sufficient bhatnagar2004torpid">(Woodard, Schmidler, and Huber 2009; Bhatnagar and Randall 2004)</span></p>
<section id="potts-model" class="level3">
<h3 class="anchored" data-anchor-id="potts-model">Potts model</h3>
<p>Consider the Potts model with <img src="https://latex.codecogs.com/png.latex?q"> colors on the complete graph with <img src="https://latex.codecogs.com/png.latex?N"> vertices. A configuration is <img src="https://latex.codecogs.com/png.latex?%5Csigma=(%5Csigma_1,%5Cdots,%5Csigma_N)"> with <img src="https://latex.codecogs.com/png.latex?%5Csigma_i%5Cin%5C%7B1,%5Cdots,q%5C%7D">. Define the energy <img src="https://latex.codecogs.com/png.latex?%0AE(%5Csigma)%5C;=%5C;-%5Cfrac%7B1%7D%7B2N%7D%5Csum_%7Bi,j=1%7D%5EN%20%5Cmathbf%201%5C%7B%5Csigma_i=%5Csigma_j%5C%7D.%0A"> At inverse temperature <img src="https://latex.codecogs.com/png.latex?%5Cbeta">, the Boltzmann distribution is <img src="https://latex.codecogs.com/png.latex?%5Cmu_%5Cbeta(%5Csigma)=e%5E%7B-%5Cbeta%20E(%5Csigma)%7D%20/%20Z_%5Cbeta">. On the complete graph, the only relevant macroscopic variable is the empirical proportions vector <img src="https://latex.codecogs.com/png.latex?%0A%5Crho=(%5Crho_1,%5Cdots,%5Crho_q),%5Cqquad%0A%5Crho_a=%5Cfrac1N%5Cbigl%7C%5C%7Bi:%5Csigma_i=a%5C%7D%5Cbigr%7C.%0A"></p>
<p>so that <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Ba=1%7D%5Eq%20%5Crho_a=1">. A short computation rewrites the energy in terms of <img src="https://latex.codecogs.com/png.latex?%5Crho">: <img src="https://latex.codecogs.com/png.latex?%0AE(%5Crho)=%20-%5Cfrac%7BN%7D%7B2%7D%5Csum_%7Ba=1%7D%5Eq%20%5Crho_a%5E2,%0A"> up to an additive constant irrelevant for Gibbs weights. The number of configurations with a given <img src="https://latex.codecogs.com/png.latex?%5Crho"> is <img src="https://latex.codecogs.com/png.latex?%5Cexp%5C%7BN%20%5C,%20H(%5Crho)+o(N)%5C%7D">, where <img src="https://latex.codecogs.com/png.latex?H(%5Crho)=-%5Csum_a%20%5Crho_a%5Clog%5Crho_a"> is the <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">Shannon entropy</a>. Putting energy and entropy together, typical samples from <img src="https://latex.codecogs.com/png.latex?%5Cmu_%5Cbeta"> concentrate near minimizers of the mean-field free-energy functional: <img src="https://latex.codecogs.com/png.latex?%0A%5CPhi_%5Cbeta(%5Crho)%0A=%5Csum_%7Ba=1%7D%5Eq%20%5Crho_a%5Clog%5Crho_a%0A-%5Cfrac%7B%5Cbeta%7D%7B2%7D%5Csum_%7Ba=1%7D%5Eq%20%5Crho_a%5E2%0A"></p>
<p>constrained to the probability simplex. Everything that follows is geometry: how the minima of <img src="https://latex.codecogs.com/png.latex?%5CPhi_%5Cbeta"> evolve as the inverse temperature <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> varies.</p>
<p><strong>Local minima:</strong> Two features make the analysis almost trivial. First, permuting colors leaves <img src="https://latex.codecogs.com/png.latex?%5CPhi_%5Cbeta"> unchanged. Second, a stationary point under the simplex constraint satisfies a Lagrange-multiplier condition <img src="https://latex.codecogs.com/png.latex?%0A%5Clog%5Crho_a+1-%5Cbeta%5Crho_a=%5Clambda%0A"> for <img src="https://latex.codecogs.com/png.latex?a=1,%5Cdots,q">. Hence each coordinate <img src="https://latex.codecogs.com/png.latex?%5Crho_a"> must solve the same scalar equation. From this, one finds that all local minima are necessarily of the following two types:</p>
<ul>
<li><p><strong>Disordered point</strong> (uniform): <img src="https://latex.codecogs.com/png.latex?%0A%5Crho%5E%7B%5Cmathrm%7Bdis%7D%7D=%5CBigl(%5Cfrac1q,%5Cdots,%5Cfrac1q%5CBigr),%0A"> always a stationary point.</p></li>
<li><p><strong>Ordered points</strong> (one dominant color, the rest equal): <img src="https://latex.codecogs.com/png.latex?%0A%5Crho%5E%7B%5Cmathrm%7Bord%7D%7D(r)=%5CBigl(r,%5C%20%5Cunderbrace%7B%5Cfrac%7B1-r%7D%7Bq-1%7D,%5Cdots,%5Cfrac%7B1-r%7D%7Bq-1%7D%7D_%7Bq-1%5C%20%5Ctext%7Btimes%7D%7D%5CBigr)%0A"> for some dominant proportion <img src="https://latex.codecogs.com/png.latex?r%3E1/q">. There are <img src="https://latex.codecogs.com/png.latex?q"> such points by choosing which color is dominant.</p></li>
</ul>
<p>The remaining work is algebra: determine, as a function of <img src="https://latex.codecogs.com/png.latex?%5Cbeta">, when ordered stationary points exist, and which of the stationary points are local minima versus saddles. The details are routine and add little insight.</p>
</section>
<section id="the-phase-diagram-in-beta" class="level2">
<h2 class="anchored" data-anchor-id="the-phase-diagram-in-beta">The phase diagram in <img src="https://latex.codecogs.com/png.latex?%5Cbeta"></h2>
<p>Assume <img src="https://latex.codecogs.com/png.latex?q%5Cge%203">. Then the model exhibits a first-order transition, and metastability appears on both sides. There are two key inverse temperatures:</p>
<ul>
<li><strong>Spinodal threshold <img src="https://latex.codecogs.com/png.latex?%5Cbeta_s"></strong>: the ordered stationary points appear as local minima.</li>
<li><strong>Coexistence threshold <img src="https://latex.codecogs.com/png.latex?%5Cbeta_c"></strong>: the disordered minimum and the ordered minimum have equal free energy.</li>
</ul>
<p>For the normalization above, the coexistence point is <img src="https://latex.codecogs.com/png.latex?%0A%5Cbeta_c=%5Cfrac%7B2(q-1)%5Clog(q-1)%7D%7Bq-2%7D.%0A"></p>
<p>The geometry of <img src="https://latex.codecogs.com/png.latex?%5CPhi_%5Cbeta"> splits into three regimes.</p>
<section id="high-temperature-betabeta_s" class="level3">
<h3 class="anchored" data-anchor-id="high-temperature-betabeta_s">High temperature: <img src="https://latex.codecogs.com/png.latex?%5Cbeta%3C%5Cbeta_s"></h3>
<ul>
<li><strong>Critical points:</strong> only the uniform point <img src="https://latex.codecogs.com/png.latex?%5Crho%5E%7B%5Cmathrm%7Bdis%7D%7D">.</li>
<li><strong>Landscape:</strong> <img src="https://latex.codecogs.com/png.latex?%5CPhi_%5Cbeta"> is strictly minimized at <img src="https://latex.codecogs.com/png.latex?%5Crho%5E%7B%5Cmathrm%7Bdis%7D%7D">.</li>
<li><strong>Meaning:</strong> entropy dominates; colors mix.</li>
</ul>
</section>
<section id="metastable-coexistence-beta_sbetabeta_c" class="level3">
<h3 class="anchored" data-anchor-id="metastable-coexistence-beta_sbetabeta_c">Metastable coexistence: <img src="https://latex.codecogs.com/png.latex?%5Cbeta_s%3C%5Cbeta%3C%5Cbeta_c"></h3>
<ul>
<li><strong>Critical points:</strong> the uniform point remains a (global) minimum, but there are also <img src="https://latex.codecogs.com/png.latex?q"> ordered local minima <img src="https://latex.codecogs.com/png.latex?%5Crho%5E%7B%5Cmathrm%7Bord%7D%7D"> and corresponding saddles separating them from the uniform basin.</li>
<li><strong>Landscape:</strong> multiple basins exist, but the uniform basin is lowest.</li>
</ul>
</section>
<section id="low-temperature-betabeta_c" class="level3">
<h3 class="anchored" data-anchor-id="low-temperature-betabeta_c">Low temperature: <img src="https://latex.codecogs.com/png.latex?%5Cbeta%3E%5Cbeta_c"></h3>
<ul>
<li><strong>Critical points:</strong> the ordered minima become the global minima; the uniform point persists as a local minimum (until it eventually disappears at a second spinodal on the low-temperature side).</li>
<li><strong>Landscape:</strong> the roles swap: ordered basins are deepest; the uniform basin becomes metastable.</li>
<li><strong>Meaning:</strong> energy dominates; one color wins.</li>
</ul>
<p>At <img src="https://latex.codecogs.com/png.latex?%5Cbeta=%5Cbeta_c">, the global minimizer changes discontinuously (as can be seen in the animation at the start of these notes): the equilibrium state jumps from <img src="https://latex.codecogs.com/png.latex?%5Crho%5E%7B%5Cmathrm%7Bdis%7D%7D"> to an ordered vector with one strictly larger coordinate. This is in contrast with the (mean-field) Curie–Weiss Ising model (<img src="https://latex.codecogs.com/png.latex?q=2">), where the ordered minimizers bifurcate continuously from the disordered one, i.e.&nbsp;a second-order transition. For <img src="https://latex.codecogs.com/png.latex?q%5Cge%203">, ordered minima appear while the uniform minimum is still globally optimal, and the eventual swap of global minima happens with a jump. That single change in the geometry of <img src="https://latex.codecogs.com/png.latex?%5CPhi_%5Cbeta"> is the entire origin of the first-order transition and the metastable window.</p>



</section>
</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-bhatnagar2004torpid" class="csl-entry">
Bhatnagar, Nayantara, and Dana Randall. 2004. <span>“Torpid Mixing of Simulated Tempering on the Potts Model.”</span> In <em>SODA</em>, 4:478–87.
</div>
<div id="ref-woodard2009sufficient" class="csl-entry">
Woodard, Dawn, Scott Schmidler, and Mark Huber. 2009. <span>“Sufficient Conditions for Torpid Mixing of Parallel and Simulated Tempering.”</span>
</div>
</div></section></div> ]]></description>
  <category>analysis</category>
  <guid>https://alexxthiery.github.io/notes/potts_transition/potts.html</guid>
  <pubDate>Mon, 23 Feb 2026 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Gegenbauer Polynomials and the Laplacian</title>
  <link>https://alexxthiery.github.io/notes/gegenbauer/gegenbauer.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/gegenbauer/gegenbauer.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Leopold_Gegenbauer">Leopold Gegenbauer</a> (1849 – 1903)</figcaption>
</figure>
</div>
</div>
<section id="zonal-spherical-harmonics" class="level3">
<h3 class="anchored" data-anchor-id="zonal-spherical-harmonics">Zonal spherical harmonics</h3>
<p>Understanding the eigenfunctions of the spherical Laplacian is a central task. These eigenfunctions form the building blocks of harmonic analysis on spheres.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/gegenbauer/Spherical_Harmonics.png" class="img-fluid figure-img" style="width:85.0%"></p>
<figcaption>Spherical Harmonics</figcaption>
</figure>
</div>
</div>
<p>A particularly important situation arises when the function of interest depends only on the angular separation from a fixed axis.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/gegenbauer/zonal.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption>Zonal Spherical Harmonics</figcaption>
</figure>
</div>
</div>
<p>Such zonal functions retain full rotational symmetry around that axis and reduce the spherical Laplacian to a one-dimensional operator. Its eigenfunctions turn out to be polynomials on the interval <img src="https://latex.codecogs.com/png.latex?%5B-1,1%5D"> with a very specific weight. These polynomials are the <a href="https://en.wikipedia.org/wiki/Gegenbauer_polynomials">Gegenbauer polynomials</a>, which generalize several classical families of orthogonal polynomials, including the Legendre and Chebyshev polynomials.</p>
</section>
<section id="reminder-on-differential-operators" class="level2">
<h2 class="anchored" data-anchor-id="reminder-on-differential-operators">Reminder on differential operators</h2>
<p>Many problems require the ability to compute gradients, divergences, and Laplacians in arbitrary coordinate systems. Before specializing to spherical coordinates in higher dimensions, it is useful to recall the geometric meaning of these operators and the minimal formalism needed to manipulate them. Suppose that a point in space is described by a collection of coordinates <img src="https://latex.codecogs.com/png.latex?%0Aq%20=%20(q%5E1,%20%5Cdots,%20q%5En).%0A"> If we make a small change <img src="https://latex.codecogs.com/png.latex?(dq%5E1,%5Cdots,dq%5En)">, the physical displacement has a length denoted by <img src="https://latex.codecogs.com/png.latex?ds">. In an orthogonal coordinate system, each coordinate direction comes with a scale factor <img src="https://latex.codecogs.com/png.latex?h_i(q)"> such that a change <img src="https://latex.codecogs.com/png.latex?dq%5Ei"> corresponds to a “physical” displacement of length <img src="https://latex.codecogs.com/png.latex?h_i(q),%20dq%5Ei"> along the unit vector <img src="https://latex.codecogs.com/png.latex?e_i">. These functions <img src="https://latex.codecogs.com/png.latex?h_i"> encode how the coordinate system stretches or compresses distances along each axis. Once the scale factors are known, the rest of the differential operators follow directly.</p>
<p><strong>The gradient:</strong> For a scalar function <img src="https://latex.codecogs.com/png.latex?f(q)">, the change produced by varying only <img src="https://latex.codecogs.com/png.latex?q%5Ei"> is approximately <img src="https://latex.codecogs.com/png.latex?%5Cpartial_%7Bd%20q%5Ei%7D%20f%20%5Ccdot%20dq%5Ei">. The physical distance traveled in this move is <img src="https://latex.codecogs.com/png.latex?h_i%20%5C,%20dq%5Ei">. The rate of increase of <img src="https://latex.codecogs.com/png.latex?f"> per unit distance in the direction <img src="https://latex.codecogs.com/png.latex?e_i"> is therefore <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7Bh_i%7D%20%5Cfrac%7B%5Cpartial%20f%7D%7B%5Cpartial%20q%5Ei%7D">. This is the <img src="https://latex.codecogs.com/png.latex?i">th component of the gradient. Summing over all directions gives <img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla%20f%20=%20%5Csum_%7Bi=1%7D%5En%20%5Cleft(%20%5Cfrac%7B1%7D%7Bh_i%7D%5Cfrac%7B%5Cpartial%20f%7D%7B%5Cpartial%20q%5Ei%7D%20%5Cright)%20e_i.%0A"></p>
<p><strong>The divergence</strong> describes the net outflow of a vector field from an infinitesimal volume around a point. Consider a vector field <img src="https://latex.codecogs.com/png.latex?v%20=%20%5Csum_i%20v%5Ei%20e_i">. Place a tiny coordinate-aligned box at a point. Its physical edge lengths are <img src="https://latex.codecogs.com/png.latex?h_i%20%5C,%20dq%5Ei">, so its infinitesimal volume is: <img src="https://latex.codecogs.com/png.latex?%0AdV%20=%20%5Cleft(%20%5Cprod_%7Bi=1%7D%5En%20h_i(q)%20%5Cright)%20dq%5E1%20%5Ccdots%20dq%5En.%0A"> The flux of <img src="https://latex.codecogs.com/png.latex?v"> through the pair of faces orthogonal to <img src="https://latex.codecogs.com/png.latex?e_i"> is <img src="https://latex.codecogs.com/png.latex?v%5Ei"> multiplied by the physical area of those faces: <img src="https://latex.codecogs.com/png.latex?%5Cprod_%7Bj%5Cne%20i%7D%20h_j">. The divergence is the total outward flux divided by the infinitesimal volume: <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathop%7B%5Cmathrm%7Bdiv%7D%7Dv%0A=%0A%5Cfrac%7B1%7D%7B%5Cprod_j%20h_j%7D%0A%5Csum_%7Bi=1%7D%5En%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20q%5Ei%7D%0A%5Cleft(%0Av%5Ei%20%5C,%20%5Cfrac%7B%5Cprod_j%20h_j%7D%7Bh_i%7D%0A%5Cright).%0A"></p>
<p>This expression is simply “flux in direction <img src="https://latex.codecogs.com/png.latex?i">” minus “flux in direction <img src="https://latex.codecogs.com/png.latex?i">” across the opposite face, normalized by the physical volume.</p>
<p><strong>The Laplacian</strong> is the divergence of the gradient. It describes how a scalar function curves around a point since it measures the net outflow of the gradient field. It also encodes how <img src="https://latex.codecogs.com/png.latex?f"> compares to its local averages over small spheres. In orthogonal coordinates, inserting the expression for the gradient into the divergence formula yields: <span id="eq-master-laplacian"><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20f%0A=%0A%5Cfrac%7B1%7D%7B%5Cprod_j%20h_j%7D%0A%5Csum_%7Bi=1%7D%5En%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20q%5Ei%7D%0A%5Cleft(%0A%5Cfrac%7B%5Cprod_j%20h_j%7D%7Bh_i%5E2%7D%0A%5Cfrac%7B%5Cpartial%20f%7D%7B%5Cpartial%20q%5Ei%7D%0A%5Cright).%0A%5Ctag%7B1%7D"></span></p>
<p>This is the expression we will apply to spherical coordinates in the next section.</p>
<section id="spherical-coordinates-and-the-laplacian" class="level3">
<h3 class="anchored" data-anchor-id="spherical-coordinates-and-the-laplacian">Spherical coordinates and the Laplacian</h3>
<p>To analyze rotationally symmetric functions, we now specialize the general formulas from the previous section to spherical coordinates in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5Ed">. A point <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathbb%7BR%7D%5Ed"> is described by a radius <img src="https://latex.codecogs.com/png.latex?r%20%5Cge%200"> together with <img src="https://latex.codecogs.com/png.latex?(d-1)"> angular coordinates <img src="https://latex.codecogs.com/png.latex?%0A(r,%5Ctheta_1,%5Cdots,%5Ctheta_%7Bd-1%7D).%0A"> The radius <img src="https://latex.codecogs.com/png.latex?r%20=%20%5C%7Cx%5C%7C"> determines the sphere on which the point lies, and the angles <img src="https://latex.codecogs.com/png.latex?%5Ctheta_1,%5Cdots,%5Ctheta_%7Bd-1%7D"> specify a direction on the unit sphere <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-1%7D">. Geometrically, the construction is recursive: fixing <img src="https://latex.codecogs.com/png.latex?%5Ctheta_1"> leaves an <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-2%7D">; choosing <img src="https://latex.codecogs.com/png.latex?%5Ctheta_2"> then fixes a point on that <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-2%7D">, and so on until the last coordinate <img src="https://latex.codecogs.com/png.latex?%5Ctheta_%7Bd-1%7D">, which parameterizes a circle. In these coordinates, motion in the radial direction has physical length <img src="https://latex.codecogs.com/png.latex?dr"> so that <img src="https://latex.codecogs.com/png.latex?h_r%20=%201">. Motion in the direction of <img src="https://latex.codecogs.com/png.latex?%5Ctheta_1"> traces a circle of radius <img src="https://latex.codecogs.com/png.latex?r"> so that <img src="https://latex.codecogs.com/png.latex?h_%7B%5Ctheta_1%7D%20=%20r">. Holding <img src="https://latex.codecogs.com/png.latex?r"> and <img src="https://latex.codecogs.com/png.latex?%5Ctheta_1"> fixed while varying <img src="https://latex.codecogs.com/png.latex?%5Ctheta_2"> traces a circle of radius <img src="https://latex.codecogs.com/png.latex?r%20%5Csin%20%5Ctheta_1">, hence <img src="https://latex.codecogs.com/png.latex?h_%7B%5Ctheta_2%7D%20=%20r%20%5Csin%20%5Ctheta_1">. Continuing in this way, the general pattern is <img src="https://latex.codecogs.com/png.latex?%0Ah_%7B%5Ctheta_k%7D%0A=%20r%20%5Csin%20%5Ctheta_1%20%5Ccdots%20%5Csin%20%5Ctheta_%7Bk-1%7D,%0A%5Cqquad%20k%20=%201,%5Cdots,d-1.%0A"> These scale factors reflect the fact that angular displacements correspond to motion along circles whose radii depend on the previously chosen angles. The physical volume of an infinitesimal coordinate box is the product of all scale factors, giving <img src="https://latex.codecogs.com/png.latex?%0AdV%0A=%20r%5E%7Bd-1%7D%5C,%0A(%5Csin%5Ctheta_1)%5E%7Bd-2%7D%0A(%5Csin%5Ctheta_2)%5E%7Bd-3%7D%0A%5Ccdots%0A(%5Csin%5Ctheta_%7Bd-2%7D)%5C,%0Adr%5C,%20d%5Ctheta_1%20%5Ccdots%20d%5Ctheta_%7Bd-1%7D.%0A"> The factor <img src="https://latex.codecogs.com/png.latex?r%5E%7Bd-1%7D"> is the familiar scaling of the surface area of a sphere. The remaining sine powers encode the intrinsic geometry of <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-1%7D">.</p>
<p><strong>Spherical Laplacian:</strong> Inserting these scale factors into the general expression Equation&nbsp;1 gives the Laplacian in spherical coordinates: <img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20f%0A=%0A%5Cbig%5B%20%5Cpartial_r%5E2%20f%20+%20%5Cfrac%7Bd-1%7D%7Br%7D%5Cpartial_r%20f%20%5Cbig%5D%0A+%20%5Cfrac%7B1%7D%7Br%5E2%7D%5C,%5CDelta_%7BS%5E%7Bd-1%7D%7D%20f.%0A"> The operator <img src="https://latex.codecogs.com/png.latex?%5CDelta_%7BS%5E%7Bd-1%7D%7D"> is the Laplacian intrinsic to the unit sphere <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-1%7D">, whose expression is not particularly enlightening or useful for our purposes here. For example, the first term reads: <img src="https://latex.codecogs.com/png.latex?%0A%5CDelta_%7BS%5E%7Bd-1%7D%7D%20f%0A=%0A%5Cfrac%7B1%7D%7B(%5Csin%5Ctheta_1)%5E%7Bd-2%7D%7D%5C,%5Cpartial_%7B%5Ctheta_1%7D%0A%5Cleft(%0A(%5Csin%5Ctheta_1)%5E%7Bd-2%7D%20%5C,%5Cpartial_%7B%5Ctheta_1%7D%20f%0A%5Cright)%0A+%20%5Ccdots%0A"> What matters is that it acts only on the angular variables <img src="https://latex.codecogs.com/png.latex?%5Ctheta_1,%5Cdots,%5Ctheta_%7Bd-1%7D">, treating <img src="https://latex.codecogs.com/png.latex?r"> as a constant. The factor <img src="https://latex.codecogs.com/png.latex?(d-1)/r"> arises because the surface area of spheres grows like <img src="https://latex.codecogs.com/png.latex?r%5E%7Bd-1%7D">, while the factor <img src="https://latex.codecogs.com/png.latex?1/r%5E2"> preceding <img src="https://latex.codecogs.com/png.latex?%5CDelta_%7BS%5E%7Bd-1%7D%7D"> reflects the fact that angular motion takes place along circles of radius <img src="https://latex.codecogs.com/png.latex?r">. Furthermore, this scaling is clear by dimensional analysis: the Laplacian has units of inverse length squared, and <img src="https://latex.codecogs.com/png.latex?%5CDelta_%7BS%5E%7Bd-1%7D%7D"> is dimensionless since it acts on the unit sphere. In the next section, we restrict <img src="https://latex.codecogs.com/png.latex?%5CDelta_%7BS%5E%7Bd-1%7D%7D"> to zonal functions, which depend only on the polar angle <img src="https://latex.codecogs.com/png.latex?%5Ctheta_1">.</p>
</section>
<section id="zonal-functions" class="level3">
<h3 class="anchored" data-anchor-id="zonal-functions">Zonal functions</h3>
<p>We now study the angular part of the Laplacian on the sphere. A particularly important class of functions are the zonal functions, which depend only on the angle with a fixed direction. Fix a unit vector <img src="https://latex.codecogs.com/png.latex?e%20%5Cin%20S%5E%7Bd-1%7D">; typically, one takes <img src="https://latex.codecogs.com/png.latex?e"> to be the “north pole” <img src="https://latex.codecogs.com/png.latex?e%20=%20(1,0,%5Cldots,0)"> and we will do so here. A function <img src="https://latex.codecogs.com/png.latex?f:%20S%5E%7Bd-1%7D%20%5Cto%20%5Cmathbb%7BR%7D"> is called zonal (with respect to <img src="https://latex.codecogs.com/png.latex?e">) if it only depends on the inner product <img src="https://latex.codecogs.com/png.latex?x%20%5Ccdot%20e%20=%20%5Ccos%20%5Ctheta_1">; this means that there is a function <img src="https://latex.codecogs.com/png.latex?F:%20%5B0,%5Cpi%5D%20%5Cto%20%5Cmathbb%7BR%7D"> and <img src="https://latex.codecogs.com/png.latex?G:%20%5B-1,1%5D%20%5Cto%20%5Cmathbb%7BR%7D"> such that <img src="https://latex.codecogs.com/png.latex?%0Af(x)%20=%20F(%5Ctheta_1)%20=%20G(z)%0A"> where we set <img src="https://latex.codecogs.com/png.latex?z%20=%20%5Ccos%20%5Ctheta_1%20=%20x%20%5Ccdot%20e">. To keep notation simple, we will often conflate <img src="https://latex.codecogs.com/png.latex?f">, <img src="https://latex.codecogs.com/png.latex?F">, and <img src="https://latex.codecogs.com/png.latex?G"> and write <img src="https://latex.codecogs.com/png.latex?f(x)">, <img src="https://latex.codecogs.com/png.latex?f(%5Ctheta_1)">, or <img src="https://latex.codecogs.com/png.latex?f(z)"> depending on context. The zonal functions describe rotational symmetry around the axis spanned by <img src="https://latex.codecogs.com/png.latex?e"> and arise naturally in problems where only the angular separation between two points matters. If <img src="https://latex.codecogs.com/png.latex?f"> depends only on the polar angle <img src="https://latex.codecogs.com/png.latex?%5Ctheta_1">, all derivatives with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta_2,%5Cdots,%5Ctheta_%7Bd-1%7D"> vanish. The spherical Laplacian therefore reduces to the one-dimensional operator <img src="https://latex.codecogs.com/png.latex?%0A%5CDelta_%7BS%5E%7Bd-1%7D%7D%20f(x)%0A=%0A%5Cfrac%7B1%7D%7B%5Csin%5E%7Bd-2%7D%5Ctheta_1%7D%5C,%0A%5Cpartial_%7B%5Ctheta_1%7D%5C!%0A%5Cleft(%0A%5Csin%5E%7Bd-2%7D%5Ctheta_1%20%5C,%0A%5Cpartial_%7B%5Ctheta_1%7D%20f(%5Ctheta_1)%0A%5Cright).%0A"> Using <img src="https://latex.codecogs.com/png.latex?%5Csin%5Ctheta_1%20=%20%5Csqrt%7B1-z%5E2%7D"> and the chain rule, simple but tedious algebraic manipulations lead to the expression: <img src="https://latex.codecogs.com/png.latex?%0A%5CDelta_%7BS%5E%7Bd-1%7D%7D%20f(x)%0A=%0A(1-z%5E2)%20f''(z)%20-%20(d-1)%20z%20f'(z).%0A"> A convenient notation for the zonal part of the spherical Laplacian is <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D_d%20f(z)%0A=%0A(1-z%5E2)%5C,%20f''(z)%20-%20(d-1)%5C,%20z%5C,%20f'(z)%0A"> for <img src="https://latex.codecogs.com/png.latex?z%20%5Cin%20%5B-1,1%5D">. This operator is simply the restriction of the spherical Laplacian <img src="https://latex.codecogs.com/png.latex?%5CDelta_%7BS%5E%7Bd-1%7D%7D"> to zonal functions. If <img src="https://latex.codecogs.com/png.latex?d%5Csigma"> denotes the surface measure on <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-1%7D">, then for any two zonal functions <img src="https://latex.codecogs.com/png.latex?f,g:%20S%5E%7Bd-1%7D%20%5Cto%20%5Cmathbb%7BR%7D">, one has the weighted self-adjointness property: <img src="https://latex.codecogs.com/png.latex?%5Cint_%7BS%5E%7Bd-1%7D%7D%20f(x)%20%5C,%20g(x)%20%5C,%20d%5Csigma(x)%0A=%0A%5Cint_%7B-1%7D%5E1%20f(z)%20%5C,%20g(z)%20%5C,%20w_d(z)%5C,%20dz,%0A"> where, computing the marginal over all but the first angle, gives the weight: <img src="https://latex.codecogs.com/png.latex?%0Aw_d(z)%20=%20(1-z%5E2)%5E%7B%5Cfrac%7Bd-3%7D%7B2%7D%7D.%0A"> Furthermore, since the spherical Laplacian is self-adjoint on <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-1%7D"> with respect to the standard inner product, so is <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D_d"> with respect to this weighted inner product: <img src="https://latex.codecogs.com/png.latex?%5Cint_%7B-1%7D%5E1%20(%5Cmathcal%7BL%7D_d%20f)(z)%5C,%20g(z)%5C,%20w_d(z)%5C,%20dz%0A=%0A%5Cint_%7B-1%7D%5E1%20f(z)%5C,%20(%5Cmathcal%7BL%7D_d%20g)(z)%5C,%20w_d(z)%5C,%20dz.%0A"> Now, suppose we look for eigenfunctions of the spherical Laplacian on <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-1%7D">, i.e.&nbsp;functions <img src="https://latex.codecogs.com/png.latex?f"> satisfying <img src="https://latex.codecogs.com/png.latex?-%5CDelta_%7BS%5E%7Bd-1%7D%7D%20f%20=%20%5Clambda%20f">, that are zonal: <a href="https://en.wikipedia.org/wiki/Zonal_spherical_harmonics">zonal spherical harmonics</a>.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/gegenbauer/zonal_spherical.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Zonal_spherical_harmonics">Zonal spherical harmonics</a></figcaption>
</figure>
</div>
</div>
<p>This eigenvalue equation becomes <img src="https://latex.codecogs.com/png.latex?-%5Cmathcal%7BL%7D_d%20f%20=%20%5Clambda%20f">, which reads: <span id="eq-gegenbauer-ode"><img src="https://latex.codecogs.com/png.latex?%0A(1-z%5E2)%20f''(z)%20-%20(d-1)%20z%20f'(z)%20+%20%5Clambda%20f(z)%20=%200.%0A%5Ctag%7B2%7D"></span> Since the eigenvalues of the spherical Laplacian on <img src="https://latex.codecogs.com/png.latex?S%5E%7Bd-1%7D"> are given by: <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20n(n+d-2)"> for integers <img src="https://latex.codecogs.com/png.latex?n%20%5Cge%200">, we set <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20n(n+d-2)"> in the following.</p>
</section>
<section id="gegenbauer-polynomials." class="level3">
<h3 class="anchored" data-anchor-id="gegenbauer-polynomials.">Gegenbauer polynomials.</h3>
<p>Equation&nbsp;2 is a second-order ordinary differential equation on the interval <img src="https://latex.codecogs.com/png.latex?%5B-1,1%5D"> and it is customary to parametrize it by <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%3E%20-%5Ctfrac12"> by setting <img src="https://latex.codecogs.com/png.latex?%5Calpha%20=%20d/2-1">. Since the eigenvalues of the spherical Laplacian in dimension <img src="https://latex.codecogs.com/png.latex?d"> are <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20n(n+d-2)">, the equation becomes: <span id="eq-gegenbauer-ode-alpha"><img src="https://latex.codecogs.com/png.latex?(1-z%5E2)%20y''(z)%20-%20(2%5Calpha%20+%201)%20z%20y'(z)%20+%20n(n+2%5Calpha)%5C,%20y(z)%20=%200.%0A%5Ctag%7B3%7D"></span> One can then show that this equation admits polynomial solutions of degree <img src="https://latex.codecogs.com/png.latex?n">, called the <a href="https://en.wikipedia.org/wiki/Gegenbauer_polynomials">Gegenbauer polynomials</a> and usually denoted by <img src="https://latex.codecogs.com/png.latex?C_n%5E%7B(%5Calpha)%7D(z)">. Furthermore, if we insist that the solutions be regular on the entire interval <img src="https://latex.codecogs.com/png.latex?%5B-1,1%5D"> and can be lifted to smooth functions on the sphere, then these polynomial solutions are the only ones! For a given dimension <img src="https://latex.codecogs.com/png.latex?d=2%5Calpha+2"> and eigenvalue <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20n(n+d-2)"> of the spherical Laplacian, up to rotational symmetry and normalization, there is a unique eigenfunction and it is described by the Gegenbauer polynomial <img src="https://latex.codecogs.com/png.latex?C_n%5E%7B(%5Calpha)%7D(z)">. They can be recursively defined as <img src="https://latex.codecogs.com/png.latex?C_0%5E%7B(%5Calpha)%7D(z)%20=%201"> and <img src="https://latex.codecogs.com/png.latex?C_1%5E%7B(%5Calpha)%7D(z)%20=%202%5Calpha%20z">, together with the recurrence relation: <img src="https://latex.codecogs.com/png.latex?%0An(n+1)%20%5C,%20C_%7Bn+1%7D%5E%7B(%5Calpha)%7D(z)%0A=%202%20(n+%5Calpha)%20z%20%5C,%20C_n%5E%7B(%5Calpha)%7D(z)%0A-%20(n+2%5Calpha%20-1)%20%5C,%20C_%7Bn-1%7D%5E%7B(%5Calpha)%7D(z).%0A"></p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/gegenbauer/poly.png" class="img-fluid figure-img" style="width:85.0%"></p>
<figcaption>Gegenbauer Polynomials: <img src="https://latex.codecogs.com/png.latex?%5Calpha=3"></figcaption>
</figure>
</div>
</div>
<p>By construction, for any vector <img src="https://latex.codecogs.com/png.latex?e%20%5Cin%20S%5E%7Bd-1%7D">, the function defined on the sphere by <img src="https://latex.codecogs.com/png.latex?Y_n(x)%20=%20C_n%5E%7B(d/2-1)%7D(x%20%5Ccdot%20e)"> is a zonal spherical harmonic of degree <img src="https://latex.codecogs.com/png.latex?n"> in dimension <img src="https://latex.codecogs.com/png.latex?d"> with corresponding eigenvalue <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20n(n+d-2)">. <a href="https://en.wikipedia.org/wiki/Legendre_polynomials">Legendre polynomials</a> are a special case of Gegenbauer polynomials obtained by setting <img src="https://latex.codecogs.com/png.latex?%5Calpha%20=%20%5Ctfrac12">, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?d=3">. In that case, the Legendre polynomial are orthogonal with respect to the uniform weight on <img src="https://latex.codecogs.com/png.latex?%5B-1,1%5D"> since <img src="https://latex.codecogs.com/png.latex?w_3(z)%20=%201">. Similarly, setting <img src="https://latex.codecogs.com/png.latex?%5Calpha%20=%200"> (i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?d=2">) gives the <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials">Chebyshev polynomials</a> of the first kind, which are orthogonal with respect to the weight <img src="https://latex.codecogs.com/png.latex?w_2(z)%20=%20(1-z%5E2)%5E%7B-1/2%7D">. This also illuminates why the Chebyshev polynomials are defined on the interval <img src="https://latex.codecogs.com/png.latex?%5B-1,1%5D"> with that specific weight and satisfy <img src="https://latex.codecogs.com/png.latex?T_n(%5Ccos%20%5Ctheta)%20=%20%5Ccos(n%20%5Ctheta)">: they are simply the zonal spherical harmonics in dimension <img src="https://latex.codecogs.com/png.latex?d=2">.</p>


</section>
</section>

 ]]></description>
  <category>analysis</category>
  <guid>https://alexxthiery.github.io/notes/gegenbauer/gegenbauer.html</guid>
  <pubDate>Sun, 23 Nov 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Scoring Rules for Probabilistic Forecasts</title>
  <link>https://alexxthiery.github.io/notes/scoring_rules/scoring.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/scoring_rules/Savage.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Leonard_Jimmie_Savage">Leonard Jimmie Savage</a> (1917 – 1971)</figcaption>
</figure>
</div>
</div>
<p>When working with probabilistic models, predictions are expressed as full distributions rather than point estimates. To keep things simple, we focus on the case where the outcome <img src="https://latex.codecogs.com/png.latex?Y"> to be predicted consists of a finite number of <img src="https://latex.codecogs.com/png.latex?n"> possible outcomes labeled <img src="https://latex.codecogs.com/png.latex?%5B1:n%5D%20=%20%5C%7B1,2,%5Cldots,n%5C%7D">. A probabilistic forecast then takes the form of a vector <img src="https://latex.codecogs.com/png.latex?%5Cpi=(%5Cpi_1,%5Cpi_2,%5Cldots,%5Cpi_n)"> in the probability simplex <img src="https://latex.codecogs.com/png.latex?%5CDelta%5E%7Bn-1%7D">. Each coordinate <img src="https://latex.codecogs.com/png.latex?%5Cpi_i"> represents the predicted probability of outcome <img src="https://latex.codecogs.com/png.latex?i"> occurring. How should we evaluate the quality of such probabilistic forecasts?</p>
<p>A scoring rule assigns a numerical reward <img src="https://latex.codecogs.com/png.latex?s(%5Cpi,Y)"> to the probabilistic forecast <img src="https://latex.codecogs.com/png.latex?%5Cpi"> when outcome <img src="https://latex.codecogs.com/png.latex?Y"> occurs. If the true distribution of the outcome <img src="https://latex.codecogs.com/png.latex?Y"> is <img src="https://latex.codecogs.com/png.latex?p">, the expected reward for reporting <img src="https://latex.codecogs.com/png.latex?%5Cpi"> is <span id="eq-S"><img src="https://latex.codecogs.com/png.latex?%0AS(%5Cpi,p)%20%5Cequiv%20%5Csum_%7Bi=1%7D%5En%20p_i%20%5C,%20s(%5Cpi,i).%0A%5Ctag%7B1%7D"></span></p>
<p>Although the function <img src="https://latex.codecogs.com/png.latex?s:%20%5CDelta%5E%7Bn-1%7D%20%5Ctimes%20%5B1:n%5D%20%5Cto%20%5Cmathbb%7BR%7D"> is generally non-linear, the function <img src="https://latex.codecogs.com/png.latex?S"> can be extended to a function from <img src="https://latex.codecogs.com/png.latex?%5CDelta%5E%7Bn-1%7D%20%5Ctimes%20%5Cmathbb%7BR%7D%5En"> to <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D">, linear in its second argument, through Equation&nbsp;1 by interpreting <img src="https://latex.codecogs.com/png.latex?p"> as a vector in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5En">. The remark that it is linear in its second argument will reveal to be very useful later. Furthermore, if one denotes by <img src="https://latex.codecogs.com/png.latex?%5Cdelta_i%20=%20(0,%5Cldots,0,1,0,%5Cldots,0)%20%5Cin%20%5CDelta%5E%7Bn-1%7D"> the Dirac measure at <img src="https://latex.codecogs.com/png.latex?i">, then the scoring rule can be recovered from the expected reward via</p>
<p><img src="https://latex.codecogs.com/png.latex?s(%5Cpi,i)%20=%20S(%5Cpi,%5Cdelta_i)."></p>
<p>The central requirement for the design of scoring rules is that a forecaster has no incentive to misreport their beliefs. This means that if the forecaster’s belief about the distribution of <img src="https://latex.codecogs.com/png.latex?Y"> is given by the probability distribution <img src="https://latex.codecogs.com/png.latex?%5Cpi%20%5Cin%5CDelta%5E%7Bn-1%7D">, then reporting <img src="https://latex.codecogs.com/png.latex?%5Cpi"> should maximize their expected reward. There are a number of situations in which such a design is desirable. For example, in market settings where agents are asked to provide probabilistic forecasts, proper scoring rules incentivize truthful reporting of beliefs. A scoring rule is called proper if the mapping <img src="https://latex.codecogs.com/png.latex?%5Cpi%20%5Cmapsto%20S(%5Cpi,%20p)"> attains its maximum at <img src="https://latex.codecogs.com/png.latex?%5Cpi=p">. Formally, this means that for all two distributions <img src="https://latex.codecogs.com/png.latex?p,%5Cpi%20%5Cin%20%5CDelta%5E%7Bn-1%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AS(p,p)%20%5Cge%20S(%5Cpi,%20p).%0A"></p>
<p>This condition ensures that the best strategy, in expectation, is to report one’s genuine probabilities. If the inequality is strict whenever <img src="https://latex.codecogs.com/png.latex?%5Cpi%20%5Cne%20p">, the scoring rule is called strictly proper. Proper scoring rules have a long history in statistics and decision theory. The natural question arises: what do proper scoring rules look like, and how can we construct them? What functional forms can we use for <img src="https://latex.codecogs.com/png.latex?s(%5Cpi,i)"> that ensure properness?</p>
<p>For each distribution <img src="https://latex.codecogs.com/png.latex?p">, define its self-expected score <img src="https://latex.codecogs.com/png.latex?%0AH(p)=S(p,p)=%5Csum_%7Bi=1%7D%5En%20p_i%20%5C,%20s(p,i).%0A"></p>
<p>It is the average reward a forecaster receives when its reported distribution matches the true distribution. Crucially, the affine function <img src="https://latex.codecogs.com/png.latex?p%20%5Cmapsto%20S(%5Cpi,p)"> describes a supporting hyperplane to the function <img src="https://latex.codecogs.com/png.latex?H"> at the point <img src="https://latex.codecogs.com/png.latex?%5Cpi">: it is linear in <img src="https://latex.codecogs.com/png.latex?p">, matches <img src="https://latex.codecogs.com/png.latex?H"> at <img src="https://latex.codecogs.com/png.latex?p=%5Cpi">, while never exceeding it elsewhere. If one knew that <img src="https://latex.codecogs.com/png.latex?H"> were convex and differentiable, by uniqueness of supporting hyperplanes to convex and differentiable functions, one could immediately write down a representation for <img src="https://latex.codecogs.com/png.latex?S(%5Cpi,p)"> in terms of <img src="https://latex.codecogs.com/png.latex?H">. But <img src="https://latex.codecogs.com/png.latex?H"> is indeed convex since it is the pointwise maximum of the family of affine functions <img src="https://latex.codecogs.com/png.latex?p%20%5Cmapsto%20S(%5Cpi,p)"> indexed by <img src="https://latex.codecogs.com/png.latex?%5Cpi">. Assuming differentiability to keep a few technicalities at bay, this shows that:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0As(%5Cpi,i)%0A&amp;=%20S(%5Cpi,%5Cdelta_i)%0A=%20S(%5Cpi,%20%5Cpi)%20+%20%5Cleft%3C%20%20%5Cnabla%20H(%5Cpi),%20%5Cdelta_i%20-%20%5Cpi%20%20%5Cright%3E%5C%5C%0A&amp;=%20H(%5Cpi)%20+%20%5Cleft%3C%20%20%5Cnabla%20H(%5Cpi),%20%5Cdelta_i%20-%20%5Cpi%20%20%5Cright%3E.%0A%5Cend%7Balign*%7D%0A"></p>
<p>Without assuming differentiability, one can use subgradients instead of gradients to obtain a similar representation. This shows that proper scoring rules <img src="https://latex.codecogs.com/png.latex?s(%5Cpi,i)"> are in one-to-one correspondence with convex functions <img src="https://latex.codecogs.com/png.latex?H(%5Cpi)"> on the probability simplex <img src="https://latex.codecogs.com/png.latex?%5Cpi%20%5Cin%20%5CDelta%5E%7Bn-1%7D">. Similarly, strictly proper scoring rules correspond to strictly convex functions. Extension to continuous sample spaces is possible through the use of functional derivatives instead of gradients or subgradients; see <span class="citation" data-cites="gneiting2007strictly">(Gneiting and Raftery 2007)</span> for details.</p>
<p>Let us look at some examples of proper scoring rules defined through this correspondence:</p>
<ol type="1">
<li><p><strong>Logarithmic Score</strong>: The logarithmic scoring rule is defined as <img src="https://latex.codecogs.com/png.latex?s(%5Cpi,i)%20=%20%5Clog(%5Cpi_i)">. The corresponding self-expected score is the negative Shannon entropy: <img src="https://latex.codecogs.com/png.latex?%0AH(p)%20=%20%5Csum_%7Bi=1%7D%5En%20p_i%20%5Clog(p_i).%0A"> It is interesting to note that the logarithmic scoring rule is essentially the only local proper scoring rule, i.e.&nbsp;one of the type <img src="https://latex.codecogs.com/png.latex?s(%5Cpi,i)%20=%20F(%5Cpi_i,%20i)"> for some function <img src="https://latex.codecogs.com/png.latex?F">. In other words, the score assigned to outcome <img src="https://latex.codecogs.com/png.latex?i"> depends only on the predicted probability <img src="https://latex.codecogs.com/png.latex?%5Cpi_i"> of that outcome, and not on the other predicted probabilities <img src="https://latex.codecogs.com/png.latex?%5Cpi_j"> for <img src="https://latex.codecogs.com/png.latex?j%20%5Cne%20i">. Indeed, assuming <img src="https://latex.codecogs.com/png.latex?F"> smooth for simplicity, the condition of properness easily implies that <img src="https://latex.codecogs.com/png.latex?%5Cpartial_%7B%5Cpi_i%7D%20F(%5Cpi_i,%20i)%20=%20A"> for some constant <img src="https://latex.codecogs.com/png.latex?%5Calpha"> independent of <img src="https://latex.codecogs.com/png.latex?i">. Integrating this relation gives that <img src="https://latex.codecogs.com/png.latex?F(%5Cpi_i,%20i)%20=%20%5Calpha%20%5Clog(%5Cpi_i)%20+%20%5Cbeta_i">, where necessarily <img src="https://latex.codecogs.com/png.latex?%5Calpha%3E0"> for properness, and where <img src="https://latex.codecogs.com/png.latex?B_i"> are arbitrary constants.</p></li>
<li><p><strong>Brier Score</strong>: The Brier scoring rule is given by <img src="https://latex.codecogs.com/png.latex?s(%5Cpi,i)%20=%20%5Cpi_i%20-%20%5Ctfrac12%20%5Csum_%7Bj=1%7D%5En%20%5Cpi_j%5E2">. The associated self-expected score is <img src="https://latex.codecogs.com/png.latex?%0AH(p)%20=%20%5Cfrac12%20%5C,%20%5Csum_%7Bi=1%7D%5En%20p_i%5E2.%0A"></p></li>
<li><p><strong>Spherical Score</strong>: The spherical scoring rule is defined as <img src="https://latex.codecogs.com/png.latex?s(%5Cpi,i)%20=%20%5Cfrac%7B%5Cpi_i%7D%7B%5C%7C%5Cpi%5C%7C_2%7D">. The corresponding self-expected score is <img src="https://latex.codecogs.com/png.latex?%0AH(p)%20=%20%5C%7Cp%5C%7C_2.%0A"></p></li>
<li><p><strong>Energy Score</strong>: consider a distance function <img src="https://latex.codecogs.com/png.latex?d:%20%5B1:n%5D%20%5Ctimes%20%5B1:n%5D%20%5Cto%20%5Cmathbb%7BR%7D_+">. The energy scoring rule is defined through expected distances: <img src="https://latex.codecogs.com/png.latex?%0As(%5Cpi,i)%20=%20-%20%20%7B%5Cleft(%20%20%5Cmathbb%7BE%7D%5Bd(X,i)%5D%20-%20%5Cfrac%2012%20%5C,%20%5Cmathbb%7BE%7D%5Bd(X,X')%5D%20%20%5Cright)%7D%20,%0A"> where <img src="https://latex.codecogs.com/png.latex?X,X'%20%5Csim%20%5Cpi"> are independent. The associated self-expected score is <img src="https://latex.codecogs.com/png.latex?%0AH(p)%20=%20-%5Cfrac12%20%5C,%20%5Cmathbb%7BE%7D%5Bd(X,X')%5D%20=%20-%20%5Cfrac12%20%5C,%5Csum_%7Bi,j=1%7D%5En%20p_i%20p_j%20%5C,%20d(i,j).%0A"> where <img src="https://latex.codecogs.com/png.latex?X,X'%20%5Csim%20p"> are independent. This function is convex in <img src="https://latex.codecogs.com/png.latex?p"> if the distance matrix <img src="https://latex.codecogs.com/png.latex?M_%7Bi,j%7D=d(i,j)"> is negative semi-definite on the subspace of zero-sum vectors, i.e., if for all vectors <img src="https://latex.codecogs.com/png.latex?z%20%5Cin%20%5Cmathbb%7BR%7D%5En"> with <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Bi=1%7D%5En%20z_i%20=%200">, one has <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Bi,j=1%7D%5En%20z_i%20z_j%20%5C,%20d(i,j)%20%5Cle%200">. Luckily, there are many such distances. For example, if the distance <img src="https://latex.codecogs.com/png.latex?d"> is of the form <img src="https://latex.codecogs.com/png.latex?d(i,j)%20=%20%5C%7C%5Cvarphi_i%20-%20%5Cvarphi_j%5C%7C_2%5E2"> for some (feature) vectors <img src="https://latex.codecogs.com/png.latex?%5Cvarphi_1,%5Cldots,%5Cvarphi_n%20%5Cin%20%5Cmathbb%7BR%7D%5Em">, then the corresponding distance matrix is negative semi-definite on the subspace of zero-sum vectors. In a continuous setting, for example when <img src="https://latex.codecogs.com/png.latex?%5B1:n%5D"> is replaced by <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5Ed">, one could take <img src="https://latex.codecogs.com/png.latex?%5Cvarphi(x)=x"> to obtain the energy score associated to the squared Euclidean distance; this would leads to <img src="https://latex.codecogs.com/png.latex?H(%5Cpi)%20=%20-%20%5Cfrac12%20%5C,%20%5Cmathbb%7BE%7D%5B%5C%7CX-X'%5C%7C_2%5E2%5D%20=%20-%20%5Ctext%7BVar%7D(X)"> when <img src="https://latex.codecogs.com/png.latex?X,X'%20%5Csim%20%5Cpi"> are independent. This shows that in that case the function <img src="https://latex.codecogs.com/png.latex?H"> is <strong>not</strong> strictly convex since it depends only on the variance of the distribution, so entirely flat on the subspace of distributions with fixed variance! For this reason, the energy score with squared Euclidean distance is proper but typically not strictly proper. In fact, one can check that for any <img src="https://latex.codecogs.com/png.latex?0%3C%20%5Cbeta%20%3C%202">, the distance defined as <img src="https://latex.codecogs.com/png.latex?d(i,j)%20=%20%5C%7C%5Cvarphi_i%20-%20%5Cvarphi_j%5C%7C_2%5E%5Cbeta"> also leads to a negative semi-definite distance matrix on the subspace of zero-sum vectors. But contrarily to the case <img src="https://latex.codecogs.com/png.latex?%5Cbeta=2"> of squared Euclidean distance, when choosing <img src="https://latex.codecogs.com/png.latex?0%3C%5Cbeta%3C2">, the associated energy score is strictly proper <span class="citation" data-cites="schoenberg1938metric">(Schoenberg 1938)</span> <span class="citation" data-cites="szekely2013energy">(Székely and Rizzo 2013)</span>. This includes, in particular, the case <img src="https://latex.codecogs.com/png.latex?%5Cbeta=1"> which corresponds to the standard Euclidean distance.</p></li>
</ol>




<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-gneiting2007strictly" class="csl-entry">
Gneiting, Tilmann, and Adrian E Raftery. 2007. <span>“Strictly Proper Scoring Rules, Prediction, and Estimation.”</span> <em>Journal of the American Statistical Association</em> 102 (477). Taylor &amp; Francis: 359–78.
</div>
<div id="ref-schoenberg1938metric" class="csl-entry">
Schoenberg, Isaac J. 1938. <span>“Metric Spaces and Positive Definite Functions.”</span> <em>Transactions of the American Mathematical Society</em> 44 (3). JSTOR: 522–36.
</div>
<div id="ref-szekely2013energy" class="csl-entry">
Székely, Gábor J, and Maria L Rizzo. 2013. <span>“Energy Statistics: A Class of Statistics Based on Distances.”</span> <em>Journal of Statistical Planning and Inference</em> 143 (8). Elsevier: 1249–72.
</div>
</div></section></div> ]]></description>
  <category>diffusion</category>
  <guid>https://alexxthiery.github.io/notes/scoring_rules/scoring.html</guid>
  <pubDate>Fri, 21 Nov 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Masked Discrete Diffusion</title>
  <link>https://alexxthiery.github.io/notes/DiscreteDiff/DiscreteDiff.html</link>
  <description><![CDATA[ 





<!-- 
Consider a finite state space $\{M, 1,2, \ldots, V\}$ where $M$ is a special state corresponding to a masked value and $1, \ldots, V$ correspond to the possible values of a token (e.g., a word in a vocabulary of size $V$). On the time interval $[0,T]$, we consider a continuous Markov chain with initial distribution $p_0$ and infinitesimal rate matrix $Q_t \in \bbR^{(V+1) \times (V+1)}$,

$$
\P(X_{t+h} = j \mid X_t = i) = Q_t(i,j) \, h + o(h).
$$

Bayes' rule immediately shows that the time-reversal of this Markov chain is also a Markov chain with infinitesimal rate matrix $Q^{\star}_t$ satisfying:

$$
Q^{\star}_t(i,j) = \frac{p_t(j)}{p_t(i)} \, Q_t(j,i),
$$

where $p_t$ is the marginal distribution of $X_t$. We have 

$$\P(X_{t-h} = j \mid X_t = i) = Q^{\star}_t(i,j) \, h + o(h).$$

We are interested in modeling a Markov chain that progressively masks the initial value into the masked state $M$ as time $t$ goes from $0$ to $T$. Furthermore, only transitions towards the masked state $M$ are allowed, and once the process has jumped to the masked state $M$, it stays there forever. This means that outside the diagonal, the only non-zero entries of the infinitesimal rate matrix $Q_t$ are of the form $Q_t(i,M)$ for $i \in \{1, \ldots, V\}$. For convenience, we denote by $\tau$ the (almost surely unique) jump time of this Markov chain. We assume that $\tau < T$ almost surely, so that the process ends up at time $T$ at the masked state $M$ with probability one.

Now, to model text sequences of length $L$ for example, we consider $L$ independent copies of this Markov chain, i.e., $\overline{X}_t = (X_t^1, \ldots, X_t^L)$ where each $X_t^i$ is an independent Markov chain with infinitesimal rate matrix $Q_t$. This Markov process ends up at time $T$ at a fully masked sequence $\overline{X}_T = (M, \ldots, M)$ with probability one. If one can accurately model the time-reversal of this process, one can use this time-reversal to progressively unmask a fully masked sequence into a realistic sequence. That is the basic idea behind the (discrete) diffusion models. In our case, since jump times are almost surely distinct, the rate matrix of the joint process $\overline{X}_t$ is only non-zero when a single coordinate jumps. This means that, if $x,y \in \{M,1,\ldots,V\}^L$ differ by a single coordinate, say the $i$-th coordinate, then the infinitesimal rate matrix $\overline{Q}_t$ of the joint process satisfies:

$$
\overline{Q}_t(x,y) = Q_t(x^i,y_i).
$$

Similarly, the time-reversal of this joint process has infinitesimal rate matrix $\overleftarrow{\overline{Q}}_t$ satisfying:

$$
\overline{Q}^{\star}_t(x,y) = \frac{\overline{p}_t(y)}{\overline{p}_t(x)} \, Q_t(y_i,x^i),
$$

where $\overline{p}_t$ is the marginal distribution of $\overline{X}_t$. Now, $x$ and $y$ differ by a single coordinate $i$ so that, necessarily, $x^i = M$ and $y_i \in \{1, \ldots, V\}$. Let us denote by $S$ the set of indices $j$ such that $x^j \neq M$ (i.e., the set of unmasked coordinates in $x$). To observe that $x$ at time $t$, this means that the jump time of the $(L - |S|)$ masked coordinates must be smaller than $t$, while the jump time of the $|S|$ unmasked coordinates must be larger than $t$:

$$
\overline{p}_t(x) = \P(\tau < t)^{L-|S|} \, \P(\tau \geq t)^{|S|} \, \blue{ \overline{p}_0(x^{S}) }.
$$

Similarly, to observe $y$ at time $t$, the jump time of the $(L - |S| - 1)$ masked coordinates must be smaller than $t$, the jump time of the $|S|+1$ unmasked coordinates must be larger than $t$:

$$
\overline{p}_t(y) = \P(\tau < t)^{L-|S|-1} \, \P(\tau \geq t)^{|S|+1} \, \blue{ \overline{p}_0(x^{S}) \, \overline{p}_0(x^i \mid x^{S}) }.
$$

This means that the time-reversal rate matrix satisfies:

$$
\overline{Q}^{\star}_t(x,y) = R(t) \, \blue{\overline{p}_0(x^i \mid x^{S}) } \, Q_t(M, x^i),
$$

with time-dependent scalar $R(t) = \frac{\P(\tau \geq t)}{\P(\tau < t)}$. This means that, in practice, in order to model the time-reversal of this masked discrete diffusion process, one only needs to model the conditional distribution $\blue{\overline{p}_0(x^i \mid x^{S}) }$ of the data distribution $\overline{p}_0$. Standard BERT-like, masked language models, and transformers are perfectly suited to model such conditional distributions. This also shows that these discrete diffusion processes are in fact almost identical to masked language models, so one should not be surprised that they achieve similar performance on text generation tasks.

--- -->
<section id="masked-discrete-diffusion" class="level3">
<h3 class="anchored" data-anchor-id="masked-discrete-diffusion">Masked Discrete Diffusion</h3>
<p>We consider a finite state space <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BX%7D%20=%20%5C%7BM,%201,2,%20%5Cldots,%20V%5C%7D,%0A"> where <img src="https://latex.codecogs.com/png.latex?M"> is a special <strong>masked</strong> state and <img src="https://latex.codecogs.com/png.latex?1,%20%5Cldots,%20V"> correspond to token values in a vocabulary of size <img src="https://latex.codecogs.com/png.latex?V">. On the time interval <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D">, we define a continuous-time Markov chain with initial distribution <img src="https://latex.codecogs.com/png.latex?p_0"> and time-dependent infinitesimal rate matrix <img src="https://latex.codecogs.com/png.latex?Q_t%20%5Cin%20%5Cmathbb%7BR%7D%5E%7B(V+1)%5Ctimes(V+1)%7D"> so that for any <img src="https://latex.codecogs.com/png.latex?x%20%5Cne%20y">, <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BP%7D(X_%7Bt+h%7D%20=%20y%20%5Cmid%20X_t%20=%20x)%20=%20Q_t(x,y)%20%5C,%20h%20+%20o(h).%0A"></p>
<p>If the <strong>total jump rate</strong> out of state <img src="https://latex.codecogs.com/png.latex?x"> is <img src="https://latex.codecogs.com/png.latex?J_t(x)%20=%20%5Csum_%7By%20%5Cne%20x%7D%20Q_t(x,y)"> , then</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_%7Bt+h%7D%20=%20x%20%5Cmid%20X_t%20=%20x)%20=%201%20-%20J_t(x)%20%5C,%20h%20+%20o(h)."></p>
<p>Bayes’ rule implies that the time-reversal of this Markov chain is itself Markov, with infinitesimal rate matrix <img src="https://latex.codecogs.com/png.latex?Q_t%5E%7B%5Cstar%7D"> satisfying <img src="https://latex.codecogs.com/png.latex?%0AQ_t%5E%7B%5Cstar%7D(x,y)%20=%20%5Cfrac%7Bp_t(y)%7D%7Bp_t(x)%7D%20%5C,%20Q_t(y,x),%0A"> where <img src="https://latex.codecogs.com/png.latex?p_t"> is the marginal distribution of <img src="https://latex.codecogs.com/png.latex?X_t"> at time <img src="https://latex.codecogs.com/png.latex?t">. We have: <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BP%7D(X_%7Bt-h%7D%20=%20y%20%5Cmid%20X_t%20=%20x)%20=%20Q_t%5E%7B%5Cstar%7D(x,y)%20%5C,%20h%20+%20o(h).%0A"></p>
<p>We are interested in modeling a Markov chain that progressively masks the initial value into the masked state <img src="https://latex.codecogs.com/png.latex?M"> as time <img src="https://latex.codecogs.com/png.latex?t"> goes from <img src="https://latex.codecogs.com/png.latex?0"> to <img src="https://latex.codecogs.com/png.latex?T">. Transitions are only allowed from any token <img src="https://latex.codecogs.com/png.latex?i%20%5Cin%20%5C%7B1,%5Cdots,V%5C%7D"> to the masked state <img src="https://latex.codecogs.com/png.latex?M">, and once in <img src="https://latex.codecogs.com/png.latex?M"> the process remains there. Thus, outside the diagonal, the only nonzero entries of <img src="https://latex.codecogs.com/png.latex?Q_t"> are <img src="https://latex.codecogs.com/png.latex?Q_t(x,M)">. As it will be useful later, we denote by <img src="https://latex.codecogs.com/png.latex?%5Ctau"> the jump time to <img src="https://latex.codecogs.com/png.latex?M"> and we assume <img src="https://latex.codecogs.com/png.latex?%5Ctau%20%3C%20T"> almost surely, so that <img src="https://latex.codecogs.com/png.latex?X_T%20=%20M"> with probability one, and that <img src="https://latex.codecogs.com/png.latex?%5Ctau"> has a continuous distribution. In words: the process starts at some token value and at a random time <img src="https://latex.codecogs.com/png.latex?%5Ctau"> jumps to the masked state <img src="https://latex.codecogs.com/png.latex?M">, where it remains until time <img src="https://latex.codecogs.com/png.latex?T">.</p>
</section>
<section id="extension-to-sequences" class="level3">
<h3 class="anchored" data-anchor-id="extension-to-sequences">Extension to Sequences</h3>
<p>We are interested in modeling sequences comprised of <img src="https://latex.codecogs.com/png.latex?L"> discrete tokens, eg: binary images, genomic sequences, chemical compounds, protein sequences, etc… Each token takes value in <img src="https://latex.codecogs.com/png.latex?%5C%7B1,2,%5Cldots,V%5C%7D">. We denote by <img src="https://latex.codecogs.com/png.latex?%5Coverline%7Bp%7D_0"> the data distribution over such sequences. For this purpose, we consider <img src="https://latex.codecogs.com/png.latex?L"> independent copies of the above Markov chain, one per coordinate: <img src="https://latex.codecogs.com/png.latex?%0AX_t%20=%20(X_t%5E1,%20%5Cldots,%20X_t%5EL),%0A"> each with rate matrix <img src="https://latex.codecogs.com/png.latex?Q_t"> as defined previously. At time <img src="https://latex.codecogs.com/png.latex?T">, the process reaches the fully masked sequence <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BX%7D_T%20=%20(M,%20%5Cldots,%20M)"> with probability one. Denote by <img src="https://latex.codecogs.com/png.latex?%5Ctau_i"> the jump time of coordinate <img src="https://latex.codecogs.com/png.latex?i">. Since the jump times <img src="https://latex.codecogs.com/png.latex?%5Ctau_i"> are almost surely distinct, the infinitesimal rate matrix <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BQ%7D_t"> of the joint process is nonzero only when a single coordinate changes. If <img src="https://latex.codecogs.com/png.latex?x,%5Cwidehat%7Bx%7D%20%5Cin%20%5Cmathcal%7BX%7D%5EL"> differ by a single coordinate <img src="https://latex.codecogs.com/png.latex?i">, we have <img src="https://latex.codecogs.com/png.latex?%0A%5Coverline%7BQ%7D_t(x,%5Cwidehat%7Bx%7D)%20=%20Q_t(x%5Ei,%20%5Cwidehat%7Bx%7D%5Ei).%0A"></p>
<p>As before, the time-reversal has infinitesimal rate matrix <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BQ%7D_t%5E%7B%5Cstar%7D"> satisfying <img src="https://latex.codecogs.com/png.latex?%0A%5Coverline%7BQ%7D_t%5E%7B%5Cstar%7D(%5Cwidehat%7Bx%7D,%20x)%20=%20%5Cfrac%7B%5Coverline%7Bp%7D_t(x)%7D%7B%5Coverline%7Bp%7D_t(%5Cwidehat%7Bx%7D)%7D%20%5C,%20Q_t(x%5Ei,%20%5Cwidehat%7Bx%7D%5Ei),%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Coverline%7Bp%7D_t"> is the marginal distribution of <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BX%7D_t">. Since <img src="https://latex.codecogs.com/png.latex?x"> and <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7Bx%7D"> differ at coordinate <img src="https://latex.codecogs.com/png.latex?i"> only, for <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BQ%7D_t%5E%7B%5Cstar%7D(%5Cwidehat%7Bx%7D,%20x)"> to be non-zero, necessarily <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7Bx%7D%5Ei%20=%20M"> and <img src="https://latex.codecogs.com/png.latex?x%5Ei%20%5Cin%20%5C%7B1,%5Cldots,V%5C%7D">. Let <img src="https://latex.codecogs.com/png.latex?S%20=%20%5C%7Bj%20:%20x%5Ej%20%5Cneq%20M%5C%7D"> be the set of unmasked coordinates in <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7Bx%7D">. To observe configuration <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7Bx%7D"> at time <img src="https://latex.codecogs.com/png.latex?t">, the <img src="https://latex.codecogs.com/png.latex?(L%20-%20%7CS%7C)"> masked coordinates must have <img src="https://latex.codecogs.com/png.latex?%5Ctau%20%3C%20t"> and the <img src="https://latex.codecogs.com/png.latex?%7CS%7C"> unmasked ones <img src="https://latex.codecogs.com/png.latex?%5Ctau%20%5Cge%20t">: <img src="https://latex.codecogs.com/png.latex?%0A%5Coverline%7Bp%7D_t(%5Cwidehat%7Bx%7D)%20=%20%5Cmathbb%7BP%7D(%5Ctau%20%3C%20t)%5E%7BL-%7CS%7C%7D%20%5C,%20%5Cmathbb%7BP%7D(%5Ctau%20%5Cge%20t)%5E%7B%7CS%7C%7D%20%5C,%20%5Coverline%7Bp%7D_0(%5Cwidehat%7Bx%7D%5E%7BS%7D).%0A"></p>
<p>Similarly, and since <img src="https://latex.codecogs.com/png.latex?x"> differs from <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7Bx%7D"> only at coordinate <img src="https://latex.codecogs.com/png.latex?i">: <img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Coverline%7Bp%7D_t(x)%0A&amp;=%20%5Cmathbb%7BP%7D(%5Ctau%20%3C%20t)%5E%7BL-%7CS%7C-1%7D%20%5C,%20%5Cmathbb%7BP%7D(%5Ctau%20%5Cge%20t)%5E%7B%7CS%7C+1%7D%20%5C,%0A%5Coverline%7Bp%7D_0(x%5E%7BS%20%5Ccup%20%5C%7Bi%5C%7D%7D)%5C%5C%0A&amp;=%20%5Cmathbb%7BP%7D(%5Ctau%20%3C%20t)%5E%7BL-%7CS%7C-1%7D%20%5C,%20%5Cmathbb%7BP%7D(%5Ctau%20%5Cge%20t)%5E%7B%7CS%7C+1%7D%20%5C,%0A%5Coverline%7Bp%7D_0(%5Cwidehat%7Bx%7D%5E%7BS%7D)%5C,%20%5Coverline%7Bp%7D_0(x%5Ei%20%5Cmid%20%5Cwidehat%7Bx%7D%5E%7BS%7D).%0A%5Cend%7Balign*%7D%0A"></p>
<p>This shows that the time-reversal rate matrix becomes <span id="eq-reverse-rate-matrix"><img src="https://latex.codecogs.com/png.latex?%0A%5Coverline%7BQ%7D_t%5E%7B%5Cstar%7D(%5Cwidehat%7Bx%7D,%20x)%20=%20R(t)%20%5C,%20%5Coverline%7Bp%7D_0(x%5Ei%20%5Cmid%20%5Cwidehat%7Bx%7D%5E%7BS%7D)%5C,%20Q_t(x%5Ei,%20M),%0A%5Ctag%7B1%7D"></span></p>
<p>with time dependent scalar <img src="https://latex.codecogs.com/png.latex?R(t)%20=%20%5Cfrac%7B%5Cmathbb%7BP%7D(%5Ctau%20%5Cge%20t)%7D%7B%5Cmathbb%7BP%7D(%5Ctau%20%3C%20t)%7D">. To simulate the reverse process that progressively unmasks a fully masked sequence, one only needs to model the conditional distribution <img src="https://latex.codecogs.com/png.latex?%5Coverline%7Bp%7D_0(%5Cwidehat%7Bx%7D%5Ei%20%5Cmid%20x%5E%7BS%7D)"> of the data distribution <img src="https://latex.codecogs.com/png.latex?%5Coverline%7Bp%7D_0">. This is precisely the prediction task of masked language models such as BERT, which estimate token probabilities conditioned on visible context.</p>
</section>
<section id="training" class="level3">
<h3 class="anchored" data-anchor-id="training">Training</h3>
<p>To train the denoising model, Equation&nbsp;1 shows that it is natural to parametrize the conditional distribution</p>
<p><img src="https://latex.codecogs.com/png.latex?f_%5Ctheta(x%5Ei%20%5Cmid%20%5Cwidehat%7Bx%7D%5E%7BS%7D)%20%5Capprox%20%5Coverline%7Bp%7D_0(x%5Ei%20%5Cmid%20%5Cwidehat%7Bx%7D%5E%7BS%7D)"></p>
<p>for all sets <img src="https://latex.codecogs.com/png.latex?S%20%5Csubset%20%5C%7B1,%5Cldots,L%5C%7D"> with <img src="https://latex.codecogs.com/png.latex?i%20%5Cnotin%20S">. Once done, one can define the rate matrix of the time-reversal process as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Coverline%7BQ%7D_%7Bt,%5Ctheta%7D%5E%7B%5Cstar%7D(%5Cwidehat%7Bx%7D,%20x)%20=%20R(t)%20%5C,%20f_%5Ctheta(x%5Ei%20%5Cmid%20%5Cwidehat%7Bx%7D%5E%7BS%7D)%20%5C,%20Q_t(x%5Ei,%20M).%0A"></p>
<p>If one denotes by <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D"> the law of the forward noising process started from <img src="https://latex.codecogs.com/png.latex?%5Coverline%7Bp%7D_0">, and by <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D_%7B%5Ctheta%7D"> the law of the time-reversal process started from the fully masked sequence <img src="https://latex.codecogs.com/png.latex?(M,M,%20%5Cldots,%20M)"> at time <img src="https://latex.codecogs.com/png.latex?T"> and with learned denoising model <img src="https://latex.codecogs.com/png.latex?f_%5Ctheta">, one can train the model by minimizing</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctext%7BKL%7D%7D%20%7B%5Cleft(%20%20%5Cmathbb%7BP%7D%5C;%20%7C%7C%20%5C;%20%5Cmathbb%7BP%7D_%7B%5Ctheta%7D%20%20%5Cright)%7D%20%20=%20%5Cmathbb%7BE%7D_%7B%5Cmathbb%7BP%7D%7D%20%20%7B%5Cleft(%20%20%5Clog%20%5Cfrac%7B%5Cmathbb%7BP%7D%7D%7B%5Cmathbb%7BP%7D_%7B%5Ctheta%7D%7D%20%20%5Cright)%7D%20.%0A"></p>
<p>Consider a trajectory <img src="https://latex.codecogs.com/png.latex?x_%7B%5B0,T%5D%7D"> of the forward noising process. The jump at time <img src="https://latex.codecogs.com/png.latex?%5Ctau_i"> of the <img src="https://latex.codecogs.com/png.latex?i">-th coordinate is denoted by <img src="https://latex.codecogs.com/png.latex?%5CDelta_i:%20=%20(x_%7B%5Ctau_i%5E-%7D%5Ei,%20x_%7B%5Ctau_i%5E+%7D%5Ei)%20=%20(x_0%5Ei,%20M)">. For simplifying the notations, we denote the reverse jump by <img src="https://latex.codecogs.com/png.latex?%5CDelta_i%5E%7B%5Cstar%7D">. The log-likelihood ratio between the two processes is easily shown to be:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clog%20%5Cfrac%7B%5Cmathbb%7BP%7D%7D%7B%5Cmathbb%7BP%7D_%7B%5Ctheta%7D%7D(x_%7B%5B0,T%5D%7D)%20=%20%5Clog%20%5Coverline%7Bp%7D_0(x_0)%20+%20%5Csum_i%20%5Clog%20%5Cfrac%7B%5Coverline%7BQ%7D_%7B%5Ctau_i%7D(%5CDelta_i)%7D%7B%5Coverline%7BQ%7D%5E%7B%5Cstar%7D_%7B%5Ctau_i,%20%5Ctheta%7D(%5CDelta_i%5E%7B%5Cstar%7D)%7D%20-%20%5Cint_0%5ET%20%20%7B%5Cleft%5C%7B%20%20%5Coverline%7BJ%7D_t(x_t)%20-%20%5Coverline%7BJ%7D%5E%7B%5Cstar%7D_%7Bt,%5Ctheta%7D(x_t)%20%20%5Cright%5C%7D%7D%20%20%5C,%20dt,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BJ%7D_t"> and <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BJ%7D%5E%7B%5Cstar%7D_%7Bt,%5Ctheta%7D"> are the total jump rates of the forward and reverse processes respectively. Since <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BJ%7D%5E%7B%5Cstar%7D_%7Bt,%5Ctheta%7D(x_t)"> in fact does not depend on <img src="https://latex.codecogs.com/png.latex?%5Ctheta">, minimizing the KL divergence is equivalent to minimizing:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A-%5Cmathbb%7BE%7D_%7B%5Cmathbb%7BP%7D%7D%20%20%7B%5Cleft(%20%20%5Csum_%7Bi=1%7D%5EL%20%5Clog%20f_%5Ctheta(X_0%5Ei%20%5Cmid%20X_%7B%5Ctau_i%7D%5E%7BS_%7B%7B%5Ctau_i%7D%7D%7D)%20%20%5Cright)%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?S_t"> is the set of unmasked coordinates at time <img src="https://latex.codecogs.com/png.latex?t">. It is more convenient to rewrite this quantity as an integral over time so that one can sample a time <img src="https://latex.codecogs.com/png.latex?t"> uniformly in <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D"> during training. With the Dirac delta function <img src="https://latex.codecogs.com/png.latex?%5Cdelta%5B%5Ctau_i%20=%20t%5D">, we can rewrite this expectation as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A-%5Cmathbb%7BE%7D_%7B%5Cmathbb%7BP%7D%7D%20%20%7B%5Cleft(%20%20%5Csum_%7Bi=1%7D%5EL%20%5Clog%20f_%5Ctheta(X_0%5Ei%20%5Cmid%20X_%7B%5Ctau_i%7D%5E%7BS_%7Bi%7D%7D)%20%20%5Cright)%7D%0A&amp;=%0A-%5Cmathbb%7BE%7D_%7B%5Cmathbb%7BP%7D%7D%20%5Cint_%7Bt=0%7D%5E%7BT%7D%20%5Csum_%7Bi=1%7D%5EL%20%5Cdelta%5B%5Ctau_i%20=%20t%5D%20%5C,%20%5Clog%20f_%5Ctheta(X_0%5Ei%20%5Cmid%20X_%7Bt%7D%5E%7BS_%7Bt%7D%7D)%20%5C,%20dt%5C%5C%0A&amp;=%0A-%5Cint_%7Bt=0%7D%5E%7BT%7D%20%5Cfrac%7B%5Cdot%7B%5Cbeta_t%7D%7D%7B%5Cbeta_t%7D%20%5Csum_%7Bi:%20%5C,%20x_t%5Ei=M%7D%20%5Clog%20f_%5Ctheta(X_0%5Ei%20%5Cmid%20X_%7Bt%7D%5E%7BS_%7Bt%7D%7D)%20%5C,%20dt%0A%5Cend%7Balign*%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cbeta_t%20=%20%5Cmathbb%7BP%7D(%5Ctau%20%5Cle%20t)"> so that <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(%5Ctau%20%5Cin%20dt%20%7C%20%5Ctau%20%5Cle%20t)%20=%20%5Cfrac%7B%5Cdot%7B%5Cbeta_t%7D%7D%7B%5Cbeta_t%7D%20%5C,%20dt">. For training, it suffices to sample <img src="https://latex.codecogs.com/png.latex?t"> uniformly in <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D">, then choose <img src="https://latex.codecogs.com/png.latex?X_0%20%5Csim%20%5Coverline%7Bp%7D_0">, then sample the noised configuration <img src="https://latex.codecogs.com/png.latex?X_t"> according to the forward process, and finally obtain an unbiased estimate of the loss. The term <img src="https://latex.codecogs.com/png.latex?%5Cdot%7B%5Cbeta_t%7D/%5Cbeta_t"> is large for small <img src="https://latex.codecogs.com/png.latex?t">, counter-balancing the fact that the reconstruction task is much easier when only a few tokens are masked. For standard denoising <a href="../../notes/DDPM/DDPM.html">diffusion model</a>, there is a similar “signal-to-noise” weighting term that balances the easy and hard denoising tasks.</p>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">Conclusion</h3>
<p>Discrete diffusion models with one-way masking are mathematically almost identical to masked language models. Hence their similar behavior and performance on text generation tasks are not coincidental. The ideas summarized in these notes were developed in a very interesting stream of papers, including <span class="citation" data-cites="ou2024your">(Ou et al. 2024)</span>, <span class="citation" data-cites="sahoo2024simple">(Sahoo et al. 2024)</span>, <span class="citation" data-cites="shi2024simplified">(Shi et al. 2024)</span> and a number of more recent works. One of the potential drawbacks of such masked discrete diffusion models is that the support of the noising distribution is typically strictly smaller, and indeed often much smaller, than the whole state space. This means that when the denoising model is not perfect and wanders outside the support of the noising distribution, one can quickly end up in regions never seen during training. This can leads to poor sample quality and unstable behavior. Other discrete diffusion models such as the ones reaching the uniform distribution over all tokens at time <img src="https://latex.codecogs.com/png.latex?T"> are not as badly affected by this issue, although they do suffer from other important computational and modeling challenges. Exciting research directions remain to be explored in this area!</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-ou2024your" class="csl-entry">
Ou, Jingyang, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. 2024. <span>“Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data.”</span> <em>arXiv Preprint arXiv:2406.03736</em>.
</div>
<div id="ref-sahoo2024simple" class="csl-entry">
Sahoo, Subham, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. 2024. <span>“Simple and Effective Masked Diffusion Language Models.”</span> <em>Advances in Neural Information Processing Systems</em> 37: 130136–84.
</div>
<div id="ref-shi2024simplified" class="csl-entry">
Shi, Jiaxin, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. 2024. <span>“Simplified and Generalized Masked Diffusion for Discrete Data.”</span> <em>Advances in Neural Information Processing Systems</em> 37: 103131–67.
</div>
</div></section></div> ]]></description>
  <category>DDPM</category>
  <category>score</category>
  <guid>https://alexxthiery.github.io/notes/DiscreteDiff/DiscreteDiff.html</guid>
  <pubDate>Mon, 20 Oct 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Schrödinger Bridges</title>
  <link>https://alexxthiery.github.io/notes/shrodinger_bridge/shrodinger.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/shrodinger_bridge/erwin.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Erwin_Schr%C3%B6dinger">Erwin Schrödinger</a> (1887 – 1961)</figcaption>
</figure>
</div>
</div>
<p>Consider two discrete probability distributions <img src="https://latex.codecogs.com/png.latex?%5Cnu_0"> and <img src="https://latex.codecogs.com/png.latex?%5Cnu_1"> over <img src="https://latex.codecogs.com/png.latex?%5B%5B1,n%5D%5D">. We would like to find a <a href="https://en.wikipedia.org/wiki/Coupling_(probability)">coupling</a> <img src="https://latex.codecogs.com/png.latex?%5Cgamma(x_0,x_1)"> between <img src="https://latex.codecogs.com/png.latex?%5Cnu_0"> and <img src="https://latex.codecogs.com/png.latex?%5Cnu_1"> such that, under this coupling <img src="https://latex.codecogs.com/png.latex?(X_0,X_1)%20%5Csim%20%5Cgamma">, the distance <img src="https://latex.codecogs.com/png.latex?d(X_0,X_1)"> between <img src="https://latex.codecogs.com/png.latex?X_0"> and <img src="https://latex.codecogs.com/png.latex?X_1"> is small. Naturally, this can be formulated as the following <a href="https://en.wikipedia.org/wiki/Transportation_theory_(mathematics)">optimal transport</a> problem:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7B%5Cgamma%20%5Cin%20%5CPi(%5Cnu_0,%20%5Cnu_1)%7D%20%5C;%20%5Csum_%7Bx_0,x_1%7D%20d(x_0,x_1)%20%5C,%20%5Cgamma(x_0,x_1),%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5CPi(%5Cnu_0,%20%5Cnu_1)"> is the set of couplings between <img src="https://latex.codecogs.com/png.latex?%5Cnu_0"> and <img src="https://latex.codecogs.com/png.latex?%5Cnu_1">. It is a <a href="https://en.wikipedia.org/wiki/Linear_programming">linear program</a> and can be solved efficiently when <img src="https://latex.codecogs.com/png.latex?n"> is not too large. However, the optimal transport plan is often very sparse since it puts mass on at most <img src="https://latex.codecogs.com/png.latex?(2n-1)"> pairs <img src="https://latex.codecogs.com/png.latex?(x_0,x_1)">. This can be undesirable in some applications. Furthermore, small changes in the distributions <img src="https://latex.codecogs.com/png.latex?%5Cnu_0"> and <img src="https://latex.codecogs.com/png.latex?%5Cnu_1"> can lead to large changes in the optimal transport plan. This sensitivity can be problematic in practice, especially when the distributions are estimated from data.</p>
<section id="static-shrödinger-bridge-problem" class="level3">
<h3 class="anchored" data-anchor-id="static-shrödinger-bridge-problem">Static Shrödinger Bridge Problem</h3>
<p>A standard way to address these issues is to add an entropic regularization term to the objective function. The resulting problem is known as the Schrödinger bridge problem and can be formulated as follows. Consider a reference joint distribution <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)"> over <img src="https://latex.codecogs.com/png.latex?%5B%5B1,n%5D%5D%20%5Ctimes%20%5B%5B1,n%5D%5D"> and find the coupling <img src="https://latex.codecogs.com/png.latex?%5Cgamma(x_0,x_1)"> that minimizes the Kullback-Leibler divergence to <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5Cmathrm%7Bref%7D%7D"> while matching the marginals <img src="https://latex.codecogs.com/png.latex?%5Cnu_0"> and <img src="https://latex.codecogs.com/png.latex?%5Cnu_1">:</p>
<p><span id="eq-kl-contrained"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7B%5Cgamma%20%5Cin%20%5CPi(%5Cnu_0,%20%5Cnu_1)%7D%20%5C;%20%5Csum_%7Bx_0,x_1%7D%20%5Cgamma(x_0,x_1)%20%5Clog%20%5Cfrac%7B%5Cgamma(x_0,x_1)%7D%7B%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%7D.%0A%5Ctag%7B1%7D"></span></p>
<p><strong>Remark (invariance under separable rescaling).</strong> Only the “interaction structure” of <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5Cmathrm%7Bref%7D%7D"> matters: if one replaces <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5Cmathrm%7Bref%7D%7D"> by <img src="https://latex.codecogs.com/png.latex?%5Ctilde%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%5Cpropto%20a(x_0)%5C,%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%5C,b(x_1)"> with positive <img src="https://latex.codecogs.com/png.latex?a,b">, then the optimal coupling can still be written in the same form below, with the factors <img src="https://latex.codecogs.com/png.latex?a,b"> absorbed into the potentials. Equivalently, the solution depends on <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5Cmathrm%7Bref%7D%7D"> only through its kernel up to left/right diagonal scaling.</p>
<p>A common choice for the reference measure is a distribution of the form <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%20%5Cpropto%20%5Cexp(-d(x_0,x_1)/%5Cvarepsilon)"> (optionally times separable factors). This choice encourages the coupling <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> to put more mass on pairs <img src="https://latex.codecogs.com/png.latex?(x_0,x_1)"> that are close according to the distance <img src="https://latex.codecogs.com/png.latex?d">, and the resulting optimization problem can be rewritten (up to an additive constant) as:</p>
<p><span id="eq-entropic-ot"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7B%5Cgamma%20%5Cin%20%5CPi(%5Cnu_0,%20%5Cnu_1)%7D%20%5C;%20%5Csum_%7Bx_0,x_1%7D%20d(x_0,x_1)%20%5C,%20%5Cgamma(x_0,x_1)%20%20%5Ctextcolor%7Bblue%7D%7B%5C;%20-%20%5C;%20%5Cvarepsilon%5C,%20%5Cmathrm%7BH%7D(%5Cgamma)%7D%0A%5Ctag%7B2%7D"></span></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cmathrm%7BH%7D(%5Cgamma)%20=%20-%20%5Csum_%7Bx_0,x_1%7D%20%5Cgamma(x_0,x_1)%20%5Clog%20%5Cgamma(x_0,x_1)"> is the <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a> of the coupling <img src="https://latex.codecogs.com/png.latex?%5Cgamma">. Note that since the marginals of <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> are fixed, it is also equivalent (up to constants) to replacing the entropy term by the <a href="https://en.wikipedia.org/wiki/Kullback–Leibler_divergence">Kullback-Leibler</a> divergence to the independent coupling <img src="https://latex.codecogs.com/png.latex?%5Cnu_0(x_0)%20%5Cotimes%20%5Cnu_1(x_1)">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7B%5Cgamma%20%5Cin%20%5CPi(%5Cnu_0,%20%5Cnu_1)%7D%20%5C;%20%5Csum_%7Bx_0,x_1%7D%20d(x_0,x_1)%20%5C,%20%5Cgamma(x_0,x_1)%20%20%5Ctextcolor%7Bblue%7D%7B%5C;%20+%20%5C;%20%5Cvarepsilon%5C,%20D_%7B%5Ctext%7BKL%7D%7D(%5Cgamma%20%5Cmid%20%5Cnu_0%20%5Cotimes%20%5Cnu_1)%7D.%0A"></p>
<p>Writing the <a href="https://en.wikipedia.org/wiki/Duality_(optimization)">Lagrange dual</a> formulation of the problem Equation&nbsp;2 provides fast algorithms to solve it such as the <a href="https://en.wikipedia.org/wiki/Iterative_proportional_fitting">iterative proportional fitting</a> procedure (IPFP), also known as the Sinkhorn-Knopp algorithm in the optimal transport literature <span class="citation" data-cites="cuturi2013sinkhorn">(Cuturi 2013)</span>. Importantly, it also almost immediately shows that the optimal coupling has a very simple form,</p>
<p><span id="eq-schrodinger-solution"><img src="https://latex.codecogs.com/png.latex?%0A%5Cgamma(x_0,x_1)%20=%20f(x_0)%20%5C,%20%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%20%5C,%20g(x_1).%0A%5Ctag%7B3%7D"></span></p>
<p>These potentials must satisfy the marginal constraints, equivalently the <strong>Schrödinger/Sinkhorn scaling equations</strong>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0A%5Cnu_0(x_0)%20&amp;=%20%5Csum_%7Bx_1%7D%20f(x_0)%5C,%20%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%5C,%20g(x_1),%5C%5C%0A%5Cnu_1(x_1)%20&amp;=%20%5Csum_%7Bx_0%7D%20f(x_0)%5C,%20%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%5C,%20g(x_1).%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<details>
<summary>
Some details:
</summary>
<p style="color: blue;">
The Lagrangian of the constrained convex optimization Equation&nbsp;1 reads: <img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Csum_%7Bx_0,x_1%7D%20%5Cgamma(x_0,x_1)%20%5Clog%20%5Cfrac%7B%5Cgamma(x_0,x_1)%7D%7B%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%7D%0A&amp;+%20%5Csum_%7Bx_0%7D%20%5Calpha(x_0)%20%5Cleft(%20%5Cnu_0(x_0)%20-%20%5Csum_%7Bx_1%7D%20%5Cgamma(x_0,x_1)%20%5Cright)%5C%5C%0A&amp;+%20%5Csum_%7Bx_1%7D%20%5Cbeta(x_1)%20%5Cleft(%20%5Cnu_1(x_1)%20-%20%5Csum_%7Bx_0%7D%20%5Cgamma(x_0,x_1)%20%5Cright).%0A%5Cend%7Balign*%7D%0A"> There is no need to impose a constraint on the total mass of <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> since it is automatically satisfied by the marginal constraints. For a fixed set of dual variables <img src="https://latex.codecogs.com/png.latex?%5Calpha"> and <img src="https://latex.codecogs.com/png.latex?%5Cbeta">, minimizing the Lagrangian with respect to <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> leads to the optimum <img src="https://latex.codecogs.com/png.latex?%5Cgamma_%7B%5Calpha,%20%5Cbeta%7D(x_0,x_1)%20=%20%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)%20%5C,%20e%5E%7B%5Calpha(x_0)%20+%20%5Cbeta(x_1)%20-1%7D">, hence proving Equation&nbsp;3 with <img src="https://latex.codecogs.com/png.latex?f=e%5E%7B%5Calpha-1%7D"> and <img src="https://latex.codecogs.com/png.latex?g=e%5E%5Cbeta"> (up to scaling). Plugging this expression back into the Lagrangian gives: <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Calpha,%20%5Cbeta)%20=%20%5Csum_%7Bx_0%7D%20%5Calpha(x_0)%20%5C,%20%5Cnu_0(x_0)%20+%20%5Csum_%7Bx_1%7D%20%5Cbeta(x_1)%20%5C,%20%5Cnu_1(x_1)%20-%20%5Csum_%7Bx_0,x_1%7D%20%5Cgamma_%7B%5Calpha,%20%5Cbeta%7D(x_0,x_1).%0A"> Alternating maximization in <img src="https://latex.codecogs.com/png.latex?(%5Calpha,%5Cbeta)"> corresponds to alternately enforcing the two marginal constraints, i.e.&nbsp;IPFP / Sinkhorn scaling. Directly maximizing the Lagrange dual <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D(%5Calpha,%20%5Cbeta)"> using gradient ascent methods such as <a href="https://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS</a> is also possible and can lead to faster convergence in some cases. Note that the dual problem only has <img src="https://latex.codecogs.com/png.latex?2n"> variables, while the primal problem has <img src="https://latex.codecogs.com/png.latex?n%5E2"> variables!
</p>
</details>
</section>
<section id="dynamic-schrödinger-bridge-problem" class="level3">
<h3 class="anchored" data-anchor-id="dynamic-schrödinger-bridge-problem">Dynamic Schrödinger Bridge Problem</h3>
<p>Now, suppose that the reference distribution <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5Cmathrm%7Bref%7D%7D(x_0,x_1)"> is given as the two-time marginal of a Markov process <img src="https://latex.codecogs.com/png.latex?(X_t)_%7Bt%20%5Cin%20%5B0,1%5D%7D"> with transition kernels <img src="https://latex.codecogs.com/png.latex?p%5E%7B%5Ctext%7Bref%7D%7D_%7Bs,t%7D(x_s,x_t)"> and path measure <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D"> on trajectories <img src="https://latex.codecogs.com/png.latex?x_%7B%5B0,1%5D%7D">.</p>
<p>The dynamic Schrödinger bridge problem consists in finding a new path measure <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> on trajectories <img src="https://latex.codecogs.com/png.latex?x_%7B%5B0,1%5D%7D"> such that the starting and ending marginals are <img src="https://latex.codecogs.com/png.latex?%5Cnu_0"> and <img src="https://latex.codecogs.com/png.latex?%5Cnu_1">, while minimizing the Kullback-Leibler divergence to the reference path distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7B%5Cmathbb%7BQ%7D:%5C%20%5Cmathbb%7BQ%7D_0=%5Cnu_0,%5C%20%5Cmathbb%7BQ%7D_1=%5Cnu_1%7D%5C;%20%5Cmathrm%7BKL%7D(%5Cmathbb%7BQ%7D%5Cmid%20%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D).%0A"></p>
<p>When the reference is Markov, the chain rule for KL (disintegration with respect to <img src="https://latex.codecogs.com/png.latex?(X_0,X_1)">) shows this is equivalent to the static Schrödinger bridge problem for the two-time marginal of <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D">: one first solves for the optimal endpoint coupling <img src="https://latex.codecogs.com/png.latex?%5Cgamma%5E%5Cstar(x_0,x_1)">, and then fills in intermediate times using the <strong>reference bridges</strong> <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D(%5Ccdot%5Cmid%20X_0=x_0,X_1=x_1)">:</p>
<ol type="1">
<li>Sample the endpoints <img src="https://latex.codecogs.com/png.latex?(X_0,%20X_1)%20%5Csim%20%5Cgamma%5E%5Cstar(x_0,x_1)">.<br>
</li>
<li>Sample the intermediate points according to the conditional law of the reference process given the endpoints, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?(X_t)_%7Bt%20%5Cin%20(0,1)%7D%20%5Csim%20%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D(%5Ccdot%20%5Cmid%20X_0,%20X_1)">.</li>
</ol>
<p><strong>Continuous-Time Stochastic Processes:</strong></p>
<p>A typical setting is when the reference process <img src="https://latex.codecogs.com/png.latex?(X_t)"> is given as the solution of a stochastic differential equation of the form</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX_t%20=%20b_t(X_t)%20%5C,%20dt%20+%20%5Csigma%20%5C,%20dW_t,%0A"></p>
<p>started from some initial distribution (one may take it to be <img src="https://latex.codecogs.com/png.latex?%5Cnu_0"> without loss of generality, since any mismatch can be absorbed into the endpoint tilt below). The solution to the dynamic Schrödinger bridge problem is given by the twisted path distribution:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%5Cmathbb%7BQ%7D%5E%5Cstar%7D%7Bd%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D%7D(x_%7B%5B0,1%5D%7D)%0A%5Cpropto%20f(X_0)%5C,g(X_1),%0A"></p>
<p>for endpoint potentials <img src="https://latex.codecogs.com/png.latex?f"> and <img src="https://latex.codecogs.com/png.latex?g"> satisfying the marginal constraints at <img src="https://latex.codecogs.com/png.latex?t=0"> and <img src="https://latex.codecogs.com/png.latex?t=1">.</p>
<p>The marginal density at intermediate time <img src="https://latex.codecogs.com/png.latex?t%5Cin%5B0,1%5D"> factorizes as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aq_t(x)%20=%20p_t%5E%7B%5Ctext%7Bref%7D%7D(x)%5C,%5Cwidehat%7B%5Cvarphi%7D_t(x)%5C,%5Cvarphi_t(x),%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?p_t%5E%7B%5Ctext%7Bref%7D%7D"> is the time-<img src="https://latex.codecogs.com/png.latex?t"> marginal density of the reference process, and where the time-dependent Schrödinger potentials are defined by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0A%5Cvarphi_t(x)%20&amp;=%20%5Cmathbb%7BE%7D_%7B%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D%7D%5Bg(X_1)%5Cmid%20X_t=x%5D,%5C%5C%0A%5Cwidehat%7B%5Cvarphi%7D_t(x)%20&amp;=%20%5Cmathbb%7BE%7D_%7B%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D%7D%5Bf(X_0)%5Cmid%20X_t=x%5D,%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>(with <img src="https://latex.codecogs.com/png.latex?%5Cvarphi_1%20%5Cequiv%20g"> and <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7B%5Cvarphi%7D_0%20%5Cequiv%20f">, up to normalization). Naturally, the dynamics of the Schrödinger bridge process can be described as a new stochastic differential equation obtained by applying a <a href="../../notes/doob_transforms/doob.html">Doob h-transform</a> to the reference SDE,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX_t%20=%20b_t(X_t)%5C,dt%20+%20%20%5Ctextcolor%7Bblue%7D%7B%5Csigma%20%5Csigma%5E%5Ctop%20%5Cnabla_x%20%5Clog%20%5Cvarphi_t(%20X_t)%5C,dt%7D%20+%20%5Csigma%5C,dW_t,%0A"></p>
<p>where, as explained in these previous <a href="../../notes/doob_transforms/doob.html">notes</a>, the function <img src="https://latex.codecogs.com/png.latex?%5Cvarphi_t(x)"> is the harmonic extension of the terminal potential <img src="https://latex.codecogs.com/png.latex?g"> defined above.</p>
<p><strong>Discrete-Time Markov Chains:</strong></p>
<p>It is often useful to state the Schrödinger bridge dynamics in a discrete setting. Let the reference process be a Markov chain <img src="https://latex.codecogs.com/png.latex?X_0,%20%5Cldots,%20X_T"> with one-step kernels <img src="https://latex.codecogs.com/png.latex?M_k(x,dy)">, i.e. <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BP%7D%5E%7B%5Ctext%7Bref%7D%7D(X_%7Bk+1%7D%5Cin%20dy%5Cmid%20X_k=x)=M_k(x,dy).%0A"> Let <img src="https://latex.codecogs.com/png.latex?g"> be the terminal potential at time <img src="https://latex.codecogs.com/png.latex?T"> and define the backward (harmonic) potentials defined by <img src="https://latex.codecogs.com/png.latex?%5Cvarphi_T%20%5Cequiv%20g"> and recursively as: <img src="https://latex.codecogs.com/png.latex?%0A%5Cvarphi_k(x)%20=%20%5Cint%20M_k(x,dy)%5C,%5Cvarphi_%7Bk+1%7D(y).%0A"></p>
<p>Then the Schrödinger bridge <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D%5E%5Cstar"> is Markov and its <strong>forward transition kernels</strong> are given by the discrete Doob <img src="https://latex.codecogs.com/png.latex?h">-transform: <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BQ%7D%5E%5Cstar(X_%7Bk+1%7D%5Cin%20dy%5Cmid%20X_k=x)%0A=%5Cfrac%7BM_k(x,dy)%5C,%5Cvarphi_%7Bk+1%7D(y)%7D%7B%5Cvarphi_k(x)%7D.%0A"> This is the discrete-time counterpart of the continuous-time drift correction <img src="https://latex.codecogs.com/png.latex?b_t%20%5Cmapsto%20b_t%20+%20%5Csigma%5Csigma%5E%5Ctop%5Cnabla%5Clog%5Cvarphi_t(%5Ccdot)">.</p>
<p>In general, Schrödinger bridge problems are difficult problems since the endpoint potentials <img src="https://latex.codecogs.com/png.latex?f"> and <img src="https://latex.codecogs.com/png.latex?g"> need to be found such that the marginal constraints are satisfied. These recent years have seen the development of many numerical methods to solve this problem approximately, especially in the machine learning community, eg: <span class="citation" data-cites="shi2023diffusion">(Shi et al. 2023)</span>; <span class="citation" data-cites="de2021diffusion">(De Bortoli et al. 2021)</span>; <span class="citation" data-cites="vargas2021solving">(Vargas et al. 2021)</span>; <span class="citation" data-cites="chen2021likelihood">(Chen, Liu, and Theodorou 2021)</span>.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-chen2021likelihood" class="csl-entry">
Chen, Tianrong, Guan-Horng Liu, and Evangelos A Theodorou. 2021. <span>“Likelihood Training of Schrodinger Bridge Using Forward-Backward Sdes Theory.”</span> <em>arXiv Preprint arXiv:2110.11291</em>.
</div>
<div id="ref-cuturi2013sinkhorn" class="csl-entry">
Cuturi, Marco. 2013. <span>“Sinkhorn Distances: Lightspeed Computation of Optimal Transport.”</span> <em>Advances in Neural Information Processing Systems</em> 26.
</div>
<div id="ref-de2021diffusion" class="csl-entry">
De Bortoli, Valentin, James Thornton, Jeremy Heng, and Arnaud Doucet. 2021. <span>“Diffusion Schrodinger Bridge with Applications to Score-Based Generative Modeling.”</span> <em>Advances in Neural Information Processing Systems</em> 34: 17695–709.
</div>
<div id="ref-shi2023diffusion" class="csl-entry">
Shi, Yuyang, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. 2023. <span>“Diffusion Schrodinger Bridge Matching.”</span> <em>Advances in Neural Information Processing Systems</em> 36: 62183–223.
</div>
<div id="ref-vargas2021solving" class="csl-entry">
Vargas, Francisco, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. 2021. <span>“Solving Schrodinger Bridges via Maximum Likelihood.”</span> <em>Entropy</em> 23 (9). MDPI: 1134.
</div>
</div></section></div> ]]></description>
  <category>SDE</category>
  <category>markov</category>
  <guid>https://alexxthiery.github.io/notes/shrodinger_bridge/shrodinger.html</guid>
  <pubDate>Sat, 18 Oct 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Infinite Products</title>
  <link>https://alexxthiery.github.io/notes/infinite_products/inft_prod.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/infinite_products/hadamard.jpg" class="img-fluid figure-img" style="width:60.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Jacques_Hadamard">Jacques Hadamard</a> (1865 – 1963)</figcaption>
</figure>
</div>
</div>
<p>Any polynomial <img src="https://latex.codecogs.com/png.latex?P(z)"> can be expressed as the product of its zeros, <img src="https://latex.codecogs.com/png.latex?P(z)%20=%20c%20%5Cprod_%7Bk=1%7D%5En%20(z%20-%20z_k)">. Now, consider an entire function <img src="https://latex.codecogs.com/png.latex?f(z)"> with an infinite number of zeros <img src="https://latex.codecogs.com/png.latex?z_k">. Necessarily, the zeros must accumulate only at infinity, and one could be tempted to compare <img src="https://latex.codecogs.com/png.latex?f"> to <img src="https://latex.codecogs.com/png.latex?c%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(z%20-%20z_k)">. Indeed, this does not work since there is no hope for the product to converge. Instead, it seems more reasonable to consider <img src="https://latex.codecogs.com/png.latex?c%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1%20-%20z/z_k)"> since for a given <img src="https://latex.codecogs.com/png.latex?z">, this product has a better chance to converge for <img src="https://latex.codecogs.com/png.latex?(1%20-%20z/z_k)%20%5Cto%201"> for <img src="https://latex.codecogs.com/png.latex?k%20%5Cto%20%5Cinfty">. For simplicity, one can assume that the <img src="https://latex.codecogs.com/png.latex?z_k"> are non-zero, since otherwise, one can just add a factor <img src="https://latex.codecogs.com/png.latex?z%5Em"> to the product, where <img src="https://latex.codecogs.com/png.latex?m"> is the multiplicity of the zero at <img src="https://latex.codecogs.com/png.latex?0">.</p>
<p>There are indeed a few issues. First, one needs the condition <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Bk%20%5Cgeq%201%7D%201/%7Cz_k%7C"> to ensure convergence. Second, even if this condition is satisfied, any function of the type <img src="https://latex.codecogs.com/png.latex?e%5E%7Bg(z)%7D%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1%20-%20z/z_k)">, where <img src="https://latex.codecogs.com/png.latex?g"> is an entire function, would also share the same zeros. The first issue is quite easily taken care of. Instead of considering terms of the type <img src="https://latex.codecogs.com/png.latex?(1%20-%20z/z_k)">, one needs to consider terms that converge much faster to <img src="https://latex.codecogs.com/png.latex?1"> as <img src="https://latex.codecogs.com/png.latex?z_k%20%5Cto%20%5Cinfty">, and only vanish at <img src="https://latex.codecogs.com/png.latex?z_k">. A natural choice is <img src="https://latex.codecogs.com/png.latex?E_p(z/z_k)"> with</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AE_p(z)%20=%20(1%20-%20z)%20%5C,%20%5Cexp%5Cleft(z%20+%20%5Cfrac%7Bz%5E2%7D%7B2%7D%20+%20%5Ccdots%20+%20%5Cfrac%7Bz%5Ep%7D%7Bp%7D%5Cright)%20%5Capprox_0%201%20-%20%5Cfrac%7Bz%5E%7Bp+1%7D%7D%7Bp+1%7D.%0A"></p>
<p>It is then easy to see, for example, that the product <img src="https://latex.codecogs.com/png.latex?%5Cprod_k%20E_k(z/z_k)"> is well defined for <img src="https://latex.codecogs.com/png.latex?z%20%5Cin%20%5Cmathbb%7BC%7D"> and precisely vanishes at the <img src="https://latex.codecogs.com/png.latex?z_k">. If one knew that the zeros satisfied <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Bk%20%5Cgeq%201%7D%201/%7Cz_k%7C%5E%7Bp+1%7D%20%3C%20%5Cinfty">, then one could use instead <img src="https://latex.codecogs.com/png.latex?%5Cprod_k%20E_p(z/z_k)">. However, this approach is of often of limited use since, as mentioned above, one can always multiply by an entire function <img src="https://latex.codecogs.com/png.latex?e%5E%7Bg(z)%7D"> to obtain another entire function with the same zeros.</p>
<p>To make progress, one can impose some growth condition on the entire function <img src="https://latex.codecogs.com/png.latex?f(z)">. For example, an entire function <img src="https://latex.codecogs.com/png.latex?f(z)"> is said to be of order <img src="https://latex.codecogs.com/png.latex?%5Crho"> if</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%7Cf(z)%7C%20%5Cleq%20C_%7B%5Cvarepsilon%7D%20%5C,%20%5Cexp%5Cleft(%7Cz%7C%5E%7B%5Crho%20+%20%5Cvarepsilon%7D%5Cright)%20%5Cqquad%20%5Ctext%7Bfor%20all%20%7D%20%5Cvarepsilon%3E%200.%0A"></p>
<p>For example, one can readily see that the sine function is of order <img src="https://latex.codecogs.com/png.latex?1">, and any polynomial is of order <img src="https://latex.codecogs.com/png.latex?0">. If one knows that the entire function <img src="https://latex.codecogs.com/png.latex?f(z)"> is of order <img src="https://latex.codecogs.com/png.latex?%5Crho"> and has zeros <img src="https://latex.codecogs.com/png.latex?z_k"> (counted with multiplicity), then <a href="https://en.wikipedia.org/wiki/Hadamard_factorization_theorem">Hadamard’s factorization theorem</a> states that, in fact, the function <img src="https://latex.codecogs.com/png.latex?f"> can be expressed as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Af(z)%20=%20e%5E%7BP(z)%7D%5C,%20z%5Em%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20E_%7Bd%7D%20%5Cleft(%5Cfrac%7Bz%7D%7Bz_k%7D%5Cright)%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?m"> is the multiplicity of the zero at <img src="https://latex.codecogs.com/png.latex?0">, <img src="https://latex.codecogs.com/png.latex?d%20=%20%5Clfloor%20%5Crho%20%5Crfloor">, and <img src="https://latex.codecogs.com/png.latex?P(z)"> is a polynomial of degree at most <img src="https://latex.codecogs.com/png.latex?d">.</p>
<section id="some-natural-examples" class="level3">
<h3 class="anchored" data-anchor-id="some-natural-examples">Some natural examples</h3>
<p>One can then asks oneself what are some natural entire functions that vanish at some predetermined set of zeros. For example, what functions vanish on all the integers? Such a function cannot be of order less than one since otherwise it could be writtem as <img src="https://latex.codecogs.com/png.latex?z%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1-z/k)(1+z/k)">, but this product does not converge. Any entire function of order <img src="https://latex.codecogs.com/png.latex?1"> that has a simple zero at each integer is of the form</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0Af(z)%0A&amp;=%20e%5E%7Baz%20+%20b%7D%20%5C,%20z%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D(1-z/k)(1+z/k)%20e%5E%7Bz/z_k%7D%20%5C,%20e%5E%7B-z/z_k%7D%5C%5C%0A&amp;=%20e%5E%7Baz%20+%20b%7D%20%5C,%20z%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1-z%5E2/k%5E2).%0A%5Cend%7Balign%7D%0A"></p>
<p>Checking the Taylor expansion of sine at zero, one finds that the celebrated formula:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Csin(%5Cpi%20z)%20=%20%5Cpi%20%5C,%20z%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20%5Cleft(1%20-%20%5Cfrac%7Bz%5E2%7D%7Bk%5E2%7D%5Cright)%0A"></p>
<p>and a similar interesting example by taking the derivatives of <img src="https://latex.codecogs.com/png.latex?log%20%5Csin(%5Cpi%20z)">. Now, what about a function that vanishes on all the negative integers? Again, such a function needs to be of order at least one. Hadamard’s factorization theorem then tells us that such a function is, up to a multiplicative constant, of the form</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ag(z)%20=%20e%5E%7Baz%7D%20%5C,%20z%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1%20+%20z/k)%20e%5E%7B-z/k%7D.%0A"></p>
<p>Naturally, since such a function vanishes on all the negative integers, one of the first things one would like to try is to look at <img src="https://latex.codecogs.com/png.latex?g(z+1)"> and relate it to <img src="https://latex.codecogs.com/png.latex?g"> itself. Since <img src="https://latex.codecogs.com/png.latex?g(z+1)"> vanishes on <img src="https://latex.codecogs.com/png.latex?%5C%7B-1,%20-2,%20%5Cldots%5C%7D">, one knows that it can be expressed as <img src="https://latex.codecogs.com/png.latex?e%5E%7Ba'z%20+%20b'%7D%20%5C,%20z%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1%20+%20z/k)%20e%5E%7B-z/k%7D"> so that <img src="https://latex.codecogs.com/png.latex?z%20%5C,%20g(z+1)"> is almost the same as <img src="https://latex.codecogs.com/png.latex?g(z)">. One can then do some algebra to choose the constant <img src="https://latex.codecogs.com/png.latex?a"> so that <img src="https://latex.codecogs.com/png.latex?z%20%5C,%20g(z+1)%20=%20g(z)">. One finds that the correct choice is the <a href="https://en.wikipedia.org/wiki/Euler's_constant">Euler-Mascheroni constant</a>,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aa%20=%20%5Cgamma%20=%20%5Clim_%7Bn%20%5Cto%20%5Cinfty%7D%20%5Cleft(%20%5Csum_%7Bk=1%7D%5En%20%5Cfrac%7B1%7D%7Bk%7D%20-%20%5Clog%20n%20%5Cright).%0A"></p>
<p>This gives the final expression for <img src="https://latex.codecogs.com/png.latex?g(z)"> as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ag(z)%20=%20e%5E%7B%5Cgamma%20z%7D%20%5C,%20z%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1%20+%20z/k)%20e%5E%7B-z/k%7D.%0A"></p>
<p>Furthermore, since <img src="https://latex.codecogs.com/png.latex?g(z)/z%20%5Cto%201"> as <img src="https://latex.codecogs.com/png.latex?s%20%5Cto%200">, it follows that <img src="https://latex.codecogs.com/png.latex?g(1)=1">, from which the identity <img src="https://latex.codecogs.com/png.latex?z%20%5C,%20g(z+1)%20=%20g(z)"> gives that <img src="https://latex.codecogs.com/png.latex?1/g(n+1)%20=%20n!">. In other words, <img src="https://latex.codecogs.com/png.latex?g(z)"> is an analytic continuation on the whole complex plane of the inverse of the <a href="https://en.wikipedia.org/wiki/Gamma_function">Gamma function</a>,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CGamma(z)%20=%20e%5E%7B-%5Cgamma%20z%7D%20%5C,%20z%5E%7B-1%7D%20%5C,%20%5Cprod_%7Bk%20%5Cgeq%201%7D%20(1%20+%20z/k)%5E%7B-1%7D%20e%5E%7Bz/k%7D.%0A"></p>
<p>This generalizes the usual definition <img src="https://latex.codecogs.com/png.latex?%5CGamma(z)%20=%20%5Cint_0%5E%5Cinfty%20t%5E%7Bz-1%7D%20e%5E%7B-t%7D%20%5C,%20dt"> valid for <img src="https://latex.codecogs.com/png.latex?%5CRe(z)%20%3E%200"> to the whole complex plane.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/infinite_products/Gamma_abs_3D.png" class="img-fluid figure-img" style="width:60.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Gamma_function">Gamma function</a></figcaption>
</figure>
</div>
</div>
<p>The connection to the sine function is also almost immediate. Indeed, the function <img src="https://latex.codecogs.com/png.latex?g(z)%20g(-z)"> vanishes on all the integers and the infinite product immediately shows that <img src="https://latex.codecogs.com/png.latex?g(z)g(-z)%20=%20-z%20%5C,%20%5Csin(%5Cpi%20z)%20/%20%5Cpi">. But since <img src="https://latex.codecogs.com/png.latex?g(-z)/(-z)%20=%20g(1-z)">, one obtains that <img src="https://latex.codecogs.com/png.latex?g(z)%20g(1-z)%20=%20%5Csin(%5Cpi%20z)%20/%20%5Cpi">, i.e.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CGamma(z)%20%5C,%20%5CGamma(1-z)%20=%20%5Cfrac%7B%5Cpi%7D%7B%5Csin(%5Cpi%20z)%7D.%0A"> This is the celebrated <a href="https://en.wikipedia.org/wiki/Reflection_formula">Euler reflection formula</a>, from which it also follows that <img src="https://latex.codecogs.com/png.latex?%5CGamma(1/2)%20=%20%5Csqrt%7B%5Cpi%7D">.</p>
</section>
<section id="zeta-function" class="level3">
<h3 class="anchored" data-anchor-id="zeta-function">Zeta function</h3>
<p>Let’s conclude these notes by deriving the analytic continuation of the Riemann <a href="https://en.wikipedia.org/wiki/Riemann_zeta_function">zeta function</a> <img src="https://latex.codecogs.com/png.latex?%5Czeta(s)%20=%20%5Csum_%7Bn=1%7D%5E%5Cinfty%20n%5E%7B-s%7D">, originnaly defined for <img src="https://latex.codecogs.com/png.latex?%5CRe(s)%20%3E%201">, to the whole complex plane. I have always found this proof very elegant. A common trick that is used in many places is to do a change of variable in the definition of the Gamma function to obtains that</p>
<p><img src="https://latex.codecogs.com/png.latex?n%5E%7B-s%7D%20=%20%5CGamma(s)%5E%7B-1%7D%20%5C,%20%5Cint_0%5E%5Cinfty%20e%5E%7B-nt%7D%20t%5E%7Bs-1%7D%20%5C,%20dt."></p>
<p>It is valid for <img src="https://latex.codecogs.com/png.latex?%5CRe(s)%20%3E%200">. This allows one to obtain express the zeta function as:</p>
<p><span id="eq-boring"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%5Czeta(s)%0A&amp;=%20%5CGamma(s)%5E%7B-1%7D%20%5Cint_0%5E%5Cinfty%20%20%7B%5Cleft%5C%7B%20%20%5Csum_%7Bn=1%7D%5E%5Cinfty%20e%5E%7B-t%20x%7D%20%20%5Cright%5C%7D%7D%20%20%5C,%20t%5E%7Bs-1%7D%20%5C,%20dt%5C%5C%0A&amp;=%20%5CGamma(s)%5E%7B-1%7D%20%5Cint_0%5E%5Cinfty%20%5Cfrac%7B1%7D%7Be%5E%7Bt%7D%20-%201%7D%20%5C,%20t%5E%7Bs-1%7D%20%5C,%20dt.%0A%5Cend%7Balign%7D%0A%5Ctag%7B1%7D"></span></p>
<p>From the previous discussion, one knows that <img src="https://latex.codecogs.com/png.latex?%5CGamma(s)%5E%7B-1%7D"> is an entire function of order <img src="https://latex.codecogs.com/png.latex?1">, and the integral converges for <img src="https://latex.codecogs.com/png.latex?%5CRe(s)%20%3E%200"> except for <img src="https://latex.codecogs.com/png.latex?s%20=%201">; in other words, this already gives a meromorphic continuation of <img src="https://latex.codecogs.com/png.latex?%5Czeta(s)"> to the domain <img src="https://latex.codecogs.com/png.latex?%5CRe(s)%20%3E%200"> with a simple pole at <img src="https://latex.codecogs.com/png.latex?s%20=%201">. Since things get nasty near <img src="https://latex.codecogs.com/png.latex?t=0">, the standard approach consist in splitting the integral into two parts. The part <img src="https://latex.codecogs.com/png.latex?%5Cint_1%5E%5Cinfty%20%5Cfrac%7B1%7D%7Be%5Et%20-%201%7D%20%5C,%20t%5E%7Bs-1%7D%20%5C,%20dt"> defines an entire function and one only needs to take care of the integral <img src="https://latex.codecogs.com/png.latex?%5Cint_0%5E1%20%5Cfrac%7B1%7D%7Be%5Et%20-%201%7D%20%5C,%20t%5E%7Bs-1%7D%20%5C,%20dt">. This can be done by expressing <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7Be%5Et%20-%201%7D%20=%20%5Csum_%7Bm=0%7D%5E%7B%5Cinfty%7D%20B_m%20%5C,%20t%5Em%20/%20m!">, where <img src="https://latex.codecogs.com/png.latex?B_m"> are the <a href="https://en.wikipedia.org/wiki/Bernoulli_number">Bernoulli numbers</a> and integrating each term, but the way <a href="https://en.wikipedia.org/wiki/Bernhard_Riemann">Bernhard Riemann</a> did it is way more fun. The idea is to note use the boring <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Bn=1%7D%5E%5Cinfty%20e%5E%7B-n%20t%7D"> but instead introduce the way more interesting function</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctheta(t)%20=%20%5Csum_%7Bn%20%5Cin%20%5Cmathbb%7BZ%7D%7D%20e%5E%7B-%5Cpi%20n%5E2%20t%7D%20=%201%20+%202%20%5C,%20J(t)%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?J(t)%20=%20%5Csum_%7Bn=1%7D%5E%5Cinfty%20e%5E%7B-%5Cpi%20n%5E2%20t%7D">, defined for <img src="https://latex.codecogs.com/png.latex?t%3E0">. Note that the function <img src="https://latex.codecogs.com/png.latex?J(t)"> decreases exponentially rapidly to <img src="https://latex.codecogs.com/png.latex?0"> as <img src="https://latex.codecogs.com/png.latex?t%20%5Cto%20%5Cinfty">. In other words, instead of Equation&nbsp;1 one can just as easily write</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B1%7D%7B%5Cpi%5E%7Bs/2%7D%20%5C,%20n%5Es%7D%20=%20%5Cfrac%7B1%7D%7B%5CGamma(s/2)%7D%20%5Cint_0%5E%5Cinfty%20e%5E%7B-%5Cpi%20n%5E2%20t%7D%20%5C,%20t%5E%7B%7Bs/2%7D-1%7D%20%5C,%20dt.%0A"></p>
<p>This slight change of parametrization allows to write the zeta function as</p>
<p><span id="eq-less-boring"><img src="https://latex.codecogs.com/png.latex?%0A%5Czeta(s)%20=%20%5Cfrac%7B%5Cpi%5E%7Bs/2%7D%7D%7B%5CGamma(s/2)%7D%20%5Cint_0%5E%5Cinfty%20J(t)%20%5C,%20t%5E%7B%7Bs/2%7D-1%7D%20%5C,%20dt.%0A%5Ctag%7B2%7D"></span></p>
<p>One has not gained much doing this since there is indeed still an issue at <img src="https://latex.codecogs.com/png.latex?t=0"> where <img src="https://latex.codecogs.com/png.latex?J(t)"> diverges. However, the Jacobi theta function <img src="https://latex.codecogs.com/png.latex?%5Ctheta(t)"> enjoys some interesting symmetries. Crucially, the <a href="https://en.wikipedia.org/wiki/Poisson_summation_formula">Poisson summation</a> formula applied to the Gaussian function <img src="https://latex.codecogs.com/png.latex?x%20%5Cmapsto%20e%5E%7B-%5Cpi%20x%5E2%20t%7D"> gives that <img src="https://latex.codecogs.com/png.latex?%5Ctheta(t)"> satisfies modular&nbsp;inversion&nbsp;symmetry:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctheta(t)%20=%20%5Cfrac%7B1%7D%7B%5Csqrt%7Bt%7D%7D%20%5C,%20%5Ctheta(1/t).%0A"></p>
<p>This means that splitting <img src="https://latex.codecogs.com/png.latex?%5Cint_0%5E%7B%5Cinfty%7D%20=%20%5Cint_0%5E1%20+%20%5Cint_1%5E%7B%5Cinfty%7D"> in Equation&nbsp;2, using the change of variable <img src="https://latex.codecogs.com/png.latex?t%20%5Cmapsto%201/t"> to map <img src="https://latex.codecogs.com/png.latex?%5Cint_0%5E1"> to <img src="https://latex.codecogs.com/png.latex?%5Cint_1%5E%7B%5Cinfty%7D"> and finally use the modular inversion symmetry of the theta function leads after standard algebra:</p>
<p><span id="eq-zeta-expansion"><img src="https://latex.codecogs.com/png.latex?%0A%5Czeta(s)%20=%20%5Cfrac%7B%5Cpi%5E%7Bs/2%7D%7D%7B%5CGamma(s/2)%7D%20%20%7B%5Cleft%5C%7B%0A%5Cunderbrace%7B%5Cint_%7B1%7D%5E%7B%5Cinfty%7D%20J(t)%20%5C,%20%20%7B%5Cleft(%20t%5E%7B(1-s)/2%7D%20+%20t%5E%7Bs/2%7D%20%5Cright)%7D%20%20%5C,%20%5Cfrac%7Bdt%7D%7Bt%7D%20-%20%5Cfrac%7B1%7D%7B1-s%7D%20-%20%5Cfrac%7B1%7D%7Bs%7D%20%7D_%7B%5CLambda(s)%7D%0A%5Cright%5C%7D%7D%20.%0A%5Ctag%7B3%7D"></span></p>
<p>First, one can note that since <img src="https://latex.codecogs.com/png.latex?J(t)"> decreases exponentially quickly to <img src="https://latex.codecogs.com/png.latex?0"> as <img src="https://latex.codecogs.com/png.latex?t%20%5Cto%20%5Cinfty">, the integral above defines and entire function. This means that the expression above defines a meromorphic continuation of <img src="https://latex.codecogs.com/png.latex?%5Czeta(s)"> to the whole complex plane with a simple pole at <img src="https://latex.codecogs.com/png.latex?s%20=%201">. There is no pole at <img src="https://latex.codecogs.com/png.latex?s=0"> since the simple zero of <img src="https://latex.codecogs.com/png.latex?%5CGamma(s/2)%5E%7B-1%7D"> takes care of it and gives the value <img src="https://latex.codecogs.com/png.latex?%5Czeta(0)%20=%20-1/2">. This also shows that the <img src="https://latex.codecogs.com/png.latex?%5Czeta"> function inherits from <img src="https://latex.codecogs.com/png.latex?%5CGamma(s/2)%5E%7B-1%7D"> a simple zero at all the negative even integers, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5Czeta(-2n)%20=%200"> for <img src="https://latex.codecogs.com/png.latex?n%20%5Cgeq%201">. And there are indeed a few other zeros, as the plot below shows… and they seem to be located on the critical line <img src="https://latex.codecogs.com/png.latex?%5CRe(s)%20=%201/2">…</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/infinite_products/zeta_plot.jpg" class="img-fluid figure-img" style="width:60.0%"></p>
<figcaption>Zeta function, where are the zeros?</figcaption>
</figure>
</div>
</div>
<p>What is remarkable is that the term inside the curly brackets of Equation&nbsp;3 is symmetric in <img src="https://latex.codecogs.com/png.latex?s"> and <img src="https://latex.codecogs.com/png.latex?1-s">, i.e.&nbsp;symmetric with respect to the vertical line <img src="https://latex.codecogs.com/png.latex?%5CRe(s)%20=%201/2"> in the complex plane. This means that the function <img src="https://latex.codecogs.com/png.latex?%5CLambda(s)%20=%20%5Czeta(s)%20%5C,%20%5CGamma(s/2)%20%5C,%20%5Cpi%5E%7B-s/2%7D"> satisfies the functional equation</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CLambda(s)%20=%20%5CLambda(1-s).%0A"></p>


</section>

 ]]></description>
  <category>complex_analysis</category>
  <guid>https://alexxthiery.github.io/notes/infinite_products/inft_prod.html</guid>
  <pubDate>Sat, 14 Jun 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Fisher-Rao Geometry</title>
  <link>https://alexxthiery.github.io/notes/fisher-rao/distance.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/fisher-rao/rao.jpg" class="img-fluid figure-img" style="width:60.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/C._R._Rao">Calyampudi Radhakrishna Rao</a> (1920 – 2023)</figcaption>
</figure>
</div>
</div>
<section id="fisher-rao-metric" class="level3">
<h3 class="anchored" data-anchor-id="fisher-rao-metric">Fisher-Rao metric</h3>
<p>Suppose we want to define a distance on the space of probability densities on <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5Ed">. A natural but naive approach is to use an <img src="https://latex.codecogs.com/png.latex?L%5E2">-type distance:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad(%5Crho_1,%20%5Crho_2)%5E2%20=%20%5Cint_%7B%5Cmathbb%7BR%7D%5Ed%7D%20%7C%5Crho_1(x)%20-%20%5Crho_2(x)%7C%5E2%20%5C,%20dx,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Crho_i"> are densities with respect to the Lebesgue measure. However, this definition has several shortcomings. For instance, if we change the base measure to <img src="https://latex.codecogs.com/png.latex?%5Cmu(x)%20%5C,%20dx"> for some positive density <img src="https://latex.codecogs.com/png.latex?%5Cmu">, and define the distance as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cint%20%5Cleft%7C%5Cfrac%7B%5Crho_1(x)%7D%7B%5Cmu(x)%7D%20-%20%5Cfrac%7B%5Crho_2(x)%7D%7B%5Cmu(x)%7D%5Cright%7C%5E2%20%5Cmu(x)%20%5C,%20dx,%0A"></p>
<p>we obtain a different value. Perhaps more troubling, the distance is not invariant under reparametrizations. Let <img src="https://latex.codecogs.com/png.latex?T"> be a diffeomorphism of <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5Ed">, and set <img src="https://latex.codecogs.com/png.latex?y%20=%20T(x)">. Then the transformed densities become <img src="https://latex.codecogs.com/png.latex?%5Crho%5EY_i(y)%20=%20%5Crho%5EX_i(x)%20%5C,%20%7CJ_T(x)%7C%5E%7B-1%7D">, where <img src="https://latex.codecogs.com/png.latex?J_T"> is the Jacobian determinant. In general,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cint%20%7C%5Crho%5EY_1(y)%20-%20%5Crho%5EY_2(y)%7C%5E2%20%5C,%20dy%20%5Cneq%20%5Cint%20%7C%5Crho%5EX_1(x)%20-%20%5Crho%5EX_2(x)%7C%5E2%20%5C,%20dx,%0A"></p>
<p>so the distance depends on the choice of coordinates. That is, measuring in Cartesian or polar coordinates yields different results—an undesirable feature. Ideally, we seek a distance that is invariant under reparametrizations and changes of base measure, such as the <a href="https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures">total variation distance</a>,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad_%7BTV%7D(%5Crho_1,%20%5Crho_2)%20=%20%5Cint_%7B%5Cmathbb%7BR%7D%5Ed%7D%20%7C%5Crho_1(x)%20-%20%5Crho_2(x)%7C%20%5C,%20dx.%0A"></p>
<p>One potential drawback of the total variation distance is that it is not differentiable, which can make it difficult to use in optimization problems. An alternative is to consider <a href="https://en.wikipedia.org/wiki/F-divergence"><img src="https://latex.codecogs.com/png.latex?f">-divergences</a>, defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad_f(%5Crho_1,%20%5Crho_2)%20=%20%5Cint_%7B%5Cmathbb%7BR%7D%5Ed%7D%20f%20%5Cleft(%20%5Cfrac%7Bd%5Crho_1%7D%7Bd%5Crho_2%7D%20%5Cright)%20%5Crho_2(dx),%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?f"> is a convex function with <img src="https://latex.codecogs.com/png.latex?f(1)%20=%200">. These divergences are differentiable and invariant under reparametrizations and changes of base measure, although they are not symmetric and thus not true distances. Locally, however, all <img src="https://latex.codecogs.com/png.latex?f">-divergences are equivalent, as a second-order Taylor expansion shows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad_f(%5Crho%20+%20d%20%5Crho,%20%5C,%20%5Crho)%20=%20%20%5Ctextrm%7B(cst)%7D%20%5Ctimes%20%5Cint%20%5Cleft(%20%5Cfrac%7Bd%20%5Crho%7D%7B%5Crho%7D%20%5Cright)%5E2%20%5C,%20%5Crho(dx)%20+%20o(%5C%7Cd%20%5Crho%5C%7C%5E2).%0A"></p>
<p>This means that all these divergences describe the same local geometry, defined by the Fisher-Rao information metric. Furthermore, it is relatively straightforward to derive the global geometry induced by the <a href="https://en.wikipedia.org/wiki/Fisher_information_metric">Fisher information metric</a>. Consider the mapping <img src="https://latex.codecogs.com/png.latex?%5Crho%20%5Cmapsto%20%5Csqrt%7B%5Crho%7D">, which maps a density <img src="https://latex.codecogs.com/png.latex?%5Crho"> to an element of the unit sphere <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D"> of <img src="https://latex.codecogs.com/png.latex?L%5E2(%5Cmathbb%7BR%7D%5Ed)">. Since</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5C%7C%20%5Csqrt%7B%5Crho_1%7D%20-%20%5Csqrt%7B%5Crho_2%7D%20%5C%7C_%7BL%5E2%7D%5E2%20=%20d_f(%5Crho_1,%20%5Crho_2)%0A"></p>
<p>for <img src="https://latex.codecogs.com/png.latex?f(x)%20=%20%7C1%20-%20%5Csqrt%7Bx%7D%7C%5E2">, and we have just seen that any <img src="https://latex.codecogs.com/png.latex?f">-divergence is locally equivalent to the Fisher-Rao metric, it follows that the geometry induced by the Fisher-Rao information information metric is the same as the geometry induced by the <img src="https://latex.codecogs.com/png.latex?L%5E2">-norm on the unit sphere of <img src="https://latex.codecogs.com/png.latex?L%5E2(%5Cmathbb%7BR%7D%5Ed)">. This implies that the geodesic distance between two densities <img src="https://latex.codecogs.com/png.latex?%5Crho_1"> and <img src="https://latex.codecogs.com/png.latex?%5Crho_2"> is given (up to an irrelevant constant) by the <img src="https://latex.codecogs.com/png.latex?L%5E2">-geodesic distance between the points <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B%5Crho_1%7D%20%5Cin%20%5Cmathcal%7BS%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B%5Crho_2%7D%20%5Cin%20%5Cmathcal%7BS%7D">. In other words, the geodesic distance, i.e.&nbsp;the Fisher-Rao distance, is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad_%7BFR%7D(%5Crho_1,%20%5Crho_2)%20=%20%5Carccos%20%5Cleft(%20%5Clangle%20%5Csqrt%7B%5Crho_1%7D,%20%5Csqrt%7B%5Crho_2%7D%20%5Crangle_%7BL%5E2%7D%20%5Cright).%0A"></p>
<p>The geodesic path is a great circle, <img src="https://latex.codecogs.com/png.latex?t%20%5Cmapsto%20%5Crho_t">, where <img src="https://latex.codecogs.com/png.latex?%5Crho_t%20%5Cpropto%20%5Cleft((1-t)%20%5Csqrt%7B%5Crho_1%7D%20+%20t%20%5Csqrt%7B%5Crho_2%7D%20%5Cright)%5E2"> for <img src="https://latex.codecogs.com/png.latex?t%20%5Cin%20%5B0,1%5D">. This shows, for example, that the Fisher-Rao geodesic between two Gaussian densities is composed of densities that are Gaussian mixtures; i.e., the geodesic is not composed of Gaussian densities in general. In other words, probability mass is not transported along the geodesic but reshaped, unlike the Wasserstein metric, which describes transport of probability mass. Note in passing that the <a href="https://en.wikipedia.org/wiki/Hellinger_distance">Hellinger distance</a>, defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad_H(%5Crho_1,%20%5Crho_2)%5E2%20=%20%5Cint%20%5Cleft(%20%5Csqrt%7B%5Crho_1(x)%7D%20-%20%5Csqrt%7B%5Crho_2(x)%7D%20%5Cright)%5E2%20%5C,%20dx,%0A"></p>
<p>is just just a slightly rescaled version of the Fisher-Rao distance since they are related by a deterministic function, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?d_H%20=%20%5Csqrt%7B2(1-%5Ccos(d_%7BFR%7D))%7D">. In this sense, the Hellinger distance is equivalent to the Fisher-Rao distance, and both describe a “correct” way to measure distances between probability densities for many applications.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/fisher-rao/FR_geodesic.gif" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Here is how a geodesic paths looks like under the Fisher-Rao metric</figcaption>
</figure>
</div>
</div>
</section>
<section id="gradient-flow" class="level3">
<h3 class="anchored" data-anchor-id="gradient-flow">Gradient flow</h3>
<p>What do gradient flows look like in this Fisher-Rao geometry? For example, for a given distribution <img src="https://latex.codecogs.com/png.latex?%5Cpi">, the gradient flow of <img src="https://latex.codecogs.com/png.latex?%5Crho%20%5Cmapsto%20%5Cmathrm%7BKL%7D(%5Crho,%20%5Cpi)"> under the Wasserstein metric is given by the <a href="https://en.wikipedia.org/wiki/Continuity_equation">transport equation</a>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpartial_t%20%5Crho%20=%20-%5Cnabla%20%5Ccdot%20%5Cleft(%20%5Crho%20%5C,%20%5Cnabla%20%5Clog%20%5Cfrac%7B%5Cpi%7D%7B%5Crho%7D%20%5Cright),%0A"></p>
<p>which describes the Langevin dynamics of the process <img src="https://latex.codecogs.com/png.latex?dX%20=%20%5Cnabla%20%5Clog%20%5Cpi(X)%20%5C,%20dt%20+%20%5Csqrt%7B2%7D%20%5C,%20dB">, as informally described in these <a href="../../notes/wasserstein_langevin/wasserstein_langevin.html">notes</a>. So what does the gradient flow of <img src="https://latex.codecogs.com/png.latex?%5Crho%20%5Cmapsto%20%5Cmathrm%7BKL%7D(%5Crho,%20%5Cpi)"> look like in the Fisher-Rao geometry?</p>
<p>To answer this, one can consider the square-root mapping <img src="https://latex.codecogs.com/png.latex?%5Crho%20%5Cmapsto%20%5Csqrt%7B%5Crho%7D%20%5Cequiv%20%5CPhi(%5Crho)%20%5Cin%20%5Cmathcal%7BS%7D">, express everything in terms of <img src="https://latex.codecogs.com/png.latex?%5CPhi(%5Crho)">, compute the <img src="https://latex.codecogs.com/png.latex?L%5E2">-gradient on the unit sphere <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D"> (which is straightforward), and finally map back to the density <img src="https://latex.codecogs.com/png.latex?%5Crho"> using the inverse mapping <img src="https://latex.codecogs.com/png.latex?%5CPhi%5E%7B-1%7D">. One readily finds that the gradient flow is described by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpartial_t%20%5Crho%20=%20%5Clog%20%5Cfrac%7B%5Cpi%7D%7B%5Crho%7D%20-%20%5Cmathbb%7BE%7D_%5Crho%20%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi%7D%7B%5Crho%7D%20%5Cright%5D.%0A"></p>
<p>This is quite intuitive: the flow tries to increase <img src="https://latex.codecogs.com/png.latex?%5Crho"> in regions where <img src="https://latex.codecogs.com/png.latex?%5Crho%20%5Cll%20%5Cpi"> and decrease <img src="https://latex.codecogs.com/png.latex?%5Crho"> in regions where <img src="https://latex.codecogs.com/png.latex?%5Crho%20%5Cgg%20%5Cpi">. Discretizing this flow can naturally be done using sampling-based methods. If <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Bi=1%7D%5EN%20w_i%20%5C,%20%5Cdelta(x_i)"> is a system of <img src="https://latex.codecogs.com/png.latex?N"> weighted particles approximating <img src="https://latex.codecogs.com/png.latex?%5Crho">, following the Fisher-Rao gradient flow corresponds to updating the weights as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aw_i%20%5Cmapsto%20%5Cfrac%7B%20w_i%20%5C,%20(%5Cpi(x_i)%20/%20%5Crho(x_i))%5E%7B%5Cvarepsilon%7D%7D%7B%5Csum_%7Bj=1%7D%5EN%20w_j%20%5C,%20(%5Cpi(x_j)%20/%20%5Crho(x_j))%5E%7B%5Cvarepsilon%7D%7D%0A"></p>
<p>for a small <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%3E%200"> time-step. Indeed, it is because <img src="https://latex.codecogs.com/png.latex?%5Cpartial_t%20%5Crho(x)%20=%20v(x)">, where <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_%7B%5Crho%7D%5Bv%5D=0">, can be discretized by updating the weights as <img src="https://latex.codecogs.com/png.latex?w_i%20%5Cmapsto%20w_i%20%5C,%20%5Cexp%5B%5Cvarepsilon%5C,%20v(x_i)%5D%20/%20Z">. This is very much related to the resampling step in sequential Monte Carlo methods, and the recent article <span class="citation" data-cites="crucinio2025sequential">(Crucinio and Pathiraja 2025)</span> make these connections explicit.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/fisher-rao/FR_gradient_flow.gif" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Minimising <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D(%5Crho,%20%5Cpi)"> with Fisher-Rao gradient flow</figcaption>
</figure>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-crucinio2025sequential" class="csl-entry">
Crucinio, Francesca R, and Sahani Pathiraja. 2025. <span>“Sequential Monte Carlo Approximations of Wasserstein–Fisher–Rao Gradient Flows.”</span> <em>arXiv Preprint arXiv:2506.05905</em>.
</div>
</div></section></div> ]]></description>
  <category>probability</category>
  <category>information-geometry</category>
  <guid>https://alexxthiery.github.io/notes/fisher-rao/distance.html</guid>
  <pubDate>Thu, 12 Jun 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Adjoint method for sensitivities</title>
  <link>https://alexxthiery.github.io/notes/adjoint_method/adjoint.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/adjoint_method/pontryagin.jpg" class="img-fluid figure-img" style="width:60.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Lev_Pontryagin">Lev Pontryagin</a> (1908 - 1988)</figcaption>
</figure>
</div>
</div>
<section id="table-of-contents" class="level2">
<h2 class="anchored" data-anchor-id="table-of-contents">Table of Contents</h2>
<ul>
<li>Linear Systems</li>
<li>Adjoint Method</li>
<li>PDE Inverse Problems</li>
<li>Controlled Diffusions</li>
</ul>
<p>The adjoint method is, at its core, the same idea as <a href="https://en.wikipedia.org/wiki/Backpropagation">backpropagation</a> or <a href="https://en.wikipedia.org/wiki/Automatic_differentiation">reverse-mode</a> automatic differentiation. In practice, though, it’s often helpful to understand how it works under the hood. Basic implementations of backprop can be sub-optimal or impractical (eg. memory intensive), especially in settings like <a href="https://en.wikipedia.org/wiki/PDE-constrained_optimization">PDE-constrained optimization</a> or stochastic optimal control.</p>
<section id="linear-systems" class="level3">
<h3 class="anchored" data-anchor-id="linear-systems">Linear Systems</h3>
<p>For a parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_%5Ctheta%7D">, consider <img src="https://latex.codecogs.com/png.latex?x%20=%20x(%5Ctheta)"> the solution of the linear system</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AA%20x%20=%20b,%0A"></p>
<p>where both the matrix <img src="https://latex.codecogs.com/png.latex?A%20=%20A(%5Ctheta)%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%20%5Ctimes%20d_x%7D"> and the vector <img src="https://latex.codecogs.com/png.latex?b%20=%20b(%5Ctheta)%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> depend on <img src="https://latex.codecogs.com/png.latex?%5Ctheta">. This setup is typical when discretizing PDEs: <img src="https://latex.codecogs.com/png.latex?A"> arises from the differential operator, and <img src="https://latex.codecogs.com/png.latex?b"> from the source term. We are interested in a loss function of the type</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Ctheta)%20=%20F(%5Ctheta,%20x(%5Ctheta)),%0A"></p>
<p>with <img src="https://latex.codecogs.com/png.latex?F:%20%5Cmathbb%7BR%7D%5E%7Bd_%5Ctheta%7D%20%5Ctimes%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D%20%5Cto%20%5Cmathbb%7BR%7D">. We aim to compute the derivative of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta">. The chain rule gives</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%5Ctheta%20%5Cmathcal%7BL%7D=%20F_%5Ctheta%20-%20F_x%20A%5E%7B-1%7D%20%5Cleft(%20A_%5Ctheta%20%5C,%20x%20-%20b_%5Ctheta%20%5Cright)%20%5C;%20%5Cin%20%5Cmathbb%7BR%7D%5E%7B1,%20d_%5Ctheta%7D.%0A"></p>
<p>The notation <img src="https://latex.codecogs.com/png.latex?F_%5Ctheta%20=%20%5Cnabla_%5Ctheta%20F%5E%5Ctop%20%5Cin%20%5Cmathbb%7BR%7D%5E%7B1,%20d_%5Ctheta%7D"> denotes the Jacobian of <img src="https://latex.codecogs.com/png.latex?F"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta">, and similarly notations are used for <img src="https://latex.codecogs.com/png.latex?F_x"> and <img src="https://latex.codecogs.com/png.latex?A_%5Ctheta"> and <img src="https://latex.codecogs.com/png.latex?b_%5Ctheta">. As usual, the jacobian can be thought of as the transpose of the gradient for scalar functions. When <img src="https://latex.codecogs.com/png.latex?d_x%20%5Cgg%201"> and <img src="https://latex.codecogs.com/png.latex?d_%5Ctheta%20%5Cgg%201">, directly computing <img src="https://latex.codecogs.com/png.latex?A%5E%7B-1%7D"> is not feasible. Naively evaluating <img src="https://latex.codecogs.com/png.latex?A%5E%7B-1%7D%20(A_%5Ctheta%20x%20-%20b_%5Ctheta)"> would require <img src="https://latex.codecogs.com/png.latex?d_%5Ctheta"> linear solves, each of one of complexity cubic in <img src="https://latex.codecogs.com/png.latex?d_x">. A better approach is to first compute <img src="https://latex.codecogs.com/png.latex?%5Clambda%5E%5Ctop%20=%20F_x%20A%5E%7B-1%7D"> by solving the so-called adjoint system</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AA%5E%5Ctop%20%5Clambda%20=%20F_x%5E%5Ctop.%0A"></p>
<p>This requires only one linear solve. Once <img src="https://latex.codecogs.com/png.latex?%5Clambda%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> is computed, the jacobian simplifies to</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%5Ctheta%20%5Cmathcal%7BL%7D=%20F_%5Ctheta%20-%20%5Clambda%5E%5Ctop%20(A_%5Ctheta%20x%20-%20b_%5Ctheta).%0A"></p>
</section>
<section id="adjoint-method" class="level3">
<h3 class="anchored" data-anchor-id="adjoint-method">Adjoint Method</h3>
<p>Now consider a more general situation where <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_%5Ctheta%7D"> are related by an implicit equation</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CPhi(x,%20%5Ctheta)%20=%200%0A"></p>
<p>for some function <img src="https://latex.codecogs.com/png.latex?%5CPhi:%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D%20%5Ctimes%20%5Cmathbb%7BR%7D%5E%7Bd_%5Ctheta%7D%20%5Cto%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> that satisfies the usual conditions for the implicit function <a href="https://en.wikipedia.org/wiki/Implicit_function_theorem">theorem</a> to hold. Differentiating gives <img src="https://latex.codecogs.com/png.latex?x_%5Ctheta%20=%20-%5CPhi_x%5E%7B-1%7D%20%5CPhi_%5Ctheta">. As before, we want the sensitivity with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D(%5Ctheta)%20=%20F(x(%5Ctheta),%20%5Ctheta)">. It equals <img src="https://latex.codecogs.com/png.latex?D_%5Ctheta%20%5Cmathcal%7BL%7D=%20F_%5Ctheta%20-%20F_x%20%5CPhi_x%5E%7B-1%7D%20%5CPhi_%5Ctheta"> and can also be expressed as <img src="https://latex.codecogs.com/png.latex?D_%5Ctheta%20%5Cmathcal%7BL%7D=%20F_%5Ctheta%20-%20%5Clambda%5E%5Ctop%20%5CPhi_%5Ctheta"> where <img src="https://latex.codecogs.com/png.latex?%5Clambda%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> is the solution of the adjoint system</p>
<p><span id="eq-adjoint"><img src="https://latex.codecogs.com/png.latex?%0A%5CPhi_x%5E%5Ctop%20%5Clambda%20=%20F_x%5E%5Ctop.%0A%5Ctag%7B1%7D"></span></p>
<p>Another way to present this computation is to note that, for <strong>any</strong> vector <img src="https://latex.codecogs.com/png.latex?%5Clambda%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D">, we have</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Ctheta)%20=%20F(x(%5Ctheta),%20%5Ctheta)%20-%20%5Clambda%5E%5Ctop%20%5CPhi(x(%5Ctheta),%20%5Ctheta)%0A"></p>
<p>since <img src="https://latex.codecogs.com/png.latex?%5CPhi(x(%5Ctheta),%20%5Ctheta)%20%5Cequiv%200">. As will soon become clear, introducing the “adjoint” variable <img src="https://latex.codecogs.com/png.latex?%5Clambda"> allows one to eliminate cumbersome terms when computing <img src="https://latex.codecogs.com/png.latex?D_%5Ctheta%20%5Cmathcal%7BL%7D">. Differentiation with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> gives</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%5Ctheta%20%5Cmathcal%7BL%7D=%20F_%5Ctheta%20-%20%5Clambda%5E%5Ctop%20%5CPhi_%5Ctheta%20+%20%20%7B%5Cleft(%20%20F_x%20-%20%5Clambda%5E%5Ctop%20%5CPhi_x%20%5Cright)%7D%20%20x_%5Ctheta.%0A"></p>
<p>The term <img src="https://latex.codecogs.com/png.latex?x_%5Ctheta%20=%20-%5CPhi_x%5E%7B-1%7D%20%5CPhi_%5Ctheta%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x,%20d_%5Ctheta%7D"> is cumbersome (eg. intractable in high-dimensional settings), and we would like to eliminate it. To this end, it suffices to choose <img src="https://latex.codecogs.com/png.latex?%5Clambda"> so that the term <img src="https://latex.codecogs.com/png.latex?F_x%20-%20%5Clambda%5E%5Ctop%20%5CPhi_x"> vanishes; indeed, this is exactly the adjoint-system Equation&nbsp;1.</p>
</section>
<section id="pde-inverse-problems" class="level3">
<h3 class="anchored" data-anchor-id="pde-inverse-problems">PDE Inverse Problems</h3>
<p>Let us see how this works in the context of PDE-constrained optimization. Let <img src="https://latex.codecogs.com/png.latex?%5COmega%20%5Csubset%20%5Cmathbb%7BR%7D%5Ed"> be a domain and let <img src="https://latex.codecogs.com/png.latex?%5Ckappa:%20%5COmega%20%5Cto%20%5Cmathbb%7BR%7D"> be a scalar field. Consider the PDE</p>
<p><span id="eq-elliptic"><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla%20%20%7B%5Cleft(%20%20e%5E%7B%5Ckappa(x)%7D%20%5Cnabla%20u%20%20%5Cright)%7D%20%20=%20f,%0A%5Ctag%7B2%7D"></span></p>
<p>where <img src="https://latex.codecogs.com/png.latex?f"> is a given source term. The field <img src="https://latex.codecogs.com/png.latex?%5Ckappa"> describes the diffusion, or permeability, properties of the medium. We are interested in the solution <img src="https://latex.codecogs.com/png.latex?u"> of the PDE on a bounded domain <img src="https://latex.codecogs.com/png.latex?%5COmega%20%5Csubset%20%5Cmathbb%7BR%7D%5Ed"> with Dirichlet boundary conditions <img src="https://latex.codecogs.com/png.latex?u(x)%20=%200"> for <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cpartial%20%5COmega">. For each field <img src="https://latex.codecogs.com/png.latex?%5Ckappa">, the elliptic PDE determines a unique solution <img src="https://latex.codecogs.com/png.latex?u">. We are interested in minimizing the quantity</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Ckappa)%20=%20%5Cint_%5COmega%20F(u(x))%20%5C,%20dx,%0A"></p>
<p>for some given function <img src="https://latex.codecogs.com/png.latex?F:%20%5Cmathbb%7BR%7D%20%5Cto%20%5Cmathbb%7BR%7D">. A common case is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Ckappa)%20=%20%5Cfrac%7B1%7D%7B2%7D%20%5C,%20%5Cint_%5COmega%20%5Cleft%7C%20u(x)%20-%20u%5E%5Cstar(x)%20%5Cright%7C%5E2%20%5C,%20w(x)%20%5C,%20dx,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?u%5E%5Cstar"> is a target solution and <img src="https://latex.codecogs.com/png.latex?w(x)%3E0"> is a weight. The goal is to adjust the field <img src="https://latex.codecogs.com/png.latex?%5Ckappa"> so that <img src="https://latex.codecogs.com/png.latex?u"> matches <img src="https://latex.codecogs.com/png.latex?u%5E%5Cstar"> as closely as possible. To carry out the minimization of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Ckappa">, one needs the derivative of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Ckappa">. To that end, define the augmented functional</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D%20=%20%5Cint_%5COmega%20F(u(x))%20%5C,%20dx%20-%20%5Cint_%5COmega%20%5Clambda%20%5Cleft(%20%5Cnabla%20%5Ccdot%20(e%5E%7B%5Ckappa%7D%20%5Cnabla%20u)%20-%20f%20%5Cright)%20%5C,%20dx,%0A"></p>
<p>for an auxiliary field <img src="https://latex.codecogs.com/png.latex?%5Clambda%20:%20%5COmega%20%5Cto%20%5Cmathbb%7BR%7D"> that will be chosen later. As before, a good choice of <img src="https://latex.codecogs.com/png.latex?%5Clambda"> can simplify the computations. Let <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%5Ckappa"> be a perturbation of <img src="https://latex.codecogs.com/png.latex?%5Ckappa">. This induces a perturbation <img src="https://latex.codecogs.com/png.latex?u%20+%20%5Cdelta%20u"> in the solution and the first order variation of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> reads:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cdelta%20%5Cmathcal%7BL%7D%20=%20%5Cint_%5COmega%20F'(u)%20%5C,%20%5Cdelta%20u%20%5C,%20dx%20-%20%5Cint_%5COmega%20%5Clambda%20%5C,%20%5Cnabla%20%5Ccdot%20(e%5E%7B%5Ckappa%7D%20%5Cnabla%20%5Cdelta%20u)%20%5C,%20dx%20-%20%5Cint_%5COmega%20%5Clambda%20%5C,%20%5Cnabla%20%5Ccdot%20(e%5E%7B%5Ckappa%7D%20%5Cdelta%20%5Ckappa%20%5Cnabla%20u)%20%5C,%20dx.%0A"></p>
<p>The term involving <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20u"> is inconvenient. Assuming <img src="https://latex.codecogs.com/png.latex?%5Clambda"> also satisfies Dirichlet boundary conditions, which we can indeed assume seems we are free to define <img src="https://latex.codecogs.com/png.latex?%5Clambda"> in any manner we want, we integrate by parts:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cdelta%20%5Cmathcal%7BL%7D%20=%20%5Cint_%5COmega%20%5Cleft(%20F'(u)%20-%20%5Cnabla%20%5Ccdot%20(e%5E%7B%5Ckappa%7D%20%5Cnabla%20%5Clambda)%20%5Cright)%20%5Cdelta%20u%20%5C,%20dx%20-%20%5Cint_%5COmega%20%5Clambda%20%5C,%20%5Cnabla%20%5Ccdot%20(e%5E%7B%5Ckappa%7D%20%5Cdelta%20%5Ckappa%20%5Cnabla%20u)%20%5C,%20dx.%0A"></p>
<p>To eliminate the <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20u"> term, choose <img src="https://latex.codecogs.com/png.latex?%5Clambda"> to satisfy the adjoint equation</p>
<p><span id="eq-adjoint-elliptic"><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla%20%5Ccdot%20(e%5E%7B%5Ckappa%7D%20%5Cnabla%20%5Clambda)%20=%20F'(u),%0A%5Ctag%7B3%7D"></span></p>
<p>with Dirichlet boundary conditions. Then, an integration by parts gives</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Cdelta%20%5Cmathcal%7BL%7D%0A&amp;=%20-%5Cint_%5COmega%20%5Clambda%20%5C,%20%5Cnabla%20%5Ccdot%20(e%5E%7B%5Ckappa%7D%20%5Cdelta%20%5Ckappa%20%5Cnabla%20u)%20%5C,%20dx%5C%5C%0A&amp;=%20%5Cint_%5COmega%20e%5E%7B%5Ckappa%7D%20%5C,%20%5Cleft%3C%20%20%5Cnabla%20u,%20%5Cnabla%20%5Clambda%20%20%5Cright%3E%20%5C,%20%5Cdelta%20%5Ckappa%20%5C,%20dx%0A=%20%5Cleft%3C%20g,%20%5Cdelta%20%5Ckappa%20%5Cright%3E_%7BL%5E2(%5COmega)%7D.%0A%5Cend%7Balign*%7D%0A"></p>
<p>This means that the <img src="https://latex.codecogs.com/png.latex?L%5E2"> gradient of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Ckappa"> is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ag%20=%20e%5E%7B%5Ckappa%7D%20%5C,%20%5Cleft%3C%20%20%5Cnabla%20u,%20%5Cnabla%20%5Clambda%20%20%5Cright%3E,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Clambda:%20%5COmega%20%5Cto%20%5Cmathbb%7BR%7D"> solves the adjoint system Equation&nbsp;3. This shows that the gradient of the objective can be computed at the same computational cost as the solution the original PDE Equation&nbsp;2. This expression can be used directly in gradient-based optimization schemes.</p>
</section>
<section id="controlled-diffusions" class="level3">
<h3 class="anchored" data-anchor-id="controlled-diffusions">Controlled Diffusions</h3>
<p>Consider the <a href="https://en.wikipedia.org/wiki/Ordinary_differential_equation">ODE</a> on <img src="https://latex.codecogs.com/png.latex?%5B0,%20T%5D">:</p>
<p><span id="eq-ode-forward"><img src="https://latex.codecogs.com/png.latex?%0A%5Cdot%7Bx%7D%20=%20b(t,%20%5Ctheta,%20x),%0A%5Ctag%7B4%7D"></span></p>
<p>with initial condition <img src="https://latex.codecogs.com/png.latex?x(0)%20=%20%5Cmu(%5Ctheta)">, where <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_%5Ctheta%7D">. The drift term <img src="https://latex.codecogs.com/png.latex?b(t,%20%5Ctheta,%20x)%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> is parameterized by <img src="https://latex.codecogs.com/png.latex?%5Ctheta">. We want the sensitivity of the functional</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Ctheta)%20=%20%5Cint_0%5ET%20f(t,%20%5Ctheta,%20x(t))%20%5C,%20dt%20+%20g(%5Ctheta,%20x(T)),%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?f"> and <img src="https://latex.codecogs.com/png.latex?g"> are given functions. As before, it is often helpful to introduce an auxiliary function <img src="https://latex.codecogs.com/png.latex?%5Clambda(t)%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_x%7D"> and write:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Ctheta)%20=%20%5Cmathcal%7BL%7D(%5Ctheta)%20-%20%5Cint_0%5ET%20%5Clambda%5E%5Ctop%20%5C,%20%5Cunderbrace%7B%20%20%7B%5Cleft(%20%20%5Cdot%7Bx%7D%20-%20b%20%20%5Cright)%7D%20%20%7D_%7B%5Cequiv%200%7D%20%5C,%20dt.%0A"></p>
<p>Differentiating with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> and integrating by parts gives:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0AD_%5Ctheta%20%5Cmathcal%7BL%7D%0A&amp;=%20%5Cint_0%5ET%20%5Cleft(%20f_%5Ctheta%20+%20%5Clambda%5E%5Ctop%20b_%5Ctheta%20%5Cright)%20%5C,%20dt%20+%20g_%5Ctheta(%5Ctheta,%20x(T))%20+%20%5Clambda%5E%5Ctop(0)%20%5Cmu_%5Ctheta%20%5C%5C%0A&amp;%5Cquad%20+%20%5Cleft(%20g_x%20-%20%5Clambda%5E%5Ctop(T)%20%5Cright)%20%5C,%20x_%5Ctheta(T)%0A+%20%5Cint_0%5ET%20%5Cleft(%20f_x%20+%20%5Cdot%7B%5Clambda%7D%5E%5Ctop%20+%20%5Clambda%5E%5Ctop%20b_x%20%5Cright)%20x_%5Ctheta(t)%20%5C,%20dt.%0A%5Cend%7Baligned%7D%0A"></p>
<p>To eliminate the dependence on <img src="https://latex.codecogs.com/png.latex?x_%5Ctheta(t)">, choose <img src="https://latex.codecogs.com/png.latex?%5Clambda(t)"> to satisfy the adjoint system:</p>
<p><span id="eq-adjoint-ode"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bcases%7D%0A%5Cdot%7B%5Clambda%7D(t)%20=%20-%5Cnabla_x%20f%20-%20b_x%5E%5Ctop%20%5Clambda(t),%20%5C%5C%0A%5Clambda(T)%20=%20%5Cnabla_x%20g.%0A%5Cend%7Bcases%7D%0A%5Ctag%7B5%7D"></span></p>
<p>This is a linear ODE with a terminal condition <img src="https://latex.codecogs.com/png.latex?%5Clambda(T)%20=%20%5Cnabla_x%20g"> that needs to be solved backward in time. This means that for computing the derivative of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D">, one can first solve the forward ODE Equation&nbsp;4 to obtain <img src="https://latex.codecogs.com/png.latex?x(t)">, and then solve the adjoint system Equation&nbsp;5 backward in time to obtain <img src="https://latex.codecogs.com/png.latex?%5Clambda(t)">. The Jacobian (i.e.&nbsp;transpose of the gradient) of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> is then given by</p>
<p><span id="eq-gradient-ode"><img src="https://latex.codecogs.com/png.latex?%0AD_%5Ctheta%20%5Cmathcal%7BL%7D%0A=%20%5Cint_0%5ET%20%5Cleft(%20f_%5Ctheta%20+%20%5Clambda%5E%5Ctop%20b_%5Ctheta%20%5Cright)%20dt%20+%20g_%5Ctheta(%5Ctheta,%20x(T))%20+%20%5Clambda%5E%5Ctop(0)%20%5Cmu_%5Ctheta.%0A%5Ctag%7B6%7D"></span></p>
<p>The term <img src="https://latex.codecogs.com/png.latex?%5Clambda%5E%5Ctop%20b_%5Ctheta"> is a vector jacobian product, and can be computed efficiently. This formulation is often referred to as the “continuous adjoint method” or “adjoint sensitivity analysis” or “optimize-then-discretize” and dates back to the work of <span class="citation" data-cites="pontryagin2018mathematical">(Pontryagin 1962)</span>. A naive implementation of <a href="https://en.wikipedia.org/wiki/Backpropagation">backpropagation</a> can be inefficient memory-wise since quantities such as <img src="https://latex.codecogs.com/png.latex?f_%5Ctheta"> would typically be stored along the forward pass. When <img src="https://latex.codecogs.com/png.latex?d_%5Ctheta%20%5Cgg%201">, as is for example the case when the drift is parameterized by a neural network, this can be impractical. Instead, it may be more efficient to store the forward trajectory <img src="https://latex.codecogs.com/png.latex?x(t)"> only, and recompute all the other quantities during the backward pass; there is a slight computational cost but potentially very large memory savings. In machine-learning settings, this often means the possibility to exploit much larger batch sizes. Similarly, if implicit methods are used to solve the ODE instead of a simple <a href="https://en.wikipedia.org/wiki/Euler–Maruyama_method">Euler-Maruyama</a> scheme, backpropagation through the implicit solver can be tricky.</p>
<p>Nothing really changes when considering a <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">SDE</a> with additive noise instead,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Adx%20=%20b(t,%20%5Ctheta,%20x)%20%5C,%20dt%20+%20%5Csigma(t)%20%5C,%20dW_t.%0A"></p>
<p>Informally, one can apply the same reasoning as previously to the controlled ODE: <img src="https://latex.codecogs.com/png.latex?%0A%5Cdot%7Bx%7D%20%5C,%20=%20%5C,%20b(t,%20%5Ctheta,%20x)%20+%20%5Csigma(t)%20%5C,%20dW_t/dt.%0A"></p>
<p>Again, it suffices to solve the SDE forward in time to obtains <img src="https://latex.codecogs.com/png.latex?x(t)"> and then solve the same exact same adjoint ODE Equation&nbsp;5 backward in time to obtain <img src="https://latex.codecogs.com/png.latex?%5Clambda(t)">: it is still an ordinary differential equation. The derivative of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> is then given by the same expression Equation&nbsp;6. For SDEs with multiplicative noise, the adjoint system is slightly more complicated, but hardly changes the overall picture. Finally, note that in the case the two functions <img src="https://latex.codecogs.com/png.latex?f"> and <img src="https://latex.codecogs.com/png.latex?g"> do not depend on <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> and one chooses <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20=%20x_0"> and the initial condition <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7Bx_0%7D%20=%20x_0">, Equation&nbsp;6 shows that <img src="https://latex.codecogs.com/png.latex?D_%7Bx_0%7D%20%5Cmathcal%7BL%7D=%20%20%7B%5Cleft(%20%5Cnabla_%7Bx_0%7D%20%5Cmathcal%7BL%7D%20%5Cright)%7D%20%5E%5Ctop%20=%20%5Clambda(0)%5E%5Ctop">. More generally, this shows that:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clambda(t)%20=%20%5Cnabla_%7Bx(t)%7D%20%5C,%20%20%7B%5Cleft%5C%7B%20%20%5Cint_t%5ET%20f(s,%20%5Ctheta,%20x(s))%20%5C,%20ds%20+%20g(%5Ctheta,%20x(T))%20%5Cright%5C%7D%7D%20.%0A"></p>



</section>
</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-pontryagin2018mathematical" class="csl-entry">
Pontryagin, Lev Semenovich. 1962. <em>Mathematical Theory of Optimal Processes</em>.
</div>
</div></section></div> ]]></description>
  <category>ODE</category>
  <category>PDE</category>
  <category>Adjoint</category>
  <guid>https://alexxthiery.github.io/notes/adjoint_method/adjoint.html</guid>
  <pubDate>Fri, 09 May 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Sparse GP</title>
  <link>https://alexxthiery.github.io/notes/sparse_GP/sparse_gp.html</link>
  <description><![CDATA[ 





<p><em>These notes are mainly for my own reference; I’m pretty clueless about GPs at the moment, and that needs to change. Read at your own risk; typos and mistakes are likely.</em></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><video src="SparseGP.mp4" class="img-fluid quarto-figure quarto-figure-center" style="width:100.0%" controls=""><a href="SparseGP.mp4">Video</a></video></p>
</figure>
</div>
<p>Let <img src="https://latex.codecogs.com/png.latex?f(%5Ccdot)%20%5Csim%20%5Cmathrm%7BGP%7D(m,%20K)"> denote a Gaussian Process (GP) prior with zero mean <img src="https://latex.codecogs.com/png.latex?m=0"> and covariance kernel <img src="https://latex.codecogs.com/png.latex?K">. Assume we observe <img src="https://latex.codecogs.com/png.latex?n%20%5Cgg%201"> noisy measurements <img src="https://latex.codecogs.com/png.latex?y%20=%20(y_i)_%7Bi=1%7D%5En"> of <img src="https://latex.codecogs.com/png.latex?f_i%20=%20f(x_i)"> at input locations <img src="https://latex.codecogs.com/png.latex?x%20=%20(x_i)_%7Bi=1%7D%5En">. The goal is to compute the posterior of <img src="https://latex.codecogs.com/png.latex?f%20=%20(f_i)_%7Bi=1%7D%5En"> and to infer GP hyperparameters. The main challenge with GP models is the cubic complexity of the matrix inversion required to many of the posterior computations.</p>
<p>Sparse GPs are a class of approaches that aim to reduce this complexity by approximating the full GP posterior with a smaller set of so-called inducing variables <img src="https://latex.codecogs.com/png.latex?u=(u_1,%20%5Cldots,%20u_m)"> that entirely describe an approximate posterior distribution. Consider <img src="https://latex.codecogs.com/png.latex?m%20%5Cll%20n"> locations <img src="https://latex.codecogs.com/png.latex?z=(z_i)_%7Bi=1%7D%5Em"> called inducing points and set <img src="https://latex.codecogs.com/png.latex?u_i%20=%20f(z_i)"> for the latent function values at the inducing points. The Gaussian random variables <img src="https://latex.codecogs.com/png.latex?u_i"> can be used as inducing random variable; the choice of inducing points <img src="https://latex.codecogs.com/png.latex?z"> defines a different set of inducing variables <img src="https://latex.codecogs.com/png.latex?u">. In this setting, optimizing the inducing variables simply means optimizing the locations of the inducing points <img src="https://latex.codecogs.com/png.latex?z">. The strategy is to approximate the posterior of <img src="https://latex.codecogs.com/png.latex?(u,f)"> with a tractable distribution <img src="https://latex.codecogs.com/png.latex?q(u,f)">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap(u,f%20%5Cmid%20y)%20%5C,%20=%20%5C,%20%5Cfrac%7Bp(u)%20%5C,%20p(f%20%5Cmid%20u)%20%5C,%20p(y%20%5Cmid%20f)%7D%7Bp(y)%7D%20%5C;%20%5Capprox%20%5C;%20q(u,f).%0A"></p>
<p>Later, we will see that setting <img src="https://latex.codecogs.com/png.latex?u_i%20=%20f(z_i)"> is indeed not the only choice of inducing variables, but let’s keep it to this for now. We have <img src="https://latex.codecogs.com/png.latex?p(u)%20=%20N(0,K_u)"> and <img src="https://latex.codecogs.com/png.latex?p(f%20%5Cmid%20u)%20=%20N(%5Cmu_%7Bf%7Cu%7D,%20K_%7Bf%7Cu%7D)"> where</p>
<p><span id="eq-conditionals"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign*%7D%0A%5Cmu_%7Bf%7Cu%7D%20&amp;=%20K_%7Bfu%7D%20K_u%5E%7B-1%7D%20u%5C%5C%0AK_%7Bf%7Cu%7D%20&amp;=%20K_f%20-%20K_%7Bfu%7D%20K_u%5E%7B-1%7D%20K_%7Buf%7D.%0A%5Cend%7Balign*%7D%0A%5Cright.%0A%5Ctag%7B1%7D"></span></p>
<p>where <img src="https://latex.codecogs.com/png.latex?K_u"> is the covariance matrix of the inducing variables <img src="https://latex.codecogs.com/png.latex?u"> and <img src="https://latex.codecogs.com/png.latex?K_%7Bfu%7D"> is the covariance matrix between <img src="https://latex.codecogs.com/png.latex?f"> and <img src="https://latex.codecogs.com/png.latex?u">. Evaluating <img src="https://latex.codecogs.com/png.latex?p(f%20%5Cmid%20u)"> involves inverting <img src="https://latex.codecogs.com/png.latex?K_%7Bf%7Cu%7D%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bn,n%7D">, which typically scales as <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(n%5E3)">, hence intractable for large <img src="https://latex.codecogs.com/png.latex?n">. To approximate <img src="https://latex.codecogs.com/png.latex?p(u,f%20%5Cmid%20y)"> with another distribution <img src="https://latex.codecogs.com/png.latex?q(u,f)">, one can minimize the Kullback-Leibler divergence <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D%5Bq(u,f)%20%5C%7C%20p(u,f%20%5Cmid%20y)%5D">, i.e.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cint%20q(u)%20%5C,%20q(f%7Cu)%20%5C,%20%5Clog%20%20%7B%5Cleft%5C%7B%20%20%5Cfrac%7Bq(u)%20%5C,%20q(f%7Cu)%7D%7Bp(u)%20%5C,%20%20%5Ctextcolor%7Bred%7D%7Bp(f%20%5Cmid%20u)%7D%20%5C,%20p(y%20%5Cmid%20f)%7D%20%20%5Cright%5C%7D%7D%20%20%5C,%20du%20%5C,%20df%20%5C,%20+%20%5C,%20%5Clog%20p(y).%0A"></p>
<p>It is not extremely helpful since the intractable term <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bred%7D%7Bp(f%20%5Cmid%20u)%7D"> is present. However, <span class="citation" data-cites="titsias2009variational">(Titsias 2009)</span> proposes to set <img src="https://latex.codecogs.com/png.latex?q(f%7Cu)%20=%20p(f%7Cu)">, i.e.&nbsp;to consider an approximate posterior of the form:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aq(u,f)%20=%20q(u)%20%5C,%20p(f%20%5Cmid%20u).%0A"></p>
<p>Note that the correct posterior distribution is typically not of this form, although when the number of inducing points is large enough, this approximations becomes increasingly accurate. With this class of approximate posterior, the expectation <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5B%5CPhi(f)%20%5Cmid%20y%5D"> of some functional <img src="https://latex.codecogs.com/png.latex?%5CPhi"> is approximated as <img src="https://latex.codecogs.com/png.latex?%5Cint%20%5Cmathbb%7BE%7D%5B%5CPhi(f)%20%5Cmid%20u%5D%20%5C,%20q(u)%20%5C,%20du">. For example, if <img src="https://latex.codecogs.com/png.latex?q(u)%20=%20%5Cmathcal%7BN%7D(%5Cmu_q,%20K_q)"> is a Gaussian variational distribution, the posterior distribution of <img src="https://latex.codecogs.com/png.latex?f_%5Cstar%20=%20f(x_%5Cstar)"> at a new location <img src="https://latex.codecogs.com/png.latex?x_%5Cstar"> is approximated as <img src="https://latex.codecogs.com/png.latex?K_%7B%5Cstar,u%7D%20%5C,%20K_%7Bu%7D%5E%7B-1%7D%20%5Cmathcal%7BN%7D(%5Cmu_q,%20K_q)%20+%20K_%7B%5Cstar%7Cu%7D">; it is a Gausian with</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign*%7D%0A%5Ctextrm%7Bmean%7D%20&amp;=%20K_%7B%5Cstar,u%7D%20%5C,%20K_%7Bu%7D%5E%7B-1%7D%20%5Cmu_q%5C%5C%0A%5Ctextrm%7Bcov%7D%20&amp;=%20K_%7B%5Cstar,u%7D%20%5C,%20K_%7Bu%7D%5E%7B-1%7D%20%5C,%20K_q%20%5C,%20K_%7Bu%7D%5E%7B-1%7D%20K_%7Bu,%5Cstar%7D%20+%20K_%7B%5Cstar%7Cu%7D%0A%5Cend%7Balign*%7D%0A%5Cright.%0A"></p>
<p>Optimizing the inducing variables is equivalent to minimizing the free energy quantity</p>
<p><span id="eq-variational"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BF%7D%5C;%20%5Cequiv%20%5C;%20%5Cint%20q(u)%20%5C,%20p(f%7Cu)%20%5C,%20%5Clog%20%20%7B%5Cleft%5C%7B%20%20%5Cfrac%7Bq(u)%7D%7Bp(u)%20%5C,%20p(y%20%5Cmid%20f)%7D%20%20%5Cright%5C%7D%7D%20%20%5C,%20du%20%5C,%20df,%0A%5Ctag%7B2%7D"></span></p>
<p>over the variational distribution <img src="https://latex.codecogs.com/png.latex?q(u)"> and choice of inducing variables. For a fixed set of inducing variables (eg. set of inducing points), it is clear that the optimal variational distribution is given by</p>
<p><span id="eq-optimal-variational"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0Aq_%7B%5Cstar%7D(u)%0A&amp;=%20p(u)%20%5Cexp%20%7B%5Cleft%5C%7B%20%20%5Cint%20p(f%7Cu)%20%5C,%20%5Clog%20%20%7B%5Cleft(%20p(y%20%5Cmid%20f)%20%5Cright)%7D%20%20%5C,%20df%20%20%5Cright%5C%7D%7D%20%20/%20%5Cmathcal%7BZ%7D%5C%5C%0A&amp;=%20p(u)%20%5Cexp%20%7B%5Cleft%5C%7B%20%20%5Cmathbb%7BE%7D%5B%5Clog%20p(y%20%5Cmid%20f)%20%5Cmid%20u%5D%20%20%5Cright%5C%7D%7D%20%20/%20%5Cmathcal%7BZ%7D%0A%5Cend%7Balign*%7D%0A%5Ctag%7B3%7D"></span></p>
<p>for some normalization constant <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D=%20%5Cint%20p(u)%20%5Cexp%20%7B%5Cleft%5C%7B%20%20%5Cmathbb%7BE%7D%5B%5Clog%20p(y%20%5Cmid%20f)%20%5Cmid%20u%5D%20%20%5Cright%5C%7D%7D%20%20%5C,%20du">; this can be seen by expressing Equation&nbsp;2 as KL divergence, as similarly done for example when deriving the Coordinate Ascent Variational Inference (CAVI) method,</p>
<p><span id="eq-free-energy"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BF%7D=%20D_%7B%5Ctext%7BKL%7D%7D%5Bq(u)%20%5Cmid%20q_%7B%5Cstar%7D(u)%5D%20%5C,%20-%20%5C,%20%5Clog%20%5Cmathcal%7BZ%7D.%0A%5Ctag%7B4%7D"></span></p>
<p>Equation&nbsp;3 shows that <img src="https://latex.codecogs.com/png.latex?q_%7B%5Cstar%7D(u)"> is the prior <img src="https://latex.codecogs.com/png.latex?p(u)"> weighted by a term that is large when the observations <img src="https://latex.codecogs.com/png.latex?y"> are likely given <img src="https://latex.codecogs.com/png.latex?u">, i.e.&nbsp;when <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5B%5Clog%20p(y%20%5Cmid%20f)%20%5Cmid%20u%5D"> is large.</p>
<section id="nystrom-approximation" class="level3">
<h3 class="anchored" data-anchor-id="nystrom-approximation">Nystrom approximation</h3>
<p>Before describing the simple and most important case of additive Gaussian noise, let’s give a brief reminder on the Nystrom approximation. The distribution of <img src="https://latex.codecogs.com/png.latex?f%20%5Cmid%20u"> is Gaussian with mean <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7Bf%7Cu%7D%20=%20K_%7Bfu%7D%20K_u%5E%7B-1%7D%20u"> and covariance <img src="https://latex.codecogs.com/png.latex?K_%7Bf%7Cu%7D">. This means that <img src="https://latex.codecogs.com/png.latex?K_%7Bfu%7D%20K_u%5E%7B-1%7D%20u%20+%20%5Cmathcal%7BN%7D(0,%20K_%7Bf%7Cu%7D)"> is distributed as the unconditional distribution <img src="https://latex.codecogs.com/png.latex?f%20%5Csim%20%5Cmathcal%7BN%7D(0,%20K_f)">. In particular:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0AK_f%0A&amp;=%20K_%7Bfu%7D%20K_u%5E%7B-1%7D%20K_u%20K_u%5E%7B-1%7D%20K_%7Buf%7D%20+%20K_%7Bf%7Cu%7D%20%5C%5C%0A&amp;=%20%20%5Ctextcolor%7Bred%7D%7BK_%7Bfu%7D%20K_u%5E%7B-1%7D%20K_%7Buf%7D%7D%20+%20K_%7Bf%7Cu%7D%20%5C%5C%0A&amp;=%20%20%5Ctextcolor%7Bblue%7D%7B%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D%7D%20+%20K_%7Bf%7Cu%7D%0A%5Cend%7Balign*%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7B%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D%7D%20%5Cequiv%20K_%7Bfu%7D%20K_u%5E%7B-1%7D%20K_%7Buf%7D"> is the so-called <a href="https://en.wikipedia.org/wiki/Low-rank_matrix_approximations">Nystrom approximation</a> of the covariance matrix <img src="https://latex.codecogs.com/png.latex?K_f"> based on the inducing variable <img src="https://latex.codecogs.com/png.latex?u">. This shows that the Nystrom approximation <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D"> simply consists in ignoring the conditional variance term <img src="https://latex.codecogs.com/png.latex?K_%7Bf%7Cu%7D">, and is thus an underestimate of the covariance matrix <img src="https://latex.codecogs.com/png.latex?K_f">. Furthermore, if <img src="https://latex.codecogs.com/png.latex?u"> is very informative of <img src="https://latex.codecogs.com/png.latex?f">, then <img src="https://latex.codecogs.com/png.latex?K_%7Bf%7Cu%7D"> is small and the Nystrom approximation is accurate.</p>
</section>
<section id="observation-with-additive-gaussian-noise" class="level3">
<h3 class="anchored" data-anchor-id="observation-with-additive-gaussian-noise">Observation with additive Gaussian noise</h3>
<p>The case of additive Gaussian noise is particularly simple. Assume that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ay_i%20=%20f_i%20+%20%5Cvarepsilon_i%0A%5Cqquad%20%5Ctext%7Bwith%7D%20%5Cqquad%0A%5Cvarepsilon_i%20%5Csim%20%5Cmathcal%7BN%7D(0,%20%5Csigma%5E2)%0A"></p>
<p>where the noise terms <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon_i"> are independent. Since <img src="https://latex.codecogs.com/png.latex?f%7Cu%20%5Csim%20%5Cmathcal%7BN%7D(%5Cmu_%7Bf%7Cu%7D,%20K_%7Bf%7Cu%7D)">, algebra gives that</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5B%5Clog%20p(y%20%5Cmid%20f)%20%5Cmid%20u%5D%20=%0A%5Clog%5B%20%5Cmathcal%7BN%7D(y;%20%5Cmu_%7Bf%7Cu%7D,%20%5Csigma%5E2%20%5C,%20I)%20%5D%20-%20%5Cfrac%7B1%7D%7B2%20%5Csigma%5E2%7D%20%5C,%20%5Cmathop%7B%5Cmathrm%7BTr%7D%7D(%20K_%7Bf%7Cu%7D%20)"></p>
<p>Using that <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7Bf%7Cu%7D%20=%20K_%7Bfu%7D%20K_u%5E%7B-1%7D%20u"> and the <a href="https://en.wikipedia.org/wiki/Woodbury_matrix_identity">matrix inversion lemma</a> it quickly follows that optimal variational distribution is <img src="https://latex.codecogs.com/png.latex?q_%7B%5Cstar%7D(u)%20=%20%5Cmathcal%7BN%7D(%5Cmu_%7B%5Cstar%7D,%20K_%7B%5Cstar%7D)"> with</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0A%5Cmu_%7B%5Cstar%7D%20&amp;=%20K_%7Buf%7D%20%5C,%20%20%7B%5Cleft(%20%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D%20+%20%5Csigma%5E2%20I%20%5Cright)%7D%20%5E%7B-1%7D%20%5C,%20y%5C%5C%0AK_%7B%5Cstar%7D%20&amp;=%20K_u%20-%20K_%7Buf%7D%20%5C,%20%20%7B%5Cleft(%20%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D%20+%20%5Csigma%5E2%20I%20%5Cright)%7D%20%5E%7B-1%7D%20%5C,%20K_%7Bfu%7D.%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>Indeed, these are approximations of the exact condition moments,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0A%5Cmu_%7Bu%7Cy%7D%20&amp;=%20K_%7Buf%7D%20%5C,%20%20%7B%5Cleft(%20K_%7Bf%7D%20+%20%5Csigma%5E2%20I%20%5Cright)%7D%20%5E%7B-1%7D%20%5C,%20y%5C%5C%0AK_%7Bu%7Cy%7D%20&amp;=%20K_u%20-%20K_%7Buf%7D%20%5C,%20%20%7B%5Cleft(%20K_%7Bf%7D%20+%20%5Csigma%5E2%20I%20%5Cright)%7D%20%5E%7B-1%7D%20%5C,%20K_%7Bfu%7D.%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>where the Nystrom approximation <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D%20%5Capprox%20K_f"> is used instead. One then finds that <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cmathcal%7BZ%7D=%20%5Clog%20%5Cmathcal%7BN%7D(y;%200,%20%5Cwidehat%7BK%7D%5Eu_f%20+%20%5Csigma%5E2%20I)%20-%20%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%20%5Cmathop%7B%5Cmathrm%7BTr%7D%7D(K_%7Bf%7Cu%7D)">. With the optimal variational distribution <img src="https://latex.codecogs.com/png.latex?q_%5Cstar(u)">, Equation&nbsp;4 gives that the free energy is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BF%7D%0A=%20-%5Clog%20%5Cmathcal%7BN%7D%20%7B%5Cleft(%20y;%200,%20%5C;%20%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D%20+%20%5Csigma%5E2%20I%20%5Cright)%7D%20%20+%20%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%20%5Cmathop%7B%5Cmathrm%7BTr%7D%7DK_%7Bf%7Cu%7D%20%5C%5C%0A"></p>
<p>Furthermore, note that exact likelihood of the observations is <img src="https://latex.codecogs.com/png.latex?p(y)%20=%20%5Cmathcal%7BN%7D%20%7B%5Cleft(%20y;%200,%20%5C;%20K_f%20+%20%5Csigma%5E2%20I%20%5Cright)%7D%20"> so that the free energy can be expressed as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BF%7D%0A%5C;%20=%20%5C;%0A-%5Clog%20%5Cwidehat%7Bp%7D%5Eu(y)%20+%20%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%20%5Cmathop%7B%5Cmathrm%7BTr%7D%7DK_%7Bf%7Cu%7D%20%5C%5C%0A"></p>
<p>for pseudo-likelihood <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7Bp%7D%5Eu(y)%20=%20%5Cmathcal%7BN%7D%20%7B%5Cleft(%20y;%200,%20%5C;%20%5Cwidehat%7BK%7D%5E%7Bu%7D_%7Bf%7D%20+%20%5Csigma%5E2%20I%20%5Cright)%7D%20">. This shows that <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D"> is given by: <img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0AD_%7B%5Ctext%7BKL%7D%7D&amp;%5Bq(u,f)%20%5Cmid%20p(u,f%20%5Cmid%20y)%5D%0A=%0A%5Cmathcal%7BF%7D+%20%5Clog%20p(y)%20%5C%5C%0A&amp;=%0A%5Clog%20%5Cfrac%7Bp(y)%7D%7B%5Cwidehat%7Bp%7D%5Eu(y)%7D%0A+%0A%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%20%5Cmathop%7B%5Cmathrm%7BTr%7D%7DK_%7Bf%7Cu%7D.%0A%5Cend%7Balign*%7D%0A"></p>
<p>The term <img src="https://latex.codecogs.com/png.latex?%5Cmathop%7B%5Cmathrm%7BTr%7D%7DK_%7Bf%7Cu%7D"> is just the sum of the conditional variances <img src="https://latex.codecogs.com/png.latex?%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%20%7B%5Cleft(%20f_i%20%7C%20u%20%5Cright)%7D%20"> and can be thought of as a regularization term,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AR%20=%20%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%20%5Cmathop%7B%5Cmathrm%7BTr%7D%7DK_%7Bf%7Cu%7D%20=%0A%5Cfrac12%20%5C,%20%5Csum_%7Bi=1%7D%5En%20%5Cfrac%7B%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%20%7B%5Cleft(%20f_i%20%7C%20u%20%5Cright)%7D%20%7D%7B%5Csigma%5E2%7D.%0A"></p>
<p>As the number of inducing variables <img src="https://latex.codecogs.com/png.latex?m"> increases, the pseudo-likelihood becomes more accurate <img src="https://latex.codecogs.com/png.latex?%5Cwidehat%7Bp%7D%5Eu(y)%20%5Cto%20p(y)">, the conditional variances <img src="https://latex.codecogs.com/png.latex?%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%20%7B%5Cleft(%20f_i%20%7C%20u%20%5Cright)%7D%20%20%5Cto%200"> shrink to zero, and the KL divergence <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D%5Bq(u,f)%20%5Cmid%20p(u,f%20%5Cmid%20y)%5D"> approaches zero.</p>
<p>The animation at the start of this note illustrates the effect of optimizing the location of the inducing points <img src="https://latex.codecogs.com/png.latex?z"> with a very simple gradient descent. A few experiments show that it is worth being careful with the initial choice of inducing points. Inducing points chosen very far from the data essentially remain fixed during the optimization (ie. the gradient is very small). Initializing with <a href="https://en.wikipedia.org/wiki/K-means%2B%2B">k-means++</a> clustering of the data points seems to be a robust strategy and give an almost optimal choice of inducing points.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-titsias2009variational" class="csl-entry">
Titsias, Michalis. 2009. <span>“Variational Learning of Inducing Variables in Sparse Gaussian Processes.”</span> In <em>Artificial Intelligence and Statistics</em>, 567–74. PMLR.
</div>
</div></section></div> ]]></description>
  <category>GP</category>
  <guid>https://alexxthiery.github.io/notes/sparse_GP/sparse_gp.html</guid>
  <pubDate>Thu, 17 Apr 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Self Avoiding Walks</title>
  <link>https://alexxthiery.github.io/notes/SAW/SAW.html</link>
  <description><![CDATA[ 





<!-- \begin{figure}[h]
\centering
\includegraphics[width=0.3\textwidth]{polymer-selfavoiding.png}
\caption{A 2D self-avoiding walk}
\end{figure} -->
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/SAW/polymer-selfavoiding.png" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Self-avoiding_walk">A 2D self-avoiding walk</a></figcaption>
</figure>
</div>
</div>
<p><em>These notes present comments on the “Self-avoiding walks” assignment given to the “ST3247: Simulations” class. Most of the drafts that have been submitted so far describe variations of importance sampling. The purpose of these notes is to suggest directions for slightly more advanced Monte Carlo methods that can be used to estimate the connective constant <img src="https://latex.codecogs.com/png.latex?%5Cmu"> of self-avoiding walks. These are only pointers and suggestions.</em></p>
<section id="the-problems-and-notations" class="level3">
<h3 class="anchored" data-anchor-id="the-problems-and-notations">The problems and notations</h3>
<p>Recall that we are trying to estimate the <a href="https://en.wikipedia.org/wiki/Connective_constant">connective constant</a> <img src="https://latex.codecogs.com/png.latex?%5Cmu"> of self-avoiding walks (SAW) in the 2D lattice <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BZ%7D%5E2">. If <img src="https://latex.codecogs.com/png.latex?c_L"> denotes the number of SAWs of length <img src="https://latex.codecogs.com/png.latex?L">, we have the following asymptotic behavior:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ac_L%20%5C;%20%5Csim%20%5C;%20A%20%5C,%20%5Cmu%5EL%20%5C,%20L%5E%7B%5Cgamma%7D%0A"></p>
<p>for some unknown constants <img src="https://latex.codecogs.com/png.latex?A">, <img src="https://latex.codecogs.com/png.latex?%5Cmu">, and <img src="https://latex.codecogs.com/png.latex?%5Cgamma">. The main objective of the assignment is to estimate <img src="https://latex.codecogs.com/png.latex?%5Cmu">, which can also be expressed as the limit of <img src="https://latex.codecogs.com/png.latex?c_L%5E%7B1/L%7D"> as <img src="https://latex.codecogs.com/png.latex?L%20%5Cto%20%5Cinfty">. As of today, the <a href="https://en.wikipedia.org/wiki/Connective_constant">best known estimate</a> is <img src="https://latex.codecogs.com/png.latex?%5Cmu%20%5Capprox%202.638158533032790(3)">, which required several tens of thousand hours of CPU time to compute. To estimate <img src="https://latex.codecogs.com/png.latex?%5Cmu">, one must approximate the number of SAWs of length <img src="https://latex.codecogs.com/png.latex?L"> starting at the origin for large values of <img src="https://latex.codecogs.com/png.latex?L"> if one hopes to get a good estimate.</p>
<p>Consider a sequence <img src="https://latex.codecogs.com/png.latex?z_%7B0:L%7D%20=%20(z_0,%20z_1,%20%5Cdots,%20z_L)"> of <img src="https://latex.codecogs.com/png.latex?L+1"> distinct vertices in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BZ%7D%5E2"> with <img src="https://latex.codecogs.com/png.latex?z_0%20=%20(0,0)"> and <img src="https://latex.codecogs.com/png.latex?%5C%7Cz_%7Bk+1%7D%20-%20z_k%5C%7C=1"> for all <img src="https://latex.codecogs.com/png.latex?0%20%5Cleq%20k%20%5Cleq%20L-1">, i.e., a walk of length <img src="https://latex.codecogs.com/png.latex?L">. For notational convenience, let us introduce the function <img src="https://latex.codecogs.com/png.latex?%5Cvarphi%5E%7B%5Ctextrm%7Bwalk%7D%7D(z_%7B:L%7D)"> that returns one if <img src="https://latex.codecogs.com/png.latex?z_%7B0:L%7D"> is a correct walk of length <img src="https://latex.codecogs.com/png.latex?L">, and zero otherwise. In particular, this function returns zero if two consecutive vertices are the same, or if the walk does not start at zero. Similarly, introduce the function <img src="https://latex.codecogs.com/png.latex?%5Cvarphi%5E%7B%5Ctextrm%7BSAW%7D%7D(z_%7B:L%7D)"> that returns one if <img src="https://latex.codecogs.com/png.latex?z_%7B0:L%7D"> is a SAW of length <img src="https://latex.codecogs.com/png.latex?L">. One can define two important probability mass functions:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap%5E%7B%5Ctextrm%7Bwalk%7D%7D_%7BL%7D(z_%7B0:L%7D)%20=%20%5Cfrac%7B%5Cvarphi_L%5E%7B%5Ctextrm%7Bwalk%7D%7D(z_%7B0:L%7D)%7D%7B4%5EL%7D%0A%5Cqquad%20%5Ctextrm%7Band%7D%20%5Cqquad%0Ap%5E%7B%5Ctextrm%7BSAW%7D%7D_%7BL%7D(z_%7B0:L%7D)%20=%20%5Cfrac%7B%5Cvarphi_L%5E%7B%5Ctextrm%7BSAW%7D%7D(z_%7B0:L%7D)%7D%7Bc_L%7D.%0A"></p>
<p>They describe the uniform distributions on all the walks of length <img src="https://latex.codecogs.com/png.latex?L"> and all the SAWs of length <img src="https://latex.codecogs.com/png.latex?L">, respectively.</p>
</section>
<section id="importance-sampling" class="level3">
<h3 class="anchored" data-anchor-id="importance-sampling">Importance sampling</h3>
<p>One can approximate <img src="https://latex.codecogs.com/png.latex?c_L"> with naive Monte Carlo by estimating the proportion <img src="https://latex.codecogs.com/png.latex?p_L"> of walks that are SAWs,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ap_L%20=%20%5Cmathbb%7BE%7D_%7Bp%5E%7B%5Ctextrm%7Bwalk%7D%7D_%7BL%7D%7D%20%5Cleft%5B%20%5Cvarphi_L%5E%7B%5Ctextrm%7BSAW%7D%7D(z_%7B0:L%7D)%20%5Cright%5D%0A=%0A%5Cfrac%7B1%7D%7B4%5EL%7D%20%5Csum_%7Bz_%7B0:L%7D%7D%20%5Cvarphi_L%5E%7B%5Ctextrm%7BSAW%7D%7D(z_%7B0:L%7D).%0A"></p>
<p>This is an absolute disaster since the proportion of SAWs among all walks is extremely small. One can do significantly better using importance sampling. For this, consider a proposal distribution that starts at the origin and continues by choosing uniformly among the four neighbors of the last vertex that have not been visited yet. If there are no unvisited neighbors, the walk continues by standing still until length <img src="https://latex.codecogs.com/png.latex?L"> is reached: the resulting path is not even a valid walk, so <img src="https://latex.codecogs.com/png.latex?p%5E%7B%5Ctextrm%7Bwalk%7D%7D_%7BL%7D(z_%7B0:L%7D)%20=%200"> as well as <img src="https://latex.codecogs.com/png.latex?p%5E%7B%5Ctextrm%7BSAW%7D%7D_%7BL%7D(z_%7B0:L%7D)%20=%200">. The probability mass function of the proposal distribution is easy to compute, so estimating <img src="https://latex.codecogs.com/png.latex?p_L"> with importance sampling is straightforward. This is usually called the Rosenbluth method <span class="citation" data-cites="rosenbluth1955monte">(Rosenbluth and Rosenbluth 1955)</span>. <em>[<strong>Note to students</strong>: make it much clearer in your report that the Rosenbluth method is just importance sampling. Do note that even the “rejected” walks have to be taken into account!]</em></p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/SAW/SAW_SMC.png" class="img-fluid figure-img" style="width:95.0%"></p>
<figcaption>Importance Sampling (Rosenbluth method)</figcaption>
</figure>
</div>
</div>
<p>As one can see, the quality quickly deteriorates as <img src="https://latex.codecogs.com/png.latex?L"> increases. This is because the number of accepted walks is very small, and, among them, the importance weights are highly unequal.<br>
<em>[<strong>Note to students</strong>: you should explain this much more clearly, and possibly explore this more quantitatively. The reason it is failing is not only that the number of accepted walks is small]</em></p>
</section>
<section id="recursive-formulation" class="level3">
<h3 class="anchored" data-anchor-id="recursive-formulation">Recursive formulation</h3>
<p>We have just seen that importance sampling will not be able to estimate <img src="https://latex.codecogs.com/png.latex?c_L"> for large values of <img src="https://latex.codecogs.com/png.latex?L">. This makes accurate estimates of <img src="https://latex.codecogs.com/png.latex?%5Cmu"> difficult to obtain this way.</p>
<p>To make progress, one can exploit the recursive structure of the problem. Let us define the concatenation of two walks. Given a first walk <img src="https://latex.codecogs.com/png.latex?z%5E%7B(A)%7D_%7B0:L_A%7D"> and a second walk <img src="https://latex.codecogs.com/png.latex?z%5E%7B(B)%7D_%7B0:L_B%7D">, one can define a new walk of length <img src="https://latex.codecogs.com/png.latex?L_A%20+%20L_B"> by starting at the origin, following the <img src="https://latex.codecogs.com/png.latex?L_A"> increments of the first walk, then the <img src="https://latex.codecogs.com/png.latex?L_B"> increments of the second. The concatenation of two SAWs is not always a SAW. However, it is not hard to prove the following. Define <img src="https://latex.codecogs.com/png.latex?B(L_A,%20L_B)%20%5Cin%20(0,1)"> as the probability that, when sampling SAWs <img src="https://latex.codecogs.com/png.latex?z%5E%7B(A)%7D_%7B0:L_A%7D"> and <img src="https://latex.codecogs.com/png.latex?z%5E%7B(B)%7D_%7B0:L_B%7D"> independently and uniformly at random, their concatenation is still a SAW. Then:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AB(L_A,%20L_B)%20%5C;%20=%20%5C;%20%5Cfrac%7Bc_%7BL_A%20+%20L_B%7D%7D%7Bc_%7BL_A%7D%20%5C,%20c_%7BL_B%7D%7D.%0A"></p>
<p><em>[<strong>Note to students</strong>: it is OK for you to use this fact. It’s even better if you can prove it, but not absolutely necessary.]</em></p>
<p>Assuming one can generate SAWs of length <img src="https://latex.codecogs.com/png.latex?L"> uniformly at random ( a problem that will be discussed later), we can estimate <img src="https://latex.codecogs.com/png.latex?%5Cmu"> in several ways:</p>
<ol type="1">
<li><p>For small values of <img src="https://latex.codecogs.com/png.latex?L_1">, the number of SAWs <img src="https://latex.codecogs.com/png.latex?c_%7BL_1%7D"> is known exactly (e.g., <img src="https://latex.codecogs.com/png.latex?c_1%20=%204">, <img src="https://latex.codecogs.com/png.latex?c_%7B10%7D%20=%2044100">). Suppose one can generate SAWs of length <img src="https://latex.codecogs.com/png.latex?L_2%20%5Cgg%201">. One can then estimate <img src="https://latex.codecogs.com/png.latex?B(L_1,%20L_2)"> empirically. Since <img src="https://latex.codecogs.com/png.latex?c_L%20%5C;%20%5Csim%20%5C;%20A%20%5C,%20%5Cmu%5EL%20%5C,%20L%5E%7B%5Cgamma%7D">, it follows that, for <img src="https://latex.codecogs.com/png.latex?L_1"> fixed and <img src="https://latex.codecogs.com/png.latex?L_2%20%5Cto%20%5Cinfty">, <img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bc_%7BL_1+L_2%7D%7D%7Bc_%7BL_2%7D%7D%20%5Capprox%20%5Cmu%5E%7BL_1%7D.%0A"> Using the fact that <img src="https://latex.codecogs.com/png.latex?B(L_1,%20L_2)%20=%20c_%7BL_1+L_2%7D%20/%20(c_%7BL_2%7D%20c_%7BL_1%7D)">, one can then estimate <img src="https://latex.codecogs.com/png.latex?%5Cmu"> from the estimate of <img src="https://latex.codecogs.com/png.latex?B(L_1,%20L_2)">.</p></li>
<li><p>Alternatively, one can estimate <img src="https://latex.codecogs.com/png.latex?c_L"> for large <img src="https://latex.codecogs.com/png.latex?L"> recursively. For example, starting from <img src="https://latex.codecogs.com/png.latex?c_%7B10%7D%20=%2044100">, estimate <img src="https://latex.codecogs.com/png.latex?B(10,10)"> to compute <img src="https://latex.codecogs.com/png.latex?c_%7B20%7D">, then use <img src="https://latex.codecogs.com/png.latex?B(20,20)"> to compute <img src="https://latex.codecogs.com/png.latex?c_%7B40%7D">, and so on. Using this method and about <img src="https://latex.codecogs.com/png.latex?5"> hours of CPU time (see below for details) with <img src="https://latex.codecogs.com/png.latex?10,000"> SAWs of lengths <img src="https://latex.codecogs.com/png.latex?10,%2020,%20%5Cdots,%202560">, I obtained <img src="https://latex.codecogs.com/png.latex?%5Cmu%20%5Capprox%202.643">.</p></li>
</ol>
</section>
<section id="generating-saws" class="level3">
<h3 class="anchored" data-anchor-id="generating-saws">Generating SAWs</h3>
<p>The previous discussion shows that, once we know how to generate uniform SAWs, we can estimate <img src="https://latex.codecogs.com/png.latex?%5Cmu"> relatively easily. One of the most common methods is the pivot algorithm: see <a href="https://clisby.net/projects/sm_simulator/">here</a> for a nice visualization. The principle is simple: given a SAW, randomly select a pivot site and apply a symmetry operation (like rotation or reflection) to one part of the walk. If the resulting walk remains self-avoiding, accept it; otherwise, reject it. Repeating this process generates diverse, approximately uniform SAWs.<br>
<em>[<strong>Note to students</strong>: explain this much more clearly if you decide to use it]</em></p>
<p>In short, the pivot algorithm updates a SAW by applying a symmetry operation to a subpath. Given a SAW <img src="https://latex.codecogs.com/png.latex?z_%7B0:L%7D">, one can obtain another SAW by applying to it the pivot algorithm a (large) number of times. To obtain a nearly independent SAW of length <img src="https://latex.codecogs.com/png.latex?L"> starting from <img src="https://latex.codecogs.com/png.latex?z_%7B0:L%7D">, one typically need to apply about <img src="https://latex.codecogs.com/png.latex?L"> pivot steps. While it can be slow for large <img src="https://latex.codecogs.com/png.latex?L">, it is far more efficient than naive importance sampling.<br>
<em>[<strong>Note to students</strong>: efficiently implementing the pivot algorithm is non-trivial, but LLM assistants can help a lot, and are actually quite useful for code optimization]</em></p>
</section>
<section id="sequential-monte-carlo" class="level3">
<h3 class="anchored" data-anchor-id="sequential-monte-carlo">Sequential Monte Carlo</h3>
<p>To estimate <img src="https://latex.codecogs.com/png.latex?c_L"> for large <img src="https://latex.codecogs.com/png.latex?L">, one can use Sequential Monte Carlo (SMC). The idea is to grow a population of <img src="https://latex.codecogs.com/png.latex?N"> SAWs in parallel and estimate <img src="https://latex.codecogs.com/png.latex?c_L"> by recursively estimating the ratios <img src="https://latex.codecogs.com/png.latex?c_%7BL+1%7D/c_L">. Suppose you have <img src="https://latex.codecogs.com/png.latex?N"> SAWs of length <img src="https://latex.codecogs.com/png.latex?L">. Try to extend each SAW by choosing a neighbor of the last vertex that has not been visited yet. This is a form of importance sampling, giving <img src="https://latex.codecogs.com/png.latex?N"> new walks of length <img src="https://latex.codecogs.com/png.latex?L+1"> with associated weights (some of them being non-valid walks!). Then, <em>resample</em> <img src="https://latex.codecogs.com/png.latex?N"> times from this weighted set to get <img src="https://latex.codecogs.com/png.latex?N"> new SAWs of length <img src="https://latex.codecogs.com/png.latex?L+1"> (with possible duplicates). Apply the pivot algorithm to eliminate these duplicates and generate more diverse SAWs.<br>
<em>[<strong>Note to students</strong>: if you decide to use SMC, explain it much more clearly. It’s not entirely straightforward to understand or implement, but it is one of the most powerful and versatile Monte Carlo methods to this day. A good investment of your time if you decide to understand SMC]</em></p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/SAW/SMC_mu_estimates.png" class="img-fluid figure-img" style="width:95.0%"></p>
<figcaption>Sequential Monte Carlo</figcaption>
</figure>
</div>
</div>
</section>
<section id="improving-the-estimation-of-mu" class="level3">
<h3 class="anchored" data-anchor-id="improving-the-estimation-of-mu">Improving the estimation of <img src="https://latex.codecogs.com/png.latex?%5Cmu"></h3>
<p>Suppose you have estimates of <img src="https://latex.codecogs.com/png.latex?(%5Clog%20c_L)/L"> for various <img src="https://latex.codecogs.com/png.latex?L">. Since</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Clog%20c_L%7D%7BL%7D%20%5Capprox%20%5Clog%20A%20%5Ccdot%20%5Cfrac%7B1%7D%7BL%7D%20+%20%5Clog%20%5Cmu%20+%20%5Cgamma%20%5Ccdot%20%5Cfrac%7B%5Clog%20L%7D%7BL%7D,%0A"></p>
<p>you can fit a linear regression to estimate <img src="https://latex.codecogs.com/png.latex?%5Clog%20A">, <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cmu">, and <img src="https://latex.codecogs.com/png.latex?%5Cgamma">. I tried this approach using a naive and non-optimized SMC implementation with <img src="https://latex.codecogs.com/png.latex?N=1000"> and <img src="https://latex.codecogs.com/png.latex?L=1000">, running for 10 hours on a free (and bad) online CPU, and obtained <img src="https://latex.codecogs.com/png.latex?%5Cmu%20%5Capprox%202.6366">.<br>
<em>[<strong>Note to students</strong>: can you do much better?]</em></p>
</section>
<section id="running-long-simulations" class="level3">
<h3 class="anchored" data-anchor-id="running-long-simulations">Running long simulations</h3>
<p>The best known estimate of <img src="https://latex.codecogs.com/png.latex?%5Cmu"> required several tens of thousands of CPU hours. While writing these notes, I was able to run simulations easily and for free using <a href="https://deepnote.com">deepNote</a>: it was my first time using it, and it was very user friendly. This allowed me to run simulations for 8 hours on a (free but slow) CPU without issue. Launch simulations in the evening and let them run overnight. <em>[<strong>Note to students</strong>: for the more motivated ones, you can try writing GPU-friendly code to run simulations, possibly on Google Colab]</em></p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-rosenbluth1955monte" class="csl-entry">
Rosenbluth, Marshall N, and Arianna W Rosenbluth. 1955. <span>“Monte Carlo Calculation of the Average Extension of Molecular Chains.”</span> <em>The Journal of Chemical Physics</em> 23 (2). American Institute of Physics: 356–59.
</div>
</div></section></div> ]]></description>
  <category>monte-carlo</category>
  <guid>https://alexxthiery.github.io/notes/SAW/SAW.html</guid>
  <pubDate>Fri, 04 Apr 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Jarzynski and Crooks</title>
  <link>https://alexxthiery.github.io/notes/jarzynski/jarzynski.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/jarzynski/jarzynski_crooks.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Christopher_Jarzynski">Christopher Jarzynski</a> and <a href="https://en.wikipedia.org/wiki/Gavin_E._Crooks">Gavin Crooks</a></figcaption>
</figure>
</div>
</div>
<p>Consider a sequence of densities on <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5ED"> indexed by time parameter <img src="https://latex.codecogs.com/png.latex?t%20%5Cin%20%5B0,T%5D">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi_t(x)%20%5C;%20=%20%5C;%20%5Cfrac%7B%20e%5E%7B-U_t(x)%7D%7D%7BZ_t%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?U_t:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D"> is time-dependent potential function and <img src="https://latex.codecogs.com/png.latex?Z_t"> is the normalizing constant. We are in fact really interested in studying the final density <img src="https://latex.codecogs.com/png.latex?%5Cpi_T"> and the bridging sequence of densities <img src="https://latex.codecogs.com/png.latex?%5Cpi_t"> is just a tool to get there, starting from an initial and tractable density <img src="https://latex.codecogs.com/png.latex?%5Cpi_0">. If one initializes a particle <img src="https://latex.codecogs.com/png.latex?X_0%20%5Csim%20%5Cpi_0"> and evolves it according to the <a href="https://en.wikipedia.org/wiki/Langevin_equation">Langevin</a> dynamics</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX_t%20%5C;%20=%20%5C;%20-%5Cnabla%20U_t(X_t)%20%5C,%20dt%20+%20%5Csqrt%7B2%7D%20%5C,%20dW_t%0A"></p>
<p>one can hope that the distribution of <img src="https://latex.codecogs.com/png.latex?X_T"> will be close to <img src="https://latex.codecogs.com/png.latex?%5Cpi_T">. This would be the case if one evolved the particle according <img src="https://latex.codecogs.com/png.latex?dX_t%20%5C;%20=%20%5C;%20-%5Cgamma%20%5Cnabla%20U_t(X_t)%20%5C,%20dt%20+%20%5Csqrt%7B2%20%5Cgamma%7D%20%5C,%20dW_t"> and let <img src="https://latex.codecogs.com/png.latex?%5Cgamma%20%5Cto%20%5Cinfty"> since in that case <img src="https://latex.codecogs.com/png.latex?X_t"> would be distributed according to <img src="https://latex.codecogs.com/png.latex?%5Cpi_t"> for all <img src="https://latex.codecogs.com/png.latex?t">. Can one correct the distribution of <img src="https://latex.codecogs.com/png.latex?X_T"> with importance sampling weights?</p>
<p>I like the approach presented in <span class="citation" data-cites="vargas2023transport">(Vargas et al. 2024)</span> and these notes are my attempt to understand it. One very fruitful idea that has been used in a number of works in the Monte-Carlo literature is to look at a probability distribution of interest as the marginal of a joint distribution and to carry out computations and build numerical methods on the joint distribution <span class="citation" data-cites="del2006sequential">(Del Moral, Doucet, and Jasra 2006)</span>. Indeed, there is a lot of flexibility in the choice of the joint distribution.</p>
<p>Here, we can also consider the diffusion process <img src="https://latex.codecogs.com/png.latex?Y_T"> that runs backward in times and that is initialized according to <img src="https://latex.codecogs.com/png.latex?%5Cpi_T"> and follows the same Langevin dynamics (backward in time). Again, one expects the distribution of <img src="https://latex.codecogs.com/png.latex?Y_t"> to be close to <img src="https://latex.codecogs.com/png.latex?%5Cpi_t">. It is more intuitive to discuss discretized version of the process. For a time discretization <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20=%20T/N">, we consider the forward Markov chain <img src="https://latex.codecogs.com/png.latex?X_t"> defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0Ax_%7Bt%20+%20%5Cdelta%7D%20&amp;=%20x_t%20-%20%5Cnabla%20U_t(x_t)%20%5C,%20%5Cdelta%20+%20%5Csqrt%7B2%20%5Cdelta%7D%20%5C,%20%5Cxi_t%5C%5C%0Ax_0%20&amp;%5Csim%20%5Cpi_0%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>as well as the backward Markov chain <img src="https://latex.codecogs.com/png.latex?Y_t"> defined as <img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0Ay_%7Bt%7D%20&amp;=%20y_%7Bt%20+%20%5Cdelta%7D%20-%20%5Cnabla%20U_%7Bt%20+%20%5Cdelta%7D(y_%7Bt%20+%20%5Cdelta%7D)%20%5C,%20%5Cdelta%20+%20%5Csqrt%7B2%20%5Cdelta%7D%20%5C,%20%5Cxi_t%5C%5C%0Ay_T%20&amp;%5Csim%20%5Cpi_T.%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>The quantities <img src="https://latex.codecogs.com/png.latex?%5Cxi_t%20%5Csim%20%5Cmathcal%7BN%7D(0,I)"> are i.i.d. standard Gaussian random variables. Let us continue with these discretized versions and denote by <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7BX%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7BY%7D"> the probability measures associated with the discretized processes. For a discretized path <img src="https://latex.codecogs.com/png.latex?%5Cunderline%7Bz%7D%20=%20(z_0,%20z_%7B%5Cdelta%7D,%20%5Cldots,%20z_%7BT%7D)"> and notation <img src="https://latex.codecogs.com/png.latex?t_k%20=%20k%20%5Cdelta">, we have:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Cmathbb%7BP%7D%5EX(%5Cunderline%7Bz%7D)%20&amp;=%0A%5Cpi_0(z_0)%20%5C,%20%5Cexp%20%7B%5Cleft%5C%7B%20-%5Cfrac%7B1%7D%7B4%20%5Cdelta%7D%20%5Csum_%7Bk=0%7D%5E%7BN-1%7D%20%5C%7Cz_%7Bt_%7Bk+1%7D%7D%20-%20%5Bz_%7Bt_k%7D%20-%20%5Cnabla%20U_%7Bt_k%7D(z_%7Bt_k%7D)%5C,%5Cdelta%5D%5C%7C%5E2%20%5Cright%5C%7D%7D%20%5C%5C%0A%5Cmathbb%7BP%7D%5EY(%5Cunderline%7Bz%7D)%20&amp;=%0A%5Cpi_T(z_T)%20%5C,%20%5Cexp%20%7B%5Cleft%5C%7B%20-%5Cfrac%7B1%7D%7B4%20%5Cdelta%7D%20%5Csum_%7Bk=0%7D%5E%7BN-1%7D%20%5C%7Cz_%7Bt_%7Bk%7D%7D%20-%20%5Bz_%7Bt_%7Bk+1%7D%7D%20-%20%5Cnabla%20U_%7Bt_%7Bk+1%7D%7D(z_%7Bt_%7Bk+1%7D%7D)%5C,%5Cdelta%5D%5C%7C%5E2%20%5Cright%5C%7D%7D%20.%0A%5Cend%7Baligned%7D%0A"></p>
<p>One can compute the ratio <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5EY(z)%20/%20%5Cmathbb%7BP%7D%5EX(z)"> and examine its limit as <img src="https://latex.codecogs.com/png.latex?N%20%5Cto%20%5Cinfty">. Algebra gives:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Cmathbb%7BP%7D%5EY%7D%7B%5Cmathbb%7BP%7D%5EX%7D(%5Cunderline%7Bz%7D)%20=%0A%5Cfrac%7B%5Cpi_T(z_T)%7D%7B%5Cpi_0(z_0)%7D%20%5C,%0A%5Cexp%20%7B%5Cleft%5C%7B%20%5Csum_%7Bk=0%7D%5E%7BN-1%7D%20%5Cleft%3C%20z_%7Bt_%7Bk+1%7D%7D%20-%20z_%7Bt_k%7D,%20%5Cfrac%7B%5Cnabla%20U_%7Bt_k%7D(z_%7Bt_k%7D)%20+%20%5Cnabla%20U_%7Bt_%7Bk+1%7D%7D(z_%7Bt_%7Bk+1%7D%7D)%7D%7B2%7D%20%20%5Cright%3E%20%5Cright%5C%7D%7D%20%20%20+%20%5Cmathcal%7BO%7D(%5Cdelta%5E%7B1/2%7D).%0A"></p>
<p>One could probably use some <a href="https://en.wikipedia.org/wiki/Stratonovich_integral">Stratonovich</a> calculus to study this, but I always forget these things, so let’s use Ito instead. Write</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cnabla%20U_%7Bt_k%7D(z_%7Bt_k%7D)%20+%20%5Cnabla%20U_%7Bt_%7Bk+1%7D%7D(z_%7Bt_%7Bk+1%7D%7D)%7D%7B2%7D%0A=%0A%5Cnabla%20U_%7Bt_k%7D(z_%7Bt_k%7D)%20+%20%5Cfrac%7B1%7D%7B2%7D%20%5Cmathrm%7BHess%7D_%7BU_%7Bt_%7Bk%7D%7D%7D%20(z_%7Bt_%7Bk%7D%7D)%20(z_%7Bt_%7Bk+1%7D%7D%20-%20z_%7Bt_k%7D)%0A+%0A%5Cmathcal%7BO%7D(%5Cdelta).%0A"></p>
<p>Consequently, in the limit <img src="https://latex.codecogs.com/png.latex?N%20%5Cto%20%5Cinfty">, the quantity converges to:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%5Cmathbb%7BP%7D%5EY%7D%7Bd%5Cmathbb%7BP%7D%5EX%7D(%5Cunderline%7Bz%7D)%20%5C;%20=%20%5C;%0A%5Cfrac%7B%5Cpi_T(z_T)%7D%7B%5Cpi_0(z_0)%7D%20%5C,%0A%5Cexp%20%7B%5Cleft%5C%7B%20%20%5Cint_%7Bt=0%7D%5ET%20%5Cnabla%20U_t(z_t)%20%5C,%20dz_t%0A+%0A%5Cfrac%7B1%7D%7B2%7D%20%5Cint_0%5ET%20%5Cleft%3C%20dz_t,%20%5Cmathrm%7BHess%7D_%7BU_t%7D(z_t)%20%5C,%20dz_t%20%5Cright%3E%0A%5Cright%5C%7D%7D%20.%0A"></p>
<p>One can obtain a slightly simpler formuler using Ito’s lemma. Since <img src="https://latex.codecogs.com/png.latex?d%20U_t(z_t)%20=%20%5Cpartial_t%20U_t(z_T)%20%5C,%20dt%20+%20%5Cleft%3C%20%5Cnabla%20U_t(z_t),%20dz_t%20%5Cright%3E%20+%20%5Cfrac%7B1%7D%7B2%7D%20%5Cleft%3C%20dz_t,%20%5Cmathrm%7BHess%7D_%7BU_t%7D(z_t)%20%5C,%20dz_t%20%5Cright%3E">, we also have:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%5EY%7D%7Bd%20%5Cmathbb%7BP%7D%5EX%7D(%5Cunderline%7Bz%7D)%0A=%0A%5Cfrac%7B%5Cpi_T(z_T)%7D%7B%5Cpi_0(z_0)%7D%20%5C,%0A%5Cexp%20%7B%5Cleft%5C%7B%20%20U_T(z_T)%20-%20U_0(z_0)%20-%20%5Cint_0%5ET%20%5Cpartial_t%20U_t(z_t)%20%5C,%20dt%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>Since <img src="https://latex.codecogs.com/png.latex?%5Cpi_t(z_t)%20=%20%5Cexp(-U_t(z_t))%20/%20Z_t">, this gives the <a href="https://en.wikipedia.org/wiki/Crooks_fluctuation_theorem">Crooks relation</a>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%5EY%7D%7Bd%20%5Cmathbb%7BP%7D%5EX%7D(%5Cunderline%7Bz%7D)%0A=%0A%5Cfrac%7BZ_0%7D%7BZ_T%7D%20%5C,%0A%5Cexp%20%7B%5Cleft%5C%7B%20%20-%20%5Cint_0%5ET%20%5Cpartial_t%20U_t(z_t)%20%5C,%20dt%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>Integrating over trajectories of <img src="https://latex.codecogs.com/png.latex?X_t">, since <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_%7BX%7D%5B(d%20%5Cmathbb%7BP%7D%5EY%20/%20d%20%5Cmathbb%7BP%7D%5EX)(X)%5D%20=%201">, one obtains the <a href="https://en.wikipedia.org/wiki/Jarzynski_equality">Jarzynski equality</a> <img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7BZ_T%7D%7BZ_0%7D%20%5C;%20=%20%5C;%20%5Cmathbb%7BE%7D_%7BX%7D%20%20%7B%5Cleft%5B%20%20%5Cexp%20%7B%5Cleft%5C%7B%20%20-%20%5Cint_0%5ET%20%5Cpartial_t%20U_t(X_t)%20%5C,%20dt%20%5Cright%5C%7D%7D%20%20%5Cright%5D%7D%0A"></p>
<p>which is indeed also central to sequential Monte-Carlo methods. As described in <span class="citation" data-cites="vargas2023transport">(Vargas et al. 2024)</span>, the same approach can be used to slightly generalize the Crooks relation. Indeed, suppose that one instead consider the dynamics:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX_t%20%5C;%20=%20%5C;%20-%5Cnabla%20U_t(X_t)%20%5C,%20dt%20%20%5Ctextcolor%7Bblue%7D%7B+%20b_t(X_t)%20%5C,%20dt%7D%20+%20%5Csqrt%7B2%7D%20%5C,%20dW_t%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?b:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED"> is a control function. One can consider the backward dynamics <img src="https://latex.codecogs.com/png.latex?Y_t"> that is initialized according to <img src="https://latex.codecogs.com/png.latex?%5Cpi_T"> and follows the dynamics <img src="https://latex.codecogs.com/png.latex?dY_t%20=%20-%5Cnabla%20U_t(Y_t)%20%5C,%20dt%20%20%5Ctextcolor%7Bred%7D%7B-%7D%20b(Y_t)%20%5C,%20dt%20+%20%5Csqrt%7B2%7D%20%5C,%20dW_t"> backward in time, i.e.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0Ax_%7Bt%20+%20%5Cdelta%7D%20&amp;=%20x_t%20-%20%5Cnabla%20U_t(x_t)%20%5C,%20%5Cdelta%20+%20b_t(x_t)%20%5C,%20%5Cdelta%20+%20%5Csqrt%7B2%20%5Cdelta%7D%20%5C,%20%5Cxi_t%5C%5C%0Ay_%7Bt%7D%20&amp;=%20y_%7Bt%20+%20%5Cdelta%7D%20-%20%5Cnabla%20U_%7Bt%20+%20%5Cdelta%7D(y_%7Bt%20+%20%5Cdelta%7D)%20%5C,%20%5Cdelta%20%20%5Ctextcolor%7Bred%7D%7B-%7D%20b_%7Bt%20+%20%5Cdelta%7D(y_%7Bt%20+%20%5Cdelta%7D)%20%5C,%20%5Cdelta%20+%20%5Csqrt%7B2%20%5Cdelta%7D%20%5C,%20%5Cxi_t.%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>The minus sign for the backward dynamics <img src="https://latex.codecogs.com/png.latex?Y_t"> is natural. Indeed, one would like the dynamics of <img src="https://latex.codecogs.com/png.latex?Y_t"> to be as close as possible to the time-reversal of the dynamics of <img src="https://latex.codecogs.com/png.latex?X_t">. Furthermore, one knows that the <a href="../../notes/reverse_and_tweedie/reverse_and_tweedie.html">backward dynamics</a> of <img src="https://latex.codecogs.com/png.latex?X_t"> is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdY_t%20=%20%20%5Ctextcolor%7Bred%7D%7B+%7D%20%5Cnabla%20U_t(Y_t)%20%5C,%20dt%20%20%5Ctextcolor%7Bred%7D%7B-%7D%20b_t(Y_t)%20%5C,%20dt%20+%202%20%5Cnabla%20%5Clog%20p_t(Y_t)%20%5C,%20dt%20+%20%5Csqrt%7B2%7D%20%5C,%20dW_t%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?p_t"> is the marginal distribution of <img src="https://latex.codecogs.com/png.latex?X_t"> at time <img src="https://latex.codecogs.com/png.latex?t">. Since we would like <img src="https://latex.codecogs.com/png.latex?p_t%20=%20%5Cpi_t%20=%20%5Cexp(-U_t)%20/%20Z_t">, in that case this gives</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdY_t%20=%20-%5Cnabla%20U_t(Y_t)%20%5C,%20dt%20%20%5Ctextcolor%7Bred%7D%7B-%7D%20b_t(Y_t)%20%5C,%20dt%20+%20%5Csqrt%7B2%7D%20%5C,%20dW_t,%0A"></p>
<p>which is exactly the dynamics we chose for <img src="https://latex.codecogs.com/png.latex?Y_t">.One can then follow the exact same steps as done previously, using that the quadratic variation is <img src="https://latex.codecogs.com/png.latex?%5Cleft%3C%20dz_t,%20dz_t%20%5Cright%3E%20=%202%20%5C,%20dt">, to obtain that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%5EY%7D%7Bd%20%5Cmathbb%7BP%7D%5EX%7D(%5Cunderline%7Bz%7D)%0A=%0A%5Cfrac%7BZ_0%7D%7BZ_T%7D%20%5C,%0A%5Cexp%20%7B%5Cleft%5C%7B%20%20%5Cint_0%5ET%20-%5Cpartial_t%20U_t(z_t)%20%20%5Ctextcolor%7Bblue%7D%7B+%20%5Cnabla%20%5Ccdot%20b_t(z_t)%20-%20%5Cleft%3C%20%5Cnabla%20U_t(z_t),%20b_t(z_t)%20%5Cright%3E%7D%20%5C,%20dt%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>This for example shows that, for <img src="https://latex.codecogs.com/png.latex?dX_t%20%5C;%20=%20%5C;%20-%5Cnabla%20U_t(X_t)%20%5C,%20dt%20%20%5Ctextcolor%7Bblue%7D%7B+%20b_t(X_t)%20%5C,%20dt%7D%20+%20%5Csqrt%7B2%7D%20%5C,%20dW_t"> initialized according to <img src="https://latex.codecogs.com/png.latex?%5Cpi_0">, we have:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7BZ_T%7D%7BZ_0%7D%20%5C;%20=%20%5C;%20%5Cmathbb%7BE%7D_%7BX%7D%20%20%7B%5Cleft%5C%7B%20%20%5Cexp%20%7B%5Cleft%5C%7B%20%20%5Cint_0%5ET%20-%5Cpartial_t%20U_t(X_t)%20%20%5Ctextcolor%7Bblue%7D%7B+%20%5Cnabla%20%5Ccdot%20b_t(X_t)%20-%20%5Cleft%3C%20%5Cnabla%20U_t(X_t),%20b_t(X_t)%20%5Cright%3E%7D%20%5C,%20dt%20%5Cright%5C%7D%7D%20%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>This generalization of the Crooks relation is also explored in <span class="citation" data-cites="albergo2024nets">(Albergo and Vanden-Eijnden 2024)</span> where an alternative derivation by directly exploiting the <a href="https://en.wikipedia.org/wiki/Fokker–Planck_equation">Fokker-Planck</a> equation. Crucially, <span class="citation" data-cites="albergo2024nets">(Albergo and Vanden-Eijnden 2024)</span> note that, if the control function <img src="https://latex.codecogs.com/png.latex?b_t:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED"> is chosen so that</p>
<p><span id="eq-nets"><img src="https://latex.codecogs.com/png.latex?%0A-%5Cpartial_t%20U_t(x)%20+%20%5Cnabla%20%5Ccdot%20b_t(x)%20-%20%5Cleft%3C%20%5Cnabla%20U_t(x),%20b_t(x)%20%5Cright%3E%0A=%0A%5Cfrac%7Bd%7D%7Bdt%7D%20%5C,%20%5Clog%20Z_t%0A%5Ctag%7B1%7D"></span></p>
<p>then the term <img src="https://latex.codecogs.com/png.latex?%5Cint_0%5ET%20-%5Cpartial_t%20U_t(X_t)%20+%20%5Cnabla%20%5Ccdot%20b_t(X_t)%20-%20%5Cleft%3C%20%5Cnabla%20U_t(X_t),%20b_t(X_t)%20%5Cright%3E%20%5C,%20dt"> is indeed constant, which gives a zero-variance estimator of the free energy difference <img src="https://latex.codecogs.com/png.latex?%5Clog(Z_T/Z_0)">. Indeed, it is a formidable challenge to solve the high-dimensional PDE Equation&nbsp;1 and <span class="citation" data-cites="albergo2024nets">(Albergo and Vanden-Eijnden 2024)</span> propose interesting <a href="https://en.wikipedia.org/wiki/Physics-informed_neural_networks">PINNs</a>-based methods to do so.</p>
<section id="some-references" class="level3">
<h3 class="anchored" data-anchor-id="some-references">Some References:</h3>
<ul>
<li>The original papers by Jarzynski <span class="citation" data-cites="jarzynski1997nonequilibrium">(Jarzynski 1997)</span> and Crooks <span class="citation" data-cites="crooks1999entropy">(Crooks 1999)</span>.</li>
<li>The book <span class="citation" data-cites="stoltz2010free">(Stoltz, Rousset, et al. 2010)</span> is excellent!</li>
<li>The two papers that prompted these notes: <span class="citation" data-cites="vargas2023transport">(Vargas et al. 2024)</span> and <span class="citation" data-cites="albergo2024nets">(Albergo and Vanden-Eijnden 2024)</span>.</li>
</ul>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-albergo2024nets" class="csl-entry">
Albergo, Michael S, and Eric Vanden-Eijnden. 2024. <span>“Nets: A Non-Equilibrium Transport Sampler.”</span> <em>arXiv Preprint arXiv:2410.02711</em>.
</div>
<div id="ref-crooks1999entropy" class="csl-entry">
Crooks, Gavin E. 1999. <span>“Entropy Production Fluctuation Theorem and the Nonequilibrium Work Relation for Free Energy Differences.”</span> <em>Physical Review E</em> 60 (3). APS: 2721.
</div>
<div id="ref-del2006sequential" class="csl-entry">
Del Moral, Pierre, Arnaud Doucet, and Ajay Jasra. 2006. <span>“Sequential Monte Carlo Samplers.”</span> <em>Journal of the Royal Statistical Society Series B: Statistical Methodology</em> 68 (3). Oxford University Press: 411–36.
</div>
<div id="ref-jarzynski1997nonequilibrium" class="csl-entry">
Jarzynski, Christopher. 1997. <span>“Nonequilibrium Equality for Free Energy Differences.”</span> <em>Physical Review Letters</em> 78 (14). APS: 2690.
</div>
<div id="ref-stoltz2010free" class="csl-entry">
Stoltz, Gabriel, Mathias Rousset, et al. 2010. <em>Free Energy Computations: A Mathematical Perspective</em>. World Scientific.
</div>
<div id="ref-vargas2023transport" class="csl-entry">
Vargas, Francisco, Shreyas Padhy, Denis Blessing, and Nikolas Nüsken. 2024. <span>“Transport Meets Variational Inference: Controlled Monte Carlo Diffusions.”</span> <em>ICLR 2024</em>.
</div>
</div></section></div> ]]></description>
  <category>SDE</category>
  <category>markov</category>
  <guid>https://alexxthiery.github.io/notes/jarzynski/jarzynski.html</guid>
  <pubDate>Fri, 21 Feb 2025 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Basic Conformal Inference</title>
  <link>https://alexxthiery.github.io/notes/conformal_inference/conformal.html</link>
  <description><![CDATA[ 





<p>Unfortunately, I am totally ignorant about <a href="https://en.wikipedia.org/wiki/Conformal_prediction">conformal inference</a>. However, in today’s seminar, I attended a very interesting talk on the topic, and I think it’s time I try implementing the most basic version of it. It seems like a useful concept, and I might even explain it next semester in my simulation class. What I’ll describe below is the simplest version of conformal inference. There appear to be many extensions and variations of it, most of which I don’t yet understand. For now, I just want to spend a few minutes implementing it myself to ensure I grasp the basic idea.</p>
<p>Consider the (simulated) 1D dataset <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D%20=%20%5C%7Bx_i,%20y_i%5C%7D_%7Bi=1%7D%5EN"> below; our goal is to build prediction confidence intervals <img src="https://latex.codecogs.com/png.latex?%5BL(x),%20U(x)%5D"> for the target variable <img src="https://latex.codecogs.com/png.latex?y"> given a new input <img src="https://latex.codecogs.com/png.latex?x">. Crucially, we would like these predictions to be well-calibrated in the sense that <img src="https://latex.codecogs.com/png.latex?y%20%5Cin%20%5BL(x),%20U(x)%5D"> with probability <img src="https://latex.codecogs.com/png.latex?90%5C%25">, say.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/conformal_inference/data.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>1D regression dataset</figcaption>
</figure>
</div>
</div>
<p>I am lazy so I will be using a simple KNN regressor to predict the target variable <img src="https://latex.codecogs.com/png.latex?y"> given a new input <img src="https://latex.codecogs.com/png.latex?x">. For this purpose, split the dataset <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D"> into two parts <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D_%7B%5Ctext%7Btrain%7D%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D_%7B%5Ctext%7BCal%7D%7D">. The regressor is fitted on <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D_%7B%5Ctext%7Btrain%7D%7D">. To calibrate the prediction intervals, compute the residuals <img src="https://latex.codecogs.com/png.latex?r_i%20=%20%7Cy_i%20-%20%5Chat%7By%7D_i%7C"> on the calibration set <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D_%7B%5Ctext%7BCal%7D%7D">, where <img src="https://latex.codecogs.com/png.latex?%5Chat%7By%7D_i%20=%20%5Chat%7By%7D(x_i)"> is the prediction of the regressor on <img src="https://latex.codecogs.com/png.latex?x_i">. One can then compute the <img src="https://latex.codecogs.com/png.latex?90%5C%25"> quantile <img src="https://latex.codecogs.com/png.latex?%5Cgamma_%7B90%5C%25%7D"> of the residuals: with probability <img src="https://latex.codecogs.com/png.latex?90%5C%25"> we have that <img src="https://latex.codecogs.com/png.latex?y_i%20%5Cin%20%5B%5Chat%7By%7D_i%20-%20%5Cgamma_%7B90%5C%25%7D,%20%5Chat%7By%7D_i%20+%20%5Cgamma_%7B90%5C%25%7D%5D"> on the calibration set, and this can be used to build the prediction intervals, as displayed below:</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/conformal_inference/conformal_basic.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>1D regression dataset: basic conformal inference</figcaption>
</figure>
</div>
</div>
<p>Not terribly impressive, but at least it is entirely straightforward to implement and it has the correct (marginal) coverage: for a new pair <img src="https://latex.codecogs.com/png.latex?(X,Y)"> coming from the same distribution as the training data, the probability that <img src="https://latex.codecogs.com/png.latex?Y"> falls within the prediction interval is indeed <img src="https://latex.codecogs.com/png.latex?90%5C%25">, up to a bit of nitpicking. Note that it is much much less impressive than saying that</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%20%7B%5Cleft(%20Y%20%5Cin%20%5B%5Chat%7By%7D(x)%20-%20%5Cgamma_%7B90%5C%25%7D,%20%5Chat%7By%7D(x)%20+%20%5Cgamma_%7B90%5C%25%7D%5D%20%5C;%20%7C%20%5C;%20X=x%20%5Cright)%7D%20%20=%2090%5C%25,"></p>
<p>which is clearly not true as can be seen from the figure above, but it is a good start. As a matter of fact, I’ve learned today from the very nice talk that without other assumptions, it is impossible to design a procedure that would guarantee the above so-called conditional coverage <span class="citation" data-cites="lei2014distribution">(Lei and Wasserman 2014)</span>. But let’s face it, the figure above is terribly unimpressive. Nevertheless, one can indeed make it slightly less useless by calibrating using a different strategy. For example, I can use the training set to estimate the Mean Absolute Deviation (MAD) of the residuals <img src="https://latex.codecogs.com/png.latex?%5Csigma(x)%20=%20%5Cmathbb%7BE%7D%5B%20%7CY%20-%20%5Chat%7By%7D(x)%20%7C%20%5C;%20%7C%20%5C;%20X=x%5D"> (again with a naive KNN regressor) and use the calibration set to estimate the <img src="https://latex.codecogs.com/png.latex?90%5C%25"> quantile <img src="https://latex.codecogs.com/png.latex?%5Cgamma_%7B90%5C%25%7D"> of the quantities <img src="https://latex.codecogs.com/png.latex?%7Cy_i%20-%20%5Chat%7By%7D_i%7C%20/%20%5Csigma(x_i)">. This allows one to produce calibrated prediction intervals of the type <img src="https://latex.codecogs.com/png.latex?%5B%5Chat%7By%7D_i%20-%20%5Cgamma_%7B90%5C%25%7D%20%5Csigma(x_i),%20%5Chat%7By%7D_i%20+%20%5Cgamma_%7B90%5C%25%7D%20%5Csigma(x_i)%5D">, which are displayed below:</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/conformal_inference/conformal.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>1D regression dataset: less useless conformal inference</figcaption>
</figure>
</div>
</div>
<p>It is slightly more useful, and it is again surprisingly straightforward to implement, literally 5 lines of code. I think I will have to read more about this in the future and I am pretty sure I will introduce the idea to the next batch of students!</p>
<section id="readings" class="level4">
<h4 class="anchored" data-anchor-id="readings">Readings:</h4>
<ul>
<li>The introduction paper <span class="citation" data-cites="lei2018distribution">(Lei et al. 2018)</span> is really good</li>
<li>I am really curious about <span class="citation" data-cites="gibbs2023conformal">(Gibbs, Cherian, and Candès 2023)</span> and it’s next on my reading list</li>
</ul>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-gibbs2023conformal" class="csl-entry">
Gibbs, Isaac, John J Cherian, and Emmanuel J Candès. 2023. <span>“Conformal Prediction with Conditional Guarantees.”</span> <em>arXiv Preprint arXiv:2305.12616</em>.
</div>
<div id="ref-lei2018distribution" class="csl-entry">
Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. 2018. <span>“Distribution-Free Predictive Inference for Regression.”</span> <em>Journal of the American Statistical Association</em> 113 (523). Taylor &amp; Francis: 1094–1111.
</div>
<div id="ref-lei2014distribution" class="csl-entry">
Lei, Jing, and Larry Wasserman. 2014. <span>“Distribution-Free Prediction Bands for Non-Parametric Regression.”</span> <em>Journal of the Royal Statistical Society Series B: Statistical Methodology</em> 76 (1). Oxford University Press: 71–96.
</div>
</div></section></div> ]]></description>
  <category>conformal</category>
  <guid>https://alexxthiery.github.io/notes/conformal_inference/conformal.html</guid>
  <pubDate>Sat, 07 Dec 2024 17:00:00 GMT</pubDate>
</item>
<item>
  <title>VIASM mini-course on diffusions and flows</title>
  <link>https://alexxthiery.github.io/notes/VIASM_2024/VIASM_2024.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/VIASM_2024/viasm_2024.jpg" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:95.0%"></p>
</figure>
</div>
</div>
<p>The slides for this short course on diffusion models (denoising diffusions, probability flows) and other flow methods (stochastic interpolants, flow-matching) are available <a href="https://alexxthiery.github.io/viasm_2024/">here</a>. There are a few animations, so loading the slides may be slow…</p>



 ]]></description>
  <category>diffusion</category>
  <guid>https://alexxthiery.github.io/notes/VIASM_2024/VIASM_2024.html</guid>
  <pubDate>Mon, 29 Jul 2024 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Doob, Girsanov and Bellman</title>
  <link>https://alexxthiery.github.io/notes/HJB/HJB.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/HJB/bellman.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Richard_E._Bellman">Richard Bellman</a> (1920 – 1984)</figcaption>
</figure>
</div>
</div>
<section id="change-of-measure" class="level3">
<h3 class="anchored" data-anchor-id="change-of-measure">Change of measure</h3>
<p>Consider a diffusion in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5ED"> given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign*%7D%0AdX_t%20&amp;=%20b(X_t)dt%20+%20%5Csigma(X_t)%20%5C,%20dW_t%5C%5C%0AX_0%20&amp;%5Csim%20p_0(x_0)%0A%5Cend%7Balign*%7D%0A%5Cright.%0A"></p>
<p>for an initial distribution <img src="https://latex.codecogs.com/png.latex?p_0"> and for drift and volatility functions <img src="https://latex.codecogs.com/png.latex?b:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED"> and <img src="https://latex.codecogs.com/png.latex?%5Csigma:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5E%7BD%20%5Ctimes%20D%7D">. On the time interval <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D">, this defines a probability <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D"> on the path-space <img src="https://latex.codecogs.com/png.latex?C(%5B0,T%5D;%5Cmathbb%7BR%7D%5ED)">. For two functions <img src="https://latex.codecogs.com/png.latex?f:%20%5B0,T%5D%20%5Ctimes%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D"> and <img src="https://latex.codecogs.com/png.latex?g:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D">, consider the probability distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cmathbb%7BQ%7D%7D%7Bd%20%5Cmathbb%7BP%7D%7D%20=%20%5Cfrac%7B1%7D%7B%5Cmathcal%7BZ%7D%7D%20%5Cexp%20%20%7B%5Cleft%5C%7B%20%20%5Cint_0%5ET%20f(X_s)%20%5C,%20ds%20+%20g(X_T)%20%20%5Cright%5C%7D%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D"> denotes the normalizing constant <span id="eq-normalizing-constant"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BZ%7D%5C;%20=%20%5C;%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cexp%20%20%7B%5Cleft%5C%7B%20%20%5Cint_0%5ET%20f(X_s)%20%5C,%20ds%20+%20g(X_T)%20%20%5Cright%5C%7D%7D%20%20%20%5Cright%5D%7D%20.%0A%5Ctag%7B1%7D"></span></p>
<p>The distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> places more probability mass on trajectories such that <img src="https://latex.codecogs.com/png.latex?%5Cint_0%5ET%20f(X_s)%20%5C,%20ds%20+%20g(X_T)"> is large. As described in these notes on <a href="../../notes/doob_transforms/doob.html">Doob h-transforms</a>, the path distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> can be described by a diffusion process <img src="https://latex.codecogs.com/png.latex?X%5E%5Cstar"> with dynamics</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign*%7D%0AdX%5E%5Cstar%20&amp;=%20b(X%5E%5Cstar)dt%20+%20%5Csigma(X%5E%5Cstar)%20%5C,%20%20%7B%5Cleft%5C%7B%20%20dW_t%20+%20%20%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar(t,%20X%5E%5Cstar)%7D%20%5C,%20dt%20%20%5Cright%5C%7D%7D%20%5C%5C%0AX%5E%5Cstar_0%20&amp;%5Csim%20q_0(x_0)%20=%20p_0(x_0)%20%5C,%20h(0,x_0)%20/%20%5Cmathcal%7BZ%7D%0A%5Cend%7Balign*%7D%0A%5Cright.%0A"></p>
<p>The control function <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar:%20%5B0,T%5D%20%5Ctimes%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED%7D"> is of the gradient form</p>
<p><span id="eq-u-star"><img src="https://latex.codecogs.com/png.latex?%0A%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar(t,%20x)%7D%20%5C;%20=%20%5C;%20%5Csigma%5E%5Ctop(x)%20%5C,%20%5Cnabla%20%5Clog%5B%20%20%5Ctextcolor%7Bgreen%7D%7Bh(t,x)%7D%20%5D%0A%5Ctag%7B2%7D"></span></p>
<p>and the function <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bgreen%7D%7Bh(t,x)%7D"> is described by the conditional expectation,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctextcolor%7Bgreen%7D%7Bh(t,x)%20=%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cexp%20%20%7B%5Cleft%5C%7B%20%20%5Cint_t%5ET%20f(X_s)%20%5C,%20ds%20+%20g(X_T)%20%20%5Cright%5C%7D%7D%20%20%20%5Cmid%20X_t%20=%20x%20%20%5Cright%5D%7D%20%7D.%0A"></p>
<p>The expression <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar(t,%20x)%7D%20%5C;%20=%20%5C;%20%5Csigma%5E%5Ctop(x)%20%5C,%20%5Cnabla%20%5Clog%5B%20%20%5Ctextcolor%7Bgreen%7D%7Bh(t,x)%7D%20%5D"> is intuitive; to describe the tilted measure <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> that places more probability mass on trajectories such that <img src="https://latex.codecogs.com/png.latex?%5Cexp%20%7B%5Cleft%5C%7B%20%20%5Cint_0%5ET%20f(X_s)%20%5C,%20ds%20+%20g(X_T)%20%20%5Cright%5C%7D%7D%20"> is large, the optimal control <img src="https://latex.codecogs.com/png.latex?u%5E%5Cstar(t,x)"> should point towards promising states, i.e.&nbsp;states such that the expected “reward-to-go” quantity <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cexp%20%20%7B%5Cleft%5C%7B%20%20%5Cint_t%5ET%20f(X_s)%20%5C,%20ds%20+%20g(X_T)%20%20%5Cright%5C%7D%7D%20%20%20%5Cmid%20X_t%20=%20x%20%20%5Cright%5D%7D%20"> is large.</p>
</section>
<section id="variational-formulation" class="level3">
<h3 class="anchored" data-anchor-id="variational-formulation">Variational Formulation</h3>
<p>To obtain a variational description of the optimal control function <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar%7D">, it suffices to express it as the solution of an optimization problem of the form</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Au%5E%5Cstar%20%5C;%20=%20%5C;%20%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D%5C;%20%5Ctextrm%7BDist%7D%20%7B%5Cleft(%20q%20%5Cotimes%20%5Cmathbb%7BP%7D%5E%7Bu%7D%20%5C,%20,%20%5C,%20%5Cmathbb%7BQ%7D%20%5Cright)%7D%0A"></p>
<p>for an appropriately chosen distance. Here <img src="https://latex.codecogs.com/png.latex?q%20%5Cotimes%20%5Cmathbb%7BP%7D%5E%7Bu%7D"> is the probability distribution of the controlled diffusion <img src="https://latex.codecogs.com/png.latex?X%5Eu"> with dynamics</p>
<p><span id="eq-Xu"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign*%7D%0AdX%5Eu%20&amp;=%20b(X%5Eu)dt%20+%20%5Csigma(X%5Eu)%20%5C,%20%20%7B%5Cleft%5C%7B%20%20dW_t%20+%20%20%5Ctextcolor%7Bblue%7D%7Bu(t,%20X%5Eu)%7D%20%5C,%20dt%20%20%5Cright%5C%7D%7D%20%5C%5C%0AX%5Eu_0%20&amp;%5Csim%20q(x_0)%0A%5Cend%7Balign*%7D%0A%5Cright.%0A%5Ctag%7B3%7D"></span></p>
<p>for some control function <img src="https://latex.codecogs.com/png.latex?u:%20%5B0,T%5D%20%5Ctimes%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED"> and initial distribution <img src="https://latex.codecogs.com/png.latex?q(x_0)">. Note that we have that <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D=%20q_0%20%5Cotimes%20%5Cmathbb%7BP%7D%5E%7Bu%5E%5Cstar%7D">. The KL-divergence is natural (pseudo) distance since it elegantly deals with the intractable constant <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D"> and the ratio <img src="https://latex.codecogs.com/png.latex?d%20%5Cmathbb%7BP%7D%5E%7Bu%7D%20/%20d%20%5Cmathbb%7BQ%7D"> is easy to compute. <a href="../../notes/girsanov/girsanov.html">Girsanov Theorem</a> gives that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Cfrac%7Bd%5Cmathbb%7BQ%7D%7D%7Bd%5Bq%20%5Cotimes%20%5Cmathbb%7BP%7D%5Eu%5D%7D(X%5Eu)%20&amp;=%0A%5Cfrac%7Bp_0(X%5Eu_0)%7D%7Bq(X%5Eu_0)%7D%20%5C,%20%5Cexp%5CBig%5C%7B%20%5Cint_%7B0%7D%5E%7BT%7D%20(f-%5Ctfrac12%20%5C%7Cu%5C%7C%5E2)(s,%20X%5Eu_s)%20%20%5C,%20ds%5C%5C%0A&amp;-%20%5Cint_%7B0%7D%5E%7BT%7D%20u(s,X%5Eu_s)%5E%5Ctop%20%5C,%20dW_s%20+%20g(X%5Eu_T)%20%5CBig%5C%7D%20/%20%5Cmathcal%7BZ%7D.%0A%5Cend%7Balign*%7D%0A"></p>
<p>From this expression, one can readily write <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D(q%20%5Cotimes%20%5Cmathbb%7BP%7D%5Eu,%5Cmathbb%7BQ%7D)">. Minimizing <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D(q%20%5Cotimes%20%5Cmathbb%7BP%7D%5Eu,%5Cmathbb%7BQ%7D)"> over the control <img src="https://latex.codecogs.com/png.latex?u"> and the initial distribution <img src="https://latex.codecogs.com/png.latex?q"> shows that the optimal control is: <!-- and the normalization constant $\cZ$ are: --></p>
<!-- $$
(-\log \cZ, \, u^\star)
\; = \;
(\inf_{u} , \, \argmin_{u}) \;
\bbE\sqBK{ \int_{0}^{T} \tfrac12 \|u(s,X^u_s)\|^2 - f(X^u_s)  \, ds - g(X^u_T) }.
$$ -->
<p><img src="https://latex.codecogs.com/png.latex?%0A(q_0,%20u%5E%5Cstar)%0A%5C;%20=%20%5C;%0A%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7Bq,u%7D%20%5C;%20D_%7B%5Ctext%7BKL%7D%7D(q,%20p_0)%20+%0A%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cint_%7B0%7D%5E%7BT%7D%20(f-%5Ctfrac%2012%20%5C%7Cu%5C%7C%5E2)(s,%20X%5Eu_s)%20%20%5C,%20ds%20-%20g(X%5Eu_T)%20%20%5Cright%5D%7D%20.%0A"></p>
<p>Minimizing this loss attempts to find a control that drives the quantity <img src="https://latex.codecogs.com/png.latex?%5Cint_%7B0%7D%5E%7BT%7D%20f(X%5Eu_s)%20%5C,%20ds%20+%20g(X%5Eu_T)"> large while keeping the control effort <img src="https://latex.codecogs.com/png.latex?%5Cint_%7B0%7D%5E%7BT%7D%20%5C%7Cu(s,X%5Eu_s)%5C%7C%5E2%20%5C,%20ds"> small. Equivalently, this can be expressed as a maximization problem,</p>
<!-- $$
(\log \cZ, \, u^\star)
\; = \;
(\sup_{u} , \, \argmax_{u}) \;
\bbE\sqBK{ \int_{0}^{T} -\tfrac12 \|u(s,X^u_s)\|^2 + f(X^u_s)  \, ds + g(X^u_T) }.
$$ -->
<p><img src="https://latex.codecogs.com/png.latex?%0A(q_0,%20%5C,%20u%5E%5Cstar)%0A%5C;%20=%20%5C;%0A%5Cmathop%7B%5Cmathrm%7Bargmax%7D%7D_%7Bq,u%7D%20%5C;%0A%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cint_%7B0%7D%5E%7BT%7D%20(f-%5Ctfrac%2012%20%5C%7Cu%5C%7C%5E2)(s,%20X%5Eu_s)%20%20%5C,%20ds%20+%20g(X%5Eu_T)%20%20%5Cright%5D%7D%20%20-%20D_%7B%5Ctext%7BKL%7D%7D(q,%20p_0).%0A"></p>
<p>Note that since <img src="https://latex.codecogs.com/png.latex?q_0%20%5Cotimes%20%5Cmathbb%7BP%7D%5E%7Bu_%7B%5Cstar%7D%7D%20=%20%5Cmathbb%7BQ%7D=%20%5Cmathbb%7BP%7D%5C,%20e%5E%7Bg%7D%20/%20%5Cmathcal%7BZ%7D">, the optimal control <img src="https://latex.codecogs.com/png.latex?u%5E%5Cstar"> is such that for <strong>any trajectory</strong> we have:</p>
<p><span id="eq-logZ-pathwise"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Clog%20%5Cmathcal%7BZ%7D%0A=%20%5Clog%20%5Cfrac%7Bp_0(X%5E%7Bu%5E%7B%5Cstar%7D%7D_0)%7D%7Bq_0(X%5E%7Bu%5E%7B%5Cstar%7D%7D_0)%7D%20&amp;+%20%5C,%20%5Cint_%7B0%7D%5E%7BT%7D%20(f-%5Ctfrac12%20%5C%7Cu%5E%5Cstar%5C%7C%5E2)(s,%20X%5Eu_s)%20%5C,%20ds%20+%20g(X%5E%7Bu%5E%7B%5Cstar%7D%7D_T)%20%5C%5C%0A&amp;%5Cquad%20-%20%5Cint_%7B0%7D%5E%7BT%7D%20u%5E%7B%5Cstar%7D(s,X%5E%7Bu%5E%7B%5Cstar%7D%7D_s)%5E%5Ctop%20%5C,%20dW_s%20.%0A%5Cend%7Balign*%7D%0A%5Ctag%7B4%7D"></span></p>
<!-- It turns out that KL-divergences between diffusion processes are the right tool for this: we will write $\bbQ$ as the minimizer of $\kl(\bbP^u \| \bbQ)$ for a class of tractable probability distributions $\bbP^u$ described by controlled diffusions. As described in these notes on the [Girsanov Theorem](../girsanov/girsanov.qmd), for any control function $u(t,x)$, the controlled diffusion $X^u$ with dynamics

$$
dX^u = b(X^u)dt + \sigma(X^u) \, \curBK{ dW_t + \blue{u(t, X^u)} \, dt }
$$

and started at $x_0$ induces a probability distribution $\bbP^u$ on path-space given by

$$
\frac{d\bbP}{d\bbP^u}(x)
\; = \; 
\exp\curBK{-\frac12 \int_{0}^{T} \|u(s,X^u_S)\|^2 \, ds - \int_{0}^{T} u(s,X^u_s)^\top \, dW_s}.
$${#eq-girsanov}

This allows one to write down explicitly the expression for the negative KL divergence 

$$
-\kl(\bbP^u \| \bbQ) = 
\bbE_u\sqBK{  \log\curBK{\frac{d\bbQ}{d\bbP^u}(X^u)}}
$$

between $\bbP^u$ and the tilted distribution $\bbQ$. The notation $\bbE_u$ denotes the expectation with respect to the controlled diffusion $X^u$.
The negative KL is, up to a constant, the usual Evidence Lower Bound (ELBO) used in variational inference. Since the quantity $\log\curBK{\frac{d\bbQ}{d\bbP^u}(X^u)}$ can be expressed as

$$
\log\curBK{\frac{d\bbP}{d\bbP^u}(X^u)} - \log(\cZ)
+ \int_0^T f(X^u_s) \, ds + g(X^u_T)
$$

it follows from @eq-girsanov that $-\kl(\bbP^u \| \bbQ)$ equals

$$
-\log(\cZ) + 
\E\sqBK{ \int_{0}^{T} \curBK{ -\tfrac12 \|u(s,X^u_s)\|^2 + f(X^u_s) } \, ds + g(X^u_T)}.
$$

Since the KL divergence is positive and the optimal control $u^\star$ in @eq-u-star drives the KL divergence to zero, we have that 

$$
\max_u \; \text{ELBO}(u) = \log \cZ
$$

where the minimization is over all (reasonably well-behaved) control functions $u: [0,T] \times \bbR^D \to \bbR^D$ and 

$$
\text{ELBO}(u) \; = \; \E\sqBK{ \int_{0}^{T} \curBK{ -\tfrac12 \|u(s,X^u_s)\|^2 + f(X^u_s) } \, ds + g(X^u_T)}.
$$ -->
</section>
<section id="stochastic-control" class="level3">
<h3 class="anchored" data-anchor-id="stochastic-control">Stochastic Control</h3>
<p>In the previous section, there was nothing special about the starting point <img src="https://latex.codecogs.com/png.latex?x_0"> and the time horizon <img src="https://latex.codecogs.com/png.latex?T%3E0">. This means that the same derivation gives the solution to the following stochastic optimal control problem. Consider the reward-to-go function (a.k.a. value function) defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AV(t,x)%20=%20%5Csup_u%20%5C;%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cint_%7Bt%7D%5E%7BT%7D(f-%5Ctfrac%2012%20%5C%7Cu%5C%7C%5E2)(s,%20X%5Eu_s)%20%5C,%20ds%20+%20g(X%5Eu_T)%20%5Cmid%20X_t%20=%20x%20%5Cright%5D%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?X%5Eu"> is the controlled diffusion Equation&nbsp;3. We have that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AV(t,x)%0A&amp;=%20%5Clog%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cexp%20%20%7B%5Cleft%5C%7B%20%20%5Cint_t%5ET%20f(X_s)%20%5C,%20ds%20+%20g(X_T)%20%20%5Cright%5C%7D%7D%20%20%5Cmid%20X_t%20=%20x%20%5Cright%5D%7D%20%5C%5C%0A&amp;=%20%5Clog%5B%20%20%5Ctextcolor%7Bgreen%7D%7Bh(t,%20x)%7D%20%5D.%0A%5Cend%7Balign%7D%0A"></p>
<p>This shows that optimal control <img src="https://latex.codecogs.com/png.latex?u%5E%5Cstar"> can also be expressed as</p>
<p><span id="eq-u-star-V"><img src="https://latex.codecogs.com/png.latex?%0Au%5E%5Cstar(t,x)%20=%20%5Csigma%5E%5Ctop(x)%20%5Cnabla%20%5Clog%5B%20%20%5Ctextcolor%7Bgreen%7D%7B%20h(t,x)%20%7D%5D%0A=%20%5Csigma%5E%5Ctop(x)%20%5C,%20%5Cnabla%20V(t,x)%20.%0A%5Ctag%7B5%7D"></span></p>
<p>The expression <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E%5Ctop(x)%20%5C,%20%5Cnabla%20V(t,x)"> is intuitive: since we are trying to maximize the reward-to-go function, the optimal control should be in the direction of the gradient of the reward-to-go function.</p>
</section>
<section id="hamilton-jacobi-bellman" class="level3">
<h3 class="anchored" data-anchor-id="hamilton-jacobi-bellman">Hamilton-Jacobi-Bellman</h3>
<p>Finally, let us mention that one can easily derive the <a href="https://en.wikipedia.org/wiki/Hamilton–Jacobi–Bellman_equation">Hamilton-Jacobi-Bellman</a> equation for the reward-to-go function <img src="https://latex.codecogs.com/png.latex?V(t,x)">. We have</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AV(t,x)%20=%20%5Csup_u%20%5C;%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cint_%7Bt%7D%5ET%20C_s%20%5C,%20ds%20+%20g(X%5Eu_T)%20%5Cright%5D%7D%0A"></p>
<p>with <img src="https://latex.codecogs.com/png.latex?C_s%20=%20-%5Ctfrac12%20%5C%7Cu(s,X%5Eu_s)%5C%7C%5E2%20+%20f(X%5Eu_s)">. For <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%5Cll%201">, we have</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AV(t,x)%0A&amp;%5C;%20=%20%5C;%0A%5Csup_u%20%5C;%20%20%7B%5Cleft%5C%7B%20%20C_t%20%5C,%20%5Cdelta%20+%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20V(t+%5Cdelta,%20X%5Eu_%7Bt+%5Cdelta%7D)%20%5Cmid%20X%5Eu_t=x%20%5Cright%5D%7D%20%20%20%5Cright%5C%7D%7D%20%20+%20o(%5Cdelta)%5C%5C%0A&amp;%5C;%20=%20%5C;%0AV(t,x)%20+%20%5Cdelta%20%5C,%20%5Csup_%7Bu(t,x)%7D%20%5C;%20%20%7B%5Cleft%5C%7B%20%20C_t%20+%20(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D+%20%5Csigma(x)%20%5C,%20u(t,x)%20%5C,%20%5Cnabla)%20%5C,%20V(t,x)%20%5Cright%5C%7D%7D%20%20+%20o(%5Cdelta)%0A%5Cend%7Balign%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D=%20b%20%5Cnabla%20+%20%5Csigma%20%5Csigma%5E%5Ctop%20:%20%5Cnabla%5E2"> is the generator of the uncontrolled diffusion. Since <img src="https://latex.codecogs.com/png.latex?C_t%20=%20-%5Ctfrac12%20%5C%7Cu(t,x)%5C%7C%5E2%20+%20f(x)"> is a simple quadratic function, the supremum over the control <img src="https://latex.codecogs.com/png.latex?u(t,x)"> can be computed in closed form,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0Au%5E%5Cstar(t,x)%0A&amp;=%20%5Cmathop%7B%5Cmathrm%7Bargmax%7D%7D_%7Bz%20%5Cin%20%5Cmathbb%7BR%7D%5ED%7D%20%5C;%20-%5Ctfrac12%20%5C%7Cz%5C%7C%5E2%20+%20%5Cleft%3C%20z,%20%5Csigma%5E%5Ctop(x)%20%5Cnabla%20V(t,x)%20%20%5Cright%3E%5C%5C%0A&amp;=%20%5Csigma%5E%5Ctop(x)%20%5C,%20%5Cnabla%20V(t,x),%0A%5Cend%7Balign%7D%0A"></p>
<p>as we already knew from Equation&nbsp;5. This implies that the reward-to-go function <img src="https://latex.codecogs.com/png.latex?V(t,x)"> satisfies the HJB equation</p>
<p><span id="eq-hjb"><img src="https://latex.codecogs.com/png.latex?%0A%7B%5Cleft(%20%5Cpartial_t%20+%20%5Cmathcal%7BL%7D%20%5Cright)%7D%20V%20+%20%5Cfrac12%20%5C%7C%20%5Csigma%5E%5Ctop%20%5Cnabla%20V%20%5C%7C%5E2%20+%20f%20=%200%0A%5Ctag%7B6%7D"></span></p>
<p>with terminal condition <img src="https://latex.codecogs.com/png.latex?V(T,x)%20=%20g(x)">. Another route to derive Equation&nbsp;6 is to simply use the fact that <img src="https://latex.codecogs.com/png.latex?V(t,x)%20=%20%5Clog%20h(t,x)">; since the <a href="https://en.wikipedia.org/wiki/Feynman–Kac_formula">Feynman-Kac</a> gives that the function <img src="https://latex.codecogs.com/png.latex?h(t,x)"> satisfies <img src="https://latex.codecogs.com/png.latex?(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D+%20f)%20h%20=%200">, the conclusion follows from a few lines of algebra by starting writing <img src="https://latex.codecogs.com/png.latex?%5Cpartial_t%20V%20=%20h%5E%7B-1%7D%20%5C,%20%5Cpartial_t%20h%20=%20-h%5E%7B-1%7D(%5Cmathcal%7BL%7D+%20f)%5Bh%5D">, expanding <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7Dh"> and expressing everything back in terms of <img src="https://latex.codecogs.com/png.latex?V">. The term <img src="https://latex.codecogs.com/png.latex?%5C%7C%5Csigma%5E%5Ctop%20%20%5Cnabla%20V%5C%7C%5E2"> naturally arises when expressing the diffusion term <img src="https://latex.codecogs.com/png.latex?%5Csigma%20%5Csigma%5E%5Ctop%20:%20%5Cnabla%5E2%20h"> as a function of the second derivative of <img src="https://latex.codecogs.com/png.latex?V">; it is the idea of the standard <a href="https://en.wikipedia.org/wiki/Cole–Hopf_transformation">Cole-Hopf transformation</a>.</p>
<p>Finally, Ito’s lemma and Equation&nbsp;6 give that for <img src="https://latex.codecogs.com/png.latex?t_1%20%3C%20t_2">, the optimally controlled diffusion <img src="https://latex.codecogs.com/png.latex?X%5E%7Bu%5E%5Cstar%7D"> satisfies:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0AV(t_2,%20X%5E%7Bu%5E%5Cstar%7D_%7Bt_2%7D)%0A-%0AV(t_1,%20X%5E%7Bu%5E%5Cstar%7D_%7Bt_1%7D)%20&amp;=%0A%5Cint_%7Bt_1%7D%5E%7Bt_2%7D%20(%5Ctfrac12%20%5C,%20%5C%7Cu%5E%5Cstar%5C%7C%5E2%20-%20f)(s,X%5E%7Bu%5E%5Cstar%7D_s)%20%5C,%20ds%5C%5C%0A&amp;%5Cquad%20+%20%5Cint_%7Bt_1%7D%5E%7Bt_2%7D%20u%5E%5Cstar(s,X%5E%7Bu%5E%5Cstar%7D_s)%5E%5Ctop%20%5C,%20dW_s.%0A%5Cend%7Balign*%7D%0A"></p>
<p>Since <img src="https://latex.codecogs.com/png.latex?V(T,x_T)%20=%20g(x_T)"> and <img src="https://latex.codecogs.com/png.latex?V(0,x_0)%20=%20%5Clog%20%5Cmathcal%7BZ%7D+%20%5Clog%20%5Cfrac%7Bq_0(x_0)%7D%7Bp_0(x_0)%7D">, writing this expression in between time <img src="https://latex.codecogs.com/png.latex?t_1=0"> and <img src="https://latex.codecogs.com/png.latex?t_2=T"> gives the formula Equation&nbsp;4 for the log-normalizing constant <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cmathcal%7BZ%7D">.</p>


</section>

 ]]></description>
  <category>SDE</category>
  <category>markov</category>
  <guid>https://alexxthiery.github.io/notes/HJB/HJB.html</guid>
  <pubDate>Mon, 10 Jun 2024 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Girsanov and Importance Sampling</title>
  <link>https://alexxthiery.github.io/notes/girsanov/girsanov.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/girsanov/girsanov_portrait.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Igor_Girsanov">Igor Girsanov</a> (1934 – 1967)</figcaption>
</figure>
</div>
</div>
<p>Let <img src="https://latex.codecogs.com/png.latex?q(dx)%20%5Cequiv%20%5Cmathcal%7BN%7D(%5Cmu,%5CGamma)"> be the Gaussian distribution with mean <img src="https://latex.codecogs.com/png.latex?%5Cmu%20%5Cin%20%5Cmathbb%7BR%7D%5ED"> and covariances <img src="https://latex.codecogs.com/png.latex?%5CGamma%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BD%20%5Ctimes%20D%7D">. For a direction <img src="https://latex.codecogs.com/png.latex?u%20%5Cin%20%5Cmathbb%7BR%7D%5ED">, consider the distribution <img src="https://latex.codecogs.com/png.latex?q%5E%7Bu%7D(dx)%20%5Cequiv%20%5Cmathcal%7BN%7D(%5Cmu%20+%20%5CGamma%5E%7B1/2%7D%20%5C,%20u,%20%5CGamma)">, i.e.&nbsp;the same Gaussian distribution but shifted by an amount <img src="https://latex.codecogs.com/png.latex?%5CGamma%5E%7B1/2%7D%20%5C,%20u">. Algebra directly gives that</p>
<p><span id="eq-girsanov"><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bq%5E%7Bu%7D(x)%7D%7Bq(x)%7D%0A=%0A%5Cexp%20%7B%5Cleft%5C%7B%20-%20%5Cfrac%7B1%7D%7B2%7D%20%5C%7C%20u%5C%7C%5E2%20+%20%5Cleft%3C%20u,%20%5C,%20%5CGamma%5E%7B-1/2%7D(x-%5Cmu)%20%5Cright%3E%20%5Cright%5C%7D%7D%20.%0A%5Ctag%7B1%7D"></span></p>
<p>We will see that, not very surprisingly, a similar change-of-probability result holds in continuous time. On the time interval <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D">, let <img src="https://latex.codecogs.com/png.latex?W_t"> be a standard Brownian motion in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5ED"> and <img src="https://latex.codecogs.com/png.latex?X_t"> be the solution to the <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">SDE</a></p>
<p><span id="eq-sde-original"><img src="https://latex.codecogs.com/png.latex?%0AdX_t%20%5C;%20=%20%5C;%20b(X_t)%20%5C,%20dt%20+%20%5Csigma(X_t)%20%5C,%20dW_t%0A%5Ctag%7B2%7D"></span></p>
<p>for some drift <img src="https://latex.codecogs.com/png.latex?b:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED"> and diffusion <img src="https://latex.codecogs.com/png.latex?%5Csigma:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5E%7BD%20%5Ctimes%20D%7D"> and initial distribution <img src="https://latex.codecogs.com/png.latex?%5Cmu_0(dx_0)">. This SDE defines a probability measure <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D"> on the <a href="https://en.wikipedia.org/wiki/Classical_Wiener_space">path-space</a> <img src="https://latex.codecogs.com/png.latex?C(%5B0,T%5D;%20%5Cmathbb%7BR%7D%5ED)">, the space of continuous functions from <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D"> to <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5ED">. Consider a perturbation drift function <img src="https://latex.codecogs.com/png.latex?u:%20%5Cmathbb%7BR%7D%5ED%20%20%5Cto%20%5Cmathbb%7BR%7D%5ED"> and associated perturbed SDE given by</p>
<p><span id="eq-sde-perturbed"><img src="https://latex.codecogs.com/png.latex?%0AdX_t%5Eu%20%5C;%20=%20%5C;%20b(X_t%5Eu)%20%5C,%20dt%20+%20%5Csigma(X_t%5Eu)%20%5C,%20%20%7B%5Cleft%5C%7B%20%20dW_t%20+%20%20%5Ctextcolor%7Bblue%7D%7Bu(X_t%5Eu)%20%5C,%20dt%7D%20%20%5Cright%5C%7D%7D%20.%0A%5Ctag%7B3%7D"></span></p>
<p>This perturbed SDE, started from the same initial distribution <img src="https://latex.codecogs.com/png.latex?%5Cmu_0(dx_0)">, defines a probability measure <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5Eu"> on the path-space <img src="https://latex.codecogs.com/png.latex?C(%5B0,T%5D;%20%5Cmathbb%7BR%7D%5ED)"> and it is often useful to understand the Radon-Nikodym derivative of <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5Eu"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D">. I have never really liked the way this is <a href="https://en.wikipedia.org/wiki/Girsanov_theorem">usually</a> derived, and also never really remember the result. It takes only a few lines of algebra to re-derive these results, at least informally. For this purpose, consider a simpler <a href="https://en.wikipedia.org/wiki/Euler–Maruyama_method">Euler discretization</a> of the SDE with time-discretization <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20=%20T/N"> for <img src="https://latex.codecogs.com/png.latex?N%20%5Cgg%201">. Consider a discretized paths <img src="https://latex.codecogs.com/png.latex?(x_0,%20x_%7B%5Cdelta%7D,%20%5Cldots,%20x_%7BT%7D)"> of Equation&nbsp;2 obtained by iterating the update</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax_%7Bt_%7Bk+1%7D%7D%20%5C;%20=%20%5C;%20x_%7Bt_k%7D%20+%20b(x_%7Bt_k%7D)%5C,%5Cdelta%20+%20%5Csigma(x_%7Bt_k%7D)%20%5C,%20(%5CDelta%20W_%7Bt_k%7D)%0A"></p>
<p>with <img src="https://latex.codecogs.com/png.latex?t_k%20=%20k%5Cdelta"> and <img src="https://latex.codecogs.com/png.latex?%5CDelta%20W_%7Bt_k%7D%20=%20W_%7Bt_%7Bk+1%7D%7D%20-%20W_%7Bt_k%7D">. The probability of observing such a path reads <img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B1%7D%7B%5Cmathcal%7BZ%7D%7D%20%5C,%20%5Cmu_0(x_0)%20%5C,%20%5Cexp%20%7B%5Cleft%5C%7B%20-%5Cfrac%7B1%7D%7B2%20%5Cdelta%7D%20%5Csum_%7Bk=0%7D%5E%7BN-1%7D%0A%5C%7Cx_%7Bt_%7Bk+1%7D%7D%20-%20%5Bx_%7Bt_k%7D%20+%20b(x_%7Bt_k%7D)%5C,%5Cdelta%5D%5C%7C%5E2_%7B%5CGamma%5E%7B-1%7D(x_%7Bt_k%7D)%20%7D%20%5Cright%5C%7D%7D%0A"></p>
<p>with <img src="https://latex.codecogs.com/png.latex?%5CGamma(x)%20%5Cequiv%20%5Csigma(x)%20%5Csigma%5E%5Ctop(x)"> the volatility matrix and an irrelevant multiplicative constant <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D">. One obtains a similar expression for a discretized path of the perturbed SDE Equation&nbsp;3 and the ratio of these two quantities equals</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cwidetilde%7B%5Cmathbb%7BP%7D%7D%5E%7Bu%7D%7D%7Bd%20%5Cwidetilde%7B%5Cmathbb%7BP%7D%7D%7D(x)%20=%20%5Cexp%20%7B%5Cleft%5C%7B%20%5Csum_%7Bk=0%7D%5E%7BN-1%7D%20-%5Cfrac%7B%5Cdelta%7D%7B2%7D%20%5C%7Cu(x_%7Bt_k%7D)%5C%7C%5E2%20%20+%0A%5Cleft%3C%20x_%7Bt_%7Bk+1%7D%7D-x_%7Bt_k%7D-b(x_%7Bt_k%7D)%5Cdelta,%20%5Csigma(x_%7Bt_k%7D)%20%5C,%20u(x_%7Bt_k%7D)%20%5Cright%3E_%7B%5CGamma%5E%7B-1%7D(x_%7Bt_k%7D)%7D%20%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>where the tilde notation denotes the discretized version of the measures. Since</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax_%7Bt_%7Bk+1%7D%7D-x_%7Bt_k%7D-b(x_%7Bt_k%7D)%5Cdelta%20=%20%5Csigma(x_%7Bt_k%7D)%20%5C,%20%5CDelta%20W_%7Bt_k%7D,%0A"> for a path <img src="https://latex.codecogs.com/png.latex?dx_t%20%5C;%20=%20%5C;%20b(x_t)%20%5C,%20dt%20+%20%5Csigma(x_t)%20%5C,%20dW_t"> and taking the limit <img src="https://latex.codecogs.com/png.latex?N%20%5Cto%20%5Cinfty"> gives</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%5E%7Bu%7D%7D%7Bd%20%5Cmathbb%7BP%7D%7D(x)%20%5C;%20=%20%5C;%20%5Cexp%20%7B%5Cleft%5C%7B%0A-%5Cfrac%2012%20%5C,%20%5Cint_0%5ET%20%5C%7Cu(x_t)%5C%7C%5E2%20%5C,%20dt%20+%20%5Cint_%7B0%7D%5ET%20u%5E%5Ctop(x_t)%20%5C,%20dW_t%0A%5Cright%5C%7D%7D%20.%0A"></p>
<p>Similarly, for a path <img src="https://latex.codecogs.com/png.latex?dx%5E%7Bu%7D_t%20%5C;%20=%20%5C;%20b(x%5Eu_t)%20%5C,%20dt%20+%20%5Csigma(x%5Eu_t)%20%5C,%20%20%7B%5Cleft(%20%20dW_t%20+%20u(x%5Eu_t)%20%5Cright)%7D%20">, we have</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%7D%7Bd%20%5Cmathbb%7BP%7D%5Eu%7D(x%5Eu)%20%5C;%20=%20%5C;%20%5Cexp%20%7B%5Cleft%5C%7B%0A-%5Cfrac%2012%20%5C,%20%5Cint_0%5ET%20%5C%7Cu(x%5Eu_t)%5C%7C%5E2%20%5C,%20dt%20-%20%5Cint_%7B0%7D%5ET%20u%5E%5Ctop(x%5Eu_t)%20%5C,%20dW_t%0A%5Cright%5C%7D%7D%20.%0A"></p>
<p>These results remain identical for time-dependent drift and volatility functions, as is clear from this non-rigorous argument. The above two formulas for <img src="https://latex.codecogs.com/png.latex?d%5Cmathbb%7BP%7D%5Eu/d%5Cmathbb%7BP%7D(x)"> and <img src="https://latex.codecogs.com/png.latex?d%5Cmathbb%7BP%7D/d%5Cmathbb%7BP%7D%5Eu(x)"> may be slightly confusing since they are not immediately recognizable as inverse of each other. Furthermore, these probability ratios evaluated along a path <img src="https://latex.codecogs.com/png.latex?x"> or <img src="https://latex.codecogs.com/png.latex?x%5Eu"> are expressed in terms of the Brownian trajectory that defines them, which can be confusing. In short, this would be better to express <img src="https://latex.codecogs.com/png.latex?d%5Cmathbb%7BP%7D%5Eu/d%5Cmathbb%7BP%7D(x)"> directly in terms of the path <img src="https://latex.codecogs.com/png.latex?x">, and not in terms of the Brownian motion <img src="https://latex.codecogs.com/png.latex?W_t">, even though it is indeed equivalent. For these reasons, it is often convenient to use the following equivalent expressions:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%5E%7Bu%7D%7D%7Bd%20%5Cmathbb%7BP%7D%7D(x)%20&amp;=%20%5Cexp%20%7B%5Cleft%5C%7B%0A%5Ctextcolor%7Bblue%7D%7B-%7D%5Cfrac%2012%20%5C,%20%5Cint_0%5ET%20%5C%7Cu(x_t)%5C%7C%5E2%20%5C,%20dt%20%20%5Ctextcolor%7Bblue%7D%7B+%7D%20%5Cint_%7B0%7D%5ET%20u%5E%5Ctop(x_t)%20%5C,%20%5Cfrac%7Bdx_t%20-%20b(x_t)%20dt%7D%7B%5Csigma(x_t)%7D%20%5Cright%5C%7D%7D%20%5C%5C%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%7D%7Bd%20%5Cmathbb%7BP%7D%5E%7B(u)%7D%7D(x)%20&amp;=%20%5Cexp%20%7B%5Cleft%5C%7B%0A%5Ctextcolor%7Bblue%7D%7B+%7D%5Cfrac%2012%20%5C,%20%5Cint_0%5ET%20%5C%7Cu(x_t)%5C%7C%5E2%20%5C,%20dt%20%20%5Ctextcolor%7Bblue%7D%7B-%7D%20%5Cint_%7B0%7D%5ET%20u%5E%5Ctop(x_t)%20%5C,%20%5Cfrac%7Bdx_t%20-%20b(x_t)%20dt%7D%7B%5Csigma(x_t)%7D%20%5Cright%5C%7D%7D%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>From these expression, the fact that <img src="https://latex.codecogs.com/png.latex?d%5Cmathbb%7BP%7D%5Eu/d%5Cmathbb%7BP%7D(x)"> and <img src="https://latex.codecogs.com/png.latex?d%5Cmathbb%7BP%7D/d%5Cmathbb%7BP%7D%5Eu(x)"> are indeed inverse of each other is clear. Another entirely equivalent formulation, slightly more symmetrical again, is as follows. Consider the two measures <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7B(1)%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5E%7B(2)%7D"> associated to</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX_t%5E%7B(i)%7D%20%5C;%20=%20%5C;%20b%5E%7B(i)%7D(X_t)%20%5C,%20dt%20+%20%5Csigma(X_t)%20%5C,%20dW_t%0A"></p>
<p>for two drift functions <img src="https://latex.codecogs.com/png.latex?b%5E%7B(1)%7D:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED"> and <img src="https://latex.codecogs.com/png.latex?b%5E%7B(2)%7D:%20%5Cmathbb%7BR%7D%5ED%20%5Cto%20%5Cmathbb%7BR%7D%5ED">. Then, the Radon-Nikodym derivative between these two measures is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%5E%7B(2)%7D%7D%7Bd%20%5Cmathbb%7BP%7D%5E%7B(1)%7D%7D(x)%20=%20%5Cexp%20%7B%5Cleft%5C%7B%0A-%5Cfrac%7B1%7D%7B2%7D%5Cint_%7B0%7D%5ET%20%20%7B%5Cleft(%20%5Cfrac%7B%5C%7Cb%5E%7B(2)%7D_t%5C%7C%5E2%20-%20%5C%7Cb%5E%7B(1)%7D_t%5C%7C%5E2%7D%7B%5Csigma%5E2_t%7D%20%5Cright)%7D%20%20%5C,%20dt%0A+%0A%5Cint_%7B0%7D%5ET%20%5Cleft%3C%20%20%5Cfrac%7Bb%5E%7B(2)%7D_t%20-%20b%5E%7B(1)%7D_t%7D%7B%5Csigma_t%5E2%7D,%20dx_t%20%5Cright%3E%0A%5Cright%5C%7D%7D%0A"></p>
<p>with the shorthand <img src="https://latex.codecogs.com/png.latex?b%5E%7B(i)%7D_t%20=%20b%5E%7B(i)%7D(x_t)"> and <img src="https://latex.codecogs.com/png.latex?%5Csigma_t%20=%20%5Csigma(x_t)"> and <img src="https://latex.codecogs.com/png.latex?%5C%7Cv%5C%7C%5E2/%5Csigma%5E2%20=%20%5Cleft%3C%20v,%20%5B%5Csigma%20%5Csigma%5E%5Ctop%5D%5E%7B-1%7D%20v%20%5Cright%3E"> and <img src="https://latex.codecogs.com/png.latex?%5Cleft%3C%20u,v%20%5Cright%3E%20/%20%5Csigma%5E2%20=%20%5Cleft%3C%20u,%20%5B%5Csigma%20%5Csigma%5E%5Ctop%5D%5E%7B-1%7D%20v%20%5Cright%3E">. Again, this follows immediately from a discretized version of the SDEs. As described below, these change of variables formulae are often useful when performing importance sampling on path-space. As a sanity check, one can see that in the case of a scalar Brownian motion <img src="https://latex.codecogs.com/png.latex?dX%20=%20%5Csigma%20%5C,%20dW"> and drifted version of it <img src="https://latex.codecogs.com/png.latex?dX%5Eu%20=%20%5Csigma%20%5C,%20dW%20+%20u%20%5C,%20dt">, we indeed have that <img src="https://latex.codecogs.com/png.latex?d%5Cmathbb%7BP%7D%5Eu/d%5Cmathbb%7BP%7D(x)"> has unit expectation under <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D"> since it is equivalent to the fact <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5B%5Cexp(%5Csigma%20%5C,%20%5Cxi)%5D%20=%20%5Cexp(%5Csigma%5E2/2)"> for a standard Gaussian random variable <img src="https://latex.codecogs.com/png.latex?%5Cxi">. Finally, note that the <a href="https://en.wikipedia.org/wiki/Kullback–Leibler_divergence">Kullback-Leibler divergence</a> between <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5Eu"> has a particularly simple form. Since <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D(%5Cmathbb%7BP%7D,%20%5Cmathbb%7BP%7D%5Eu)%20=%20%5Cmathbb%7BE%7D_%7B%5Cmathbb%7BP%7D%7D%20%7B%5Cleft%5B%20-%5Clog%20%7B%5Cleft%5C%7B%20%5Cfrac%7Bd%20%5Cmathbb%7BP%7D%5E%7Bu%7D%7D%7Bd%20%5Cmathbb%7BP%7D%7D(X)%20%5Cright%5C%7D%7D%20%20%5Cright%5D%7D%20"> one obtains</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctext%7BKL%7D%7D(%5Cmathbb%7BP%7D,%20%5Cmathbb%7BP%7D%5Eu)%20=%20%5Cfrac12%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cint_0%5ET%20%5C%7Cu(X_t)%5C%7C%5E2%20%5C,%20dt%20%20%5Cright%5D%7D%20.%0A"></p>
<section id="importance-sampling-on-path-space" class="level3">
<h3 class="anchored" data-anchor-id="importance-sampling-on-path-space">Importance Sampling on path-space</h3>
<p>Consider a functional <img src="https://latex.codecogs.com/png.latex?%5CPhi:%20C(%5B0,T%5D;%20%5Cmathbb%7BR%7D%5ED)%20%5Cto%20%5Cmathbb%7BR%7D"> on path-space; a typical example is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CPhi(x)%20=%20%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_0%5ET%20f(X_t)%20%5C,%20dt%20%5C,%20+%20%5C,%20g(X_T)%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>Suppose that we would like to evaluate the expectation of <img src="https://latex.codecogs.com/png.latex?%5CPhi"> under the measure <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D">. Naive Monte-Carlo (MC) would require sampling <img src="https://latex.codecogs.com/png.latex?M"> trajectories from Equation&nbsp;2 and computing the average of <img src="https://latex.codecogs.com/png.latex?%5CPhi"> on these trajectories. To reduce the variance of this naive MC estimator, one can also use importance sampling by sampling <img src="https://latex.codecogs.com/png.latex?M"> trajectories <img src="https://latex.codecogs.com/png.latex?x%5E%7B1,u%7D,%20%5Cldots,%20x%5E%7BM,u%7D"> from the measure <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D%5Eu"> and compute the average</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B1%7D%7BM%7D%20%5C,%20%5Csum_%7Bi=1%7D%5EM%20%5CPhi(x%5E%7Bi,u%7D)%20%5C,%20W(x%5E%7Bi,u%7D)%0A"></p>
<p>with weights given by the Radon-Nikodym derivative</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AW(x%5E%7Bi,u%7D)%20%5C;%20=%20%5C;%20%5Cexp%20%7B%5Cleft%5C%7B%0A-%5Cfrac%2012%20%5C,%20%5Cint_0%5ET%20%5C%7Cu(x%5E%7Bi,u%7D_t)%5C%7C%5E2%20%5C,%20dt%20-%20%5Cint_%7B0%7D%5ET%20u%5E%5Ctop(x%5E%7Bi,u%7D_t)%20%5C,%20dW_t%0A%5Cright%5C%7D%7D%20.%0A"></p>
<p>Choosing the optimal “control” function <img src="https://latex.codecogs.com/png.latex?u"> that minimizes the variance of the estimator is not entirely straightforward, although this <a href="../../notes/doob_transforms/doob.html">previous note</a> already gives the answer. More on this in <a href="../../notes/HJB/HJB.html">another note</a>.</p>


</section>

 ]]></description>
  <category>SDE</category>
  <category>markov</category>
  <guid>https://alexxthiery.github.io/notes/girsanov/girsanov.html</guid>
  <pubDate>Sun, 02 Jun 2024 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Joe Doob &amp; Change of measures on path-space</title>
  <link>https://alexxthiery.github.io/notes/doob_transforms/doob.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/doob_transforms/joseph_doob.jpg" class="img-fluid figure-img" style="width:35.0%"></p>
<figcaption><a href="https://en.wikipedia.org/wiki/Joseph_L._Doob">Joseph Doob</a> (1910 – 2004)</figcaption>
</figure>
</div>
</div>
<p>Consider a continuous time Markov process <img src="https://latex.codecogs.com/png.latex?X_t"> on the time interval <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D"> and with value in the state space <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BX%7D">. This defines a probability <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D"> on the set of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BX%7D">-valued paths. Now, it is often the case that one has to consider a perturbed, also sometimes called “twisted”, probability distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> defined as</p>
<p><span id="eq-change-pb"><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%5Cmathbb%7BQ%7D%7D%7Bd%5Cmathbb%7BP%7D%7D(x_%7B%5B0,T%5D%7D)%20=%20%5Cfrac%7B1%7D%7B%5Cmathcal%7BZ%7D%7D%20%5C,%20%5Cexp%5Bg(X_T)%5D%0A%5Ctag%7B1%7D"></span></p>
<p>for a normalization constant <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D"> and some function <img src="https://latex.codecogs.com/png.latex?g:%20%5Cmathcal%7BX%7D%5Cto%20%5Cmathbb%7BR%7D">. For example, collecting a noisy observation <img src="https://latex.codecogs.com/png.latex?y_T%20%5Csim%20%5Cmathcal%7BF%7D(X_T)%20+%20%5Ctextrm%7B(noise)%7D"> at time <img src="https://latex.codecogs.com/png.latex?T">, the distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> defined with the log-likelihood function <img src="https://latex.codecogs.com/png.latex?g(x)%20=%20%5Clog%20%5Cmathbb%7BP%7D(y_T%20%5Cmid%20X_T=x)"> describes the dynamics of the Markov process <img src="https://latex.codecogs.com/png.latex?X_t"> conditioned on the observation <img src="https://latex.codecogs.com/png.latex?y_T">; we will use this interpretation in the following since this is the most common use case and gives the most intuitive interpretation. With this interpretation, it is clear that the normalization constant <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D"> is the model evidence <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(y_T)"> and that the initial distribution of the conditioned process <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D(x_0)"> is the conditional law <img src="https://latex.codecogs.com/png.latex?X_0%20%5Cmid%20y_T"> and is given by <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(x_0)%20%5C,%20%5Cmathbb%7BP%7D(y_T%20%5Cmid%20x_0)%20/%20%5Cmathbb%7BP%7D(y_T)">. Doob h-transforms are a powerful tool to describe the dynamics of the conditioned process.</p>
<p>For convenience, let us use the notation <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_x%5B%5Cldots%5D%20%5Cequiv%20%5Cmathbb%7BE%7D%5B%5Cldots%20%5Cmid%20x_t=x%5D">. For a test function <img src="https://latex.codecogs.com/png.latex?%5Cvarphi:%20%5Cmathcal%7BX%7D%5Cto%20%5Cmathbb%7BR%7D"> and a time increment <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200">, we have</p>
<p><span id="eq-def"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Cmathbb%7BE%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%7C%20x_t,%20y_T%5D%20&amp;=%20%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%5C,%20%5Cexp(g(x_T))%20%5D%20%5C,%20/%20%5C,%20%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cexp(g(x_T))%5D%5C%5C%0A&amp;=%20%5Cfrac%7B%20%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%5C,%20h(t+%5Cdelta,%20x_%7Bt+%5Cdelta%7D)%5D%20%7D%7Bh(t,%20x)%7D.%0A%5Cend%7Balign*%7D%0A%5Ctag%7B2%7D"></span></p>
<p>We have introduced the important function <img src="https://latex.codecogs.com/png.latex?h:%5B0,T%5D%20%5Ctimes%20%5Cmathcal%7BX%7D%5Cto%20%5Cmathbb%7BR%7D"> defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ah(t,%20x)%20%5C;%20=%20%5C;%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cexp%5Bg(x_T)%5D%20%5Cmid%20x_t%20=%20x%20%5Cright%5D%7D%20%20%5C;%20=%20%5C;%20%5Cmathbb%7BP%7D(y_T%20%5Cmid%20x_t%20=%20x).%0A"></p>
<p>One can readily check that the function <img src="https://latex.codecogs.com/png.latex?h"> satisfies the <a href="https://en.wikipedia.org/wiki/Kolmogorov_equations">Kolmogorov equation</a></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D)%20%5C,%20h%20=%200%0A"></p>
<p>with boundary condition <img src="https://latex.codecogs.com/png.latex?h(T,x)%20=%20%5Cexp%5Bg(x)%5D">. Furthermore, denoting by <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D"> the infinitesimal generator of the Markov process <img src="https://latex.codecogs.com/png.latex?X_t">, we have:</p>
<p><span id="eq-kolmogorov"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20&amp;%20h(t+%5Cdelta,%20x_%7Bt+%5Cdelta%7D)%20%5D%0A%5C;%20=%20%5C;%0A%5Cvarphi(x_t)%20h(t,%20x_t)%20%5C%5C%0A&amp;+%20%5C;%20%5Cdelta%20%5C,%20(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D)%5Bh%20%5C,%20%5Cvarphi%5D%20%5C,%20(t,%20x_t)%0A%5C;%20+%20%5C;%20o(%5Cdelta).%0A%5Cend%7Balign*%7D%0A%5Ctag%7B3%7D"></span></p>
<p>The infinitesimal generator <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%5E%7B%5Cstar%7D"> of the conditioned process is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D%5E%7B%5Cstar%7D%20%5Cvarphi(t,%20x_t)%20=%20%5Clim_%7B%5Cdelta%20%5Cto%200%5E+%7D%20%5C;%20%5Cfrac%7B%5Cmathbb%7BE%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%7C%20x_t,%20y_T%5D%20-%20%5Cvarphi(x_t)%7D%7B%5Cdelta%7D.%0A"></p>
<p>Plugging Equation&nbsp;3 within Equation&nbsp;2 directly gives that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D%5E%7B%5Cstar%7D%20%5Cvarphi%0A%5C;%20=%20%5C;%0A%5Cfrac%7B(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D)%5Bh%20%5C,%20%5Cvarphi%5D%7D%7Bh%7D%0A%5C;%20=%20%5C;%0A%5Cfrac%7B%5Cmathcal%7BL%7D%5B%5Cvarphi%5C,%20h%5D%7D%7Bh%7D%20+%20%5Cvarphi%5Cfrac%7B%5Cpartial_t%20h%7D%7Bh%7D.%0A"></p>
<details>
<summary>
Some details:
</summary>
<p style="color: blue;">
<img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Cmathbb%7BE%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%7C%20x_t,%20y_T%5D%0A&amp;=%0A%5Cfrac%7B%20%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%5C,%20h(t+%5Cdelta,%20x_%7Bt+%5Cdelta%7D)%5D%20%7D%7Bh(t,%20x)%7D%5C%5C%0A&amp;=%0A%5Cfrac%7B%20%5Cvarphi(x_t)%20h(t,%20x_t)%20+%20%5Cdelta%20(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D)%5Bh%20%5C,%20%5Cvarphi%5D%20%5C,%20(t,%20x_t)%7D%7Bh(t,%20x_t)%7D%20%20+%20o(%5Cdelta)%5C%5C%0A&amp;=%0A%5Cvarphi(x_t)%20+%20%5Cdelta%20%5C,%20%5Cfrac%7B(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D)%5Bh%20%5C,%20%5Cvarphi%5D%20%5C,%20(t,%20x_t)%7D%7Bh(t,%20x_t)%7D%20+%20o(%5Cdelta).%0A%5Cend%7Balign*%7D%0A"> Since <img src="https://latex.codecogs.com/png.latex?%5Cvarphi"> does not depend on time, we have <img src="https://latex.codecogs.com/png.latex?(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D)%5Bh%20%5C,%20%5Cvarphi%5D%20=%20%5Cvarphi%5C,%20%5Cpartial_t%20h%20+%20%5Cmathcal%7BL%7D%5Bh%20%5C,%20%5Cvarphi%5D"> and this gives the announced result.
</p>
</details>
<p>The generator <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%5E%7B%5Cstar%7D"> describes the dynamics of the conditioned process. In fact, the same computation holds with a more general change of measure of the type <span id="eq-change-gen"><img src="https://latex.codecogs.com/png.latex?%0A%5Ctextcolor%7Bgreen%7D%7B%5Cfrac%7Bd%5Cmathbb%7BQ%7D%7D%7Bd%5Cmathbb%7BP%7D%7D(x_%7B%5B0,T%5D%7D)%20=%20%5Cfrac%7B1%7D%7B%5Cmathcal%7BZ%7D%7D%20%5C,%20%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_%7B0%7D%5ET%20f(s,%20X_s)%20%5C,%20ds%20+%20g(x_T)%20%5Cright%5C%7D%7D%20%20%7D%0A%5Ctag%7B4%7D"></span></p>
<p>for some function <img src="https://latex.codecogs.com/png.latex?f:%5B0,T%5D%20%5Ctimes%20%5Cmathcal%7BX%7D%5Cto%20%5Cmathbb%7BR%7D">. One can define the function <img src="https://latex.codecogs.com/png.latex?h"> similarly as</p>
<p><span id="eq-h-gen"><img src="https://latex.codecogs.com/png.latex?%0A%5Ctextcolor%7Bgreen%7D%7B%20h(t,%20x_t)%20%5C;%20=%20%5C;%20%5Cmathbb%7BE%7D%20%7B%5Cleft%5B%20%20%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_%7Bt%7D%5ET%20f(X_s)%20%5C,%20ds%20+%20g(x_T)%20%5Cright%5C%7D%7D%20%20%5Cmid%20x_t%20%5Cright%5D%7D%20%7D.%0A%5Ctag%7B5%7D"></span></p>
<p>This function satisfies the <a href="https://en.wikipedia.org/wiki/Feynman–Kac_formula">Feynman-Kac formula</a> <img src="https://latex.codecogs.com/png.latex?(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D+%20f)%20%5C,%20h%20=%200"> and one obtains entirely similarly that the probability distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> describes a Markov process with infinitesimal generator</p>
<p><span id="eq-doob-transforms"><img src="https://latex.codecogs.com/png.latex?%0A%5Ctextcolor%7Bgreen%7D%7B%5Cmathcal%7BL%7D%5E%7B%5Cstar%7D%20%5Cvarphi%0A%5C;%20=%20%5C;%0A%5Cfrac%7B(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D+%20f)%5Bh%20%5C,%20%5Cvarphi%5D%7D%7Bh%7D%0A%5C;%20=%20%5C;%0A%5Cfrac%7B%5Cmathcal%7BL%7D%5Bh%20%5C,%20%5Cvarphi%5D%7D%7Bh%7D%20+%20%20%7B%5Cleft(%20%20%5Cfrac%7B%5Cpartial_t%20h%7D%20%7Bh%7D%20+%20f%20%5Cright)%7D%20%20%5C,%20%5Cvarphi.%7D%0A%5Ctag%7B6%7D"></span></p>
<details>
<summary>
Some details:
</summary>
<p style="color: blue;">
An interpretation of the conditioned process is as follows. Suppose for example that every <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> units of time, one observes a noisy measurement <img src="https://latex.codecogs.com/png.latex?y_t"> of the state <img src="https://latex.codecogs.com/png.latex?x_t"> with log-likelihood <img src="https://latex.codecogs.com/png.latex?f(t,%20x_t)%20%5C,%20%5Cdelta">, as well as a final observation at time <img src="https://latex.codecogs.com/png.latex?T"> with log-likelihood <img src="https://latex.codecogs.com/png.latex?g(x_T)">. For example, this could be a stream of noisy measurements with Gaussian noise of variance proportional to <img src="https://latex.codecogs.com/png.latex?1/%5Cdelta"> concluded at time <img src="https://latex.codecogs.com/png.latex?T"> with a final measurement; in other words, one very frequently observes the state with very noisy measurements and finally makes a more precise observation at time <img src="https://latex.codecogs.com/png.latex?T">. In the regime <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%5Cto%200">, the log-likelihood of that stream <img src="https://latex.codecogs.com/png.latex?y_%7B0:T%7D"> of observations is precisely: <img src="https://latex.codecogs.com/png.latex?%0A%5Cint_0%5ET%20f(s,%20X_s)%20%5C,%20ds%20+%20g(X_T).%0A"> Conditioning on these frequent observations then leads to the change of measure in Equation&nbsp;4. The computations of the conditioned generator then follow similarly as before. We have: <img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Cmathbb%7BE%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%7C%20x_t,%20y_%7B0:T%7D%5D%20&amp;=%20%5Cmathbb%7BE%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%7C%20x_t,%20y_%7Bt:T%7D%5D%5C%5C%0A&amp;=%0A%5Cfrac%7B%20%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%5C,%20%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_%7Bt%7D%5ET%20f(s,%20X_s)%20%5C,%20ds%20+%20g(x_T)%20%5Cright%5C%7D%7D%20%5D%20%7D%7Bh(t,%20x)%7D%0A%5Cend%7Balign*%7D%0A"> where <img src="https://latex.codecogs.com/png.latex?h"> is defined in Equation&nbsp;5. The rest of the computation follows similarly as before. First, express <img src="https://latex.codecogs.com/png.latex?%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_%7Bt%7D%5ET%20f(s,%20X_s)%20%5C,%20ds%20+%20g(x_T)%20%5Cright%5C%7D%7D%20"> as <img src="https://latex.codecogs.com/png.latex?%0A%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_%7Bt%7D%5E%7Bt+%5Cdelta%7D%20f(s,%20X_s)%20%5C,%20ds%20%5Cright%5C%7D%7D%20%20%5C,%20%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_%7Bt+%5Cdelta%7D%5ET%20f(s,%20X_s)%20%5C,%20ds%20+%20g(x_T)%20%5Cright%5C%7D%7D%20,%0A"> and write that <img src="https://latex.codecogs.com/png.latex?%5Cexp%20%7B%5Cleft%5C%7B%20%5Cint_%7Bt%7D%5E%7Bt+%5Cdelta%7D%20f(s,%20X_s)%20%5C,%20ds%20%5Cright%5C%7D%7D%20%20=%201%20+%20f(t,%20x_t)%20%5C,%20%5Cdelta%20+%20o(%5Cdelta)"> for small <img src="https://latex.codecogs.com/png.latex?%5Cdelta">. Then conditioned on <img src="https://latex.codecogs.com/png.latex?x_%7Bt+%5Cdelta%7D"> to obtain: <img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A%5Cmathbb%7BE%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%7C%20x_t,%20y_%7Bt:T%7D%5D%0A&amp;=%0A%5Cfrac%7B%20%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%5C,%20(1%20+%20f(t,%20x_t)%20%5C,%20%5Cdelta)%20%5C,%20h(t+%5Cdelta,%20x_%7Bt+%5Cdelta%7D)%5D%20%7D%7Bh(t,%20x)%7D%20+%20o(%5Cdelta)%5C%5C%0A&amp;=%0A%5Cfrac%7B%20%5Cmathbb%7BE%7D_%7Bx_t%7D%5B%5Cvarphi(x_%7Bt+%5Cdelta%7D)%20%5C,%20h(t+%5Cdelta,%20x_%7Bt+%5Cdelta%7D)%5D%20%7D%7Bh(t,%20x)%7D%20+%20%5Cdelta%20%5C,%20%5Cvarphi(x_t)%20%5C,%20f(t,%20x_t)%20+%20o(%5Cdelta).%0A%5Cend%7Balign*%7D%0A"> The rest of the computations are then identical to before.
</p>
</details>
<p>To see how this works, let us see a few examples:</p>
<section id="general-diffusions" class="level3">
<h3 class="anchored" data-anchor-id="general-diffusions">General diffusions</h3>
<p>Consider a diffusion process</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%20=%20b(X)%20%5C,%20dt%20+%20%5Csigma(X)%20%5C,%20dW%0A"></p>
<p>with generator <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%5Cvarphi=%20b%20%5Cnabla%20%5Cvarphi+%20%5Ctfrac12%20%5C,%20%5Csigma%20%5Csigma%5E%5Ctop%20:%20%5Cnabla%5E2%20%5Cvarphi"> and initial distribution <img src="https://latex.codecogs.com/png.latex?%5Cmu_0(dx)">. We are interested in describing the dynamics of the “conditioned” process given by the probability distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> defined in Equation&nbsp;4. Algebra applied to Equation&nbsp;6 then shows that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D%5E%5Cstar%20%5Cvarphi%5C;%20=%20%5C;%20%5Cmathcal%7BL%7D%5Cvarphi+%20%5Cunderbrace%7B%5Cfrac%7B%5Cvarphi%5C,%20(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D+%20f)%5Bh%5D%7D%7Bh%7D%7D_%7B=0%7D%0A+%20%5Csigma%20%5C,%20%5Csigma%5E%5Ctop%20%5C,%20(%5Cnabla%20%5Clog%20h)%20%5C,%20%5Cnabla%20%5Cvarphi%0A"></p>
<p>where the function <img src="https://latex.codecogs.com/png.latex?h"> is described in Equation&nbsp;5. Since <img src="https://latex.codecogs.com/png.latex?(%5Cpartial_t%20+%20%5Cmathcal%7BL%7D+%20f)%20%5C,%20h%20=%200">, this reveals that the probability distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BQ%7D"> describes a diffusion process <img src="https://latex.codecogs.com/png.latex?X%5E%5Cstar"> with dynamics</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%5E%5Cstar%20=%20b(X%5E%5Cstar)%20%5C,%20dt%20+%20%5Csigma(X%5E%5Cstar)%20%5C,%20%20%7B%5Cleft%5C%7B%20%20dW%20+%20%20%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar(t,%20X%5E%5Cstar)%7D%20%5C,%20dt%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>The additional drift term <img src="https://latex.codecogs.com/png.latex?%5Csigma(X%5E%5Cstar)%20%5C,%20%20%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar(t,%20X%5E%5Cstar)%7D%20%5C,%20dt"> is involves a “control” <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar(t,%20X%5E%5Cstar)%7D"> with <span id="eq-diffusion-doob"><img src="https://latex.codecogs.com/png.latex?%0A%5Ctextcolor%7Bblue%7D%7Bu%5E%5Cstar(t,%20x)%20=%20%5Csigma%5E%5Ctop(x)%20%5C,%20%5Cnabla%20%5Clog%20h(t,%20x)%7D.%0A%5Ctag%7B7%7D"></span></p>
<p>Note that the initial distribution of the conditioned process is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmu_0%5E%5Cstar(dx)%20=%20%5Cfrac%7B1%7D%7B%5Cmathcal%7BZ%7D%7D%20%5C,%20%5Cmu_0(dx)%20%5C,%20h(0,x).%0A"></p>
<p>Unfortunately, apart from a few straightforward cases such as a Brownian motion or an Ornstein-Uhlenbeck process, the function <img src="https://latex.codecogs.com/png.latex?h"> is generally intractable. However, there are indeed several numerical methods available to approximate it effectively.</p>
</section>
<section id="brownian-bridge" class="level3">
<h3 class="anchored" data-anchor-id="brownian-bridge">Brownian bridge</h3>
<p>What about a Brownian motion in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5ED"> conditioned to hit the state <img src="https://latex.codecogs.com/png.latex?x_%5Cstar%20%5Cin%20%5Cmathbb%7BR%7D%5ED"> at time <img src="https://latex.codecogs.com/png.latex?t=T">, i.e.&nbsp;a <a href="https://en.wikipedia.org/wiki/Brownian_bridge">Brownian bridge</a>? In that case, the function <img src="https://latex.codecogs.com/png.latex?h"> is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ah(t,x)%20=%20%5Cmathbb%7BP%7D(B_T%20=%20x_%5Cstar%20%5Cmid%20B_t%20=%20x)%0A=%0A%5Cexp%20%7B%5Cleft%5C%7B%20-%5Cfrac%7B%5C%7Cx-x_%5Cstar%5C%7C%5E2%7D%7B2%20%5C,%20(T-t)%7D%20%5Cright%5C%7D%7D%20%20/%20%5Cmathcal%7BZ%7D_%7BT-t%7D%0A"></p>
<p>for some irrelevant normalization constant <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D_%7BT-t%7D"> that only depends on <img src="https://latex.codecogs.com/png.latex?T-t">. Plugging this into Equation&nbsp;7 gives that the conditioned Brownian <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cstar%7D"> motion has dynamics</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%5E%5Cstar%20%5C;=%5C;%20%20%5Ctextcolor%7Bblue%7D%7B-%20%5Cfrac%7BX%5E%5Cstar%20-%20x_%5Cstar%7D%7BT-t%7D%20%5C,%20dt%7D%20+%20dB.%0A"></p>
<p>The additional drift term <img src="https://latex.codecogs.com/png.latex?-(X%5E%5Cstar%20-%20x_%5Cstar)/(T-t)"> is intuitive: it points in the direction of <img src="https://latex.codecogs.com/png.latex?x%5E%5Cstar"> and gets increasingly large as <img src="https://latex.codecogs.com/png.latex?t%20%5Cto%20T">.</p>
</section>
<section id="positive-brownian-motion" class="level3">
<h3 class="anchored" data-anchor-id="positive-brownian-motion">Positive Brownian motion</h3>
<p>What about a scalar Brownian conditioned to stay positive at all times? Let us consider <img src="https://latex.codecogs.com/png.latex?T"> and let us condition first on the event that the Brownian motion stays positive within <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D"> and later consider the limit <img src="https://latex.codecogs.com/png.latex?T%20%5Cto%20%5Cinfty">. The function <img src="https://latex.codecogs.com/png.latex?h"> reads</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ah(t,x)%20=%20%5Cmathbb%7BP%7D%20%7B%5Cleft(%20%5Ctext%7B$B_t$%20stays%20$%3E0$%20on%20$%5Bt,T%5D$%7D%20%5Cmid%20B_t=x%20%5Cright)%7D%20.%0A"></p>
<p>This can easily be calculated with the <a href="https://en.wikipedia.org/wiki/Reflection_principle_(Wiener_process)">reflection principle</a>. It equals</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ah(t,x)%20=%201%20-%202%20%5C,%20%5Cmathbb%7BP%7D(B_T%20%3C%200%20%5Cmid%20B_T%20=%20x)%0A=%0A%5Cmathbb%7BP%7D(%5Csqrt%7BT-t%7D%20%5C,%20%5C%7C%20%5Cxi%20%5C%7C%20%3C%20x)%0A"></p>
<p>for a standard Gaussian <img src="https://latex.codecogs.com/png.latex?%5Cxi%20%5Csim%20%5Cmathcal%7BN%7D(0,1)">. Plugging this into Equation&nbsp;7 gives that the additional drift term is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla%20%5Clog%20h(t,x)%20=%20%5Cfrac%7B%5Cexp%20%7B%5Cleft(%20-x%5E2%20/%20(2%20%5C,%20(T-t))%20%5Cright)%7D%20%7D%7Bx%7D%20%5Cquad%20%5Cto%20%5Cquad%20%5Cfrac%7B1%7D%7Bx%7D%0A"></p>
<p>as <img src="https://latex.codecogs.com/png.latex?T%20%5Cto%20%5Cinfty">. This shows that a Brownian motion conditioned to stay positive at all times has a upward drift of size <img src="https://latex.codecogs.com/png.latex?1/x">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%5E%5Cstar%20%5C;=%5C;%20%5Cfrac%7B1%7D%7BX%5E%7B%5Cstar%7D%7D%20+%20dB.%0A"></p>
<p>Incidentally, it is the dynamics of a <a href="https://en.wikipedia.org/wiki/Bessel_process">Bessel process</a> of dimension <img src="https://latex.codecogs.com/png.latex?d=3">, i.e.&nbsp;the law of the modulus of a three-dimensional Brownian motion. More generally, if one conditions a Brownian motion to stay within a closed domain <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D">, the conditioned dynamics exhibit a repulsive drift term of size about <img src="https://latex.codecogs.com/png.latex?1/%5Ctextrm%7Bdist%7D(x,%20%5Cpartial%20%5Cmathcal%7BD%7D)"> near the boundary <img src="https://latex.codecogs.com/png.latex?%5Cpartial%20%5Cmathcal%7BD%7D"> of the domain, as described below.</p>
</section>
<section id="brownian-motion-staying-in-a-domain" class="level3">
<h3 class="anchored" data-anchor-id="brownian-motion-staying-in-a-domain">Brownian motion staying in a domain</h3>
<p>What about a Brownian motion conditioned to stay within a domain <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D"> forever? As before, consider an time horizon <img src="https://latex.codecogs.com/png.latex?T"> and define the function <img src="https://latex.codecogs.com/png.latex?h"> as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ah(t,x)%20=%20%5Cmathbb%7BP%7D%20%7B%5Cleft(%20%5Ctext%7B$B_t$%20stays%20in%20$%5Cmathcal%7BD%7D$%20on%20$%5Bt,T%5D$%7D%20%5Cmid%20B_t=x%20%5Cright)%7D%20.%0A"></p>
<p>One can see that the function <img src="https://latex.codecogs.com/png.latex?h"> satisfies the PDE</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A(%5Cpartial_t%20+%20%5CDelta)%20%5C,%20h%20=%200%0A"></p>
<p>and equals zero on the boundary <img src="https://latex.codecogs.com/png.latex?%5Cpartial%20%5Cmathcal%7BD%7D"> of the domain. Furthermore <img src="https://latex.codecogs.com/png.latex?h(t,x)%20%5Cto%201"> as <img src="https://latex.codecogs.com/png.latex?t%20%5Cto%20T"> for all <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathcal%7BD%7D">. Consider the eigenfunctions <img src="https://latex.codecogs.com/png.latex?%5Cpsi_k"> of the negative Laplacian <img src="https://latex.codecogs.com/png.latex?-%5CDelta"> with Dirichlet boundary conditions on <img src="https://latex.codecogs.com/png.latex?%5Cpartial%20%5Cmathcal%7BD%7D">. Recall that <img src="https://latex.codecogs.com/png.latex?-%5CDelta"> is a positive operator with a discrete spectrum <img src="https://latex.codecogs.com/png.latex?%5Clambda_1%20%5Cleq%20%5Clambda_2%20%5Cleq%20%5Cldots"> of non-negative eigenvalues. The eigenfunction corresponding to the smallest eigenvalue <img src="https://latex.codecogs.com/png.latex?%5Clambda_1"> is the principal eigenfunction <img src="https://latex.codecogs.com/png.latex?%5Cpsi_1"> and it is standard that it is a positive function within the domain <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D">, as a “slight” generalization of the <a href="https://en.wikipedia.org/wiki/Perron–Frobenius_theorem">Perron-Frobenius</a> in linear algebra shows it. Expanding <img src="https://latex.codecogs.com/png.latex?h"> in the basis of eigenfunctions <img src="https://latex.codecogs.com/png.latex?%5Cpsi_k"> gives that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ah(t,x)%20=%20%5Cunderbrace%7Bc_1%20%5C,%20e%5E%7B-%5Clambda_1%20%5C,%20(T-t)%7D%20%5C,%20%5Cpsi_1(x)%7D_%7B%5Ctextrm%7Bdominant%20contribution%7D%7D%20+%20%5Csum_%7Bk%20%5Cgeq%202%7D%20c_k%20%5C,%20e%5E%7B-%5Clambda_k%20%5C,%20(T-t)%7D%20%5C,%20%5Cpsi_k(x).%0A"></p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/doob_transforms/eigenfunctions.jpg" class="img-fluid figure-img" style="width:90.0%"></p>
<figcaption>Eigenfunctions of the Laplacian</figcaption>
</figure>
</div>
</div>
<p>Since we are interested in the regime <img src="https://latex.codecogs.com/png.latex?T%20%5Cto%20%5Cinfty">, it holds that</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cnabla_x%20%5Clog%20h(t,x)%20%5C;%20%5Cto%20%5C;%20%5Cnabla%20%5Clog%20%5Cpsi_1(x)."></p>
<p>This shows that the conditioned Brownian motion has a drift term expressed in terms of the principal eigenfunction <img src="https://latex.codecogs.com/png.latex?%5Cpsi_1"> of the Laplacian:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%5E%5Cstar%20%5C;=%5C;%20%20%5Ctextcolor%7Bblue%7D%7B%20%5Cnabla%20%5Clog%20%5Cpsi_1(X%5E%5Cstar)%20%5C,%20dt%7D%20+%20dB.%0A"></p>
<p>For example, if <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D%5Cequiv%20%5B0,L%5D"> for a 1D Brownian motion, the principal eigenfunction is <img src="https://latex.codecogs.com/png.latex?%5Cpsi_1(x)%20=%20%5Csin(%5Cpi%20%5C,%20x%20/L)">. This shows that there is a upward drift of size <img src="https://latex.codecogs.com/png.latex?%5Csim%201/x"> near <img src="https://latex.codecogs.com/png.latex?x%20%5Capprox%200"> and a downward drift of size <img src="https://latex.codecogs.com/png.latex?%5Csim%201/(L-x)"> near <img src="https://latex.codecogs.com/png.latex?x%20%5Capprox%20L">.</p>


</section>

 ]]></description>
  <category>SDE</category>
  <category>markov</category>
  <guid>https://alexxthiery.github.io/notes/doob_transforms/doob.html</guid>
  <pubDate>Mon, 13 May 2024 17:00:00 GMT</pubDate>
</item>
<item>
  <title>RWM &amp; HMC on manifolds</title>
  <link>https://alexxthiery.github.io/notes/MCMC_on_manifold/mcmc_manifold.html</link>
  <description><![CDATA[ 





<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/MCMC_on_manifold/mcmc_manifold.jpg" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
</div>
<p>Consider a smooth manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D%5Csubset%20%5Cmathbb%7BR%7D%5En"> of dimension <img src="https://latex.codecogs.com/png.latex?d_%7B%5Cmathcal%7BM%7D%7D%20=%20(n-d)"> defined as the zero set of a well-behaved “constraint” function <img src="https://latex.codecogs.com/png.latex?C:%20%5Cmathbb%7BR%7D%5En%20%5Cto%20%5Cmathbb%7BR%7D%5Ed">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BM%7D=%20%5C%7B%20x%20%5Cin%20%5Cmathbb%7BR%7D%5En%20%5C;%20%5Ctext%7Bsuch%20that%7D%20%5C;%20C(x)%20=%200%20%5C%7D.%0A"></p>
<p>We would like to use MCMC to sample from a probability distribution supported on <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D"> with density <img src="https://latex.codecogs.com/png.latex?%5Cpi(x)"> with respect to the uniform <a href="https://en.wikipedia.org/wiki/Hausdorff_measure">Hausdorff measure</a> on <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">. It is relatively straightforward to adapt standard MCMC methods when dealing with simple manifolds such as a sphere or a torus since their geodesics and several other geometric quantities are analytically tractable. Maybe surprisingly, it is in fact relatively straightforward to design MCMC samplers on general implicitly defined manifold such as <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">. The article <span class="citation" data-cites="zappa2018monte">(Zappa, Holmes-Cerfon, and Goodman 2018)</span> explains these ideas beautifully.</p>
<section id="manifold-random-walk-metropolis-hastings" class="level3">
<h3 class="anchored" data-anchor-id="manifold-random-walk-metropolis-hastings">Manifold Random Walk Metropolis-Hastings</h3>
<p>Assume that <img src="https://latex.codecogs.com/png.latex?x_n%20%5Cin%20%5Cmathcal%7BM%7D"> is the current position of the MCMC chain. To generate a proposal <img src="https://latex.codecogs.com/png.latex?y_n%20%5Cin%20%5Cmathcal%7BM%7D"> that will eventually be accepted or rejected, one can proceed very similarly to the standard RWM algorithm with Gaussian perturbations with variance <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E2">. First, generate a vector <img src="https://latex.codecogs.com/png.latex?v%20%5Cin%20T_%7Bx_n%7D"> from a centred Gaussian distribution with covariance <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E2%20%5C,%20I"> on the tangent space <img src="https://latex.codecogs.com/png.latex?T_%7Bx_n%7D"> to <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D"> at <img src="https://latex.codecogs.com/png.latex?x_n">. To do so, it suffices for example to generate a standard Gaussian vector <img src="https://latex.codecogs.com/png.latex?z%20%5Csim%20%5Cmathcal%7BN%7D(0,%20%5Csigma%5E2%20I_n)"> in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5En"> and orthogonal-project it onto <img src="https://latex.codecogs.com/png.latex?T_%7Bx_n%7D">. Indeed, one cannot simply define the proposal as <img src="https://latex.codecogs.com/png.latex?x_n%20+%20v"> since it would not necessarily lie on <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">. Instead, one projects <img src="https://latex.codecogs.com/png.latex?x_n%20+%20v"> back to <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">. To do so, one needs to define the direction used for the projection and the manifold RWM algorithm uses <img src="https://latex.codecogs.com/png.latex?T_%7Bx_n%7D%5E%5Cperp">, for reasons that will become clear later. In other words, the proposal <img src="https://latex.codecogs.com/png.latex?y_n"> is obtained by seeking a vector <img src="https://latex.codecogs.com/png.latex?w%20%5Cin%20T_%7Bx_n%7D%5E%7B%5Cperp%7D"> such that <img src="https://latex.codecogs.com/png.latex?x_n%20+%20v%20+%20w%20%5Cin%20%5Cmathcal%7BM%7D">.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/MCMC_on_manifold/projection_onto_M.jpg" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Projection onto <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D"> from <span class="citation" data-cites="zappa2018monte">(Zappa, Holmes-Cerfon, and Goodman 2018)</span></figcaption>
</figure>
</div>
</div>
<p>If one calls <img src="https://latex.codecogs.com/png.latex?J_%7Bx_n%7D"> the Jacobian matrix of <img src="https://latex.codecogs.com/png.latex?C"> at <img src="https://latex.codecogs.com/png.latex?x_n">, i.e.&nbsp;the matrix whose <strong>rows</strong> are the gradients of the components of <img src="https://latex.codecogs.com/png.latex?C">, this projection operation boils down to finding a vector <img src="https://latex.codecogs.com/png.latex?%5Clambda%20%5Cin%20%5Cmathbb%7BR%7D%5Ed"> such that</p>
<p><span id="eq-projection"><img src="https://latex.codecogs.com/png.latex?%0AC(%20%5C,%20x_n%20+%20v%20+%20J_%7Bx_n%7D%5E%5Ctop%20%5Clambda)%20=%200%20%5Cin%20%5Cmathbb%7BR%7D%5Ed.%0A%5Ctag%7B1%7D"></span></p>
<p>Note that Equation&nbsp;1 is a non-linear equation in <img src="https://latex.codecogs.com/png.latex?%5Clambda"> that can have no solution, one solution or many solutions – this can seem like a fundamental roadblock to the design of a valid MCMC algorithm, but we will see that it is not! Before discussing in slightly more details the resolution of Equation&nbsp;1, assume that a standard root-finding algorithm takes the pair <img src="https://latex.codecogs.com/png.latex?(x_n+v,%20J_%7Bx_n%7D)"> as input and attempts to produces the projection <img src="https://latex.codecogs.com/png.latex?y_n">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BProj%7D:%20%5Cquad%20(x_n+v,%20J_%7Bx_n%7D)%20%5C;%20%5Cunderbrace%7B%5Cmapsto%7D_%7B%5Ctext%7Broot-finding%7D%7D%20%5C;%20y_n%20%5Cin%20%5Cmathcal%7BM%7D.%0A"></p>
<p>The algorithm will either converge to one of the possible solutions or fail. If the algorithm fails to converge, one can simply reject the proposal <img src="https://latex.codecogs.com/png.latex?y_n"> and set <img src="https://latex.codecogs.com/png.latex?y_n%20=%20%5Ctext%7B(Failed)%7D"> and set <img src="https://latex.codecogs.com/png.latex?x_%7Bn+1%7D%20=%20x_n">. If the algorithm converges, this defines a valid proposal <img src="https://latex.codecogs.com/png.latex?y_n%20%5Cin%20%5Cmathcal%7BM%7D">. To ensure reversibility, and it is one of the main novelty of the article <span class="citation" data-cites="zappa2018monte">(Zappa, Holmes-Cerfon, and Goodman 2018)</span>, one needs to verify that the reverse proposal <img src="https://latex.codecogs.com/png.latex?y_n%20%5Cmapsto%20x_n"> is possible.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/MCMC_on_manifold/reverse_mcmc.jpg" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Reversibility check <span class="citation" data-cites="zappa2018monte">(Zappa, Holmes-Cerfon, and Goodman 2018)</span></figcaption>
</figure>
</div>
</div>
<p>To do so, note that the only possibility for the reverse move <img src="https://latex.codecogs.com/png.latex?y_n%20%5Cto%20x_n"> to happen is if <img src="https://latex.codecogs.com/png.latex?x_n%20=%20%5Ctext%7BProj%7D(y_n%20+%20v',%20J_%7By_n%7D)"> where</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax_n-y_n%20%5C;=%5C;%20%5Cunderbrace%7Bv'%7D_%7B%5Cin%20T_%7By_n%7D%7D%20%20%5C,%20+%20%5C,%20%5Cunderbrace%7Bw'%7D_%7B%5Cin%20T_%7By_n%7D%5E%7B%5Cperp%7D%7D.%0A"></p>
<p>The uniqueness follows from the decomposition <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5En%20%5Cequiv%20T_%7By_n%7D%20%5Cotimes%20T_%7By_n%7D%5E%7B%5Cperp%7D">. The reverse move is consequently possible if and only if the following <strong>reversibility check</strong> condition is satisfied,</p>
<p><span id="eq-reversibility"><img src="https://latex.codecogs.com/png.latex?%0Ax_n%20=%20%5Ctext%7BProj%7D(y_n%20+%20v',%20J_%7By_n%7D).%0A%5Ctag%7B2%7D"></span></p>
<p>This reversibility check is necessary as it is not guaranteed that the root-finding algorithm started from <img src="https://latex.codecogs.com/png.latex?y_n%20+%20v'"> converges at all, or converges to <img src="https://latex.codecogs.com/png.latex?x_n"> in the case when there are several solutions. If Equation&nbsp;2 is not satisfied, the proposal <img src="https://latex.codecogs.com/png.latex?y_n"> is rejected and one sets <img src="https://latex.codecogs.com/png.latex?x_%7Bn+1%7D%20=%20x_n">. On the other hand, if Equation&nbsp;2 is satisfied, the proposal <img src="https://latex.codecogs.com/png.latex?y_n"> is accepted with the usual Metropolis-Hastings probability</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmin%20%5Cleft%5C%7B1,%20%5Cfrac%7B%5Cpi(y_n)%20%5C,%20p(v'%7Cx_n)%7D%7B%5Cpi(x_n)%20%5C,%20p(v%7Cx_n)%7D%20%5Cright%5C%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?p(v%7Cx)%20=%20Z%5E%7B-1%7D%20%5C,%20%5Cexp(-%5C%7Cv%5C%7C%5E2%20/%202%20%5Csigma%5E2)"> denotes the Gaussian density on the tangent space <img src="https://latex.codecogs.com/png.latex?T_%7Bx_n%7D"> The above description defines a valid MCMC algorithm on <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D"> that is reversible with respect to the target distribution <img src="https://latex.codecogs.com/png.latex?%5Cpi(x)">.</p>
</section>
<section id="projection-onto-the-manifold" class="level3">
<h3 class="anchored" data-anchor-id="projection-onto-the-manifold">Projection onto the manifold</h3>
<p>As described above, the main difficulty is to solve the non-linear equation Equation&nbsp;1 describing the projection of the proposal <img src="https://latex.codecogs.com/png.latex?(x_n%20+%20v)"> back to the manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">. The projection is along the space spanned by the columns of <img src="https://latex.codecogs.com/png.latex?J_%7Bx_n%7D%5E%5Ctop%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bn,d%7D">, i.e.&nbsp;find a vector <img src="https://latex.codecogs.com/png.latex?%5Clambda%20%5Cin%20%5Cmathbb%7BR%7D%5Ed"> such that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CPhi(%5Clambda)%20=%20C(%20%5C,%20x_n%20+%20v%20+%20J_%7Bx_n%7D%5E%5Ctop%20%5Clambda)%20=%200%20%5Cin%20%5Cmathbb%7BR%7D%5Ed.%0A"></p>
<p>One can use a standard Newton’s method to solve this equation started from <img src="https://latex.codecogs.com/png.latex?%5Clambda_0=0">. Setting for notational convenience <img src="https://latex.codecogs.com/png.latex?q(%5Clambda)%20=%20x_n%20+%20v%20+%20J_%7Bx_n%7D%5ET%20%5C,%20%5Clambda">, this boils down to iterating</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clambda_%7Bk+1%7D%20-%20%5Clambda_%7Bk%7D%0A=%0A-%20%5Cleft(%20J_%7Bq(%5Clambda_k)%7D%20%5C,%20J_%7Bx_n%7D%5E%5Ctop%20%5Cright)%5E%7B-1%7D%20%5C,%20%5CPhi(%5Clambda_k).%0A"></p>
<p>As described in <span class="citation" data-cites="barth1995algorithms">(Barth et al. 1995)</span>, it can sometimes be computationally advantageous to use a quasi-Newton method and use instead</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clambda_%7Bk+1%7D%20-%20%5Clambda_%7Bk%7D%0A=%0A-%20G%5E%7B-1%7D%20%5C,%20%5CPhi(%5Clambda_k)%0A"></p>
<p>with <strong>fixed</strong> positive definite matrix <img src="https://latex.codecogs.com/png.latex?G%20=%20J_%7Bx_n%7D%20%5C,%20J_%7Bx_n%7D%5E%5Ctop"> since one can then pre-compute a decomposition of <img src="https://latex.codecogs.com/png.latex?G"> and use it to solve the linear systems at each iterations. In some recent and related work <span class="citation" data-cites="au2020manifold">(Au, Graham, and Thiery 2022)</span>, we observed that the standard Newton method performed well in the settings we considered and there was most of the time no computational advantage to using a quasi-Newton method. In practice, the main computational bottleneck is to compute the Jacobian matrix <img src="https://latex.codecogs.com/png.latex?J_%7Bx_n%7D">, although it is problem-dependent and some structure can typically be exploited. In practice, only a relatively small number of iterations are performed and the root-finding algorithm is stopped as soon as <img src="https://latex.codecogs.com/png.latex?%5C%7C%5CPhi(%5Clambda_k)%5C%7C"> is below a certain threshold. If the stepsize is small, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5C%7Cv%5C%7C%20%5Cll%201">, it is typically the case that the Newton’s method will converge to a solution in only a very small number of iterations – indeed, <a href="https://en.wikipedia.org/wiki/Newton%27s_method">Newton’s method</a> is quadratically convergent when close to a solution.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/MCMC_on_manifold/RWM_double_torus.gif" class="img-fluid figure-img" style="width:90.0%"></p>
<figcaption>30k RWM chains ran in parallel to explore a double torus.</figcaption>
</figure>
</div>
</div>
<p>In the figure above, I have implemented the RWM algorithm above described to sample from the uniform distribution supported on a double torus described by the constraint function <img src="https://latex.codecogs.com/png.latex?C:%20%5Cmathbb%7BR%7D%5E3%20%5Cto%20%5Cmathbb%7BR%7D"> given as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AC(x,y,z)%20=%20(x%5E2%20%5C,%20(x%5E2%20-%201)%20+%20y%5E2)%5E2+z%5E2-0.03.%0A"></p>
<p>The figure shows <img src="https://latex.codecogs.com/png.latex?30,000"> chains ran in parallel, which is straightforward to implement in practice with JAX <span class="citation" data-cites="jax2018github">(Bradbury et al. 2018)</span>. All the chains are initialized from the same position so that one can visualize the evolution of the density of particles.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/MCMC_on_manifold/RWM_manifold_tuning.png" class="img-fluid figure-img" style="width:100.0%"></p>
<figcaption>Tuning of manifold-RWM</figcaption>
</figure>
</div>
</div>
<p>One can for example monitor the usual expected squared jump distance</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctextrm%7B(ESJD)%7D%20%5Cequiv%20%5Cmathbb%7BE%7D%5B%5C%7CX_%7Bn+1%7D%20-%20X_n%5C%7C%5E2%5D%0A"></p>
<p>and maximize it to tune the RWM step-size; it would probably make slightly more sense to monitor the squared geodesic distances instead the naive squared norm <img src="https://latex.codecogs.com/png.latex?%5C%7CX_%7Bn+1%7D%20-%20X_n%5C%7C%5E2">, but that’s way to much hassle and would probably make only a negligible difference. In the figure above, I have plotted the expected squared jump distance as a function of the acceptance rate for different step-sizes. It is interesting to see a pattern extremely similar to the one observed in the standard RWM algorithm <span class="citation" data-cites="roberts2001optimal">(Roberts and Rosenthal 2001)</span>: in this double torus example, the optimal acceptance rate is around <img src="https://latex.codecogs.com/png.latex?25%5C%25">. Note that since the target distribution is uniform, the rate of acceptance is only very slightly lower than the proportion of successful reversibility checks.</p>
</section>
<section id="hamiltonian-monte-carlo-hmc-on-manifolds" class="level3">
<h3 class="anchored" data-anchor-id="hamiltonian-monte-carlo-hmc-on-manifolds">Hamiltonian Monte Carlo (HMC) on manifolds</h3>
<p>While the Random Walk Metropolis-Hastings algorithm is interesting, exploiting gradient information is often necessary to design efficient MCMC samplers. Consider a single iteration of a standard <a href="https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo">Hamiltonian Monte Carlo (HMC)</a> sampler targeting a density <img src="https://latex.codecogs.com/png.latex?%5Cpi(q)"> on <img src="https://latex.codecogs.com/png.latex?q%20%5Cin%20%5Cmathbb%7BR%7D%5En">. The method proceeds by simulating from a dynamics that is reversible with respect to an extended target density <img src="https://latex.codecogs.com/png.latex?%5Cbar%7B%5Cpi%7D(q,p)"> on <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5En%20%5Cotimes%20%5Cmathbb%7BR%7D%5En"> defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Cbar%7B%5Cpi%7D(q,p)%0A&amp;%5Cpropto%20%5Cpi(q)%20%5C,%20%5Cexp%20%5Cleft%5C%7B%20-%5Cfrac%7B1%7D%7B2m%7D%20%5C%7Cp%5C%7C%5E2%20%5Cright%5C%7D%5C%5C%0A&amp;=%20%5Cexp%5Cleft%5C%7B%20%5Clog%20%5Cpi(q)%20-%20K(p)%20%5Cright%5C%7D%0A%5Cend%7Baligned%7D%0A"></p>
<p>for a user-defined mass parameter <img src="https://latex.codecogs.com/png.latex?m%20%3E%200">. In general, the mass parameter is a positive definite <strong>matrix</strong> but generalizing this to manifolds is slightly less useful in practice. For a time-discretization step <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%3E%200"> and a current position <img src="https://latex.codecogs.com/png.latex?(q_n,p_n)">, the method proceeds by generating a proposal <img src="https://latex.codecogs.com/png.latex?(q_%7B*%7D,p_%7B*%7D)"> defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0Ap_%7Bn+1/2%7D%20&amp;=%20p_n%20+%20%5Cfrac%7B%5Cvarepsilon%7D%7B2%7D%20%5Cnabla%20%5Clog%20%5Cpi(q_n)%5C%5C%0Aq_%7B*%7D%20&amp;=%20q_n%20+%20%5Cvarepsilon%5C,%20m%5E%7B-1%7D%20%5C,%20p_%7Bn+1/2%7D%5C%5C%0Ap_%7B*%7D%20&amp;=%20p_%7Bn+1/2%7D%20+%20%5Cfrac%7B%5Cvarepsilon%7D%7B2%7D%20%5Cnabla%20%5Clog%20%5Cpi(q_%7B*%7D).%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>This proposal is accepted with probability <img src="https://latex.codecogs.com/png.latex?%5Cmin%5Cleft(%201,%20%5Cbar%7B%5Cpi%7D(q_*,%20p_*)/%5Cbar%7B%5Cpi%7D(q_n,%20p_n)%20%5Cright)">. Indeed, in standard implementation, several leapfrog steps are performed instead of a single one. One can also choose to perform a single leapfrog step as above and only do a partial refreshment of the momentum after each leapfrog step – this may be more efficient or easier to implement when running a large number of HMC chains in parallel on a GPU for example. To adapt the HMC algorithm to sample from a density <img src="https://latex.codecogs.com/png.latex?%5Cpi(q)"> supported on a manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">, one can proceed similarly to the RWM algorithm by interleaving additional projection steps. These projections are needed to ensure that the momentum vectors <img src="https://latex.codecogs.com/png.latex?p_n"> remain in the right tangent spaces and the position vectors <img src="https://latex.codecogs.com/png.latex?q_n"> remain on the manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A(q_n,%20p_n)%20%5C;%20%5Cin%20%5C;%20%5Cmathcal%7BM%7D%5Ctimes%20T_%7Bq_n%7D.%0A"></p>
<p>As in the RWM algorithm, reversibility checks need to be performed to ensure that the overall algorithm is reversible with respect to the target distribution <img src="https://latex.codecogs.com/png.latex?%5Coverline%7B%5Cpi%7D(q,p)">. The resulting algorithm for generating a proposal <img src="https://latex.codecogs.com/png.latex?(q_n,%20p_n)%20%5Cmapsto%20(q_*,%20p_*)"> reads as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0A%5Cwidetilde%7Bp%7D_%7Bn+1/2%7D%20&amp;=%20p_n%20+%20%5Cfrac%7B%5Cvarepsilon%7D%7B2%7D%20%5Cnabla%20%5Clog%20%5Cpi(q_n)%5C%5C%0Ap_%7Bn+1/2%7D%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Ctext%7Borthogonal%20project%20$%5Cwidetilde%7Bp%7D_%7Bn+1/2%7D$%20onto%20$T_%7Bq_n%7D$%7D%7D%20%5C%5C%0A%5Cwidetilde%7Bq%7D_%7B*%7D%20&amp;=%20q_n%20+%20%5Cvarepsilon%5C,%20m%5E%7B-1%7D%20%5C,%20p_%7Bn+1/2%7D%5C%5C%0Aq_%7B*%7D%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Ctext%7BProj$(%5Cwidetilde%7Bq%7D_%7B*%7D,%20J_%7Bq_n%7D)$%7D%7D%5C%5C%0A%5Coverline%7Bp%7D_%7Bn+1/2%7D%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Ctext%7Borthogonal%20project%20$(q_%7B*%7D-q_n)%20%5C,%20m%20/%20%5Cvarepsilon$%20onto%20$T_%7Bq_%7B*%7D%7D$%7D%7D%20%5C%5C%0A%5Cwidetilde%7Bp%7D_%7B*%7D%20&amp;=%20%5Coverline%7Bp%7D_%7Bn+1/2%7D%20+%20%5Cfrac%7B%5Cvarepsilon%7D%7B2%7D%20%5Cnabla%20%5Clog%20%5Cpi(q_%7B*%7D)%5C%5C%0Ap_%7B*%7D%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Ctext%7Borthogonal%20project%20$%5Cwidetilde%7Bp%7D_%7B*%7D$%20onto%20$T_%7Bq_%7B*%7D%7D$%7D%7D.%0A%5Cend%7Baligned%7D%0A%5Cright.%0A"></p>
<p>If any of the projection operations fail, the proposal is rejected. If no failure occurs, a reversibility check is performed by running the algorithm backward starting from <img src="https://latex.codecogs.com/png.latex?(q_*,%20-p_*)">. If the reversibility check is successful, the proposal is accepted with the usual Metropolis-Hastings probability <img src="https://latex.codecogs.com/png.latex?%5Cmin%5Cleft(%201,%20%5Cbar%7B%5Cpi%7D(q_*,%20p_*)/%5Cbar%7B%5Cpi%7D(q_n,%20p_n)%20%5Cright)">.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/MCMC_on_manifold/HMC_double_torus_compressed.gif" class="img-fluid figure-img" style="width:90.0%"></p>
<figcaption>5k HMC chains ran in parallel: the momentum is not refreshed</figcaption>
</figure>
</div>
</div>
<p>The article <span class="citation" data-cites="lelievre2019hybrid">(Lelièvre, Rousset, and Stoltz 2019)</span> provides a detailed description of several of these ideas along with detailed analysis and extensions.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-au2020manifold" class="csl-entry">
Au, Khai Xiang, Matthew M Graham, and Alexandre H Thiery. 2022. <span>“Manifold Lifting: Scaling MCMC to the Vanishing Noise Regime.”</span> <em>Journal of the Royal Statistical Society: Series B</em>. <a href="https://arxiv.org/abs/2003.03950">https://arxiv.org/abs/2003.03950</a>.
</div>
<div id="ref-barth1995algorithms" class="csl-entry">
Barth, Eric, Krzysztof Kuczera, Benedict Leimkuhler, and Robert D Skeel. 1995. <span>“Algorithms for Constrained Molecular Dynamics.”</span> <em>Journal of Computational Chemistry</em> 16 (10). Wiley Online Library: 1192–1209. <a href="https://doi.org/10.1002/jcc.540161003">https://doi.org/10.1002/jcc.540161003</a>.
</div>
<div id="ref-jax2018github" class="csl-entry">
Bradbury, James, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, et al. 2018. <span>“<span>JAX</span>: Composable Transformations of <span>P</span>ython+<span>N</span>um<span>P</span>y Programs.”</span> <a href="http://github.com/google/jax">http://github.com/google/jax</a>.
</div>
<div id="ref-lelievre2019hybrid" class="csl-entry">
Lelièvre, Tony, Mathias Rousset, and Gabriel Stoltz. 2019. <span>“Hybrid Monte Carlo Methods for Sampling Probability Measures on Submanifolds.”</span> <em>Numerische Mathematik</em> 143 (2). Springer: 379–421. <a href="https://arxiv.org/abs/1807.02356">https://arxiv.org/abs/1807.02356</a>.
</div>
<div id="ref-roberts2001optimal" class="csl-entry">
Roberts, Gareth O, and Jeffrey S Rosenthal. 2001. <span>“Optimal Scaling for Various Metropolis-Hastings Algorithms.”</span> <em>Statistical Science</em> 16 (4). Institute of Mathematical Statistics: 351–67. <a href="https://doi.org/10.1214/ss/1015346320">https://doi.org/10.1214/ss/1015346320</a>.
</div>
<div id="ref-zappa2018monte" class="csl-entry">
Zappa, Emilio, Miranda Holmes-Cerfon, and Jonathan Goodman. 2018. <span>“Monte Carlo on Manifolds: Sampling Densities and Integrating Functions.”</span> <em>Communications on Pure and Applied Mathematics</em> 71 (12). Wiley Online Library: 2609–47. <a href="https://arxiv.org/abs/1702.08446">https://arxiv.org/abs/1702.08446</a>.
</div>
</div></section></div> ]]></description>
  <category>MCMC</category>
  <category>manifold</category>
  <guid>https://alexxthiery.github.io/notes/MCMC_on_manifold/mcmc_manifold.html</guid>
  <pubDate>Fri, 08 Mar 2024 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Metropolis-Hastings ratio with deterministic proposals</title>
  <link>https://alexxthiery.github.io/notes/MCMC_deterministic_proposals/MCMC_deterministic.html</link>
  <description><![CDATA[ 





<p>Consider a probability density <img src="https://latex.codecogs.com/png.latex?%5Cpi(x)"> on <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5Ed"> and a (deterministic) function <img src="https://latex.codecogs.com/png.latex?F:%20%5Cmathbb%7BR%7D%5Ed%20%5Cto%20%5Cmathbb%7BR%7D%5Ed">. Assume further that <img src="https://latex.codecogs.com/png.latex?F"> is an <a href="https://en.wikipedia.org/wiki/Involution_(mathematics)">involution</a> in the sense that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AF(F(x))%20=%20x%0A"></p>
<p>for all <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathbb%7BR%7D%5Ed">. To keep things simple since it is not really the point of this short note, suppose that <img src="https://latex.codecogs.com/png.latex?%5Cpi(x)%3E0"> everywhere and that <img src="https://latex.codecogs.com/png.latex?F"> is smooth. This type of transformations can be used to define Markov Chain Monte Carlo algorithms, eg. the standard <a href="https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo">Hamiltonian Monte Carlo (HMC)</a> algorithm. To design a MCMC scheme with this involution <img src="https://latex.codecogs.com/png.latex?F">, one needs to answer the following basic question: suppose that <img src="https://latex.codecogs.com/png.latex?X%20%5Csim%20%5Cpi(dx)"> and the proposal <img src="https://latex.codecogs.com/png.latex?Y%20=%20F(X)"> is constructed and accepted with probability <img src="https://latex.codecogs.com/png.latex?%5Calpha(X)">, how should the acceptance probability function <img src="https://latex.codecogs.com/png.latex?%5Calpha:%20%5Cmathbb%7BR%7D%5Ed%20%5Cto%20%5B0,1%5D"> be chosen so that the resulting random variable <img src="https://latex.codecogs.com/png.latex?Z%20%5C;%20=%20%5C;%20Y%20%5C,%20B%20+%20(1-B)%20%5C,%20X"> is also distributed according to <img src="https://latex.codecogs.com/png.latex?%5Cpi(dx)">? The Bernoulli random variable <img src="https://latex.codecogs.com/png.latex?B"> is such that <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(B=1%7CX=x)=%5Calpha(x)">. In other words, for any test function <img src="https://latex.codecogs.com/png.latex?%5Cvarphi:%20%5Cmathbb%7BR%7D%5Ed%20%5Cto%20%5Cmathbb%7BR%7D">, we would like <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5B%5Cvarphi(Z)%5D%20=%20%5Cmathbb%7BE%7D%5B%5Cvarphi(X)%5D">, which means that</p>
<p><span id="eq-necessary"><img src="https://latex.codecogs.com/png.latex?%0A%5Cint%20%20%7B%5Cleft%5C%7B%20%20%5Cvarphi(F(x))%20%5C,%20%5Calpha(x)%20+%20%5Cvarphi(x)%20%5C,%20(1-%5Calpha(x))%20%20%5Cright%5C%7D%7D%20%20%5C,%20%5Cpi(dx)%20=%20%5Cint%20%5Cvarphi(x)%20%5C,%20%5Cpi(dx).%0A%5Ctag%7B1%7D"></span></p>
<p>Requiring for Equation&nbsp;1 to hold for any test function <img src="https://latex.codecogs.com/png.latex?%5Cvarphi"> is easily seen to be equivalent to asking for the equation</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Calpha(x)%20%5C,%20%5Cpi(x)%20%5C;%20=%20%5C;%20%5Calpha(y)%20%5C,%20%5Cpi(y)%20%5C,%20%7CJ_F(x)%7C%0A"></p>
<p>to hold for any <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathbb%7BR%7D%5Ed"> where <img src="https://latex.codecogs.com/png.latex?y=F(x)"> and <img src="https://latex.codecogs.com/png.latex?J_F(x)"> is the Jacobian of <img src="https://latex.codecogs.com/png.latex?F"> at <img src="https://latex.codecogs.com/png.latex?x">. Since <img src="https://latex.codecogs.com/png.latex?%7CJ_F(y)%7C%20%5Ctimes%20%7CJ_F(x)%7C%20=%201"> because the function <img src="https://latex.codecogs.com/png.latex?F"> is an involution, this also reads</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Calpha(x)%20%5C,%20%5Cfrac%7B%5Cpi(x)%20%7D%7B%7CJ_F(x)%7C%5E%7B1/2%7D%7D%20%5C;%20=%20%5C;%0A%5Calpha(y)%20%5C,%20%5Cfrac%7B%5Cpi(y)%20%7D%7B%7CJ_F(y)%7C%5E%7B1/2%7D%7D.%0A"></p>
<p>At this point, it becomes clear to anyone familiar with the the correctness-proof of the usual <a href="https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis-Hastings algorithm</a> that a possible solution is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Calpha(x)%20%5C;%20=%20%5C;%20%5Cmin%20%7B%5Cleft%5C%7B%201,%20%5Cfrac%7B%5Cpi(y)%20/%20%7CJ_F(y)%7C%5E%7B1/2%7D%7D%7B%5Cpi(x)%20/%20%7CJ_F(x)%7C%5E%7B1/2%7D%7D%20%5Cright%5C%7D%7D%0A"></p>
<p>although there are indeed many other possible solutions. Since <img src="https://latex.codecogs.com/png.latex?%7CJ_F(y)%7C%20%5Ctimes%20%7CJ_F(x)%7C%20=%201">, this also reads</p>
<p><span id="eq-MH"><img src="https://latex.codecogs.com/png.latex?%0A%5Calpha(x)%20%5C;%20=%20%5C;%20%5Cmin%20%7B%5Cleft%5C%7B%201,%20%5Cfrac%7B%5Cpi(y)%7D%7B%5Cpi(x)%7D%20%5C,%20%7CJ_F(x)%7C%20%5Cright%5C%7D%7D%20.%0A%5Ctag%7B2%7D"></span></p>
<p>One can reach a similar conclusion by looking at the Radon-Nikodym ratio <img src="https://latex.codecogs.com/png.latex?%5B%5Cpi(dx)%20%5Cotimes%20q(x,dy)%5D%20/%20%5B%5Cpi(dy)%20%5Cotimes%20q(y,dx)%5D"> where <img src="https://latex.codecogs.com/png.latex?q(x,dy)"> is the markov kernel described the deterministic transformation <span class="citation" data-cites="green1995reversible">(Green 1995)</span>, but I do not find this approach significantly simpler. The very neat article <span class="citation" data-cites="andrieu2020general">(Andrieu, Lee, and Livingstone 2020)</span> describes much more sophisticated and interesting generalizations. Indeed, Equation&nbsp;2 is often used in the simpler case when <img src="https://latex.codecogs.com/png.latex?F"> is volume preserving, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%7CJ_F(x)%7C=1">, as is the case for the <a href="https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo">Hamiltonian Monte Carlo (HMC)</a>. The discussion above was prompted by a student implementing a variant of this but with the wrong acceptance ratio <img src="https://latex.codecogs.com/png.latex?%5Calpha(x)%20=%20%5Cmin%20%7B%5Cleft%5C%7B%201,%20%5Cfrac%7B%5Cpi(y)%7D%7B%5Cpi(x)%7D%20%5C,%20%5Cfrac%7B%7CJ_F(x)%7C%7D%7B%7CJ_F(y)%7C%7D%20%5Cright%5C%7D%7D%20"> and us taking quite a bit of time to find the bug…</p>
<p>Note that there are interesting and practical situations when the function <img src="https://latex.codecogs.com/png.latex?F"> satisfies the involution property <img src="https://latex.codecogs.com/png.latex?F(F(x))=x"> only when <img src="https://latex.codecogs.com/png.latex?x"> belongs to a subset of the state-space. For instance, this can happen when implementing MCMC on a manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D%20%5Csubset%20%5Cmathbb%7BR%7D%5Ed"> and the function <img src="https://latex.codecogs.com/png.latex?F"> involves a “projection” on the manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">, as for example described in the really interesting article <span class="citation" data-cites="zappa2018monte">(Zappa, Holmes-Cerfon, and Goodman 2018)</span>. In that case, it suffices to add a “reversibility check”, i.e.&nbsp;make sure that when applying <img src="https://latex.codecogs.com/png.latex?F"> to the proposal <img src="https://latex.codecogs.com/png.latex?y=F(x)">, one goes back to <img src="https://latex.codecogs.com/png.latex?x"> in the sense that <img src="https://latex.codecogs.com/png.latex?F(y)=x">. The acceptance probability in that case should be amended and expressed as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Calpha(x)%20%5C;%20=%20%5C;%20%5Cmin%20%7B%5Cleft%5C%7B%201,%20%5Cfrac%7B%5Cpi(y)%7D%7B%5Cpi(x)%7D%20%5C,%20%7CJ_F(x)%7C%20%5Cright%5C%7D%7D%20%20%5C,%20%5Cmathbf%7B1%7D%20%7B%5Cleft(%20F(y)=x%20%5Cright)%7D%20.%0A"></p>
<p>In other words, if applying <img src="https://latex.codecogs.com/png.latex?F"> to the proposal <img src="https://latex.codecogs.com/png.latex?y=F(x)"> does not lead back to <img src="https://latex.codecogs.com/png.latex?x">, the proposal is always rejected.</p>
<section id="same-but-without-involution" class="level3">
<h3 class="anchored" data-anchor-id="same-but-without-involution">Same, but without involution</h3>
<p>in some situations, the requirement for <img src="https://latex.codecogs.com/png.latex?F"> to be an involution can seem cumbersome. What if we consider the more general situation of a smooth bijection <img src="https://latex.codecogs.com/png.latex?T:%20%5Cmathbb%7BR%7D%5Ed%20%5Cto%20%5Cmathbb%7BR%7D%5Ed"> and its inverse <img src="https://latex.codecogs.com/png.latex?T%5E%7B-1%7D">? In that case, one can directly apply what has been described in the previous section: it suffices to consider an extended state-space <img src="https://latex.codecogs.com/png.latex?(x,%5Cvarepsilon)"> obtained by including an index <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cin%20%5C%7B-1,1%5C%7D"> and the involution <img src="https://latex.codecogs.com/png.latex?F"> defined as</p>
<p><span id="eq-extended"><img src="https://latex.codecogs.com/png.latex?%0AF:%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0A(x,%5Cvarepsilon=+1)%20&amp;%5Cmapsto%20(T(x),%20%5Cvarepsilon=-1)%5C%5C%0A(x,%5Cvarepsilon=-1)%20&amp;%5Cmapsto%20(T%5E%7B-1%7D(x),%20%5Cvarepsilon=+1).%0A%5Cend%7Balign%7D%0A%5Cright.%0A%5Ctag%7B3%7D"></span></p>
<p>This allows one to define a Markov kernel that lets the distribution <img src="https://latex.codecogs.com/png.latex?%5Coverline%7B%5Cpi%7D(x,%20%5Cvarepsilon)%20=%20%5Cpi(dx)/2"> invariant. Things can even start to get a bit more interesting if a deterministic “flip” <img src="https://latex.codecogs.com/png.latex?(x,%20%5Cvarepsilon)%20%5Cmapsto%20(x,%20-%5Cvarepsilon)"> is applied after each application of the Markov kernel above describe: doing so avoids immediately coming back to <img src="https://latex.codecogs.com/png.latex?x"> in the event the move <img src="https://latex.codecogs.com/png.latex?(x,%5Cvarepsilon)%20%5Cmapsto%20(T%5E%7B%5Cvarepsilon%7D(x),%20-%5Cvarepsilon)"> is accepted. There are indeed quite a few papers exploiting this type of ideas.</p>
</section>
<section id="a-mixture-of-deterministic-transformations" class="level3">
<h3 class="anchored" data-anchor-id="a-mixture-of-deterministic-transformations">A mixture of deterministic transformations?</h3>
<p>To conclude these notes, here is a small riddle whose answer I do not have. One can check that for any <img src="https://latex.codecogs.com/png.latex?c%20%5Cin%20%5Cmathbb%7BR%7D">, the function <img src="https://latex.codecogs.com/png.latex?F_%7Bc%7D(x)%20=%20c%20+%201/(x-c)"> is an involution of the real line. This means that for any target density <img src="https://latex.codecogs.com/png.latex?%5Cpi(x)"> on the real line, one can build the associated Markov kernel <img src="https://latex.codecogs.com/png.latex?M_c"> defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AM_c(x,%20dy)%20=%20%5Calpha_c(x)%20%5C,%20%5Cdelta_%7BF_c(x)%7D(dy)%20+%20(1-%5Calpha_c(x))%20%5C,%20%5Cdelta_x(dy)%0A"></p>
<p>for an acceptance probability <img src="https://latex.codecogs.com/png.latex?%5Calpha_c(x)"> described as above,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Calpha_c(x)%20=%20%5Cmin%20%7B%5Cleft%5C%7B%201,%20%5Cfrac%7B%5Cpi%5BF_c(x)%5D%7D%7B%5Cpi(x)%7D%20%7CF'_c(x)%7C%20%5Cright%5C%7D%7D%20.%0A"></p>
<p>Finally, choose a <img src="https://latex.codecogs.com/png.latex?N%20%5Cgeq%202"> values <img src="https://latex.codecogs.com/png.latex?c_1,%20%5Cldots,%20c_N%20%5Cin%20%5Cmathbb%7BR%7D"> and consider the mixture of Markov kernels</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AM(x,dy)%20%5C;%20=%20%5C;%20%5Cfrac%7B1%7D%7BN%7D%20%5Csum_%7Bi=1%7D%5EN%20M_%7Bc_i%7D(x,%20dy).%0A"></p>
<p>The Markov kernel <img src="https://latex.codecogs.com/png.latex?M(x,%20dy)"> lets the distribution <img src="https://latex.codecogs.com/png.latex?%5Cpi"> invariant since each Markov kernel <img src="https://latex.codecogs.com/png.latex?M_%7Bc_i%7D(x,%20dy)"> does, but it is not clear at all (to me) under what conditions the associated MCMC algorithm does converge to <img src="https://latex.codecogs.com/png.latex?%5Cpi">. One can empirically check that if <img src="https://latex.codecogs.com/png.latex?N"> is very small, things can break down quite easily. On the other, for <img src="https://latex.codecogs.com/png.latex?N"> large, the mixuture of Markov kernels <img src="https://latex.codecogs.com/png.latex?M(x,dy)"> empirically seems to behave as if it were ergodic with respect to <img src="https://latex.codecogs.com/png.latex?%5Cpi">.</p>
<div style="text-align:center;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexxthiery.github.io/notes/MCMC_deterministic_proposals/mcmc_deterministic.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
</div>
<p>For <img src="https://latex.codecogs.com/png.latex?N=5"> values <img src="https://latex.codecogs.com/png.latex?c_1,%20%5Cldots,%20c_5%20%5Cin%20%5Cmathbb%7BR%7D"> chosen at random, the illustration aboves shows the empirical distribution of the associated Markov chain ran for <img src="https://latex.codecogs.com/png.latex?T=10%5E6"> iterations and targeting the standard Gaussian distribution <img src="https://latex.codecogs.com/png.latex?%5Cpi(dx)%20%5Cequiv%20%5Cmathcal%7BN%7D(0,1)">: the fit seems almost perfect.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-andrieu2020general" class="csl-entry">
Andrieu, Christophe, Anthony Lee, and Sam Livingstone. 2020. <span>“A General Perspective on the Metropolis-Hastings Kernel.”</span> <em>arXiv Preprint arXiv:2012.14881</em>.
</div>
<div id="ref-green1995reversible" class="csl-entry">
Green, Peter J. 1995. <span>“Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.”</span> <em>Biometrika</em> 82 (4). Oxford University Press: 711–32.
</div>
<div id="ref-zappa2018monte" class="csl-entry">
Zappa, Emilio, Miranda Holmes-Cerfon, and Jonathan Goodman. 2018. <span>“Monte Carlo on Manifolds: Sampling Densities and Integrating Functions.”</span> <em>Communications on Pure and Applied Mathematics</em> 71 (12). Wiley Online Library: 2609–47.
</div>
</div></section></div> ]]></description>
  <category>auxiliary-variable</category>
  <guid>https://alexxthiery.github.io/notes/MCMC_deterministic_proposals/MCMC_deterministic.html</guid>
  <pubDate>Sun, 17 Dec 2023 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Averaging and homogenization</title>
  <link>https://alexxthiery.github.io/notes/averaging_homogenization/averaging_homogenization.html</link>
  <description><![CDATA[ 





<section id="averaging" class="level3">
<h3 class="anchored" data-anchor-id="averaging">Averaging</h3>
<p>Consider a pair of (coupled) Markov processes <img src="https://latex.codecogs.com/png.latex?X_t%20%5Cin%20%5Cmathcal%7BX%7D"> and <img src="https://latex.codecogs.com/png.latex?Y_t%20%5Cin%20%5Cmathcal%7BY%7D"> with dynamics that can informally be described as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0AdX%5E%7B%5Cvarepsilon%7D/dt%20&amp;=%20F(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D,%20W%5EX)%5C%5C%0AdY%5E%7B%5Cvarepsilon%7D/dt%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Cvarepsilon%5E%7B-1%7D%7D%20%5C,%20G(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D,%20W%5EY)%5C%5C%0A%5Cend%7Balign%7D%0A%5Cright.%0A"></p>
<p>for two independent “noise” terms <img src="https://latex.codecogs.com/png.latex?W%5EX"> and <img src="https://latex.codecogs.com/png.latex?W%5EY"> and a time-scale parameter <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cll%201">. We assume that <img src="https://latex.codecogs.com/png.latex?X"> is a <strong>slow component</strong> that moves by <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(%5Cdelta)"> in on the time interval <img src="https://latex.codecogs.com/png.latex?%5Bt,%20t+%5Cdelta%5D">. The scaling <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5E%7B-1%7D"> in the dynamics of <strong>fast process</strong> <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Cvarepsilon%7D"> indicates that we expect the process <img src="https://latex.codecogs.com/png.latex?Y"> to evolve on a time scale of order <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(%5Cvarepsilon)">. We are interested in the limit <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cto%200"> and hope to “average out” the fast process <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Cvarepsilon%7D"> and be able to describe the slow (and interesting) process <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cvarepsilon%7D"> without referring to the fast process. Informally, we would like to describe the process <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cvarepsilon%7D">, in the limit <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cto%200">, as following an effective Markovian dynamics</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX/dt%20=%20%5Coverline%7BF%7D(X,%20W%5EX).%0A"></p>
<p>For describing the averaging phenomenon, we typically assume some ergodicity conditions on the fast process <img src="https://latex.codecogs.com/png.latex?Y">. Here, we assume that for each fixed <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathcal%7BX%7D">, the fast process process <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Bx%5D%7D"> with fixed slow-component <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathcal%7BX%7D">, i.e.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdY%5E%7B%5Bx%5D%7D/dt%20=%20G(x,Y%5E%7B%5Bx%5D%7D,%20W%5EY)%0A"></p>
<p>is ergodic with respect to some probability distribution <img src="https://latex.codecogs.com/png.latex?%5Crho_x(dy)">. Although the averaging phenomenon is quite general, it is somewhat easier to illustrate it for diffusion processes. In this case, let us assume that the slow process is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%5E%7B%5Cvarepsilon%7D%20=%20%5Cmu(X%5E%7B%5Cvarepsilon%7D,%20Y%5E%7B%5Cvarepsilon%7D)%20%5C,%20dt%20+%20%5Csigma(X%5E%7B%5Cvarepsilon%7D,%20Y%5E%7B%5Cvarepsilon%7D)%20%5C,%20dW%5Ex.%0A"></p>
<p>For <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cvarepsilon%7D_%7Bt%7D%20=%20x"> and for a time increment <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%5Cll%201">, since the process <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cvarepsilon%7D"> can be considered constant we have</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AX%5E%7B%5Cvarepsilon%7D_%7Bt+%5Cdelta%7D%20-%20x%0A&amp;%5Capprox%20%5C;%0A%7B%5Cleft(%20%20%5Cfrac%7B%20%5Cint_%7Bt%7D%5E%7Bt+%5Cdelta%7D%20%5Cmu(x,%20Y%5E%7B%5Cvarepsilon%7D)%20%5C,%20dt%7D%7B%5Cdelta%7D%20%20%5Cright)%7D%20%20%5C,%20%5Cdelta%20+%20%5C%0A%7B%5Cleft(%20%20%5Cfrac%7B%20%5Cint_%7Bt%7D%5E%7Bt+%5Cdelta%7D%20%5Csigma%5E2(x,%20Y%5E%7B%5Cvarepsilon%7D)%20%7D%7B%5Cdelta%7D%20%5Cright)%7D%20%5E%7B1/2%7D%20%5C,%20%5Cmathcal%7BN%7D(0,%20%5Cdelta).%0A%5Cend%7Balign%7D%0A"></p>
<p>This can be regarded as a time-discretization of the <strong>averaged process</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%20%5C,%20=%20%5C;%20%5Coverline%7B%5Cmu%7D(X)%20%5C,%20dt%20+%20%5Coverline%7B%5Csigma%7D(X)%20%5C,%20dW%0A"></p>
<p>for averaged drift and volatility functions give by</p>
<p><span id="eq-av-diff"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0A%5Coverline%7B%5Cmu%7D(x)%0A&amp;=%0A%5Cint%20%5Cmu(x,y)%20%5C,%20%5Crho_x(dy)%20%5C%5C%0A%5Coverline%7B%5Csigma%7D%5E2(x)%0A&amp;=%20%5Cint%20%5Csigma%5E2(x,y)%20%5C,%20%5Crho_x(dy).%0A%5Cend%7Balign%7D%0A%5Cright.%0A%5Ctag%7B1%7D"></span></p>
<p>One standard approach for proving this type of results is to write the Kolmogorov equations</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7Bd%7D%7Bdt%7D%5Cvarphi%5E%7B%5Cvarepsilon%7D(x,y,t)%20=%20%5Cmathcal%7BL%7D%5E%7B%5Cvarepsilon%7D%20%5Cvarphi%5E%7B%5Cvarepsilon%7D(x,y,t)"> for <img src="https://latex.codecogs.com/png.latex?%5Cvarphi%5E%7B%5Cvarepsilon%7D(x,y,t)%20=%20%5Cmathbb%7BE%7D%5B%5Cvarphi(X%5E%7B%5Cvarepsilon%7D_%7Bt%7D,%20Y%5E%7B%5Cvarepsilon%7D_%7Bt%7D,%20t)%20%7C%20X%5E%7B%5Cvarepsilon%7D_%7B0%7D=x,%20Y%5E%7B%5Cvarepsilon%7D_%7B0%7D=y%5D"> and perform a <a href="https://en.wikipedia.org/wiki/Method_of_matched_asymptotic_expansions">multiscale expansion</a> <span class="citation" data-cites="hinch_1991">(Hinch 1991)</span> <span class="citation" data-cites="pavliotis2008multiscale">(Pavliotis and Stuart 2008)</span> <span class="citation" data-cites="weinan2011principles">(Weinan 2011)</span></p>
<p><span id="eq-multiscale"><img src="https://latex.codecogs.com/png.latex?%0A%5Cvarphi%5E%7B%5Cvarepsilon%7D(x,y,t)%0A=%0AA(x,t)%20+%20%5Cvarepsilon%20B(x,y,t)%20+%20%5Cmathcal%7BO%7D(%5Cvarepsilon%5E2).%0A%5Ctag%7B2%7D"></span></p>
<p>Indeed, the first order term <img src="https://latex.codecogs.com/png.latex?A(x,t)"> is expected to not depend on the initial condition <img src="https://latex.codecogs.com/png.latex?y"> since the process <img src="https://latex.codecogs.com/png.latex?(X%5E%7B%5Cvarepsilon%7D_t,%20Y%5E%7B%5Cvarepsilon%7D_t)"> forgets <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Cvarepsilon%7D_0%20=%20y"> on time scales of order <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon"> and we are interested in the regime <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cto%200">. From Equation&nbsp;2 one can obtain the dynamics of the averaged process described by the function <img src="https://latex.codecogs.com/png.latex?A(x,t)">. One finds that <img src="https://latex.codecogs.com/png.latex?A"> is described by the averaged generator of the slow component, i.e.&nbsp;averaging <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%5E%7BX%5E%7B%5Cvarepsilon%7D%7D"> under <img src="https://latex.codecogs.com/png.latex?%5Crho_x(dy)">; this exactly gives Equation&nbsp;1 in the case of diffusions. A typical example could be as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0AdX%5E%7B%5Cvarepsilon%7D%20&amp;=%20%5Cmu(X%5E%7B%5Cvarepsilon%7D,%20Y%5E%7B%5Cvarepsilon%7D)%20dt%20+%20%5Csigma(X%5E%7B%5Cvarepsilon%7D,%20Y%5E%7B%5Cvarepsilon%7D)%20%5C,%20dW%5EX%5C%5C%0AdY%5E%7B%5Cvarepsilon%7D%20&amp;=%20-%20%5Ctextcolor%7Bred%7D%7B%5Cvarepsilon%5E%7B-1%7D%7D%20%5Cfrac%7B%20(Y%5E%7B%5Cvarepsilon%7D%20-%20X%5E%7B%5Cvarepsilon%7D)%20%7D%7B%5Csigma%5E2%7D%20%5C,%20dt%20+%20%5Csqrt%7B2%20%20%5Ctextcolor%7Bred%7D%7B%5Cvarepsilon%5E%7B-1%7D%7D%20%7D%20%5C,%20dW%5EY.%5C%5C%0A%5Cend%7Balign%7D%0A%5Cright.%0A"></p>
<p>The fast process <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Cvarepsilon%7D_t"> is a <a href="https://en.wikipedia.org/wiki/Ornstein–Uhlenbeck_process">Ornstein-Uhlenbeck</a> process sped-up by a factor <img src="https://latex.codecogs.com/png.latex?1/%5Cvarepsilon"> that will very rapidly oscillate around <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cvarepsilon%7D_t">, with Gaussian fluctuations with variance <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E2%3E0">, ie:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Crho_x(dy)%20%5C;%20=%20%5C;%20%5Cfrac%7B%20e%5E%7B-(y-x)%5E2/2%7D%20%7D%7B%5Csqrt%7B2%5Cpi%20%5Csigma%5E2%7D%7D%5C,%20dy.%0A"></p>
<p>This averaging phenomenon is relatively straightforward and not extremely surprising. More interesting is the homogenization phenomenon described in the next Section.</p>
</section>
<section id="homogenization" class="level3">
<h3 class="anchored" data-anchor-id="homogenization">Homogenization</h3>
<p>Consider the presence of an additional intermediate time scale <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5E%7B-1/2%7D">, <img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0AdX%5E%7B%5Cvarepsilon%7D/dt%20&amp;=%20%20%5Ctextcolor%7Bblue%7D%7B%5Cvarepsilon%5E%7B-1/2%7D%20H(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D)%7D%20%5C,%20+%20%5C,F(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D,%20W%5EX)%5C%5C%0AdY%5E%7B%5Cvarepsilon%7D/dt%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Cvarepsilon%5E%7B-1%7D%7D%20%5C,%20G(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D,%20W%5EY)%5C%5C%0A%5Cend%7Balign%7D%0A%5Cright.%0A"> with the same assumption that for any fixed <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathcal%7BX%7D"> the process <img src="https://latex.codecogs.com/png.latex?dY%5E%7B%5Bx%5D%7D/dt%20=%20G(x,Y%5E%7B%5Bx%5D%7D,%20W%5EY)"> is ergodic with respect to the probability distribution <img src="https://latex.codecogs.com/png.latex?%5Crho_x(dy)">. The same reasoning as in the averaging case shows that averaging the term <img src="https://latex.codecogs.com/png.latex?F(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D,%20W%5EX)"> is relatively straightforward and has the exact same expression: it suffices to average under <img src="https://latex.codecogs.com/png.latex?%5Crho_x(dy)">. This means that one can study instead</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0AdX%5E%7B%5Cvarepsilon%7D/dt%20&amp;=%20%20%5Ctextcolor%7Bblue%7D%7B%5Cvarepsilon%5E%7B-1/2%7D%20H(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D)%7D%20%5C,%20+%20%5C,%20%5Coverline%7BF%7D(X%5E%7B%5Cvarepsilon%7D,%20W%5EX)%5C%5C%0AdY%5E%7B%5Cvarepsilon%7D/dt%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Cvarepsilon%5E%7B-1%7D%7D%20%5C,%20G(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D,%20W%5EY)%5C%5C%0A%5Cend%7Balign%7D%0A%5Cright.%0A"></p>
<p>with, informally, <img src="https://latex.codecogs.com/png.latex?%5Coverline%7BF%7D(x,w)%20=%20%5Cint%20F(x,y,w)%20%5C,%20%5Crho_x(dy)">. The new interesting phenomenon is coming from the intermediate time scale <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5E%7B-1/2%7D">. Contrarily to the averaging phenomenon of the previous section that was only relying on a Law of Large Numbers, dealing with the intermediate time-scale requires exploiting a CLT and quantifying the rate of mixing of the fast process <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Bx%5D%7D"> Note that since <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5E%7B-1/2%7D%20%5Cgg%201">, for the dynamics to not explode one needs the <strong>centering condition</strong>:</p>
<p><span id="eq-centering"><img src="https://latex.codecogs.com/png.latex?%0A%5Cint_%7B%5Cmathcal%7BY%7D%7D%20H(x,y)%20%5C,%20%5Crho_x(dy)%20=%200%0A%5Cqquad%20%5Ctextrm%7Bfor%20all%20%7D%20x%20%5Cin%20%5Cmathcal%7BX%7D.%0A%5Ctag%7B3%7D"></span></p>
<p>Because of the centering condition*, the term <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7B%5Cvarepsilon%5E%7B-1/2%7D%20H(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D)%7D"> will contribute an additional noise term in the effective dynamics of the slow process. To describe this additional noise term, assume an ergodic central limit theorem (CLT) for the fast process <img src="https://latex.codecogs.com/png.latex?dY%5E%7B%5Bx%5D%7D/dt%20=%20G(x,Y%5E%7B%5Bx%5D%7D,%20W%5EY)">: for a test function <img src="https://latex.codecogs.com/png.latex?%5Cvarphi:%20%5Cmathcal%7BY%7D%5Cto%20%5Cmathbb%7BR%7D"> with zero expectation under <img src="https://latex.codecogs.com/png.latex?%5Crho_x(dy)"> we have:</p>
<p><span id="eq-CLT"><img src="https://latex.codecogs.com/png.latex?%0A%5Clim_%7Bt%20%5Cto%20%5Cinfty%7D%20%5C;%20T%5E%7B-1/2%7D%0A%5Cint_%7Bt=0%7D%5ET%20%5C,%20%5Cvarphi(Y%5E%7B%5Bx%5D%7D_t)%20%5C,%20dt%0A%5C;%20=%20%5C;%20%5Cmathcal%7BN%7D(0,%20V_x%5B%5Cvarphi%5D)%0A%5Ctag%7B4%7D"></span></p>
<p>for asymptotic variance <img src="https://latex.codecogs.com/png.latex?V_x%5B%5Cvarphi%5D%20%5Cgeq%200">. For a time increment <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200"> and assuming <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cvarepsilon%7D_%7Bt%7D=x"> we have</p>
<p><span id="eq-split"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AX%5E%7B%5Cvarepsilon%7D_%7Bt+%5Cdelta%7D%20-%20x%0A&amp;%5Capprox%20%20%5Ctextcolor%7Bblue%7D%7B%20%5Cvarepsilon%5E%7B-1/2%7D%20%5C,%20%5Cint_%7Bu=t%7D%5E%7Bt+%5Cdelta%7D%20H(X%5E%7B%5Cvarepsilon%7D_u,Y%5E%7B%5Cvarepsilon%7D_u)%7D%20%5C,%20du%20%5C,%20+%20%5C,%20%5Cint_%7Bu=t%7D%5E%7Bt+%5Cdelta%7D%20%5Coverline%7BF%7D(x,%20W%5EX_u)%20%5C,%20du.%0A%5Cend%7Balign%7D%0A%5Ctag%7B5%7D"></span></p>
<p>The second integral term is an averaging term that can be treated easily. Approximating the process <img src="https://latex.codecogs.com/png.latex?t%20%5Cmapsto%20Y%5E%7B%5Cvarepsilon%7D_t"> by <img src="https://latex.codecogs.com/png.latex?t%20%5Cmapsto%20Y%5E%7B%5Bx%5D%7D_%7Bt%20%5Cvarepsilon%5E%7B-1%7D%7D">, the first integral on the RHS of Equation&nbsp;5 can be approximated as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%5Cunderbrace%7B%5Cvarepsilon%5E%7B-1/2%7D%20%5Cint_%7Bu=t%7D%5E%7Bt+%5Cdelta%7D%20%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bu%20%5C,%20%5Cvarepsilon%5E%7B-1%7D%7D)%20%5C,%20du%7D_%7B%5Ctextrm%7BCLT%7D%7D%20%5C,%0A+%0A%5Cunderbrace%7B%5Cint_%7Bu=t%7D%5E%7Bt+%5Cdelta%7D%20%5Cvarepsilon%5E%7B-1/2%7D%20%5Cpartial_x%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bu%20%5C,%20%5Cvarepsilon%5E%7B-1%7D%7D)%20%5C,%20(X%5E%7B%5Cvarepsilon%7D_u%20-%20x)%20%5C,%20%5C,%20du%7D_%7B%5Ctextrm%7B(drift)%7D%7D.%0A%5Cend%7Balign%7D%0A"></p>
<p>After a time-rescaling, one can readily see that the first term is described by the CLT of Equation&nbsp;4,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cvarepsilon%5E%7B-1/2%7D%20%5Cint_%7Bu=t%7D%5E%7Bt+%5Cdelta%7D%20%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bu%20%5C,%20%5Cvarepsilon%5E%7B-1%7D%7D)%20%5C,%20du%0A%5Capprox%20V_x%5BH(x,%20%5Ccdot)%5D%5E%7B1/2%7D%20%5Cmathcal%7BN%7D(0,%20%5Cdelta).%0A"></p>
<p>The second term is further approximated as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%5Cvarepsilon%5E%7B-1%7D%20%5C,%20&amp;%5Cint_%7Bu=t%7D%5E%7Bt+%5Cdelta%7D%5Cint_%7Bv=t%7D%5E%7Bt+%5Cdelta%7D%0A%5Cpartial_x%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bu%20%5C,%20%5Cvarepsilon%5E%7B-1%7D%7D)%20%5C,%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bv%20%5C,%20%5Cvarepsilon%5E%7B-1%7D%7D)%20%5C,%201_%7Bv%3Cu%7D%20%5C,%20du%20%5C,%20dv%5C%5C%0A&amp;=%20%20%7B%5Cleft(%20%20%5Cfrac%7B1%7D%7B%5Cdelta%20%5Cvarepsilon%5E%7B-1%7D%7D%20%5Cint_%7Bu=t%7D%5E%7Bt+%5Cdelta%20%5Cvarepsilon%5E%7B-1%7D%7D%5Cint_%7Bv=t%7D%5E%7Bt+%5Cdelta%20%5Cvarepsilon%5E%7B-1%7D%7D%20%5Cpartial_x%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bu%7D)%20%5C,%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bv%7D)%20%5C,%201_%7Bv%3Cu%7D%20%20%5Cright)%7D%20%20%5C,%20%5Cdelta,%0A%5Cend%7Balign%7D%0A"></p>
<p>the second equality coming from the time-rescaling <img src="https://latex.codecogs.com/png.latex?t%20%5Cmapsto%20t%20%5Cvarepsilon">. The process <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Bx%5D%7D"> mixes on scale <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(1)"> so that the term inside bracket <img src="https://latex.codecogs.com/png.latex?%20%7B%5Cleft(%20%5Cldots%20%5Cright)%7D%20"> converges to its expectation. Setting <img src="https://latex.codecogs.com/png.latex?T%20=%20%5Cdelta%20%5C,%20%5Cvarepsilon%5E%7B-1%7D%20%5Cto%20%5Cinfty">, one obtains</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AI(x)%20&amp;=%0A%5Cfrac%7B1%7D%7BT%7D%20%5Ciint_%7B%5B0,T%5D%5E2%7D%0A%5Cpartial_x%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bu%7D)%20%5C,%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bv%7D)%20%5C,%201_%7Bv%3Cu%7D%20%20%5C,%20du%20%5C,%20dv%20%5C%5C%0A&amp;%5Cto%0A%5Cint%20%5Crho_x(dy)%20%5C,%20H(x,y)%20%5C,%20%20%20%7B%5Cleft%5C%7B%20%20%5Cint_%7Bs=0%7D%5E%7B%5Cinfty%7D%20%5Cmathbb%7BE%7D%5B%5Cpartial_x%20H(%5Chat%7Bx%7D,%20Y%5E%7B%5Bx%5D%7D_s)%20%5C,%20%7C%20%20Y%5E%7B%5Bx%5D%7D_0=y%5D%20%5C,%20ds%20%20%5Cright%5C%7D%7D%20.%0A%5Cend%7Balign%7D%0A"></p>
<p>In conclusion, the fast-slow system</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0AdX%5E%7B%5Cvarepsilon%7D%20&amp;=%20%20%5Ctextcolor%7Bblue%7D%7B%5Cvarepsilon%5E%7B-1/2%7D%20%5C,%20H(X%5E%7B%5Cvarepsilon%7D,%20Y%5E%7B%5Cvarepsilon%7D)%7D%20%5C,%20dt%20+%20%5Cmu(X%5E%7B%5Cvarepsilon%7D,%20Y%5E%7B%5Cvarepsilon%7D)%20%5C,%20dt%20+%20%5Csigma(X%5E%7B%5Cvarepsilon%7D,%20Y%5E%7B%5Cvarepsilon%7D)%20%5C,%20dW%5Ex%5C%5C%0AdY%5E%7B%5Cvarepsilon%7D%20&amp;=%20%20%5Ctextcolor%7Bred%7D%7B%5Cvarepsilon%5E%7B-1%7D%7D%20%5C,%20G(X%5E%7B%5Cvarepsilon%7D,Y%5E%7B%5Cvarepsilon%7D,%20W%5EY)%20%5C,%20dt%0A%5Cend%7Balign%7D%0A%5Cright.%0A"></p>
<p>can be described in the regime <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cto%200"> by the effective dynamics</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%20=%20%20%5Ctextcolor%7Bblue%7D%7BI(X)%20%5C,%20dt%20+%20%5CGamma%5E%7B1/2%7D(X)%20%5C,%20dW%5E%7BH%7D%7D%0A+%0A%5Coverline%7B%5Cmu%7D(X)%20%5C,%20dt%20+%20%5Coverline%7B%5Csigma%7D(X)%20%5C,%20dW%5EX.%0A"></p>
<p>for two independent Brownian motions <img src="https://latex.codecogs.com/png.latex?W%5EX"> and <img src="https://latex.codecogs.com/png.latex?W%5EH">. The volatility terms <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7B%5CGamma(x)%7D"> comes from the CLT and the drift term <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bblue%7D%7BI(x)%7D"> comes from the self-interaction term:</p>
<p><span id="eq-homogenized-terms"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0A%5CGamma(x)%0A&amp;=%20%5Clim_%7BT%20%5Cto%20%5Cinfty%7D%20%5C;%20%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%20%7B%5Cleft%5C%7B%20T%5E%7B-1/2%7D%20%5Cint_%7B0%7D%5ET%20H(x,%20Y%5E%7B%5Bx%5D%7D_t)%20%5C,%20dt%20%5Cright%5C%7D%7D%20%5C%5C%0A%25%0AI(x)%0A&amp;=%20%5Clim_%7BT%20%5Cto%20%5Cinfty%7D%20%5C;%20%5Cfrac%7B1%7D%7BT%7D%20%5Ciint_%7B0%3Cu%3Cv%3CT%7D%20H(x,%20Y%5E%7B%5Bx%5D%7D_u)%20%5C,%20%5Cpartial_x%20H(x,%20Y%5E%7B%5Bx%5D%7D_v)%20%5C,%20du%20%5C,%20dv.%0A%5Cend%7Balign%7D%0A%5Cright.%0A%5Ctag%7B6%7D"></span></p>
<p>For the drift function, the scaling <img src="https://latex.codecogs.com/png.latex?T%5E%7B-1%7D%20%5Cint_%7B0%3Cu%3Cv%3CT%7D%20(%5Cldots)"> may look a bit surprising at first sight as one may expect <img src="https://latex.codecogs.com/png.latex?T%5E%7B-2%7D%20%5Cint_%7B0%3Cu%3Cv%3CT%7D%20(%5Cldots)"> instead. Note that since the process <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Bx%5D%7D"> mixes on a time scale <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(1)"> and the centering condition <img src="https://latex.codecogs.com/png.latex?%5Cint%20H(x,%20y)%20%5Crho_x(dy)=0"> holds, the expectation <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5BH(x,%20Y%5E%7B%5Bx%5D%7D_u)%20%5C,%20%5Cpartial_x%20H(x,%20Y%5E%7B%5Bx%5D%7D_v)%5D"> goes to zero as soon as <img src="https://latex.codecogs.com/png.latex?%7Cu-v%7C%20%5Cgg%201">. This means that only the subset <img src="https://latex.codecogs.com/png.latex?%7Cu-v%7C%20=%20%5Cmathcal%7BO%7D(1)"> of <img src="https://latex.codecogs.com/png.latex?%5B0,T%5D%5E2"> really matters in that double integral, hence the <img src="https://latex.codecogs.com/png.latex?(1/T)"> normalization factor.</p>
</section>
<section id="closed-form-solution-poisson-equation" class="level3">
<h3 class="anchored" data-anchor-id="closed-form-solution-poisson-equation">Closed form solution &amp; Poisson equation:</h3>
<p>The drift and volatility terms <img src="https://latex.codecogs.com/png.latex?%5CGamma(x)"> and <img src="https://latex.codecogs.com/png.latex?I(x)"> quantify the mixing properties of the fast process <img src="https://latex.codecogs.com/png.latex?Y%5E%7B%5Bx%5D%7D">. While formulas Equation&nbsp;6 are intuitive, they can be difficult to deal with if one needs the exact expressions of the drift and volatility functions. Instead, they can also be expressed in terms of the solution to an appropriate <a href="../../notes/Poisson_Eq_Asymp_Var/Poisson_Eq_Asymp_Var.html">Poisson equations</a>.</p>
<p><span id="eq-poisson"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AI(x)%20&amp;=%0A%5Cfrac%7B1%7D%7BT%7D%20%5Ciint_%7B%5B0,T%5D%5E2%7D%0AH(x,Y%5E%7B%5Bx%5D%7D_%7Bv%7D)%20%5C,%20%5Cpartial_x%20H(x,Y%5E%7B%5Bx%5D%7D_%7Bu%7D)%20%5C,%201_%7Bv%3Cu%7D%20%20%5C,%20du%20%5C,%20dv%20%5C%5C%0A&amp;%5Cto%0A%5Cint%20%5Crho_x(dy)%20%5C,%20H(x,y)%20%5C,%20%5Cpartial_%7B%5Chat%7Bx%7D%7D%20%20%7B%5Cleft%5C%7B%20%20%5Cint_%7Bs=0%7D%5E%7B%5Cinfty%7D%20%5Cmathbb%7BE%7D%5BH(%5Chat%7Bx%7D,%20Y%5E%7B%5Bx%5D%7D_s)%20%5C,%20%7CY%5E%7B%5Bx%5D%7D_0=y%5D%20%5C,%20ds%20%20%5Cright%5C%7D%7D%20%5C%5C%0A&amp;=%0A-%5Cint%20%5Crho_x(dy)%20%5C,%20H(x,y)%20%5C,%20%5Cpartial_%7Bx%7D%20%5CPhi(x,y)%5C%5C%0A&amp;=%20-%5Cleft%3C%20H(x,%20%5Ccdot),%20%5Cpartial_x%20%5CPhi(x,%20%5Ccdot)%20%5Cright%3E_%7B%5Crho_x%7D%0A%5Cend%7Balign%7D%0A%5Ctag%7B7%7D"></span></p>
<p>where the function <img src="https://latex.codecogs.com/png.latex?%5CPhi(x,y)"> is solution to the <a href="../../notes/Poisson_Eq_Asymp_Var/Poisson_Eq_Asymp_Var.html">Poisson equation</a></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D%5E%7BY%5E%7B%5Bx%5D%7D%7D%20%5CPhi(x,%20%5Ccdot)%20=%20H(x,%20%5Ccdot)%0A"></p>
<p>for all <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathcal%7BX%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%5E%7BY%5E%7B%5Bx%5D%7D%7D"> is the generator of the fast process <img src="https://latex.codecogs.com/png.latex?dY%5E%7B%5Bx%5D%7D/dt%20=%20G(x,Y%5E%7B%5Bx%5D%7D,%20W%5EY)">. The last equality in Equation&nbsp;7 follows from the <a href="../../notes/Poisson_Eq_Asymp_Var/Poisson_Eq_Asymp_Var.html">integral representation</a> of the Poisson equation. Similarly, and also as explained <a href="../../notes/Poisson_Eq_Asymp_Var/Poisson_Eq_Asymp_Var.html">here</a>, the asymptotic variance term can also be expressed in terms of the function <img src="https://latex.codecogs.com/png.latex?%5CPhi">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0A%5CGamma(x)%0A&amp;=%20%5Clim_%7BT%20%5Cto%20%5Cinfty%7D%20%5C;%20%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%20%7B%5Cleft%5C%7B%20T%5E%7B-1/2%7D%20%5Cint_%7B0%7D%5ET%20H(x,%20Y%5E%7B%5Bx%5D%7D_t)%20%5C,%20dt%20%5Cright%5C%7D%7D%20%5C%5C%0A&amp;=%20-2%20%5Cint_%7B%5Cmathcal%7BY%7D%7D%20%5CPhi(x,%20y)%20%5C,%20H(x,%20y)%20%5C,%20%5Crho_x(dy)%5C%5C%0A&amp;=%20-2%20%5Cleft%3C%20%5CPhi,%20%5Cmathcal%7BL%7D%5E%7BY%5E%7B%5Bx%5D%7D%7D%20%5CPhi%20%5Cright%3E_%7B%5Crho_x%7D.%0A%5Cend%7Balign%7D%0A"></p>
</section>
<section id="example-integrated-ou-process" class="level3">
<h3 class="anchored" data-anchor-id="example-integrated-ou-process">Example: integrated OU process</h3>
<p>Consider a slow process obtained by integrating an OU process,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0AdX%5E%7B%5Cvarepsilon%7D%20&amp;=%20%5Cvarepsilon%5E%7B-1/2%7D%20%5C,%20Y%5E%7B%5Cvarepsilon%7D%20%5C,%20dt%5C%5C%0AdY%5E%7B%5Cvarepsilon%7D%20&amp;=%20-%5Clambda%20%5Cvarepsilon%5E%7B-1%7D%5C,%20Y%5E%7B%5Cvarepsilon%7D%20%5C,%20dt%20+%20%5Csqrt%7B2%20%5Clambda/%5Cvarepsilon%7D%20%5C,%20dW%5EY,%0A%5Cend%7Balign%7D%0A%5Cright.%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Clambda%20%3E%200"> is just a fixed time-scaling parameter. The fast OU process mixes on time scales of order <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(%5Cvarepsilon)"> and has a standard Gaussian distribution as invariant distribution. Homogenization gives that in the regime <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cto%200">, the slow process can be approximated as</p>
<p><span id="eq-integ-OU"><img src="https://latex.codecogs.com/png.latex?%0AdX%20=%20%5Csqrt%7B2/%5Clambda%7D%20%5C,%20dW%0A%5Ctag%7B8%7D"></span></p>
<p>since the asymptotic variance is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%20%7B%5Cleft%5C%7B%20%20T%5E%7B-1/2%7D%20%5Cint_%7Bt=0%7D%5E%7BT%7D%20Y_t%20%5C,%20dt%20%5Cright%5C%7D%7D%0A%5Cto%0A2%20%5C,%20%5Cint_%7B0%7D%5E%7B%5Cinfty%7D%20C(r)%20%5C,%20dr%20=%202/%5Clambda%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?C(r)%20=%20%5Cmathbb%7BE%7D%5BY_t%20Y_%7Bt+r%7D%5D%20=%20%5Cexp%5B-%5Clambda%20r%5D"> is the autocorrelation function of the fast OU process, as explained <a href="../../notes/Poisson_Eq_Asymp_Var/Poisson_Eq_Asymp_Var.html">here</a>. The fact that the effective diffusion is (twice) the integrated autocorrelation of the fast process is an example of <a href="https://en.wikipedia.org/wiki/Green–Kubo_relations">Green-Kubo relations</a>.</p>
</section>
<section id="example-overdamped-langevin-dynamics" class="level3">
<h3 class="anchored" data-anchor-id="example-overdamped-langevin-dynamics">Example: Overdamped Langevin Dynamics</h3>
<p>This example does not exactly fall within the homogenization result described in the previous section, but almost. Consider a potential <img src="https://latex.codecogs.com/png.latex?U"> and the slow-fast dynamics:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0AdX%5E%7B%5Cvarepsilon%7D%20&amp;=%20%5Cvarepsilon%5E%7B-1/2%7D%20%5C,%20Y%5E%7B%5Cvarepsilon%7D%20%5C,%20dt%5C%5C%0AdY%5E%7B%5Cvarepsilon%7D%20&amp;=%20-%20%5Cvarepsilon%5E%7B-1%7D%5C,%20%5BY%5E%7B%5Cvarepsilon%7D+%5Cvarepsilon%5E%7B1/2%7D%5Cnabla%20U(X%5E%7B%5Cvarepsilon%7D)%5D%20%5C,%20dt%20+%20%5Csqrt%7B2%20/%5Cvarepsilon%7D%20%5C,%20dW%5EY.%0A%5Cend%7Balign%7D%0A%5Cright.%0A"></p>
<p>For any fixed value of <img src="https://latex.codecogs.com/png.latex?x%20%5Cin%20%5Cmathcal%7BX%7D">, the fast OU-dynamics</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdY%20=%20-%5BY%5E%7B%5Cvarepsilon%7D+%5Cvarepsilon%5E%7B1/2%7D%5Cnabla%20U(x)%5D%20%5C,%20dt%20+%20%5Csqrt%7B2%7D%20%5C,%20dW%5EY%0A"></p>
<p>converges to a Gaussian distribution with mean <img src="https://latex.codecogs.com/png.latex?-%5Cnabla%20U(x)"> and unit variance. The same arguments as the previous section immediately give that, starting from <img src="https://latex.codecogs.com/png.latex?X%5E%7B%5Cvarepsilon%7D_0=x">, we have</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cvarepsilon%5E%7B-1/2%7D%20%5C,%20%5Cint_%7B0%7D%5E%7B%5Cdelta%7D%20Y%5E%7B%5Cvarepsilon%7D%20%5C,%20dt%0A%5C;%20%5Cto%20%5C;%0A-%5Cnabla%20U(x)%20%5C,%20%5Cdelta%20+%20%5Csqrt%7B2%20%5Cdelta%7D%20%5C,%20%5Cmathcal%7BN%7D(0,1).%0A"></p>
<p>The <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B2%7D"> terms comes from the OU asymptotic variance. this shows that the slow process converges as <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cto%200"> to the <a href="https://en.wikipedia.org/wiki/Brownian_dynamics">overdamped Langevin dynamics</a></p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%20=%20-%5Cnabla%20U(X)%20%5C;%20+%20%5C;%20%5Csqrt%7B2%7D%20%5C,%20dW.%0A"></p>
</section>
<section id="example-stratonovich-corrections" class="level3">
<h3 class="anchored" data-anchor-id="example-stratonovich-corrections">Example: Stratonovich Corrections</h3>
<p>Consider a function <img src="https://latex.codecogs.com/png.latex?f:%20%5Cmathbb%7BR%7D%5Cto%20%5Cmathbb%7BR%7D"> and the slow-fast system</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%5E%7B%5Cvarepsilon%7D%20=%20%5Cvarepsilon%5E%7B-1/2%7D%20%5C,%20f(X%5E%7B%5Cvarepsilon%7D)%20%5C,%20Y%5E%7B%5Cvarepsilon%7D%20%5C,%20dt%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?dY%5E%7B%5Cvarepsilon%7D%20=%20-(%5Clambda/%5Cvarepsilon)%20Y%5E%7B%5Cvarepsilon%7D%20+%20%5Csqrt%7B2%20%5Clambda%20/%20%5Cvarepsilon%7D"> is a fast OU process mixing on scales of order <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(%5Cvarepsilon)"> and with standard centred Gaussian invariant distribution <img src="https://latex.codecogs.com/png.latex?%5Crho(dy)">.The discussion leading to Equation&nbsp;8 suggest that the term <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5E%7B-1/2%7D%20%5C,%20Y%5E%7B%5Cvarepsilon%7D%20%5C,%20dt"> can be heuristically be thought of as <img src="https://latex.codecogs.com/png.latex?(2/%5Clambda)%5E%7B1/2%7D%20%5C,%20dW">, which would imply that the effective dynamics for the slow-process is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AdX%20=%20%5Csqrt%7B2/%5Clambda%7D%20%5C,%20f(X)%20%5C,%20dW.%0A"></p>
<p>We will see that this heuristic is <strong>wrong</strong>! In order to obtain the effective dynamics of the slow process as <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%5Cto%200">, since the generator of the fast-OU reads <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%5Cvarphi=%20%5Clambda%20%5B%20-y%5C,%5Cvarphi_y%20+%20%5Cvarphi_%7Byy%7D%5D">, one can solve the Poisson equation <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%5CPsi(x,y)%20=%20f(x)y"> to obtain that <img src="https://latex.codecogs.com/png.latex?%5CPhi(x,y)%20=%20-f(x)y/%5Clambda">. One already knows that <img src="https://latex.codecogs.com/png.latex?%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%5BT%5E%7B-1%7D%5Cint_%7B%5B0,T%5D%7D%20Y_t%20%5C,%20dt%5D%20=%202/%5Clambda">. The drift term is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AI(x)%20&amp;=%20%5Cint%20f(x)%20%5Cpartial_x%20%5CPhi(x,y)%20%5C,%20%5Crho(dy)%5C%5C%0A&amp;=%20%5Clambda%5E%7B-1%7D%20%5Cint%20f(x)%20f'(x)%20y%5E2%20%5C,%20%5Crho(dy)%5C%5C%0A&amp;=%20%5Clambda%5E%7B-1%7D%20f(x)%20f'(x).%0A%5Cend%7Balign%7D%0A"></p>
<p>Putting everything together gives that the effective slow dynamics reads</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AdX%20&amp;=%20%20%5Ctextcolor%7Bblue%7D%7B%20%5Clambda%5E%7B-1%7D%20f'(X)%20f(X)%20%5C,%20dt%20%7D%20+%20%5Csqrt%7B2/%5Clambda%7D%20%5C,%20f(X)%20%5C,%20dW%5C%5C%0A&amp;=%20%5Csqrt%7B2/%5Clambda%7D%20%5C,%20f(X)%20%20%5Ctextcolor%7Bred%7D%7B%5Ccirc%7D%20dW%0A%5Cend%7Balign%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%20%5Ctextcolor%7Bred%7D%7B%5Ccirc%7D"> denotes <a href="https://en.wikipedia.org/wiki/Stratonovich_integral">Stratonovich integration</a>.</p>
</section>
<section id="readings" class="level3">
<h3 class="anchored" data-anchor-id="readings">Readings</h3>
<p>The book <span class="citation" data-cites="pavliotis2008multiscale">(Pavliotis and Stuart 2008)</span> is beautiful, and I quite like the section on multiscale expansion in <span class="citation" data-cites="weinan2011principles">(Weinan 2011)</span>. For proving this type of results with the “martingale problem” approach <span class="citation" data-cites="stroock1997multidimensional">(Stroock and Varadhan 1997)</span>, the lectures <span class="citation" data-cites="papanicolaou1977martingale">(Papanicolaou 1977)</span> are nicely done.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-hinch_1991" class="csl-entry">
Hinch, E. J. 1991. <em>Perturbation Methods</em>. Cambridge University Press.
</div>
<div id="ref-papanicolaou1977martingale" class="csl-entry">
Papanicolaou, George. 1977. <span>“Martingale Approach to Some Limit Theorems.”</span> In <em>Papers from the Duke Turbulence Conference, Duke Univ., Durham, NC, 1977</em>.
</div>
<div id="ref-pavliotis2008multiscale" class="csl-entry">
Pavliotis, Grigoris, and Andrew Stuart. 2008. <em>Multiscale Methods: Averaging and Homogenization</em>. Springer Science &amp; Business Media.
</div>
<div id="ref-stroock1997multidimensional" class="csl-entry">
Stroock, Daniel W, and SR Srinivasa Varadhan. 1997. <em>Multidimensional Diffusion Processes</em>. Vol. 233. Springer Science &amp; Business Media.
</div>
<div id="ref-weinan2011principles" class="csl-entry">
Weinan, E. 2011. <em>Principles of Multiscale Modeling</em>. Cambridge University Press.
</div>
</div></section></div> ]]></description>
  <category>diffusion</category>
  <guid>https://alexxthiery.github.io/notes/averaging_homogenization/averaging_homogenization.html</guid>
  <pubDate>Mon, 27 Nov 2023 17:00:00 GMT</pubDate>
</item>
<item>
  <title>Ensemble Kalman Smoother (EnKS)</title>
  <link>https://alexxthiery.github.io/notes/Gaussian_Assimilation/gaussian_assimilation_smoothing.html</link>
  <description><![CDATA[ 





<p>Consider a linear-Gaussian state space model with <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5E%7BD_x%7D">-valued dynamics <img src="https://latex.codecogs.com/png.latex?X_%7Bt+1%7D%20%5Csim%20F%20%5C,%20X_t%20+%20%5Cmathcal%7BN%7D(0,Q)"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5E%7BD_y%7D">-valued observations <img src="https://latex.codecogs.com/png.latex?Y_t%20%5Csim%20H%20%20X_t%20+%20%5Cmathcal%7BN%7D(0,R)">. Assuming a Gaussian initial distribution, the <strong>filtering distributions</strong> <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_t%20%5Cin%20dx%20%5C,%20%7C%20Y_%7B1:t%7D)%20%5Cequiv%20%5Cmathcal%7BN%7D(%5Cmu_%7Bt%7Ct%7D,%20P_%7Bt%7Ct%7D)"> are Gaussian and can be sequentially computed with the <a href="https://en.wikipedia.org/wiki/Kalman_filter">Kalman Filter</a>. Similarly, the <strong>predictive distributions</strong> <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_%7Bt+1%7D%20%5Cin%20dx%20%5C,%20%7C%20Y_%7B1:t%7D)%20%5Cequiv%20%5Cmathcal%7BN%7D(%5Cmu_%7Bt+1%7Ct%7D,%20P_%7Bt+1%7Ct%7D)"> are straightforward to obtain from the filtering distributions: <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7Bt+1%7Ct%7D%20=%20F%20%5C,%20%5Cmu_%7Bt%7Ct%7D"> and <img src="https://latex.codecogs.com/png.latex?P_%7Bt+1%7Ct%7D%20=%20F%20%5C,%20P_%7Bt%7Ct%7D%20%5C,%20F%5E%5Ctop%20+%20Q">. Given observations <img src="https://latex.codecogs.com/png.latex?y_%7B1:T%7D%20%5Cequiv%20(y_1,%20%5Cldots,%20y_T)"> and <img src="https://latex.codecogs.com/png.latex?1%20%5Cleq%20t%20%5Cleq%20T">, the <strong>smoothing distributions</strong> <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_t%20%5Cin%20dx%20%5C,%20%7C%20Y_%7B1:T%7D)%20%5Cequiv%20%5Cmathcal%7BN%7D(%5Cmu_%7Bt%7CT%7D,%20P_%7Bt%7CT%7D)"> can computed by performing a “backward pass”. Since everything is linear and Gaussian, it is just an exercise in Linear Algebra &amp; Gaussian-conditioning, as described by the Rauch-Tung-Striebel <span class="citation" data-cites="rauch1965maximum">(Rauch, Tung, and Striebel 1965)</span> smoothing recursions. The backward recursion reads</p>
<p><span id="eq-RTS"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Baligned%7D%0A%5Cmu_%7Bt%7CT%7D%0A&amp;=%20%5Cmu_%7Bt%7Ct%7D%20+%20B_t%20%5C,%20%20%7B%5Cleft(%20%5Cmu_%7Bt+1%7CT%7D%20-%20%5Cmu_%7Bt+1%7Ct%7D%20%5Cright)%7D%20%5C%5C%0AP_%7Bt%7CT%7D%0A&amp;=%0AP_%7Bt%7Ct%7D%20+%20B_t%20%20%7B%5Cleft(%20%20P_%7Bt+1%7CT%7D%20-%20P_%7Bt+1%7Ct%7D%20%20%5Cright)%7D%20%20B%5E%5Ctop_%7Bt%7D%0A%5Cend%7Baligned%7D%0A%5Cright.%0A%5Ctag%7B1%7D"></span></p>
<p>and allows one to compute the smoothing means and covariances matrices <img src="https://latex.codecogs.com/png.latex?(%5Cmu_%7Bt%7CT%7D,%20P_%7Bt%7CT%7D)"> for <img src="https://latex.codecogs.com/png.latex?1%20%5Cleq%20t%20%5Cleq%20T"> starting from the knowledge of <img src="https://latex.codecogs.com/png.latex?(%5Cmu_%7BT%7CT%7D,%20P_%7BT%7CT%7D)">. In Equation&nbsp;1, the <strong>smoothing gain matrix</strong> <img src="https://latex.codecogs.com/png.latex?B_t"> is given by</p>
<p><span id="eq-B-cond"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AB_t%20&amp;=%0A%5Cmathop%7B%5Cmathrm%7BCov%7D%7D(X_t,%20X_%7Bt+1%7D%20%5C,%20%7C%20y_%7B1:t%7D)%20%5C,%20%5Cmathop%7B%5Cmathrm%7BVar%7D%7D(X_%7Bt+1%7D%20%5C,%20%7C%20y_%7B1:t%7D)%5E%7B-1%7D%20%5C%5C%0A&amp;=%0AP_%7Bt%7Ct%7D%20F%5E%5Ctop%20%5C,%20%20%7B%5Cleft(%20F%20%5C,%20P_%7Bt%7Ct%7D%20%5C,%20F%5E%5Ctop%20+%20Q%20%5Cright)%7D%20%5E%7B-1%7D.%0A%5Cend%7Balign%7D%0A%5Ctag%7B2%7D"></span></p>
<p><a href="../../notes/Gaussian_Assimilation/gaussian_assimilation.html">The Ensemble Kalman Filter</a> (EnKF) is a non-linear equivalent of the Kalman filter, and the purpose of these notes is to derive the equivalent “ensemble version” of the backward recursion Equation&nbsp;1. For this purpose, it is important to understand slightly better the role of the smoothing gain matrix <img src="https://latex.codecogs.com/png.latex?B_t">. Consider the pair of random variable <img src="https://latex.codecogs.com/png.latex?(X%5Ef_t,%20X%5Ep_%7Bt+1%7D)"> distributed according to the joint distribution between the filtering distribution at time <img src="https://latex.codecogs.com/png.latex?t"> and the predictive distribution at time <img src="https://latex.codecogs.com/png.latex?t+1"> in the sense that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A(X%5Ef_t,%20X%5Ep_t)%20%5C;%20%5Cunderbrace%7B=%7D_%7B%5Ctext%7B(law)%7D%7D%5C;%20(X_t,%20X_%7Bt+1%7D%20%5C,%20%5Cmid%20%5C,%20y_%7B1:t%7D).%0A"></p>
<p>This means that <img src="https://latex.codecogs.com/png.latex?X%5Ef_t%20%5Csim%20%5Cmathcal%7BN%7D(%5Cmu_%7Bt%7Ct%7D,%20P_%7Bt%7Ct%7D)"> and <img src="https://latex.codecogs.com/png.latex?X%5Ep_%7Bt+1%7D%20%5Csim%20%5Cmathcal%7BN%7D(%5Cmu_%7Bt+1%7Ct%7D,%20P_%7Bt+1%7Ct%7D)"> and <img src="https://latex.codecogs.com/png.latex?X%5Ep_t%20=%20F%20%5C,%20X%5Ef_t%20+%20%5Cmathcal%7BN%7D(0,%20Q)">. Furthermore, Equation&nbsp;2 and the standard <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions">gaussian conditional probabilities</a> formulas give that the conditional means and covariances are given by</p>
<p><span id="eq-gauss-cond"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0A%5Ctextrm%7BMean%7D%20%7B%5Cleft(%20X%5Ef_t%20%7C%20(X%5Ep_%7Bt+1%7D=x_%7Bt+1%7D)%20%5Cright)%7D%20%20%0A%5C;%20&amp;=%20%5C;%0A%5Cmu_%7Bt%7Ct%7D%20+%20B_t%20(x_%7Bt+1%7D%20-%20%5Cmu_%7Bt+1%7Ct%7D)%20%5C%5C%0A%5Ctextrm%7BCov%7D%20%7B%5Cleft(%20X%5Ef_t%20%7C%20(X%5Ep_%7Bt+1%7D=x_%7Bt+1%7D)%20%5Cright)%7D%20%20%0A%5C;%20&amp;=%20%5C;%0AP_%7Bt%7Ct%7D%20-%20B_t%20%5C,%20P_%7Bt+1%7Ct%7D%20%5C,%20B_t%5E%5Ctop.%0A%5Cend%7Balign%7D%0A%5Cright.%0A%5Ctag%7B3%7D"></span></p>
<p>The above expression for the conditional mean also shows that the matrix <img src="https://latex.codecogs.com/png.latex?B_t"> is a minimizer of the loss</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AM%20%5C;%20%5Cmapsto%20%5C;%0A%5Cmathbb%7BE%7D%20%7B%5Cleft(%20%20%5Cleft%5C%7C%20(X%5Ef_t%20-%20%5Cmu_%7Bt%7Ct%7D)%20-%20B%20(X%5Ep_%7Bt+1%7D%20-%20%5Cmu_%7Bt+1%7Ct%7D)%20%5Cright%5C%7C%5E2%20%20%5Cright)%7D%0A"></p>
<p>over all matrices <img src="https://latex.codecogs.com/png.latex?M%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BD_x,%20D_x%7D">. Heuristically, this shows that the smoothing gain matrix <img src="https://latex.codecogs.com/png.latex?B_t"> can easily be computed by <strong>regressing</strong> <img src="https://latex.codecogs.com/png.latex?X%5Ef_t"> against <img src="https://latex.codecogs.com/png.latex?X%5Ep_%7Bt+1%7D">. We can use this remark to build an ensemble version of the backward recursion Equation&nbsp;1. Recall that when running a EnKF for filtering the observations <img src="https://latex.codecogs.com/png.latex?y_%7B1:T%7D">, the final stage proceeds in two steps:</p>
<ol type="1">
<li>Obtain an ensemble of particles <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,p%7D_%7BT%7D%20=%20F%20%5C,%20X%5E%7Bi,f%7D_%7BT-1%7D%20+%20%5Cmathcal%7BN%7D(0,Q)"> that approximate the predictive distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_T%20%7C%20y_%7B1:T-1%7D)">.<br>
</li>
<li><a href="../../notes/Gaussian_Assimilation/gaussian_assimilation.html">Assimilate</a> the last observation <img src="https://latex.codecogs.com/png.latex?y_T"> using the Kalman gain matrix <img src="https://latex.codecogs.com/png.latex?K_T"> and the correction <img src="https://latex.codecogs.com/png.latex?%5CDelta_T%5Ei%20=%20K_T%20%5C,%20(%5Ctilde%7By%7D_%7Bi,%5Cstar%7D%20-%20H%20%5C,%20X%5E%7Bi,p%7D_T)"> by setting <span id="eq-pred-perturb"><img src="https://latex.codecogs.com/png.latex?%0AX%5E%7Bi,s%7D_T%20=%20X%5E%7Bi,p%7D_T%20+%20%5CDelta_T%5Ei.%0A%5Ctag%7B4%7D"></span> The particles <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_T"> approximate the smoothing distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_T%20%7C%20y_%7B1:T%7D)">.</li>
</ol>
<p>Following our discussion of the smoothing gain matrix <img src="https://latex.codecogs.com/png.latex?B_%7Bt%7D"> and Equation&nbsp;4, it seems sensible to set</p>
<p><span id="eq-rec-ens"><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign%7D%0AX%5E%7Bi,s%7D_%7BT-1%7D%0A&amp;=%20X%5E%7Bi,f%7D_%7BT-1%7D%20+%20B_%7BT-1%7D%20%5C,%20%5CDelta%5Ei_T%5C%5C%0A&amp;=%20X%5E%7Bi,f%7D_%7BT-1%7D%20+%20B_%7BT-1%7D%20%5C,%20(X%5E%7Bi,s%7D_%7BT%7D%20-%20X%5E%7Bi,p%7D_%7BT%7D)%0A%5Cend%7Balign%7D%0A%5Ctag%7B5%7D"></span></p>
<p>and hope that the ensemble of updated particles <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_%7BT-1%7D"> approximate the smoothing distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_%7BT-1%7D%20%7C%20y_%7B1:T%7D)">. In words, the particle <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_%7BT-1%7D"> is obtained by “pulling” the correction term <img src="https://latex.codecogs.com/png.latex?%5CDelta%5Ei_%7BT%7D%20=%20X%5E%7Bi,s%7D_%7BT%7D%20-%20X%5E%7Bi,p%7D_%7BT%7D"> back to <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,f%7D_%7BT-1%7D"> through the “regression” smoothing gain matrix <img src="https://latex.codecogs.com/png.latex?B_%7BT-1%7D">. To check that the particles <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_%7BT-1%7D"> indeed approximate the smoothing distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_%7BT-1%7D%20%5C,%7Cy_%7B1:T%7D)">, it suffices to compute the mean/variance and verify that they are matching the one given by Equation&nbsp;1. Recall that Equation&nbsp;3 gives that the filtering/predictive distributions satisfy</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AX%5Ef_%7BT-1%7D%20=%20%5Cmu_%7BT-1%7CT-1%7D%20+%20B_%7BT-1%7D%20%5C,%20(X%5Ep_%7BT%7D%20-%20%5Cmu_%7BT%7CT-1%7D)%20+%20%5Cvarepsilon_t%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon_t%20%5Csim%20%5Cmathcal%7BN%7D(0,%20P_%7BT-1%7CT-1%7D%20-%20B_%7BT-1%7D%20%5C,%20P_%7BT%7CT-1%7D%20%5C,%20B_%7BT-1%7D%5E%5Ctop)"> is independent from all other sources of randomness. Plugging this into Equation&nbsp;5 gives that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AX%5E%7Bi,s%7D_%7BT-1%7D%0A=%0A%5Cmu_%7BT-1%7CT-1%7D%20+%20B_%7BT-1%7D%20%5C,%20(X%5E%7Bi,s%7D_%7BT%7D%20-%20%5Cmu_%7BT%7CT-1%7D)%20+%20%5Cvarepsilon_t.%0A"></p>
<p>Since the <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_%7BT%7D"> are distributed according to the smoothing distribution, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_%7BT%7D%20%5Csim%20%5Cmathcal%7BN%7D(%5Cmu_%7BT%7CT%7D,%20P_%7BT%7CT%7D)">, this immediately shows that <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_%7BT-1%7D"> is Gaussian with</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5C%7B%0A%5Cbegin%7Balign%7D%0A%5Ctextrm%7BMean%7D%20&amp;=%20%5Cmu_%7BT-1%7CT%7D%20=%20%5Cmu_%7BT-1%7CT-1%7D%20+%20B_%7BT-1%7D%20%5C,%20%20%7B%5Cleft(%20%5Cmu_%7BT%7CT%7D%20-%20%5Cmu_%7BT%7CT-1%7D%20%5Cright)%7D%20%5C%5C%0A%5Ctextrm%7BCovariance%7D%20&amp;=%20P_%7BT-1%7CT%7D%20=%20P_%7BT-1%7CT-1%7D%20+%20B_%7BT-1%7D%20%20%7B%5Cleft(%20%20P_%7BT%7CT%7D%20-%20P_%7BT%7CT-1%7D%20%20%5Cright)%7D%20%20B%5E%5Ctop_%7BT-1%7D,%0A%5Cend%7Balign%7D%0A%5Cright.%0A"></p>
<p>as it should. One can then iterate this construction to obtain particle approximations of the smoothing distributions <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_t%20%7C%20y_%7B1:T%7D)"> for <img src="https://latex.codecogs.com/png.latex?1%20%5Cleq%20t%20%5Cleq%20T"> by running a backward pass and recursively setting</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AX%5E%7Bi,s%7D_t%20%5C;%20=%20%5C;%20X%5E%7Bi,f%7D_t%20+%20B_t%20%5C,%20%20%7B%5Cleft(%20X%5E%7Bi,s%7D_%7Bt+1%7D%20-%20X%5E%7Bi,p%7D_%7Bt+1%7D%20%5Cright)%7D%20.%0A"></p>
<p>The ensemble of particles <img src="https://latex.codecogs.com/png.latex?X%5E%7Bi,s%7D_t"> approximates the smoothing distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(X_t%20%7C%20y_%7B1:T%7D)">. In a nonlinear setting, it suffices to approximate the smoothing gain matrices with</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cwidehat%7BB%7D_t%20=%20%5Cmathop%7B%5Cmathrm%7BCov%7D%7D%20%7B%5Cleft(%20%20x%5Ef_%7Bt,i%7D,%20x%5Ep_%7Bt+1,i%7D%20%20%5Cright)%7D%20%20%5C,%20%5Cmathop%7B%5Cmathrm%7BVar%7D%7D%20%7B%5Cleft(%20%20x%5Ep_%7Bt+1,i%7D%20%20%5Cright)%7D%20%5E%7B-1%7D.%0A"></p>
<p>[Experiments: TODO]</p>




<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-rauch1965maximum" class="csl-entry">
Rauch, Herbert E, F Tung, and Charlotte T Striebel. 1965. <span>“Maximum Likelihood Estimates of Linear Dynamic Systems.”</span> <em>AIAA Journal</em> 3 (8): 1445–50.
</div>
</div></section></div> ]]></description>
  <category>enkf</category>
  <category>data-assimilation</category>
  <guid>https://alexxthiery.github.io/notes/Gaussian_Assimilation/gaussian_assimilation_smoothing.html</guid>
  <pubDate>Fri, 17 Nov 2023 17:00:00 GMT</pubDate>
</item>
</channel>
</rss>
