Alexandre Thiéry - Girsanov and Importance Sampling

Let \(q(dx) \equiv \mathcal{N}(\mu,\Gamma)\) be the Gaussian distribution with mean \(\mu \in \mathbb{R}^D\) and covariances \(\Gamma \in \mathbb{R}^{D \times D}\). For a direction \(u \in \mathbb{R}^D\), consider the distribution \(q^{u}(dx) \equiv \mathcal{N}(\mu + \Gamma^{1/2} \, u, \Gamma)\), i.e. the same Gaussian distribution but shifted by an amount \(\Gamma^{1/2} \, u\). Algebra directly gives that

\[ \frac{q^{u}(x)}{q(x)} = \exp {\left\{ - \frac{1}{2} \| u\|^2 + \left< u, \, \Gamma^{-1/2}(x-\mu) \right> \right\}} . \tag{1}\]

We will see that, not very surprisingly, a similar change-of-probability result holds in continuous time. On the time interval \([0,T]\), let \(W_t\) be a standard Brownian motion in \(\mathbb{R}^D\) and \(X_t\) be the solution to the SDE

\[ dX_t \; = \; b(X_t) \, dt + \sigma(X_t) \, dW_t \tag{2}\]

for some drift \(b: \mathbb{R}^D \to \mathbb{R}^D\) and diffusion \(\sigma: \mathbb{R}^D \to \mathbb{R}^{D \times D}\) and initial distribution \(\mu_0(dx_0)\). This SDE defines a probability measure \(\mathbb{P}\) on the path-space \(C([0,T]; \mathbb{R}^D)\), the space of continuous functions from \([0,T]\) to \(\mathbb{R}^D\). Consider a perturbation drift function \(u: \mathbb{R}^D \to \mathbb{R}^D\) and associated perturbed SDE given by

\[ dX_t^u \; = \; b(X_t^u) \, dt + \sigma(X_t^u) \, {\left\{ dW_t + \textcolor{blue}{u(X_t^u) \, dt} \right\}} . \tag{3}\]

This perturbed SDE, started from the same initial distribution \(\mu_0(dx_0)\), defines a probability measure \(\mathbb{P}^u\) on the path-space \(C([0,T]; \mathbb{R}^D)\) and it is often useful to understand the Radon-Nikodym derivative of \(\mathbb{P}^u\) with respect to \(\mathbb{P}\). I have never really liked the way this is usually derived, and also never really remember the result. It takes only a few lines of algebra to re-derive these results, at least informally. For this purpose, consider a simpler Euler discretization of the SDE with time-discretization \(\delta = T/N\) for \(N \gg 1\). Consider a discretized paths \((x_0, x_{\delta}, \ldots, x_{T})\) of Equation 2 obtained by iterating the update

\[ x_{t_{k+1}} \; = \; x_{t_k} + b(x_{t_k})\,\delta + \sigma(x_{t_k}) \, (\Delta W_{t_k}) \]

with \(t_k = k\delta\) and \(\Delta W_{t_k} = W_{t_{k+1}} - W_{t_k}\). The probability of observing such a path reads \[ \frac{1}{\mathcal{Z}} \, \mu_0(x_0) \, \exp {\left\{ -\frac{1}{2 \delta} \sum_{k=0}^{N-1} \|x_{t_{k+1}} - [x_{t_k} + b(x_{t_k})\,\delta]\|^2_{\Gamma^{-1}(x_{t_k}) } \right\}} \]

with \(\Gamma(x) \equiv \sigma(x) \sigma^\top(x)\) the volatility matrix and an irrelevant multiplicative constant \(\mathcal{Z}\). One obtains a similar expression for a discretized path of the perturbed SDE Equation 3 and the ratio of these two quantities equals

\[ \frac{d \widetilde{\mathbb{P}}^{u}}{d \widetilde{\mathbb{P}}}(x) = \exp {\left\{ \sum_{k=0}^{N-1} -\frac{\delta}{2} \|u(x_{t_k})\|^2 + \left< x_{t_{k+1}}-x_{t_k}-b(x_{t_k})\delta, \sigma(x_{t_k}) \, u(x_{t_k}) \right>_{\Gamma^{-1}(x_{t_k})} \right\}} . \]

where the tilde notation denotes the discretized version of the measures. Since

\[ x_{t_{k+1}}-x_{t_k}-b(x_{t_k})\delta = \sigma(x_{t_k}) \, \Delta W_{t_k}, \] for a path \(dx_t \; = \; b(x_t) \, dt + \sigma(x_t) \, dW_t\) and taking the limit \(N \to \infty\) gives

\[ \frac{d \mathbb{P}^{u}}{d \mathbb{P}}(x) \; = \; \exp {\left\{ -\frac 12 \, \int_0^T \|u(x_t)\|^2 \, dt + \int_{0}^T u^\top(x_t) \, dW_t \right\}} . \]

Similarly, for a path \(dx^{u}_t \; = \; b(x^u_t) \, dt + \sigma(x^u_t) \, {\left( dW_t + u(x^u_t) \right)} \), we have

\[ \frac{d \mathbb{P}}{d \mathbb{P}^u}(x^u) \; = \; \exp {\left\{ -\frac 12 \, \int_0^T \|u(x^u_t)\|^2 \, dt - \int_{0}^T u^\top(x^u_t) \, dW_t \right\}} . \]

These results remain identical for time-dependent drift and volatility functions, as is clear from this non-rigorous argument. The above two formulas for \(d\mathbb{P}^u/d\mathbb{P}(x)\) and \(d\mathbb{P}/d\mathbb{P}^u(x)\) may be slightly confusing since they are not immediately recognizable as inverse of each other. Furthermore, these probability ratios evaluated along a path \(x\) or \(x^u\) are expressed in terms of the Brownian trajectory that defines them, which can be confusing. In short, this would be better to express \(d\mathbb{P}^u/d\mathbb{P}(x)\) directly in terms of the path \(x\), and not in terms of the Brownian motion \(W_t\), even though it is indeed equivalent. For these reasons, it is often convenient to use the following equivalent expressions:

\[ \left\{ \begin{aligned} \frac{d \mathbb{P}^{u}}{d \mathbb{P}}(x) &= \exp {\left\{ \textcolor{blue}{-}\frac 12 \, \int_0^T \|u(x_t)\|^2 \, dt \textcolor{blue}{+} \int_{0}^T u^\top(x_t) \, \frac{dx_t - b(x_t) dt}{\sigma(x_t)} \right\}} \\ \frac{d \mathbb{P}}{d \mathbb{P}^{(u)}}(x) &= \exp {\left\{ \textcolor{blue}{+}\frac 12 \, \int_0^T \|u(x_t)\|^2 \, dt \textcolor{blue}{-} \int_{0}^T u^\top(x_t) \, \frac{dx_t - b(x_t) dt}{\sigma(x_t)} \right\}} \end{aligned} \right. \]

From these expression, the fact that \(d\mathbb{P}^u/d\mathbb{P}(x)\) and \(d\mathbb{P}/d\mathbb{P}^u(x)\) are indeed inverse of each other is clear. Another entirely equivalent formulation, slightly more symmetrical again, is as follows. Consider the two measures \(\mathbb{P}^{(1)}\) and \(\mathbb{P}^{(2)}\) associated to

\[ dX_t^{(i)} \; = \; b^{(i)}(X_t) \, dt + \sigma(X_t) \, dW_t \]

for two drift functions \(b^{(1)}: \mathbb{R}^D \to \mathbb{R}^D\) and \(b^{(2)}: \mathbb{R}^D \to \mathbb{R}^D\). Then, the Radon-Nikodym derivative between these two measures is given by

\[ \frac{d \mathbb{P}^{(2)}}{d \mathbb{P}^{(1)}}(x) = \exp {\left\{ -\frac{1}{2}\int_{0}^T {\left( \frac{\|b^{(2)}_t\|^2 - \|b^{(1)}_t\|^2}{\sigma^2_t} \right)} \, dt + \int_{0}^T \left< \frac{b^{(2)}_t - b^{(1)}_t}{\sigma_t^2}, dx_t \right> \right\}} \]

with the shorthand \(b^{(i)}_t = b^{(i)}(x_t)\) and \(\sigma_t = \sigma(x_t)\) and \(\|v\|^2/\sigma^2 = \left< v, [\sigma \sigma^\top]^{-1} v \right>\) and \(\left< u,v \right> / \sigma^2 = \left< u, [\sigma \sigma^\top]^{-1} v \right>\). Again, this follows immediately from a discretized version of the SDEs. As described below, these change of variables formulae are often useful when performing importance sampling on path-space. As a sanity check, one can see that in the case of a scalar Brownian motion \(dX = \sigma \, dW\) and drifted version of it \(dX^u = \sigma \, dW + u \, dt\), we indeed have that \(d\mathbb{P}^u/d\mathbb{P}(x)\) has unit expectation under \(\mathbb{P}\) since it is equivalent to the fact \(\mathbb{E}[\exp(\sigma \, \xi)] = \exp(\sigma^2/2)\) for a standard Gaussian random variable \(\xi\). Finally, note that the Kullback-Leibler divergence between \(\mathbb{P}\) and \(\mathbb{P}^u\) has a particularly simple form. Since \(D_{\text{KL}}(\mathbb{P}, \mathbb{P}^u) = \mathbb{E}_{\mathbb{P}} {\left[ -\log {\left\{ \frac{d \mathbb{P}^{u}}{d \mathbb{P}}(X) \right\}} \right]} \) one obtains

\[ D_{\text{KL}}(\mathbb{P}, \mathbb{P}^u) = \frac12 \mathbb{E} {\left[ \int_0^T \|u(X_t)\|^2 \, dt \right]} . \]

Importance Sampling on path-space

Consider a functional \(\Phi: C([0,T]; \mathbb{R}^D) \to \mathbb{R}\) on path-space; a typical example is

\[ \Phi(x) = \exp {\left\{ \int_0^T f(X_t) \, dt \, + \, g(X_T) \right\}} . \]

Suppose that we would like to evaluate the expectation of \(\Phi\) under the measure \(\mathbb{P}\). Naive Monte-Carlo (MC) would require sampling \(M\) trajectories from Equation 2 and computing the average of \(\Phi\) on these trajectories. To reduce the variance of this naive MC estimator, one can also use importance sampling by sampling \(M\) trajectories \(x^{1,u}, \ldots, x^{M,u}\) from the measure \(\mathbb{P}^u\) and compute the average

\[ \frac{1}{M} \, \sum_{i=1}^M \Phi(x^{i,u}) \, W(x^{i,u}) \]

with weights given by the Radon-Nikodym derivative

\[ W(x^{i,u}) \; = \; \exp {\left\{ -\frac 12 \, \int_0^T \|u(x^{i,u}_t)\|^2 \, dt - \int_{0}^T u^\top(x^{i,u}_t) \, dW_t \right\}} . \]

Choosing the optimal “control” function \(u\) that minimizes the variance of the estimator is not entirely straightforward, although this previous note already gives the answer. More on this in another note.