Alexandre Thiéry - Doob, Girsanov and Bellman

Change of measure

Consider a diffusion in \(\mathbb{R}^D\) with deterministic starting position \(x_0 \in \mathbb{R}^D\) and dynamics

\[ dX_t = b(X_t)dt + \sigma(X_t) \, dW_t \]

for a drift and volatility functions \(b: \mathbb{R}^D \to \mathbb{R}^D\) and \(\sigma: \mathbb{R}^D \to \mathbb{R}^{D \times D}\). On the time interval \([0,T]\), this defines a probability \(\mathbb{P}\) on the path-space \(C([0,T];\mathbb{R}^D)\). For two functions \(f: [0,T] \times \mathbb{R}^D \to \mathbb{R}\) and \(g: \mathbb{R}^D \to \mathbb{R}\), consider the probability distribution \(\mathbb{Q}\) defined as

\[ \frac{d \mathbb{Q}}{d \mathbb{P}} = \frac{1}{\mathcal{Z}} \exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \]

where \(\mathcal{Z}\) denotes the normalizing constant \[ \mathcal{Z}\; = \; \mathbb{E} {\left[ \exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \right]} . \tag{1}\]

The distribution \(\mathbb{Q}\) places more probability mass on trajectories such that \(\int_0^T f(X_s) \, ds + g(X_T)\) is large. As described in these notes on Doob h-transforms, the tilted probability distribution \(\mathbb{Q}\) can be described by a diffusion process \(X^\star\) with dynamics

\[ dX^\star = b(X^\star)dt + \sigma(X^\star) \, {\left\{ dW_t + \textcolor{blue}{u^\star(t, X^\star)} \, dt \right\}} . \]

The control function \( \textcolor{blue}{u^\star: [0,T] \times \mathbb{R}^D \to \mathbb{R}^D}\) is of the gradient form

\[ \textcolor{blue}{u^\star(t, x)} \; = \; \sigma^\top(x) \, \nabla \log[ \textcolor{green}{h(t,x)} ] \tag{2}\]

and the function \( \textcolor{green}{h(t,x)}\) is described by the conditional expectation,

\[ \textcolor{green}{h(t,x) = \mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} }. \]

The expression \( \textcolor{blue}{u^\star(t, x)} \; = \; \sigma^\top(x) \, \nabla \log[ \textcolor{green}{h(t,x)} ]\) is intuitive; to describe the tilted measure \(\mathbb{Q}\) that places more probability mass on trajectories such that \(\exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \) is large, the optimal control \(u^\star(t,x)\) should point towards promising states, i.e. states such that the expected “reward-to-go” quantity \(\mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} \) is large.

Variational Formulation

To obtain a variational description of the optimal control function \( \textcolor{blue}{u^\star}\), it suffices to express it as the solution of an optimization problem of the form

\[ u^\star \; = \; \mathop{\mathrm{argmin}}\; \textrm{Dist} {\left( \mathbb{P}^{u} \, , \, \mathbb{Q} \right)} \]

for an appropriately chosen distance. Here \(\mathbb{P}^{u}\) is the probability distribution of the controlled diffusion \(X^u\) with dynamics

\[ dX^u = b(X^u)dt + \sigma(X^u) \, {\left\{ dW_t + \textcolor{blue}{u(t, X^u)} \, dt \right\}} \tag{3}\]

for some control function \(u: [0,T] \times \mathbb{R}^D \to \mathbb{R}^D\). Note that we have that \(\mathbb{Q}= \mathbb{P}^{u^\star}\). The KL-divergence is natural (pseudo) distance since it elegantly deals with the intractable constant \(\mathcal{Z}\) and the ratio \(d \mathbb{P}^{u} / d \mathbb{Q}\) is easy to compute. Girsanov Theorem gives that \[ \frac{d\mathbb{Q}}{d\mathbb{P}^u}(X^u) = \exp {\left\{ \int_{0}^{T} -\tfrac 12 \|u(s,X^u_s)\|^2 + f(X^u_s) \, ds - \int_{0}^{T} u(s,X^u_s)^\top \, dW_s + g(X^u_T) \right\}} / \mathcal{Z}. \]

From this expression, one can readily write \(D_{\text{KL}}(\mathbb{P}^u,\mathbb{Q})\). Minimizing \(D_{\text{KL}}(\mathbb{P}^u,\mathbb{Q})\) shows that the optimal control and the normalization constant \(\mathcal{Z}\) are:

\[ (-\log \mathcal{Z}, \, u^\star) \; = \; (\inf_{u} , \, \mathop{\mathrm{argmin}}_{u}) \; \mathbb{E} {\left[ \int_{0}^{T} \tfrac 12 \|u(s,X^u_s)\|^2 - f(X^u_s) \, ds - g(X^u_T) \right]} . \]

Minimizing this loss attempts to find a control that drives the quantity \(\int_{0}^{T} f(X^u_s) \, ds + g(X^u_T)\) large while keeping the control effort \(\int_{0}^{T} \|u(s,X^u_s)\|^2 \, ds\) small. Equivalently, this can be expressed as a maximization problem,

\[ (\log \mathcal{Z}, \, u^\star) \; = \; (\sup_{u} , \, \mathop{\mathrm{argmax}}_{u}) \; \mathbb{E} {\left[ \int_{0}^{T} -\tfrac 12 \|u(s,X^u_s)\|^2 + f(X^u_s) \, ds + g(X^u_T) \right]} . \]

Note that since \(\mathbb{P}^{u_{\star}} = \mathbb{Q}\), the optimal control \(u^\star\) is such that for any trajectory we have:

\[ \log \mathcal{Z} \; = \; \int_{0}^{T} -\tfrac12 \|u^\star(s,X^{u^{\star}}_s)\|^2 + f(X^{u^{\star}}_s) \, ds + g(X^{u^{\star}}_T) - \int_{0}^{T} u^{\star}(s,X^{u^{\star}}_s)^\top \, dW_s . \tag{4}\]

Stochastic Control

In the previous section, there was nothing special about the starting point \(x_0\) and the time horizon \(T>0\). This means that the same derivation gives the solution to the following stochastic optimal control problem. Consider the reward-to-go function (a.k.a. value function) defined as

\[ V(t,x) = \sup_u \; \mathbb{E} {\left[ \int_{t}^{T} -\tfrac 12 \|u(s,X^u_s)\|^2 + f(X^u_s) \, ds + g(X^u_T) \mid X_t = x \right]} \]

where \(X^u\) is the controlled diffusion Equation 3. We have that

\[ \begin{align} V(t,x) &= \log \mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} \\ &= \log[ \textcolor{green}{h(t, x)} ]. \end{align} \]

This shows that optimal control \(u^\star\) can also be expressed as

\[ u^\star(t,x) = \sigma^\top(x) \nabla \log[ \textcolor{green}{ h(t,x) }] = \sigma^\top(x) \, \nabla V(t,x) . \tag{5}\]

The expression \(\sigma^\top(x) \, \nabla V(t,x)\) is intuitive: since we are trying to maximize the reward-to-go function, the optimal control should be in the direction of the gradient of the reward-to-go function.

Hamilton-Jacobi-Bellman

Finally, let us mention that one can easily derive the Hamilton-Jacobi-Bellman equation for the reward-to-go function \(V(t,x)\). We have

\[ V(t,x) = \sup_u \; \mathbb{E} {\left[ \int_{t}^T C_s \, ds + g(X^u_T) \right]} \]

with \(C_s = -\tfrac12 \|u(s,X^u_s)\|^2 + f(X^u_s)\). For \(\delta \ll 1\), we have

\[ \begin{align} V(t,x) &\; = \; \sup_u \; {\left\{ C_t \, \delta + \mathbb{E} {\left[ V(t+\delta, X^u_{t+\delta}) \mid X^u_t=x \right]} \right\}} + o(\delta)\\ &\; = \; V(t,x) + \delta \, \sup_{u(t,x)} \; {\left\{ C_t + (\partial_t + \mathcal{L}+ \sigma(x) \, u(t,x) \, \nabla) \, V(t,x) \right\}} + o(\delta) \end{align} \]

where \(\mathcal{L}= b \nabla + \sigma \sigma^\top : \nabla^2\) is the generator of the uncontrolled diffusion. Since \(C_t = -\tfrac12 \|u(t,x)\|^2 + f(x)\) is a simple quadratic function, the supremum over the control \(u(t,x)\) can be computed in closed form,

\[ \begin{align} u^\star(t,x) &= \mathop{\mathrm{argmax}}_{z \in \mathbb{R}^D} \; -\tfrac12 \|z\|^2 + \left< z, \sigma^\top(x) \nabla V(t,x) \right>\\ &= \sigma^\top(x) \, \nabla V(t,x), \end{align} \]

as we already knew from Equation 5. This implies that the reward-to-go function \(V(t,x)\) satisfies the HJB equation

\[ {\left( \partial_t + \mathcal{L} \right)} V + \frac12 \| \sigma^\top \nabla V \|^2 + f = 0 \tag{6}\]

with terminal condition \(V(T,x) = g(x)\). Another route to derive Equation 6 is to simply use the fact that \(V(t,x) = \log h(t,x)\); since the Feynman-Kac gives that the function \(h(t,x)\) satisfies \((\partial_t + \mathcal{L}+ f) h = 0\), the conclusion follows from a few lines of algebra by starting writing \(\partial_t V = h^{-1} \, \partial_t h = -h^{-1}(\mathcal{L}+ f)[h]\), expanding \(\mathcal{L}h\) and expressing everything back in terms of \(V\). The term \(\|\sigma^\top \nabla V\|^2\) naturally arises when expressing the diffusion term \(\sigma \sigma^\top : \nabla^2 h\) as a function of the second derivative of \(V\); it is the idea of the standard Cole-Hopf transformation.

Finally, Ito’s lemma and Equation 6 give that for \(t_1 < t_2\), the optimally controlled diffusion \(X^{u^\star}\) satisfies:

\[ V(t_2, X^{u^\star}_{t_2}) - V(t_1, X^{u^\star}_{t_1}) = \int_{t_1}^{t_2} \tfrac12 \, \|u^\star(s,X^{u^\star}_s)\|^2 - f(X^{u^\star}_s) \, ds + \int_{t_1}^{t_2} u^\star(s,X^{u^\star}_s)^\top \, dW_s. \]

Naturally, writing this expression in between time \(t_1=0\) and \(t_2=T\) gives the expression Equation 4 for the log-normalizing constant \(\log \mathcal{Z}\).