
Change of measure
Consider a diffusion in \(\mathbb{R}^D\) given by:
\[ \left\{ \begin{align*} dX_t &= b(X_t)dt + \sigma(X_t) \, dW_t\\ X_0 &\sim p_0(x_0) \end{align*} \right. \]
for an initial distribution \(p_0\) and for drift and volatility functions \(b: \mathbb{R}^D \to \mathbb{R}^D\) and \(\sigma: \mathbb{R}^D \to \mathbb{R}^{D \times D}\). On the time interval \([0,T]\), this defines a probability \(\mathbb{P}\) on the path-space \(C([0,T];\mathbb{R}^D)\). For two functions \(f: [0,T] \times \mathbb{R}^D \to \mathbb{R}\) and \(g: \mathbb{R}^D \to \mathbb{R}\), consider the probability distribution \(\mathbb{Q}\) defined as
\[ \frac{d \mathbb{Q}}{d \mathbb{P}} = \frac{1}{\mathcal{Z}} \exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \]
where \(\mathcal{Z}\) denotes the normalizing constant \[ \mathcal{Z}\; = \; \mathbb{E} {\left[ \exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \right]} . \tag{1}\]
The distribution \(\mathbb{Q}\) places more probability mass on trajectories such that \(\int_0^T f(X_s) \, ds + g(X_T)\) is large. As described in these notes on Doob h-transforms, the path distribution \(\mathbb{Q}\) can be described by a diffusion process \(X^\star\) with dynamics
\[ \left\{ \begin{align*} dX^\star &= b(X^\star)dt + \sigma(X^\star) \, {\left\{ dW_t + \textcolor{blue}{u^\star(t, X^\star)} \, dt \right\}} \\ X^\star_0 &\sim q_0(x_0) = p_0(x_0) \, h(0,x_0) / \mathcal{Z} \end{align*} \right. \]
The control function \( \textcolor{blue}{u^\star: [0,T] \times \mathbb{R}^D \to \mathbb{R}^D}\) is of the gradient form
\[ \textcolor{blue}{u^\star(t, x)} \; = \; \sigma^\top(x) \, \nabla \log[ \textcolor{green}{h(t,x)} ] \tag{2}\]
and the function \( \textcolor{green}{h(t,x)}\) is described by the conditional expectation,
\[ \textcolor{green}{h(t,x) = \mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} }. \]
The expression \( \textcolor{blue}{u^\star(t, x)} \; = \; \sigma^\top(x) \, \nabla \log[ \textcolor{green}{h(t,x)} ]\) is intuitive; to describe the tilted measure \(\mathbb{Q}\) that places more probability mass on trajectories such that \(\exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \) is large, the optimal control \(u^\star(t,x)\) should point towards promising states, i.e. states such that the expected “reward-to-go” quantity \(\mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} \) is large.
Variational Formulation
To obtain a variational description of the optimal control function \( \textcolor{blue}{u^\star}\), it suffices to express it as the solution of an optimization problem of the form
\[ u^\star \; = \; \mathop{\mathrm{argmin}}\; \textrm{Dist} {\left( q \otimes \mathbb{P}^{u} \, , \, \mathbb{Q} \right)} \]
for an appropriately chosen distance. Here \(q \otimes \mathbb{P}^{u}\) is the probability distribution of the controlled diffusion \(X^u\) with dynamics
\[ \left\{ \begin{align*} dX^u &= b(X^u)dt + \sigma(X^u) \, {\left\{ dW_t + \textcolor{blue}{u(t, X^u)} \, dt \right\}} \\ X^u_0 &\sim q(x_0) \end{align*} \right. \tag{3}\]
for some control function \(u: [0,T] \times \mathbb{R}^D \to \mathbb{R}^D\) and initial distribution \(q(x_0)\). Note that we have that \(\mathbb{Q}= q_0 \otimes \mathbb{P}^{u^\star}\). The KL-divergence is natural (pseudo) distance since it elegantly deals with the intractable constant \(\mathcal{Z}\) and the ratio \(d \mathbb{P}^{u} / d \mathbb{Q}\) is easy to compute. Girsanov Theorem gives that
\[ \begin{align*} \frac{d\mathbb{Q}}{d[q \otimes \mathbb{P}^u]}(X^u) &= \frac{p_0(X^u_0)}{q(X^u_0)} \, \exp\Big\{ \int_{0}^{T} (f-\tfrac12 \|u\|^2)(s, X^u_s) \, ds\\ &- \int_{0}^{T} u(s,X^u_s)^\top \, dW_s + g(X^u_T) \Big\} / \mathcal{Z}. \end{align*} \]
From this expression, one can readily write \(D_{\text{KL}}(q \otimes \mathbb{P}^u,\mathbb{Q})\). Minimizing \(D_{\text{KL}}(q \otimes \mathbb{P}^u,\mathbb{Q})\) over the control \(u\) and the initial distribution \(q\) shows that the optimal control is:
\[ (q_0, u^\star) \; = \; \mathop{\mathrm{argmin}}_{q,u} \; D_{\text{KL}}(q, p_0) + \mathbb{E} {\left[ \int_{0}^{T} (f-\tfrac 12 \|u\|^2)(s, X^u_s) \, ds - g(X^u_T) \right]} . \]
Minimizing this loss attempts to find a control that drives the quantity \(\int_{0}^{T} f(X^u_s) \, ds + g(X^u_T)\) large while keeping the control effort \(\int_{0}^{T} \|u(s,X^u_s)\|^2 \, ds\) small. Equivalently, this can be expressed as a maximization problem,
\[ (q_0, \, u^\star) \; = \; \mathop{\mathrm{argmax}}_{q,u} \; \mathbb{E} {\left[ \int_{0}^{T} (f-\tfrac 12 \|u\|^2)(s, X^u_s) \, ds + g(X^u_T) \right]} - D_{\text{KL}}(q, p_0). \]
Note that since \(q_0 \otimes \mathbb{P}^{u_{\star}} = \mathbb{Q}= \mathbb{P}\, e^{g} / \mathcal{Z}\), the optimal control \(u^\star\) is such that for any trajectory we have:
\[ \begin{align*} \log \mathcal{Z} = \log \frac{p_0(X^{u^{\star}}_0)}{q_0(X^{u^{\star}}_0)} &+ \, \int_{0}^{T} (f-\tfrac12 \|u^\star\|^2)(s, X^u_s) \, ds + g(X^{u^{\star}}_T) \\ &\quad - \int_{0}^{T} u^{\star}(s,X^{u^{\star}}_s)^\top \, dW_s . \end{align*} \tag{4}\]
Stochastic Control
In the previous section, there was nothing special about the starting point \(x_0\) and the time horizon \(T>0\). This means that the same derivation gives the solution to the following stochastic optimal control problem. Consider the reward-to-go function (a.k.a. value function) defined as
\[ V(t,x) = \sup_u \; \mathbb{E} {\left[ \int_{t}^{T}(f-\tfrac 12 \|u\|^2)(s, X^u_s) \, ds + g(X^u_T) \mid X_t = x \right]} \]
where \(X^u\) is the controlled diffusion Equation 3. We have that
\[ \begin{align} V(t,x) &= \log \mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} \\ &= \log[ \textcolor{green}{h(t, x)} ]. \end{align} \]
This shows that optimal control \(u^\star\) can also be expressed as
\[ u^\star(t,x) = \sigma^\top(x) \nabla \log[ \textcolor{green}{ h(t,x) }] = \sigma^\top(x) \, \nabla V(t,x) . \tag{5}\]
The expression \(\sigma^\top(x) \, \nabla V(t,x)\) is intuitive: since we are trying to maximize the reward-to-go function, the optimal control should be in the direction of the gradient of the reward-to-go function.
Hamilton-Jacobi-Bellman
Finally, let us mention that one can easily derive the Hamilton-Jacobi-Bellman equation for the reward-to-go function \(V(t,x)\). We have
\[ V(t,x) = \sup_u \; \mathbb{E} {\left[ \int_{t}^T C_s \, ds + g(X^u_T) \right]} \]
with \(C_s = -\tfrac12 \|u(s,X^u_s)\|^2 + f(X^u_s)\). For \(\delta \ll 1\), we have
\[ \begin{align} V(t,x) &\; = \; \sup_u \; {\left\{ C_t \, \delta + \mathbb{E} {\left[ V(t+\delta, X^u_{t+\delta}) \mid X^u_t=x \right]} \right\}} + o(\delta)\\ &\; = \; V(t,x) + \delta \, \sup_{u(t,x)} \; {\left\{ C_t + (\partial_t + \mathcal{L}+ \sigma(x) \, u(t,x) \, \nabla) \, V(t,x) \right\}} + o(\delta) \end{align} \]
where \(\mathcal{L}= b \nabla + \sigma \sigma^\top : \nabla^2\) is the generator of the uncontrolled diffusion. Since \(C_t = -\tfrac12 \|u(t,x)\|^2 + f(x)\) is a simple quadratic function, the supremum over the control \(u(t,x)\) can be computed in closed form,
\[ \begin{align} u^\star(t,x) &= \mathop{\mathrm{argmax}}_{z \in \mathbb{R}^D} \; -\tfrac12 \|z\|^2 + \left< z, \sigma^\top(x) \nabla V(t,x) \right>\\ &= \sigma^\top(x) \, \nabla V(t,x), \end{align} \]
as we already knew from Equation 5. This implies that the reward-to-go function \(V(t,x)\) satisfies the HJB equation
\[ {\left( \partial_t + \mathcal{L} \right)} V + \frac12 \| \sigma^\top \nabla V \|^2 + f = 0 \tag{6}\]
with terminal condition \(V(T,x) = g(x)\). Another route to derive Equation 6 is to simply use the fact that \(V(t,x) = \log h(t,x)\); since the Feynman-Kac gives that the function \(h(t,x)\) satisfies \((\partial_t + \mathcal{L}+ f) h = 0\), the conclusion follows from a few lines of algebra by starting writing \(\partial_t V = h^{-1} \, \partial_t h = -h^{-1}(\mathcal{L}+ f)[h]\), expanding \(\mathcal{L}h\) and expressing everything back in terms of \(V\). The term \(\|\sigma^\top \nabla V\|^2\) naturally arises when expressing the diffusion term \(\sigma \sigma^\top : \nabla^2 h\) as a function of the second derivative of \(V\); it is the idea of the standard Cole-Hopf transformation.
Finally, Ito’s lemma and Equation 6 give that for \(t_1 < t_2\), the optimally controlled diffusion \(X^{u^\star}\) satisfies:
\[ \begin{align*} V(t_2, X^{u^\star}_{t_2}) - V(t_1, X^{u^\star}_{t_1}) &= \int_{t_1}^{t_2} (\tfrac12 \, \|u^\star\|^2 - f)(s,X^{u^\star}_s) \, ds\\ &\quad + \int_{t_1}^{t_2} u^\star(s,X^{u^\star}_s)^\top \, dW_s. \end{align*} \]
Since \(V(T,x_T) = g(x_T)\) and \(V(0,x_0) = \log \mathcal{Z}+ \log \frac{q_0(x_0)}{p_0(x_0)}\), writing this expression in between time \(t_1=0\) and \(t_2=T\) gives the formula Equation 4 for the log-normalizing constant \(\log \mathcal{Z}\).