Consider a diffusion in \(\mathbb{R}^D\) with deterministic starting position \(x_0 \in \mathbb{R}^D\) and dynamics

\[ dX_t = b(X_t)dt + \sigma(X_t) \, dW_t \]

for a drift and volatility functions \(b: \mathbb{R}^D \to \mathbb{R}^D\) and \(\sigma: \mathbb{R}^D \to \mathbb{R}^{D \times D}\). On the time interval \([0,T]\), this defines a probability \(\mathbb{P}\) on the path-space \(C([0,T];\mathbb{R}^D)\). For two functions \(f: [0,T] \times \mathbb{R}^D \to \mathbb{R}\) and \(g: \mathbb{R}^D \to \mathbb{R}\), consider the probability distribution \(\mathbb{Q}\) defined as

\[ \frac{d \mathbb{Q}}{d \mathbb{P}} = \frac{1}{\mathcal{Z}} \exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \]

where \(\mathcal{Z}\) denotes the normalizing constant \[ \mathcal{Z}\; = \; \mathop{\mathrm{\mathbb{E}}} {\left[ \exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \right]} . \tag{1}\]

The distribution \(\mathbb{Q}\) places more probability mass on trajectories such that \(\int_0^T f(X_s) \, ds + g(X_T)\) is large. As described in these notes on Doob h-transforms, the tilted probability distribution \(\mathbb{Q}\) can be described by a diffusion process \(X^\star\) with dynamics

\[ dX^\star = b(X^\star)dt + \sigma(X^\star) \, {\left\{ dW_t + \textcolor{blue}{u^\star(t, X^\star)} \, dt \right\}} . \]

The control function \( \textcolor{blue}{u^\star: [0,T] \times \mathbb{R}^D \to \mathbb{R}^D}\) is of the gradient form

\[ \textcolor{blue}{u^\star(t, x)} \; = \; \sigma^\top(x) \, \nabla \log[ \textcolor{green}{h(t,x)} ] \tag{2}\]

and the function \( \textcolor{green}{h(t,x)}\) is described by the conditional expectation,

\[ \textcolor{green}{h(t,x) = \mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} }. \]

The expression \( \textcolor{blue}{u^\star(t, x)} \; = \; \sigma^\top(x) \, \nabla \log[ \textcolor{green}{h(t,x)} ]\) is quite intuitive; in order to describe the tilted measure \(\mathbb{Q}\) that places more probability mass on trajectories such that \(\int_0^T f(X_s) \, ds + g(X_T)\) is large, the optimal control \(u^\star(t,x)\) should be in the direction of states such that the “reward-to-go” quantity \(\int_t^T f(X_s) \, ds + g(X_T)\) is large.

To obtain a variational description of the optimal control function \( \textcolor{blue}{u^\star}\), it suffices to express it as the solution of an optimization problem. It turns out that KL-divergences between diffusion processes are the right tool for this: we will write \(\mathbb{Q}\) as the minimizer of \(\mathop{\mathrm{D_{\text{KL}}}}(\mathbb{P}^u \| \mathbb{Q})\) for a class of tractable probability distributions \(\mathbb{P}^u\) described by controlled diffusions. As described in these notes on the Girsanov Theorem, for any control function \(u(t,x)\), the controlled diffusion \(X^u\) with dynamics

\[ dX^u = b(X^u)dt + \sigma(X^u) \, {\left\{ dW_t + \textcolor{blue}{u(t, X^u)} \, dt \right\}} \]

and started at \(x_0\) induces a probability distribution \(\mathbb{P}^u\) on path-space given by

\[ \frac{d\mathbb{P}}{d\mathbb{P}^u}(x) \; = \; \exp {\left\{ -\frac 12 \int_{0}^{T} \|u(s,X^u_S)\|^2 \, ds - \int_{0}^{T} u(s,X^u_s)^\top \, dW_s \right\}} . \tag{3}\]

This allows one to write down explicitly the expression for the negative KL divergence

\[ -\mathop{\mathrm{D_{\text{KL}}}}(\mathbb{P}^u \| \mathbb{Q}) = \mathbb{E}_u {\left[ \log {\left\{ \frac{d\mathbb{Q}}{d\mathbb{P}^u}(X^u) \right\}} \right]} \]

between \(\mathbb{P}^u\) and the tilted distribution \(\mathbb{Q}\). The notation \(\mathbb{E}_u\) denotes the expectation with respect to the controlled diffusion \(X^u\). The negative KL is, up to a constant, the usual Evidence Lower Bound (ELBO) used in variational inference. Since the quantity \(\log {\left\{ \frac{d\mathbb{Q}}{d\mathbb{P}^u}(X^u) \right\}} \) can be expressed as

\[ \log {\left\{ \frac{d\mathbb{P}}{d\mathbb{P}^u}(X^u) \right\}} - \log(\mathcal{Z}) + \int_0^T f(X^u_s) \, ds + g(X^u_T) \]

it follows from Equation 3 that \(-\mathop{\mathrm{D_{\text{KL}}}}(\mathbb{P}^u \| \mathbb{Q})\) equals

\[ -\log(\mathcal{Z}) + \mathop{\mathrm{\mathbb{E}}} {\left[ \int_{0}^{T} {\left\{ -\tfrac 12 \|u(s,X^u_s)\|^2 + f(X^u_s) \right\}} \, ds + g(X^u_T) \right]} . \]

Since the KL divergence is positive and the optimal control \(u^\star\) in Equation 2 drives the KL divergence to zero, we have that

\[ \max_u \; \text{ELBO}(u) = \log \mathcal{Z} \]

where the minimization is over all (reasonably well-behaved) control functions \(u: [0,T] \times \mathbb{R}^D \to \mathbb{R}^D\) and

\[ \text{ELBO}(u) \; = \; \mathop{\mathrm{\mathbb{E}}} {\left[ \int_{0}^{T} {\left\{ -\tfrac 12 \|u(s,X^u_s)\|^2 + f(X^u_s) \right\}} \, ds + g(X^u_T) \right]} . \]

For maximizing the ELBO, the control needs to drive the trajectories to regions where \(\int_{0}^{T} f(X^u_s) \, ds + g(X^u_T)\) is large while at the same time keep the control effort \(\int_{0}^{T} \|u(s,X^u_s)\|^2 \, ds\) small. The optimal control \(u^\star\) is given by Equation 2 and Equation 1 gives that

\[ \begin{align} \log \mathcal{Z} &= \log \mathop{\mathrm{\mathbb{E}}} {\left[ \exp {\left\{ \int_0^T f(X_s) \, ds + g(X_T) \right\}} \right]} \\ &= \log[ \textcolor{green}{ h(0,x_0) } ]. \end{align} \]

Since there was nothing really special about the starting point \(x_0\) and the time horizon \(T>0\), the above derivation gives the solution to the following stochastic optimal control problem. It is written as a maximization problem although a large part of the control and physics literature writes it as an equivalent minimization problem. Consider the reward-to-go function (a.k.a. value function) defined as

\[ V(t,x) = \sup_u \; \mathop{\mathrm{\mathbb{E}}} {\left[ \int_{t}^{T} {\left\{ -\tfrac 12 \|u(s,X^u_s)\|^2 + f(X^u_s) \right\}} \, ds + g(X^u_T) \mid X_t = x \right]} . \]

We have that

\[ \begin{align} V(t,x) &= \log \mathbb{E} {\left[ \exp {\left\{ \int_t^T f(X_s) \, ds + g(X_T) \right\}} \mid X_t = x \right]} \\ &= \log[ \textcolor{green}{h(t, x)} ]. \end{align} \]

This shows that optimal control \(u^\star\) can also be expressed as

\[ u^\star(t,x) = \sigma^\top(x) \nabla \log[ \textcolor{green}{ h(t,x) }] = \sigma^\top(x) \, \nabla V(t,x) . \tag{4}\]

The expression \(\sigma^\top(x) \, \nabla V(t,x)\) is intuitive: since we are trying to maximize the reward-to-go function, the optimal control should be in the direction of the gradient of the reward-to-go function.

Finally, let us mention that one can easily derive the Hamilton-Jacobi-Bellman equation for the reward-to-go function \(V(t,x)\). We have

\[ V(t,x) = \sup_u \; \mathop{\mathrm{\mathbb{E}}} {\left[ \int_{t}^T C_s \, ds + g(X^u_T) \right]} \]

with \(C_s = -\tfrac12 \|u(s,X^u_s)\|^2 + f(X^u_s)\). For \(\delta \ll 1\), we have

\[ \begin{align} V(t,x) &\; = \; \sup_u \; {\left\{ C_t \, \delta + \mathop{\mathrm{\mathbb{E}}} {\left[ V(t+\delta, X^u_{t+\delta}) \mid X^u_t=x \right]} \right\}} + o(\delta)\\ &\; = \; V(t,x) + \delta \, \sup_{u(t,x)} \; {\left\{ C_t + (\partial_t + \mathcal{L}+ \sigma(x) \, u(t,x) \, \nabla) \, V(t,x) \right\}} + o(\delta) \end{align} \]

where \(\mathcal{L}= b \nabla + \sigma \sigma^\top : \nabla^2\) is the generator of the uncontrolled diffusion. Since \(C_t = -\tfrac12 \|u(t,x)\|^2 + f(x)\) is a simple quadratic function, the supremum over the control \(u(t,x)\) can be computed in closed form,

\[ \begin{align} u^\star(t,x) &= \mathop{\mathrm{argmax}}_{z \in \mathbb{R}^D} \; -\tfrac12 \|z\|^2 + \left< z, \sigma^\top(x) \nabla V(t,x) \right>\\ &= \sigma^\top(x) \, \nabla V(t,x), \end{align} \]

as we already knew from Equation 4. This implies that the reward-to-go function \(V(t,x)\) satisfies the HJB equation

\[ {\left( \partial_t + \mathcal{L} \right)} V + \frac12 \| \sigma^\top \nabla V \|^2 + f = 0 \tag{5}\]

with terminal condition \(V(T,x) = g(x)\). Another route to derive Equation 5 is to simply use the fact that \(V(t,x) = \log h(t,x)\); since the Feynman-Kac gives that the function \(h(t,x)\) satisfies \((\partial_t + \mathcal{L}+ f) h = 0\), the conclusion follows from a few lines of algebra by starting writing \(\partial_t V = h^{-1} \, \partial_t h = -h^{-1}(\mathcal{L}+ f)[h]\), expanding \(\mathcal{L}h\) and expressing everything back in terms of \(V\). The term \(\|\sigma^\top \nabla V\|^2\) naturally arises when expressing the diffusion term \(\sigma \sigma^\top : \nabla^2 h\) as a function of the second derivative of \(V\); it is the idea of the standard Cole-Hopf transformation.