# [RL Note] 估计策略的梯度

策略梯度定理给出了计算策略梯度的简单方法

\nabla r(\pi) = \sum_{s} \mu_\pi(s) \sum_{a} \nabla \pi(a|s, \boldsymbol{\mathrm{\theta}}) q_{\pi}(s, a) \tag{1}

S_0, A_0, R_1, S_1, A_1, \cdots, S_t, A_t, R_{t+1}, \cdots

回顾使用随机梯度下降进行状态价值函数预测的更新公式

\boldsymbol{\mathrm{w}}_{t+1} \doteq \boldsymbol{\mathrm{w}}_t + \alpha \Big[U_t – \hat{v}(S_t, \boldsymbol{\mathrm{w}}_t)\Big] \nabla \hat{v}(S_t, \boldsymbol{\mathrm{w}}_t)

类似地，注意到式 $(1)$ 实际上是一个期望

\begin{align}
\nabla r(\pi) &= \sum_{s} \mu_\pi(s) \sum_{a} \nabla \pi(a|s, \boldsymbol{\mathrm{\theta}}) q_{\pi}(s, a) \\
&= \mathbb{E}_\mu\Big[\sum_{a} \nabla \pi(a|s, \boldsymbol{\mathrm{\theta}}) q_{\pi}(s, a)\Big] \tag{2}
\end{align}

\boldsymbol{\mathrm{\theta}}_{t+1} \doteq \boldsymbol{\mathrm{\theta}}_t + \alpha \sum_{a} \nabla \pi(a|S_t, \boldsymbol{\mathrm{\theta}}_t) q_{\pi}(S_t, a) \tag{3}

更进一步地，考虑 $\sum_{a} \nabla \pi(a|S_t, \boldsymbol{\mathrm{\theta}}_t) q_{\pi}(S_t, a)$ 这一项，它也可以写成期望的形式

\begin{align}
&\sum_{a} \nabla \pi(a|S_t, \boldsymbol{\mathrm{\theta}}_t) q_{\pi}(S_t, a) \\
&= \sum_{a} \pi(a|S_t, \boldsymbol{\mathrm{\theta}}) \frac{1}{\pi(a|S_t, \boldsymbol{\mathrm{\theta}})} \nabla \pi(a|S_t, \boldsymbol{\mathrm{\theta}}_t) q_{\pi}(S_t, a) \\
&= \mathbb{E}_{\pi}\bigg[\frac{\nabla \pi(A_t|S_t, \boldsymbol{\mathrm{\theta}}_t)}{\pi(A_t|S_t, \boldsymbol{\mathrm{\theta}}_t)} q_{\pi}(S_t, A_t) \bigg] \quad 用采样 \; A_t \sim \pi \; 替换 \; a \tag{4}
\end{align}

\boldsymbol{\mathrm{\theta}}_{t+1} \doteq \boldsymbol{\mathrm{\theta}}_t + \alpha \frac{\nabla \pi(A_t|S_t, \boldsymbol{\mathrm{\theta}}_t)}{\pi(A_t|S_t, \boldsymbol{\mathrm{\theta}}_t)} q_{\pi}(S_t, A_t) \tag{5}

\boldsymbol{\mathrm{\theta}}_{t+1} \doteq \boldsymbol{\mathrm{\theta}}_t + \alpha \nabla \ln \pi(A_t|S_t, \boldsymbol{\mathrm{\theta}}_t) q_{\pi}(S_t, A_t) \tag{6}