[RL Notes] 策略迭代

Author: nex3z 2019-10-20

由策略改进定理，对于给定策略 $\pi$，在每一个状态都根据价值函数 $v_\pi$ 贪心的选择动作，就可以得到一个更优的策略 $\pi’$

\begin{align}
\pi'(s) \doteq \underset{a}{\arg\max} \; \sum_{s’, r} p(s’, r | s, a) \big[ r + \gamma v_\pi(s’) \big] \tag{1}
\end{align}

然后计算 $\pi’$ 的价值函数 $v_{\pi’}$，通过相同的方式就可以找到一个新的更优的策略 $\pi^{\prime\prime}$。不断地对价值函数进行评估（evaluation），并对策略进行改进（improvement），得到一个不断改进的策略和价值函数序列

\begin{equation}
\pi_0 \overset{\mathrm{E}}{\rightarrow} v_{\pi_0} \overset{\mathrm{I}}{\rightarrow} \pi_1 \overset{\mathrm{E}}{\rightarrow} v_{\pi_1} \overset{\mathrm{I}}{\rightarrow} \pi_2 \overset{\mathrm{E}}{\rightarrow} \cdots \overset{\mathrm{I}}{\rightarrow} \pi_* \overset{\mathrm{E}}{\rightarrow} v_{\pi_*}
\end{equation}

在上面的序列中，每一个策略都严格优于上一个策略，除非上一个策略已经最优。有限 MDP 只有有限种策略，故在有限次迭代后，上面的序列一定收敛到一个最优策略和最优价值函数。

　　上面这种寻找最优策略的方法称为策略迭代，其完整版本如下所示。

策略迭代（使用迭代策略评估），用于估计 $\pi \approx \pi_*$

初始化
对 $s \in \mathcal{S}$，任意设定 $V(s) \in \mathbb{R}$ 以及 $\pi(s) \in \mathcal{A}(s)$
策略评估
循环：
　　$\Delta \leftarrow 0$
　　对每一个 $s \in \mathcal{S}$ 循环：
　　　　$v \leftarrow V(s)$
　　　　$V(s) \leftarrow \sum\limits_{s’, r}p(s’,r|s, \pi(s))\big[r + \gamma V(s’)\big]$
　　　　$\Delta \leftarrow \max(\Delta, |v – V(s)|)$
　　直到 $\Delta < \theta$（一个决定估计精度的小正数）
策略改进
　　$policy \text{-} stable \leftarrow true$
　　对每一个 $s \in \mathcal{S}$：
　　　　$old\text{-}action \leftarrow \pi(s)$
　　　　$\pi(s) \leftarrow \underset{a}{\arg\max} \sum\limits_{s’, r}p(s’,r|s, a)\big[r + \gamma V(s’)\big]$
　　　　如果 $old\text{-}action \neq \pi(s)$，那么 $policy\text{-}stable \leftarrow false$
　　如果 $policy\text{-}stable$ 为 $true$，那么停止并返回 $V \approx v_*$ 以及 $\pi \approx \pi_*$；否则跳转到 2。

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31