[RL Notes] 加权重要度采样及增量实现

Author: nex3z 2019-10-26

Importance Sampling, Reinforcement Learning

Contents

1. 加权重要度采样
2. 增量实现

1. 加权重要度采样

　　MC 预测算法中使用的简单重要度采样定义为

\begin{equation}
V(s) \doteq \frac{\sum\limits_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{|\mathcal{T}(s)|} \tag{1}
\end{equation}

其中 $\mathcal{T}(s)$ 为所有访问过状态 $s$ 的集合，$T(t)$ 表示在时刻 $t$ 后首次终止，$G_t$ 表示在 $t$ 之后到达 $T(t)$ 时的回报，$\{G_t\}_{t \in \mathcal{T}(s)}$ 为状态 $s$ 对应的回报，$\{\rho_{t:T(t)-1}\}_{t \in \mathcal{T}(s)}$ 为相应的重要度采样比。简单重要度采样只是简单地计算平均值。

　　另外还可以使用加权平均的方法，称为加权重要度采样，定义为

\begin{equation}
V(s) \doteq \frac{\sum\limits_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{\sum\limits_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} } \tag{2}
\end{equation}

如果式 $(2)$ 中的分母为零，则该式的定义也为零。

2. 增量实现

　　假设有一个回报序列 $G_1, G_2, \cdots, G_{n-1}$，它们都从相同的状态开始，每一个回报对应一个随机权重 $W_t$（如前面的重要度采样比 $W_t = \rho_{t:T(t)-1}$），使用加权重要度采样时，需要估计

\begin{equation}
V_n \doteq \frac{\sum\limits_{k=1}^{n-1} W_k G_k}{\sum\limits_{k=1}^{n-1} W_k}, \quad n \geq 2 \tag{3}
\end{equation}

我们希望能跟踪 $V_n$ 的变化，从而使用增量的方式对其进行更新

\begin{equation}
V_{n+1} \doteq V_n + \frac{W_n}{C_n} [G_n – V_n], \quad n \geq 1 \tag{4}
\end{equation}

其中

\begin{equation}
C_{n+1} \doteq C_n + W_{n+1} \tag{4}
\end{equation}

定义 $C_0 \doteq 0$。

　　由此可以得到增量实现离轨策略 MC 预测算法如下所示。

离轨策略 MC 预测算法（策略评估），用于估计 $Q \approx q_\pi$
输入：一个任意的目标策略 $\pi$
初始化：对所有 $s \in \mathcal{S}$，$a \in \mathcal{A(s)}$：
　　任意初始化 $Q(s, a) \in \mathbb{R}$
　　$C(s, a) \leftarrow 0$
无限循环（对每幕）：
　　$b \leftarrow$ 任何能包括 $\pi$ 的策略
　　根据 $b$ 生成一幕序列：$S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T$
　　$G \leftarrow 0$
　　$W \leftarrow 1$
　　对幕中的每一步循环，$t = T-1, T-2, \cdots, 0$，当 $W \neq 0$ 时：
　　　　$G \leftarrow \gamma G + R_{t+1}$
　　　　$C(S_t, A_t) \leftarrow C(S_t, A_t) + W$
　　　　$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)} [G – Q(S_t, A_t)]$
　　　　$W \leftarrow W \frac{\pi(A_t|S_t)}{b(A_t|S_t)}$
　　　　如果 $W = 0$，则退出内循环

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31