Tag Archive: Prediction

[RL Notes] 时序差分学习

Author: nex3z 2019-10-26

在预测问题中，我们的目标是估计价值函数 \begin{equation} v_\pi(s) \doteq \mathbb{E}[G_t|S_t = s] \tag{1} \end{equation} 即从给定状态开始能获得的回报。在使用蒙特卡洛方法进行策略评估时，可以通过下式增量地对估计值进行更新 \begin{equation} V(S_t) \leftarrow V(S…
Read more

Reinforcement Learning

Prediction, Reinforcement Learning, Time Difference

1. 基于重要度采样的离轨策略　　蒙特卡洛预测算法通过计算回报的平均值来估计状态价值，即 \begin{equation} v_\pi(s) \doteq \mathbb{E}_\pi[G_t|S_t = s] = \mathrm{average}(Returns(s)) \tag{1} \end{equation} 而在离轨策略中，样本是通过行动策略获得的，此时计算回报的平均值估计的是行动策略…
Read more

Reinforcement Learning

Off-Policy, Prediction, Reinforcement Learning

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tag Archive: Prediction

[RL Notes] 时序差分学习

[RL Notes] 重要度采样和离轨蒙特卡洛预测