Chapter05 infinite_variance

引用来自ShangtongZhang的代码chapter05/infinite_variance.py

通过一个例子论证了ordinary importance sampling的不稳定性

问题描述

这个程序通过一个简单的例子证明了ordinary importance sampling 的方差经常会发生不收敛的问题。

本例使用了一个只有一个状态s和两个状态left和right,以及一个terminate state的MDP问题,详细的Reward和转移概率如下所示:

png

选择的target_policy是:π(left|s)=1,π(right|s)=0;

选择生成episode的behavior policy是:b(left|s)=b(right|s)=0.5

满足π cover b的要求,并根据target_policy可以估计出v_π(s)=1,接下来看看代码运行结果,看看通过behavior policy预测出来的可以收敛到什么情况。

引入模块并定义常量,其中action_back=left,action_end=right

1
2
3
4
5
6
7
8
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from tqdm import tqdm
%matplotlib inline

ACTION_BACK = 0
ACTION_END = 1

定义behavior-policy和target-policy并开始训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# behavior policy
def behavior_policy():
return np.random.binomial(1, 0.5)

# target policy
def target_policy():
return ACTION_BACK

# one turn
# 返回reward和action trajectory,因为state已知,所以不用指定
def play():
# track the action for importance ratio
trajectory = []
while True:
action = behavior_policy()
trajectory.append(action)
if action == ACTION_END:
return 0, trajectory
if np.random.binomial(1, 0.9) == 0:
return 1, trajectory

def figure_5_4():
runs = 10
episodes = 100000
for run in tqdm(range(runs)):
# 每轮run之间都是独立的
rewards = []
for episode in range(0, episodes):
reward, trajectory = play()
if trajectory[-1] == ACTION_END:
rho = 0
else:
rho = 1.0 / pow(0.5, len(trajectory))
rewards.append(rho * reward)
rewards = np.add.accumulate(rewards)
estimations = np.asarray(rewards) / np.arange(1, episodes + 1)
plt.plot(estimations)

plt.plot(np.ones(episodes+1))
plt.xlabel('Episodes (log scale)')
plt.ylabel('Ordinary Importance Sampling')
plt.xscale('log')
plt.ylim(0,2)
plt.savefig('./figure_5_4.png')
plt.show()

figure_5_4()
100%|██████████| 10/10 [00:06<00:00,  1.59it/s]

png