Chapter03 gird-world

引用来自ShangtongZhang的代码chapter03/gird_world.py

使用MDP的强化学习算法解决Grid-world问题

任务解释(example 3.5 in chapter 03)

grid_world代表了一个网格,网格中每个小格子代表一种状态。其中每个状态可以有4种action:left、right、up、down。对应reward规则如下:

如果action导致agent跑到网格外面去,则reward=-1;

如果agent从A出发,则reward=10,下个状态是固定的A_PRIME_POS;从B出发,reward=5,下个状态时固定的B_PRIME_POS;

其他的state上的action均为0。

示意图如下:

png

引入模块,定义常量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.table import Table
%matplotlib inline

WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
DISCOUNT = 0.9

# left, up, right, down
ACTIONS = [np.array([0, -1]),
np.array([-1, 0]),
np.array([0, 1]),
np.array([1, 0])]
ACTION_PROB = 0.25

获取next_state(base on action),和对应的reward

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def step(state, action):
if state == A_POS:
return A_PRIME_POS, 10
if state == B_POS:
return B_PRIME_POS, 5

state = np.array(state)
next_state = (state + action).tolist()
x, y = next_state
if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
reward = -1.0
next_state = state
else:
reward = 0
return next_state, reward

根据给定的np.array数据类型绘制方格图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def draw_image(image):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0, 0, 1, 1])

nrows, ncols = image.shape
width, height = 1.0 / ncols, 1.0 / nrows

# Add cells
# np.ndenumerate() return an iterator yielding pairs of array coordinates and values.
for (i,j), val in np.ndenumerate(image):
# Index either the first or second item of bkg_colors based on
# a checker board pattern
idx = [j % 2, (j + 1) % 2][i % 2]
color = 'white'

tb.add_cell(i, j, width, height, text=val,
loc='center', facecolor=color)

# Row Labels...
for i, label in enumerate(range(len(image))):
tb.add_cell(i, -1, width, height, text=label+1, loc='right',
edgecolor='none', facecolor='none')
# Column Labels...
for j, label in enumerate(range(len(image))):
tb.add_cell(-1, j, width, height/2, text=label+1, loc='center',
edgecolor='none', facecolor='none')
ax.add_table(tb)

# test draw_image
test_image = np.array([[1.2,2.1],[3.5,4.7]])
draw_image(test_image)

png

Policies and Value Functions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def figure_3_2():
value = np.zeros((WORLD_SIZE, WORLD_SIZE))
while True:
# keep iteration until convergence
new_value = np.zeros(value.shape)
for i in range(0, WORLD_SIZE):
for j in range(0, WORLD_SIZE):
for action in ACTIONS:
(next_i, next_j), reward = step([i, j], action)
# bellman equation
# each action is random chosen
new_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j])
if np.sum(np.abs(value - new_value)) < 1e-4:
draw_image(np.round(new_value, decimals=2))
plt.savefig('./figure_3_2.png')
plt.show()
plt.close()
break
value = new_value

figure_3_2()

png

Optimal Policies and Value Functions(Greedy choose action)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def figure_3_5():
value = np.zeros((WORLD_SIZE, WORLD_SIZE))
while True:
# keep iteration until convergence
new_value = np.zeros(value.shape)
for i in range(0, WORLD_SIZE):
for j in range(0, WORLD_SIZE):
values = []
for action in ACTIONS:
(next_i, next_j), reward = step([i, j], action)
# value iteration
values.append(reward + DISCOUNT * value[next_i, next_j])
new_value[i, j] = np.max(values)
if np.sum(np.abs(new_value - value)) < 1e-4:
draw_image(np.round(new_value, decimals=2))
plt.savefig('./figure_3_5.png')
plt.show()
plt.close()
break
value = new_value
figure_3_5()

png

后记

这是关于马尔科夫决策过程一个相对简单的过程,可以看到这里一旦确定了previous state和action,current state和对应的reward就确定了。即Pr{s’,r|s,a}=1,事实上MDP的应用包括优化过程都是有相当的局限性的:

(1)首先我们需要准确的知道环境的动态变化;

(2)我们要有足够的算力来推出所有需要的state和value;

(3)我们需要保证问题的马尔科夫性。

然而这里的(2)则很大程度上局限了MDP的推广,因为事实上如果需要在所有的state上做出action并update value,对于state数量庞大的任务几乎不可能完成的。