引用来自ShangtongZhang的代码chapter09/random_walk.py
通过1000-state MRP的例子比较了linear近似函数的不同feature构造方法对应的value function approximation强化学习算法性能
问题描述
这个问题是Examples 6.2 random-walk的扩展:
(1)状态数扩展到了1000,状态序列[1,1000],开始状态是500
(2)State transitions are from the current state to one of the 100 neighboring states to its left, or to one of the 100 neighboring states to its right, all with equal probability.
(3)if the current state is near an edge, then there may be fewer than 100 neighbors on that side of it. In this case, all the probability that would have gone into those missing neighbors goes into the probability of terminating on that side (thus, state 1 has a 0.5 chance of terminating on the left, and state 950 has a 0.25 chance of terminating on the right).
(4)在左边达到terminal-state的reward是-1,在右边达到terminal-state的reward是+1
引入模块并定义常量
1 | import numpy as np |
定义函数计算真实的state-value,结果是近似直线,只有在直线两端有非线性部分
1 | def compute_true_value(): |
定义step函数来模拟单步交互,定义get_action函数使用随机policy选择action
1 | # take an @action at @state, return new state and reward for this transition |
定义State Aggregation的function approximation
1 | # a wrapper class for aggregation value function |
定义使用近似函数的Monte-Carlo方法来训练value function
1 | # gradient Monte Carlo algorithm |
绘制图像,比较使用SGD的MC方法的预测value function和true_value的区别,以及state的分布
1 | # Figure 9.1, gradient Monte Carlo algorithm |
100%|██████████| 100000/100000 [01:26<00:00, 1156.28it/s]
可以通过distribution解释更多关于function approximation的细节。最明显的就是estimate value的左下角和右上角,左下角的值基本都比true_value大,右上角的基本都比true_value小。因为在一个state group中,相对的distribution占得越多的state对聚合value的更新贡献越大。可以看到distribution越往中间值越大,但是越靠中间的group内的states的相对distribution就比较接近了,所以estimate value在靠中间的state上基本是均匀分布在true_value附近的。但是两头的state因为group内的相对distribution差别比较大,所以就出现了相对true_value的明显偏差。
定义使用近似函数的n-step TD方法训练,使用semi-gradient
1 | # semi-gradient n-step TD algorithm |
绘制图表,观察n-step TD方法的效果,以及不同n取值对rms error的影响
1 | # semi-gradient TD on 1000-state random walk,使用TD(0)方法 |
100%|██████████| 100000/100000 [01:32<00:00, 1077.20it/s]
100%|██████████| 100/100 [04:50<00:00, 2.87s/it]
下面几个例子分别使用了不同的特征模型(Feature Construction)来完成VFA(value function approximation),分别对应了 9.5的polynomial / Fourier基函数模型、Tilings-Code模型,使用的强化学习方法是n-step TD方法
Tile Coding特征的linear function approximation
1 | # a wrapper class for tile coding value function |
100%|██████████| 5000/5000 [03:53<00:00, 21.37it/s]
100%|██████████| 5000/5000 [00:09<00:00, 527.11it/s]
Polynomial / Fourier -Based value function 特征的linear function approximation
1 | # a wrapper class for polynomial / Fourier -based value function |
100%|██████████| 5000/5000 [01:18<00:00, 63.43it/s]
100%|██████████| 5000/5000 [02:05<00:00, 38.65it/s]
100%|██████████| 5000/5000 [01:36<00:00, 51.71it/s]
100%|██████████| 5000/5000 [02:54<00:00, 27.98it/s]
100%|██████████| 5000/5000 [02:06<00:00, 39.42it/s]
100%|██████████| 5000/5000 [04:38<00:00, 18.22it/s]