引用来自ShangtongZhang的代码chapter10/mountain_car.py
使用强化学习中经典的例子:mountain car来测试semi-gradient Sarsa算法的性能。
问题描述
Mountain Car是强化学习中一个比较经典的例子:一个动力不足的小车想爬上右边的goal,但是他的油门抵不过重力作用,所以它为了冲上右边的高峰,唯一的选择是先到左边坡上,然后想办法往右边冲,其实就是在两边摆,通过自己有限的动能累计来达到goal。
小车的位移和速度由以下的公式决定:
bound操作保证了-1.2≤x_t≤0.5 and -0.07≤xdot_t≤0.07,当x到达左边界,速度xdot_t重置为0,当x到达右边界,小车到达goal,结束。每个episode都从x_t∈[-0.6,-0.4]和xdot_t=0开始。action有3种:反向发动=-1,不发动=0,正向发动=1,对应的reward都是-1。
引入模块并定义了常量
1 | import numpy as np |
使用了9.5的Tiling Coding来将(s,a)转换成feature,这里没有使用custom的Tiling Coding算法,使用了Richard S. Sutton的tiling-code software
1 | ####################################################################### |
step函数完成agent和环境的交互
1 | # take an @action at @position and @velocity |
定义value function approximation,使用Tiling Code构造特征
1 | # wrapper class for state action value function |
使用epsilon-greedy policy选择action
1 | # get action at @position and @velocity based on epsilon greedy policy and @valueFunction |
使用n-step semi-gradient Sarsa方法训练control policy
1 | # semi-gradient n-step Sarsa |
绘制semi-gradient Sarsa(0)训练过程形成的cost图像
1 | # print learned cost to go |
100%|██████████| 9000/9000 [03:37<00:00, 41.37it/s]
semi-gradient Sarsa(0)在不同α值下的性能
1 | # Figure 10.2, semi-gradient Sarsa with different alphas |
100%|██████████| 500/500 [00:20<00:00, 23.95it/s]
100%|██████████| 500/500 [00:17<00:00, 39.39it/s]
100%|██████████| 500/500 [00:15<00:00, 39.90it/s]
100%|██████████| 500/500 [00:21<00:00, 23.78it/s]
100%|██████████| 500/500 [00:17<00:00, 28.72it/s]
100%|██████████| 500/500 [00:14<00:00, 38.00it/s]
100%|██████████| 500/500 [00:22<00:00, 22.57it/s]
100%|██████████| 500/500 [00:18<00:00, 26.99it/s]
100%|██████████| 500/500 [00:15<00:00, 32.92it/s]
100%|██████████| 500/500 [00:21<00:00, 23.45it/s]
100%|██████████| 500/500 [00:18<00:00, 27.60it/s]
100%|██████████| 500/500 [00:14<00:00, 33.99it/s]
100%|██████████| 500/500 [00:20<00:00, 36.76it/s]
100%|██████████| 500/500 [00:17<00:00, 29.05it/s]
100%|██████████| 500/500 [00:14<00:00, 33.48it/s]
100%|██████████| 500/500 [00:20<00:00, 38.90it/s]
100%|██████████| 500/500 [00:17<00:00, 28.14it/s]
100%|██████████| 500/500 [00:14<00:00, 33.89it/s]
100%|██████████| 500/500 [00:20<00:00, 24.13it/s]
100%|██████████| 500/500 [00:17<00:00, 28.94it/s]
100%|██████████| 500/500 [00:14<00:00, 34.23it/s]
100%|██████████| 500/500 [00:20<00:00, 23.97it/s]
100%|██████████| 500/500 [00:17<00:00, 29.15it/s]
100%|██████████| 500/500 [00:14<00:00, 42.99it/s]
100%|██████████| 500/500 [00:20<00:00, 35.01it/s]
100%|██████████| 500/500 [00:17<00:00, 29.30it/s]
100%|██████████| 500/500 [00:14<00:00, 34.86it/s]
100%|██████████| 500/500 [00:20<00:00, 24.11it/s]
100%|██████████| 500/500 [00:17<00:00, 28.32it/s]
100%|██████████| 500/500 [00:14<00:00, 33.42it/s]
比较Sarsa(0)和 n-step Sarsa的性能
1 | # Figure 10.3, one-step semi-gradient Sarsa vs multi-step semi-gradient Sarsa |
100%|██████████| 500/500 [00:14<00:00, 34.02it/s]
100%|██████████| 500/500 [00:12<00:00, 39.23it/s]
100%|██████████| 500/500 [00:14<00:00, 34.70it/s]
100%|██████████| 500/500 [00:13<00:00, 36.34it/s]
100%|██████████| 500/500 [00:14<00:00, 33.99it/s]
100%|██████████| 500/500 [00:15<00:00, 38.11it/s]
100%|██████████| 500/500 [00:14<00:00, 33.94it/s]
100%|██████████| 500/500 [00:12<00:00, 41.27it/s]
100%|██████████| 500/500 [00:14<00:00, 33.72it/s]
100%|██████████| 500/500 [00:12<00:00, 40.05it/s]
100%|██████████| 500/500 [00:14<00:00, 34.96it/s]
100%|██████████| 500/500 [00:12<00:00, 39.89it/s]
100%|██████████| 500/500 [00:14<00:00, 34.15it/s]
100%|██████████| 500/500 [00:12<00:00, 46.37it/s]
100%|██████████| 500/500 [00:14<00:00, 33.76it/s]
100%|██████████| 500/500 [00:12<00:00, 40.73it/s]
100%|██████████| 500/500 [00:14<00:00, 34.70it/s]
100%|██████████| 500/500 [00:12<00:00, 41.34it/s]
100%|██████████| 500/500 [00:14<00:00, 42.82it/s]
100%|██████████| 500/500 [00:12<00:00, 47.91it/s]
比较α值和bootstrape的n值对semi-gradient Sarsa方法性能的影响
1 | # Figure 10.4, effect of alpha and n on multi-step semi-gradient Sarsa |
100%|██████████| 50/50 [00:04<00:00, 12.42it/s]
100%|██████████| 50/50 [00:03<00:00, 24.05it/s]
100%|██████████| 50/50 [00:02<00:00, 18.91it/s]
100%|██████████| 50/50 [00:02<00:00, 18.59it/s]
100%|██████████| 50/50 [00:02<00:00, 19.42it/s]
100%|██████████| 50/50 [00:02<00:00, 19.35it/s]
100%|██████████| 50/50 [00:03<00:00, 13.55it/s]
100%|██████████| 50/50 [00:02<00:00, 18.42it/s]
100%|██████████| 50/50 [00:02<00:00, 19.95it/s]
100%|██████████| 50/50 [00:02<00:00, 20.90it/s]
100%|██████████| 50/50 [00:02<00:00, 20.82it/s]