引用来自ShangtongZhang的代码chapter08/trajectory_sampling.py
通过一个MDP的例子比较了均匀采样和on-policy采样的性能
问题描述
问题建立在一个MDP上,首先假设一个MDP有n个state,其中[0,n)是一般状态,n是terminal-state,每个状态有两个action=0;1。每个action都用epsilon的概率使state跳转到terminal-state;如果没有跳转到terminal-state,action会导致state等概率的跳转到b个branch对应的state上,同时每个branch对应的state也是|S|上等概率的,所以MDP示意图如下:
引入模块并定义常量
1 | import numpy as np |
定义argmax函数,返回value中最大值的任一个索引
1 | # break tie randomly |
定义Task类,用来模拟agent和环境的交互
1 | class Task(): |
计算使用基于value function的greedy policy得到的平均episode-reward
1 | # Evaluate the value of the start state for the greedy policy |
使用expected update更新value function,使用一致采样
1 | # perform expected update from a uniform state-action distribution of the MDP @task |
使用基于on-policy的采样方法进行value function的更新
1 | # perform expected update from an on-policy distribution of the MDP @task |
通过图像来比较两种采样方法的性能
1 | def figure_8_9(): |
100%|██████████| 20000/20000 [00:23<00:00, 863.09it/s]
100%|██████████| 20000/20000 [00:22<00:00, 882.61it/s]
100%|██████████| 20000/20000 [00:22<00:00, 873.29it/s]
100%|██████████| 20000/20000 [00:22<00:00, 884.96it/s]
100%|██████████| 20000/20000 [00:22<00:00, 870.28it/s]
100%|██████████| 20000/20000 [00:22<00:00, 888.02it/s]
100%|██████████| 20000/20000 [00:22<00:00, 891.23it/s]
100%|██████████| 20000/20000 [00:22<00:00, 883.12it/s]
100%|██████████| 20000/20000 [00:22<00:00, 885.09it/s]
100%|██████████| 20000/20000 [00:23<00:00, 864.50it/s]
100%|██████████| 20000/20000 [00:23<00:00, 833.70it/s]
100%|██████████| 20000/20000 [00:23<00:00, 847.89it/s]
100%|██████████| 20000/20000 [00:22<00:00, 872.53it/s]
100%|██████████| 20000/20000 [00:22<00:00, 879.40it/s]
100%|██████████| 20000/20000 [00:23<00:00, 868.80it/s]
100%|██████████| 20000/20000 [00:22<00:00, 906.87it/s]
100%|██████████| 20000/20000 [00:22<00:00, 876.36it/s]
100%|██████████| 20000/20000 [00:22<00:00, 890.05it/s]
100%|██████████| 20000/20000 [00:22<00:00, 888.76it/s]
100%|██████████| 20000/20000 [00:21<00:00, 910.71it/s]
100%|██████████| 20000/20000 [00:22<00:00, 874.98it/s]
100%|██████████| 20000/20000 [00:22<00:00, 896.89it/s]
100%|██████████| 20000/20000 [00:22<00:00, 884.20it/s]
100%|██████████| 20000/20000 [00:22<00:00, 904.25it/s]
100%|██████████| 20000/20000 [00:21<00:00, 924.97it/s]
100%|██████████| 20000/20000 [00:22<00:00, 881.82it/s]
100%|██████████| 20000/20000 [00:22<00:00, 904.37it/s]
100%|██████████| 20000/20000 [00:22<00:00, 896.47it/s]
100%|██████████| 20000/20000 [00:21<00:00, 909.28it/s]
100%|██████████| 20000/20000 [00:22<00:00, 884.27it/s]
100%|██████████| 20000/20000 [00:23<00:00, 857.58it/s]
100%|██████████| 20000/20000 [00:23<00:00, 861.50it/s]
100%|██████████| 20000/20000 [00:23<00:00, 858.42it/s]
100%|██████████| 20000/20000 [00:22<00:00, 876.96it/s]
100%|██████████| 20000/20000 [00:22<00:00, 872.85it/s]
100%|██████████| 20000/20000 [00:22<00:00, 900.74it/s]
100%|██████████| 20000/20000 [00:22<00:00, 889.23it/s]
100%|██████████| 20000/20000 [00:22<00:00, 896.29it/s]
100%|██████████| 20000/20000 [00:22<00:00, 891.87it/s]
100%|██████████| 20000/20000 [00:22<00:00, 879.51it/s]
100%|██████████| 20000/20000 [00:23<00:00, 854.93it/s]
100%|██████████| 20000/20000 [00:23<00:00, 853.38it/s]
100%|██████████| 20000/20000 [00:23<00:00, 851.27it/s]
100%|██████████| 20000/20000 [00:23<00:00, 839.56it/s]
100%|██████████| 20000/20000 [00:23<00:00, 868.91it/s]
100%|██████████| 20000/20000 [00:23<00:00, 858.23it/s]
100%|██████████| 20000/20000 [00:22<00:00, 869.78it/s]
100%|██████████| 20000/20000 [00:22<00:00, 883.71it/s]
100%|██████████| 20000/20000 [00:22<00:00, 894.70it/s]
100%|██████████| 20000/20000 [00:22<00:00, 886.71it/s]
100%|██████████| 20000/20000 [00:23<00:00, 868.78it/s]
100%|██████████| 20000/20000 [00:22<00:00, 882.62it/s]
100%|██████████| 20000/20000 [00:22<00:00, 887.52it/s]
100%|██████████| 20000/20000 [00:22<00:00, 881.84it/s]
100%|██████████| 20000/20000 [00:22<00:00, 878.90it/s]
100%|██████████| 20000/20000 [00:22<00:00, 902.32it/s]
100%|██████████| 20000/20000 [00:22<00:00, 901.14it/s]
100%|██████████| 20000/20000 [00:22<00:00, 882.30it/s]
100%|██████████| 20000/20000 [00:22<00:00, 872.38it/s]
100%|██████████| 20000/20000 [00:22<00:00, 894.51it/s]