Chapter04 gamblers_problem

引用来自ShangtongZhang的代码chapter04/gamblers_problem.py

赌徒问题

问题描述

一个赌徒可以在每轮赌博中决定将自己手里的钱拿来赌硬币的正反，如果硬币向上，则可以获得押金一样的奖励，但是向下的话押金就没了。结束条件是赌徒手里的钱增长到100，或者把钱输光。

这个问题可以定义为state为赌徒手里的钱，action为每次拿去赌的钱，discount=1的MDP问题。

引入模块并定义全局变量

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# goal
GOAL = 100

# all states, including state 0 and state 100
STATES = np.arange(GOAL + 1)

# probability of head, which is the probability win money
HEAD_PROB = 0.4

Value Iteration

需要注意几点：

初始化value-state的时候，除了100的状态为1，其余都为0，可以理解为除了到达100可以获得reward=1，其余action对应reward=0，即利用value-state initialize来实现reward。

训练的时候把action=0去掉，是因为aciton=0会导致agent陷入局部最优，所以需要跳出这个点。

def figure_4_3():
    # state value initialize
    state_value = np.zeros(GOAL + 1)
    state_value[GOAL] = 1.0

    # value iteration
    while True:
        delta = 0.0
        for state in STATES[1:GOAL]:
            # get possilbe actions for current state
            actions = np.arange(1,min(state, GOAL - state) + 1)
            action_returns = []
            for action in actions:
                action_returns.append(
                    HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])
            new_value = np.max(action_returns)
            delta += np.abs(state_value[state] - new_value)
            # update state value
            state_value[state] = new_value
        if delta < 1e-9:
            break

    # compute the optimal policy
    policy = np.zeros(GOAL + 1)
    for state in STATES[1:GOAL]:
        actions = np.arange(1,min(state, GOAL - state) + 1)
        action_returns = []
        for action in actions:
            action_returns.append(
                HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])

        # round to resemble the figure in the book, see
        # https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/issues/83
#         policy[state] = actions[np.argmax(np.round(action_returns[1:], 5)) + 1]
        policy[state] = actions[np.argmax(np.round(action_returns,5))]

    plt.figure(figsize=(10, 20))

    plt.subplot(2, 1, 1)
    plt.plot(state_value)
    plt.xlabel('Capital')
    plt.ylabel('Value estimates')

    plt.subplot(2, 1, 2)
    plt.scatter(STATES, policy)
    plt.xlabel('Capital')
    plt.ylabel('Final policy (stake)')

    plt.savefig('./figure_4_3.png')
    plt.show()

figure_4_3()

中奖率=0.4

png

中奖率=0.1

png

中奖率=0.8(所以说中奖率太高也不能浪吗。。。)

png