Chapter03 Finite Markov Decision Processes

马尔可夫决策过程(Markov Decision Process, MDP),是指具有马尔可夫性的一类强化学习问题,即系统下个状态和当前的状态有关,以及当前采取的动作有关。

some notation in MDP

1.state : S_t

2.action: A_t

3.reward: R_t

4.a probability of those values occurring at time t, given particular values of the preceding state and action:

上述概率对s’和r分别累加=1

5.state-transition probability

6.expect reward

7.expected reward for state-action-next-state

return of episode tasks and continuing tasks

what is return of task?

if we define each step reward as R_i(i=0,1,2,3…), then expect return is defined as some specific function of the reward sequence.

what is episode task?

Such as plays of a game, trips through a maze, or any sort of repeated interaction. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.

And the next episode begins independently of how the previous one ended.

what is continuing taks?

agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate an on-going process-control task, or an application to a robot with a long life span.

We call these continuing tasks.

the return of above two task

episode task

continuing task

png

Unified notation

through add an absorbing state in episode task(solid square):

png

unified notation:

png

update value function in mdp method

state-value function for policy π:

png

state-action-value function for policy π:

png

how to update the state in policy π?

Bellman equation for v_π:

png

Instead of getting average return in some random samples(Monte Carlo method), use some parameters to describle.

π(a|s) is the policy π: decide action due to state s

p(s’,r|s,a) is the transition probability of previous state s and action a, which follows the Markov property

get the optimal policy π and optimal policy value-state

choose the max value in the actions of state s:

png

png