Chapter03 Finite Markov Decision Processes

马尔可夫决策过程(Markov Decision Process, MDP)，是指具有马尔可夫性的一类强化学习问题，即系统下个状态和当前的状态有关，以及当前采取的动作有关。

some notation in MDP

1.state : S_t

2.action: A_t

3.reward: R_t

4.a probability of those values occurring at time t, given particular values of the preceding state and action:

$p(s^{'},r|s,a)\doteq Pr\left \{ S_{t}=s^{'},R_{t}=r|S_{t-1}=s,A_{t-1}=a \right \}$

上述概率对s’和r分别累加=1

5.state-transition probability

$p(s^{'}|s,a)\doteq Pr\left \{ S_{t}=s^{'}|S_{t-1}=s,A_{t-1}=a \right \}$

6.expect reward

$r(s,a)\doteq E[R_{t}|S_{t-1}=s,A_{t-1}=a]$

7.expected reward for state-action-next-state

$r(s,a,s^{'})\doteq E[R_{t}|S_{t-1}=s,A_{t-1}=a,S_{t}=s^{'}]$

return of episode tasks and continuing tasks

what is return of task?

if we define each step reward as R_i(i=0,1,2,3…), then expect return is defined as some specific function of the reward sequence.

what is episode task?

Such as plays of a game, trips through a maze, or any sort of repeated interaction. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.

And the next episode begins independently of how the previous one ended.

what is continuing taks?

agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate an on-going process-control task, or an application to a robot with a long life span.

We call these continuing tasks.