马尔可夫决策过程(Markov Decision Process, MDP),是指具有马尔可夫性的一类强化学习问题,即系统下个状态和当前的状态有关,以及当前采取的动作有关。
some notation in MDP
1.state : S_t
2.action: A_t
3.reward: R_t
4.a probability of those values occurring at time t, given particular values of the preceding state and action:
上述概率对s’和r分别累加=1
5.state-transition probability
6.expect reward
7.expected reward for state-action-next-state
return of episode tasks and continuing tasks
what is return of task?
if we define each step reward as R_i(i=0,1,2,3…), then expect return is defined as some specific function of the reward sequence.
what is episode task?
Such as plays of a game, trips through a maze, or any sort of repeated interaction. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.
And the next episode begins independently of how the previous one ended.
what is continuing taks?
agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate an on-going process-control task, or an application to a robot with a long life span.
We call these continuing tasks.
the return of above two task
episode task
continuing task
Unified notation
through add an absorbing state in episode task(solid square):
unified notation:
update value function in mdp method
state-value function for policy π:
state-action-value function for policy π:
how to update the state in policy π?
Bellman equation for v_π:
Instead of getting average return in some random samples(Monte Carlo method), use some parameters to describle.
π(a|s) is the policy π: decide action due to state s
p(s’,r|s,a) is the transition probability of previous state s and action a, which follows the Markov property
get the optimal policy π and optimal policy value-state
choose the max value in the actions of state s: