Temporal Difference method 可以说是强化学习算法的代表，它结合了Monte Carlo方法和DP方法的优点：
TD方法不需要借助环境的信息(model-free)，可以直接通过模拟得到经验进行学习；
同时它像DP算法一样，一个state的估计(estimate)可以通过其它部分的估计来进行更新，而不需要等待完整的episode结束。

TD Prediction

TD方法和MC方法都是根据生成的经验来学习的，但是MC方法需要在episode结束后才能更新，比如对一个非平稳问题有如下更新规则的every-visit MC方法：

$V(S_{t})\leftarrow V(S_{t})+\alpha (G_{t}-V(S_{t}))$

TD methods need to wait only until the next time step. At time t+1 they immediately form a target and make a useful update using the observed reward
R_{t+1} and the estimate V(S_{t+1}). The simplest TD method makes the update, which we call it as TD(0):

$V(S_{t})\leftarrow V(S_{t})+\alpha \left [ R_{t+1}+\gamma V(S_{t+1})-V(S_{t}) \right ]$

接着给出TD(0)方法预测(prediction)的伪代码：

png

关于TD方法的理论基础，我们需要分析一下target的由来。target的概念在第二章Bandit的增量实现那一部分提到过，在强化学习中使用的更新通式：

$NewEstimate\leftarrow OldEstimate+StepSize[Target-OldEstimate]$

从第三章我们推导出这样的公式：

png

可以看到Monte Carlo方法是使用(6.3)的近似来作为target，使用样本返回的reward来近似；DP方法使用(6.4)的近似作为target。TD方法将两者结合起来，首先它并没有使用R_{t+1}的期望，而是使用了一次抽样的reward来近似，其次没有使用基于policy π的expect value，而是使用了它当前的近似V来更新。所以可以认为TD方法结合了Monte Carlo的抽样原理，使它不必基于model，同时基于DP的原理使得TD可以快速迭代更新。

这一部分关于TD方法的预测基本完成了，但是书中还提到了一个量δ_t，这个值是update rule右边括号里的值，是V(s)更新到更好的值所使用的偏差。因为需要采样来进行更新，所以δ_t只有到t+1时刻才可以知道。

如果假设V array的值不变，或者因为step size的值α很小，认为V array的更新比较缓慢，则有下述等式近似成立：

png

Generalizations of this identity play an important role in the theory and algorithms of temporal-difference learning.

Advantages of TD Prediction Methods

这部分回答了几个问题：

1、TD方法比DP好吗？比MC好吗？

TD vs DP:

Obviously, TD methods have an advantage over DP methods in that they do not require a model of
the environment, of its reward and next-state probability distributions.

TD vs MC:

(1)One of most obvious advantage of TD methods over Monte Carlo methods is that they are naturally
implemented in an on-line, fully incremental fashion.

(2)With Monte Carlo methods one must wait until
the end of an episode, because only then is the return known, whereas with TD methods one need wait
only one time step. Surprisingly often this turns out to be a critical consideration. Some applications
have very long episodes, so that delaying all learning until the end of the episode is too slow. Other
applications are continuing tasks and have no episodes at all.

(3)Finally, as we noted in the previous
chapter, some Monte Carlo methods must ignore or discount episodes on which experimental actions
are taken, which can greatly slow learning. TD methods are much less susceptible to these problems
because they learn from each transition regardless of what subsequent actions are taken.

2、TD方法收敛吗？

答案是肯定的。对于任意给定的policy π，TD(0)方法的结果被证明会收敛到v_π，如果使用一个小的固定step size那么收敛情况是平均的(我觉得应该是在附近很小的振荡吧)或者变化的，如果使用满足如下条件的递减step size则一定会收敛v_π。

3、TD方法和MC方法那个训练起来更快？

这个目前还没有人能通过数学推导出两者的速度孰快孰慢，但是一般来说TD方法总是比constant-α MC方法快一些。

这部分的实验Example6.2 Random Walk的代码可以看这个。

Optimality of TD(0)

这一部分主要讲了batch updating的优化方法，这个方法不同于之前的方法,V(S_t)不是在得到V(S_{t+1})后就直接更新了,而是先模拟了n个完整的episode,然后再用每个episode计算相应的target值,相当于原来增量以episode为单位的总和，再用这个总和进行更新，如果value收敛了，这个episode就算用完了，再换下一个episode继续操作。这种方法适用于训练样本比较少的情况，在相同样本数量下这种方法训练速度较慢，而且训练结果从和理论值计算得到的rms error来看并没有很大的优化。

使用batch updating方法更新的python代码

其中的temporal_difference(value,batch)函数是使用TD方法更新的函数，如果batch=True则不执行数据更新，只返回trajectory和reward的两个array；

monte_carlo(value,batch)函数是使用MC方法更新的函数，如果batch=True，只返回trajectory和return的两个array。

同时因为这里使用的是MRP(example6.3)，所以可以直接计算state-value(model-based)。

for r in tqdm(range(0, runs)):
    # current_value是当前对v_π的近似(estimate)
    current_values = np.copy(VALUES)
    trajectories = []
    rewards = []
    for ep in range(episodes):
        if method == 'TD':
            # trajectory_ 是单次episode的轨迹，包括了(1)中提到的S0,S1,S2,...,S_n
            # rewards在TD方法里指的是每一步action引入的reward。
            trajectory_, rewards_ = temporal_difference(current_values, batch=True)
        else:
            # trajectory_ 是单次episode的轨迹，包括了(1)中提到的S0,S1,S2,...,S_n
            # rewards在MC方法里指的是每个state的return
            trajectory_, rewards_ = monte_carlo(current_values, batch=True)
        
        # 将每一个episode产生的trajectory还有rewards都推进宏观的trajectories和rewards
        # 这两个list随着实验进行长度迅速增长
        trajectories.append(trajectory_)
        rewards.append(rewards_)
        while True:

            # update相当于累加版的target
            update = np.zeros(7)
            
            for trajectory_, rewards_ in zip(trajectories, rewards):
                # len(trajectory)-1表明不考虑终止状态
                # 使用所有的trajectory和reward计算所以trajectory上的state的update
                for i in range(0, len(trajectory_) - 1):
                    if method == 'TD':
                        updates[trajectory_[i]] += rewards_[i] + current_values[trajectory_[i + 1]] - current_values[trajectory_[i]]
                    else:
                        updates[trajectory_[i]] += rewards_[i] - current_values[trajectory_[i]]
            updates *= alpha
            if np.sum(np.abs(updates)) < 1e-3:
                break
            # 使用update进行一次batch update
            current_values += updates

现在来讨论一下Example6.4并对TD和MC性能差异给出解释

首先给出8个episode：

A, 0, B, 0 B, 1

B, 1 B, 1

B, 1 B, 0

这是一条参数未知的MRP，如何根据给出的episode来估计其中的参数？

首先来讨论一下value(B)，因为TD(0)method 和MC method更新最后一个state的方法一致，所有最终会收敛到最大似然估计值：0.75。

最大似然估计值在《统计学习方法》里经常可以看到，抛开详细的证明，根据频率来逼近概率的思路就是最大似然估计，所以这里对v(B)的最大似然估计就是出现1的概率：0.75。

但是在v(A)的计算中，TD和MC方法出现了分歧：

MC方法直接从train data中来，所有只有一次A出现，而且G=0，所以很自然v(A)=0;

但是如果用TD方法来估计，因为reward(action:A->B)=0，所以最终v(A)会和v(B)一起收敛到0.75。

比较这两个方法可以看到，MC方法在train data上计算出来的rms error无疑更小，但是我们仍然认为TD方法更好，因为TD方法在未来或者说未知数据上可能会有更好的近似性。PS:这个问题比较像欠拟合和过拟合的辨析。

这里有一个问题，关于last state value，在github issue上问了一下代码作者，他给出的解释是最后一个state的value始终是0。的确是这样的，因为最后一个state都不更新action了，所以不会有value更新的。思考之后，我觉得问题出在Example6.4上，因为它这里让我预测的B的value，正是最后一个state的value!所以如果需要代码解决这个问题，需要在B之后再加上两个分别对应reward=0和reward=1的terminal state才行。

从Example6.4总结得到结论

batch MC方法总是找到使训练集上的均方误差(rms error)最小的estimate，而batch-TD(0)总是找到对马尔可夫过程的最大似然模型精确的估计。

通常，参数的最大似然估计是其生成数据的概率最大的参数值。在这种情况下，最大似然估计得到的是马尔可夫过程的模型，该模型以明显的方式由观测事件形成：从i到j的转移概率的估计是从i到j的观测转移的频率，并且关联的期望报酬是在这些转变中观察到的回报。给定这个模型，我们可以计算值函数的估计，如果模型是正确的，则该估计将完全正确。这被称为确定性等价性估计(certainty-equivalence estimate)，因为它等价于假定基础过程的估计是确定的，而不是近似的。总体上，批处理TD(0)收敛于确定性等价估计。

总体来说，TD方法得到的estimate是最大似然估计(maximum-likelihood)，得到的是确定性等价评估；而MC方法则太过于注重样本的结果，得到的是训练样本的无差估计。虽然这个结论是我们从batch updating得到的，但是nonbatch算法的收敛趋势和batch updating是一致的，即总的方向是粗略的朝着batch updating的方向的。从而我们证明了TD方法训练效果和收敛速度快于constant-α MC。

Sarsa: On-policy TD Control

从预测算法得到控制算法或者提升算法(chapter04)，重要的是引入greedy-policy。在DP方法那一章引入的GPI方法几乎给出了控制算法的”通解”。所以可以仿照上一章给出的基于MC方法的on-policy控制算法，给出TD方法的on-policy控制算法——Sarsa:

png

target-policy:ε-greedy or ε-soft policies.

convergence properties of the Sarsa algorithm:

depend on the nature of the policy’s dependence on Q, ε-greedy or ε-soft policies are OK. Sarsa converges with probability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, with ε-greedy policies by setting ε = 1/t).

Q-learning: Off-policy TD Control

Q-learning 是off-policy的控制方法，因为在update Q的时候，使用的不是behavior policy(比如ε-greedy)，而是使用的next-state对应的maximization state-action-value。伪代码如下：

png

Expected Sarsa

如果把Sarsa的update规则更换为如下：

png

就是Expected sarsa算法。

Expected sarsa算法相比Sarsa方法表现更好，因为引入了期望来更新Q，一定程度上消除了因为随机选择Action带来的方差(variance)。从cliff-walk的例子中我们可以看到，state的转换都是确定的，随机性或者说不确定性都是policy引入的，因为Expected Sarsa引入了期望，所以可以设置α=1，并且不会降低渐近性能(asymptotic performance)，而Sarsa只能把α设置的低一些才能有效的运行，并且只有长期训练才有好的效果。关于渐近性能和临时性能的比较，可以参考post。

Maximization Bias and Double Learning

所谓的最大正偏差，产生的原因就是在算法中使用的maximization操作。这种maximization操作无形中将估计Q值中的最大值用作了对最大q*的估计值：

In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias.

比如，假设只有一个state，并且它有很多对q(s,a)=0。但是estimate of q(s,a)有大于0的也有小于0的。如果把估计值中的最大值用作了estimate of maximization q*(s,a)，那么就会产生一个正偏差。

那么如何解决这个问题那？

One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value.

意思就是维护两个q(s,a)的估计Q1和Q2，当估计max q*的时候，可以先找到对应最大Q1的A*，然后使用Q2(A*)来估计max q*。原理就是E[Q2(A*)] = q(A*)这个期望公式是无偏的。
同理也可以反过来更新Q1，二者谁更新的问题就好比抛硬币正反面一样，可以使用二项分布来决定。这种算法和Q-learning结合起来就可以得到避免最大正向偏差的控制算法：Double Q-learning.

更新规则下例，反过来同理:

png

Example6.7代码可以参考这个post

Games, Afterstates, and Other Special Cases

算法根据不同的情况也需要进行一些调整，给出的方法只是一种参考格式，并不一定适应所有问题。

一个例子就是在导论中学习的tic-tac-toe问题，可以看到我们在这里学习的value function并不每一步的state-action-value，而是采取action后当前棋盘的所有棋子分布的value。因为只考虑action的话，就无法将对手的走子情况考虑进去，作为一个竞技游戏很明显是不合适的。比如下面的两种棋盘，虽然所处state和采取action都不同，但是结果是一样的，所以这种afterstate的更新方法更加有效可靠。

png