Chapter02 softmax-theory

This post is my understanding about P29 deeper insight part:《The Bandit Gradient Algorithm as Stochastic Gradient Ascent》

1. Understand Why it comes from Stochastic Gradient Ascent

在我们讨论这个问题之前先看看wiki上关于softmax在reinforcementLearning上的应用吧。

相信你和我一样惊奇，这里本来的H_t(a)变成了q_t(a)/τ。其中τ为温度系数(temperature parameter)，τ->无穷，所以action具有相同的选择概率，或者说是explore的；如果τ->0,则具有最大q_t(a)的行为会被选择，且τ越小，选择q_t(a)最大的概率越大并趋于1。

这太有意思了，wiki上直接解释了softmax function选择action的合理性，但是这就是这篇post想要讨论的问题啊@_@。

好了回到标题，书上这里这样写的：

In exact gradient ascent, each preference H_t(a) would be incremented proportional to the increment’s effect on performance:

并给出了公式：

$H_{t+1}(a)\doteq H_{t}(a)+\alpha \frac{\partial E[R_{t}]}{\partial H_{t}(a)}$

E[R_t]的定义为：

$E[R_{t}]=\sum_{x}^{all\ action}\pi _{t}(x)q_{*}(x)$

从梯度上升的角度来解释：如果平均的value:E[R_t]对H_t(a)导数为正，则提高H_t(a)可以提高E[R_t]；如果相反，则减小H_t(a)可以提高E[R_t]。总结起来就是H_t(a)的增量和E[R_t]对H_t(a)的偏导成正比。

计算偏导

计算公式如下：

png

公式的难点主要在第三步B_t的引入。B_t is called the baseline, can be any scalar that does not depend on x. We can include a baseline here without changing the equality because the gradient sums to zero over all the actions:

$\sum _{x}\frac{\partial \pi _{t}(x)}{\partial H_{t}(a)}=0$