Chapter02 Multi-armed Bandits

k-arm-Bandits问题可说是强化学习最简单的任务了，因为他只涉及了1个state下的action选取。通过本章可以对强化学习的目标，评估方法和训练方法有一个初步的认识。

2.1.1 什么是k-armed Bandits问题

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

大致意思就是，每步有k种action选择，每种选择对应不同的reward，完成的目标就是需要在n次执行后，最大化总共的reward。

In our k-armed bandit problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action:

$q_{a} \doteq E[R_{t}|A_{t}=a]$

2.1.2 k-armed Bandits问题的难点

这里主要是explore和exploit的权衡。下面的解释摘自周老师的西瓜书强化学习部分：

若仅为知道每个摇臂的期望奖赏，则可采用“仅探索”(exploration-only)法：将所有的尝试机会平均分配给每个摇臂，最后以每个摇臂各自的平均吐币概率作为其奖赏期望的近似估计。若仅为执行奖赏最大的动作，则可以用“仅利用”(exploitation-only)法，按下目前最优的摇臂。前者可以很好的估计每个摇臂的奖赏，却会失去很多选择最优摇臂的机会，没办法得到最好的收益；后者刚好相反，它没有很好地估计摇臂期望奖赏，很可能经常选不到最优的摇臂，因此这两种方法都难以使最终的积累奖励最大化。

根据实验情况，exploring-only收敛结果明显劣于exploiting-only，而exploiting-only效果同样很差。

2.2 Action-value Method

这是原文标题我直接抄过来了。因为上个标题提到了行为的价值(value)，所以需要给一种量化这种value的算法。

We begin by looking more closely at some simple methods for estimating the values of actions and for using the estimates to make action selection decisions. Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:

$Q_{t}(a) \doteq \frac{sum\ of\ rewards\ when\ a\ taken\ prior\ to\ t}{number\ of\ times\ a\ taken\ prior\ to\ t} = \frac{\sum_{i=1}^{t-1}R_{i}*1_{A_{i} = a}}{\sum_{i=1}^{t-1}1_{A_{i} = a}}$

然后如何选择action？

greedy action：选择Qt最大的a作为t时刻的action

ε-greedy action: behave greedily most of the time, but every once in a while, say with small probability ε, instead select randomly from among all the actions with equal probability,independently of the action-value estimates.

2.3 The 10-armed Testbed

10-armed Testbed是测试k-armed Bandits算法的一个基础模型，后面的模型评估都是基于此的。

What is it?

简而言之，Testbed中包含了2000个10-armed Bandits问题，每个问题的10个行为action对应的value q(a)是通过正态分布得到的(mean=0,variance=1)，而实际对每个问题进行训练的时候，得到的实际回报(actual reward)是在action value上附加一个正态分布(mean=q(a),variance=1)。然后用greedy算法和ε-greedy算法对每个问题进行1000step的训练，对2000个问题求平均数得到 target(Average reward or Optimal action) - Steps的折线图。并以此来评价算法性能。

算法对比参照Python代码的博客

2.4 Incremental Implementation

这部分讲了一个增量优化算法，这让我联想到了增量式PID和位置式PID，QAQ。其实这个思想和那个也挺相似的。

previous edition:

$Q_{n}\doteq \frac{R_{1}+R_{2}+...+R_{n-1}}{n-1}$

PS:这个公式和第一章的tic-tac-toe的V(s)更新公式太像了。

but，还是有一些区别。第一章的V(s)更新的时候是仅针对greedy-action的，但是这里因为各个action是独立的，如果不对exploring更新的话，就没办法有效的找到最优的action了，即这次action是无意义的。

incremental edition:

png

incremental bandit algorithm:

png

an intresting update rule

The update rule below is of a form that occurs frequently throughout this book. The general form is:

$NewEstimate\leftarrow OldEstimate+StepSize[Target-OldEstimate]$

about this rule:

1、The expression [Target−OldEstimate] is an error in the estimate. It is reduced by taking a step toward the “Target.” The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the nth reward.

2、Note that the step-size parameter (StepSize) used in the incremental method described above changes from time step to time step. In processing the nth reward for action a, the method uses the step-size parameter 1/n . In this book we denote the step-size parameter by α or, more generally, by α_t(a).

2.5 Tracking a Nonstationary(非平稳) Problem

上面讨论的Bandits-methods，都是建立在rewards的概率不变的前提下的，如Testbed中规定，reward(a)遵循(mean = q*(a),variance = 1)的概率分布。而在实际的强化学习任务中，经常是与之相对的概率不平稳的(nonstationary)。

考虑上一个标题讨论的update rule的一般形式，处理非平稳问题一个有效方法就是令α为定值(use a constant step-size parameter)，因为这样可以提高最近的reward的比重，降低过去久远的reward的比重，可以参考下面的公式：

png

因为α<1，所以很明显如果i越小，R_i的weight就会越小。

但是有一个问题，就是如果希望最终的value是收敛的，则Stepsize需要满足如下条件：

$\sum_{n=1}^{\infty }\alpha _{n}(a) = \infty\ \ and\ \ \sum_{n=1}^{\infty }\alpha _{n}^{2}(a) < \infty$

第一条保证weight足够大，可以覆盖掉初值的误差以及一些波动，第二条保证最后能够收敛。

当α=1/n条件很明显是满足的，但是如果α=constant第二条就不满足了。但是对于非稳定的情况，这正是我们希望看到的，具体证明也没给出，下次看到补上？

2.6 Optimistic Initial Values(encouraging exploration)

我们在上面看到α=1/n时，Q1被消去了，但是事实上大部分情况下Q1会对学习情况产生影响，比如当α=constant时，Q1就被保留了，不过*了一个相当小的系数。。。

事实上，this kind of bias is usually not a problem and can sometimes be very helpful.举个栗子，在Testbed里，我把训练初值设为5，对于在1附近正态分布的reward，不管选择哪个，在初始阶段都会<5，学习器对他们都很失望：Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

但是对于非平稳问题(nonstationary)，因为分布会随时间改变，而这种只在训练开始引入exploring的方法相当于只是encourage exploration for once，故效果不会很好，但是对应sample-average(平稳)的问题就会有一定的效果。

当然，我们不会只采用一种方法，将不同的方法进行组合利用，将会是贯穿全书的思想。

2.7 Upper-Confidence-Bound Action Selection

在ε-greedy方法中，通过一个小概率ε来选定non-greedy action进行exploring，但是这种方法是无差别的，如果对那些non-greedy action进行一个事前的评估，根据它们潜能(不确定度)来进行explore的话，那么可以提高explore的广度和效率，效果可能会更好。

One effective way of doing this is to select actions according to(UCB):

$A_{t}\doteq argmax_{a}[Q_{t}(a)+c\sqrt{\frac{\ln t}{N_{t}(a)}}]$

N_t(a) denotes the number of times that action a has been selected prior to time t.

number c > 0 controls the degree of exploration.

If N_t(a) = 0, then a is considered to be a maximizing action.

公式里的平方根部分是对action a的不确定度的估计，可以从两方面来解释：如果action a经常被选中，则分母N_t就会变大，不确定度就会减小；如果action a不经常被选中，那么随着试验次数t增大，而N_t不变，则不确定度上升，我们应该多考虑一下a啦。

但是也存在一些问题

UCB比起ε-greedy方法，相对来说更复杂，而且它在其他强化学习问题中的可扩展性(extend)远不如后者，而且这种方法在非平稳问题中表现的并不好，Another difficulty is dealing with large state spaces, particularly when using function approximation as developed in Part II of this book.In these more advanced settings the idea of UCB action selection is usually not practical.

2.8 Gradient Bandit Algorithms

看了这么久，休息一下如何^_^

A new method to choose action

consider learning a numerical preference for each action a, which we denote H_t(a). The larger the preference, the more often that action is taken:

$P_{r}\left \{ A_{t}=a \right \}\doteq \frac{e^{H_{t}(a)}}{\sum_{b=1}^{k}e^{H_{t}(b)}}\doteq \pi _{t}(a)$

π_t(a) is the probability of taking action a at time t. Initially all preferences are the same(zero)so that all actions have an equal probability of being selected.

How to update the H_t(a)

png

in the rule:

α>0 is a step-size paramter;

(R_t)_average is the baseline with which the reward is compared.
It which can be computed incrementally as described in Incremental Implementation

about the baseline

If the reward is higher than the baseline, then the probability of taking A_t in the future is increased, and if the reward is below baseline, then probability is decreased. The non-selected actions move in the opposite direction.

the theory of softmax method

here is a seperate blog to explain it.

2.9 Associative Search (Contextual Bandits)

上面的算法只处理了一台k-arms-Bandit，接下来考虑这个问题：如果你面对的是10台k-arms-Bandit，每次你随机从这几台机器(k-arms-Bandit comes from slot machine)里抽一个来操作，那么可以认为这仍是在处理同一台slot machine，但是true value是随着一步一步操作剧烈变化的。

说到这里应该就明白了，我的每一步action都是会对接下来的situation和reward引起影响的，而普通的k-arms-Bandit每步之间是独立的。

associative search可以认为是k-arms-Bandit和full reinforcement learning之间的桥梁，现在它一般被称为contextual bandits问题。