Chapter01 Introduction

希望能以更新博客的方式激励一下自己，目前是准备读一下强化学习的入门书《Reinforcement Learning Introduction》，然后做一下读书笔记。下面是绪论(Introduction)的内容。

What is ReinforcementLearning

First Impression of RL

如果说之前我接触到的机器学习方法，如逻辑回归、svm、决策树、人工神经网络等是数据驱动的，那么强化学习就是“情境”驱动的：行为由当前环境决定，行为导致数值化的reward更新，而reward又影响了环境。这是我对强化学习的初印象。

强化学习有两个特点：
1、trial-and-error search：学习算法并未指导行为，但是却必须得出当行为做出相应的reward(近期)

2、delayed reward：每个行为之间不是独立的，当前行为可能会有比当前收益更丰盛的远期收益(value)。

强化学习有三个重要要素，就是学习器(agent，我对学习器的理解就是训练出模型时执行的程序）必须能够感知环境(state)并做出影响它的行为(action)，同时agent must have a goal or goals relating to the state of the environment。

RL vs Supervised learning

监督学习就不赘述了，大致来说，监督学习算法通过带标签的训练集学习得到可以对未知样例的判断“能力”。但是如果让监督学习从交互中学习，按照监督学习的思路，获取正确的而且有意义的行为例子是不可取的，所以它没办法从经验中成长，不适合需要和环境交互的算法。

举一个不太恰当的例子，监督学习就像考前刷题的你，而强化学习就像每天认真学习的学霸，如果考题和你刷的吻合度比较高，那你就稳了，但是学霸每天学习，数据已经内化为他的能力了，所以他一点都不慌。

RL vs Unsupervised learning

有不少人认为监督学习和非监督学习已经把机器学习进行详尽的划分了，他们将强化学习归入后者，因为它们都没有带标签的数据，但是从目的上来看，强化学习是为了最大化reward的，而非监督学习是为了学习到数据之中的隐藏结构，非监督学习的方法并没有从实质解决强化学习的问题。

exploration and exploitation

The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future.

这个问题有点像梯度下降中使用的随机梯度法(SGD)或者模拟退火法，但是SGD是一种加速算法，或者说如果我头铁不用，我也是有可能得到还可以的结果，但是强化学习如果不学习未知的部分，结果肯定是有问题的，但是如果一直在试错 explore，不对“经验”进行总结 exploit，也是没办法使算法收敛的。

Tic-Tac-Toe

我觉得先分析一个例子可以更好的帮助理解RL的各个属性之间的关系。Tic-Tac-Toe比较像我们小时候玩的井字棋，图我就不画了。

强化学习和传统的classical minimax solution from game theory以及Classical optimization methods for sequential decision problems, such as dynamic programming有区别，前者是按照有强约束的方式运动的（这里是不会走使自己失败的位置），而后者则是需要有每一步的先验概率才能进行优化。

强化学习是基于value functions的学习方法，而前面提到的minimax可以认为是基于evolutionary的学习方法，前者是基于value of state来学习的，后者则是通过结果来更新policy，前者对状态的利用优于后者，而后者则是仅利用了结果来修改policy。

tic-tac-toe的ython代码实现戳这里

Elements of RL

Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward signal , a value function, and, optionally, a model of the environment.

policy

感觉是从state到action的一个映射关系吧，这里就对应了上面提到的explore和exploit的关系。In general, policies may be stochastic.

reward

每个action导致的短期收益。reward会决定policy，使得行为向高reward的方向发展，所以reward一般是state和action的随机函数：In general, reward signals may be stochastic functions of the state of the environment and the actions taken.

value function

由reward引起的长期收益。value由reward综合而来，而value同时会使reward朝着高的方向发展。value控制action的发展：Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run.

model of the environment

This is something that mimics(模拟) the behavior of the environment, or more generally, that allows inferences(推断) to be made about how the environment will behave.

Summary

总体来说强化学习和之前学过的其他机器学习方法还是区别挺大的，但是绪论有几点疑问：

1、action、state、reward、value之间的更新方式还是没搞懂？

2、如何建立value table？

—————————-(answers in 11.14)————————————-

1、value建立在state之上，每次state更新伴随着value的更新；action分为两种，explore和exploit，前者随机，后者基于reward最大；reward认为是从current state -> next state update时，对应的next value or (next value - current value)

2、value table建立基于state，原则视问题而定。比如tic-tac-toe问题，X赢对应的value = 1，O赢对应的value = 0，平局对应value = 0.5，当然我是站在X的角度。