
Reinforcement Learning: An Introduction (2018) [pdf] - atomroflbomber
http://incompleteideas.net/book/RLbook2018.pdf
======
svalorzen
If you ever feel like trying out the algorithms contained in the book without
going to the trouble of reimplementing everything from scratch feel free to
come over to [https://github.com/Svalorzen/AI-
Toolbox](https://github.com/Svalorzen/AI-Toolbox). This is a library I have
maintained during the past 5 years and implements quite a lot of RL
algorithms, and can be used with both C++ and Python. It's very focused on
being understandable and having a clear documentation, so I'd love to help you
out starting up :)

~~~
clickok
Cool! I'd also like to plug my own RL-related repositories:
[https://github.com/rldotai/rl-algorithms](https://github.com/rldotai/rl-
algorithms) and
[https://github.com/rldotai/mdpy](https://github.com/rldotai/mdpy) .

The first one implements some of the more "exotic" temporal difference
learning algorithms (Gradient, Emphatic, Direct Variance) with links to the
associated papers. It's in Python and heavily documented.

The second one (mdpy) has code for analyzing MDPs (with a particular focus on
RL), so you can look at what the solutions to the algorithms might be under
linear function approximation. I wrote it when I was trying to get a feel for
what the math _meant_ and continue to find it helpful, particularly when I'm
dubious about the results of some calculation.

------
nafizh
With all its hype in RL, I am yet to see significant real life problems solved
with it. I am afraid with all the funding going into it, and nothing to show
for except being able to play complex games, this might contribute to the
mistrust in proper utilization of research funds. Also the reproducibility
problem in RL is many times worse than in ML.

~~~
milaresearcher
I agree with you that it's early days for RL. I think some companies are using
it in their advertising platforms, but it's not really my field.

That said, I strongly disagree about what constitutes the proper utilization
of research funds. IMO, society should invest in basic research without the
expectation of solutions to significant real-world problems.

Still, I'd be really surprised if I don't see advances from the field of
reinforcement learning used in a ton of applications during my lifetime.

------
keeptrying
The authors , Barto and Sutton take such a complicated subject and explain it
in such simple prose.

I don’t think think I’ve read any other work that does this as well.

Also RL is only going to grow in use and popularity. Highly recommend it for
mL practitioners.

~~~
abhgh
I hope it grows in popularity if only because its an interesting take on
learning. I did a course on RL in 2007 and our textbook was the 1st edition of
this book - back then, it was perceived to be a very niche area and a lot of
ML practitioners (there weren't many of those either :) ) had only just about
heard of RL.

I am happy that it's popular today.

~~~
platz
> I am happy that it's popular today.

For doing what?

~~~
siekmanj
It's definitely finding a niche in robotic control. My lab just released a
paper about running a policy trained in simulation in the real world on a
bipedal robot.

------
wenc
My understanding is RL is a reasonable attack for situations where the
environment is either (1) mathematically uncharacterized (2) insufficiently
characterized (3) characterized, but resulting model is too complex to use,
and therefore RL simultaneously explores the environment in simple ways and
takes actions to maximize some objective function.

However, there are many environments (chemical/power plants, machines, etc.)
where there are good mathematical/empirical data-based models, where model-
based optimal control works extremely well in practice (much better than RL).

I'm wondering why the ML community has elected to skip over this latter class
of problems with large swaths of proven applications, and instead have gone
directly to RL, which is a really hard problem? Is it to publish more papers?
Or because self-driving cars?*

(* optimal control tends to not work too well in highly uncertain, non-
characterized, changing environments -- self-driving cars are an example of
one such environment, where even the sensing problem is highly complicated,
much less control)

~~~
svalorzen
RL is actually quite an umbrella term for a lot of things. There's policy
gradient methods, which improve directly on the policy to select better
actions, there's value based methods which try to approximate the value
function of the problem, and get a policy from that, and there's model based
methods which try to learn a model and do some sort of planning/processing in
order to get the policy.

Using model based methods can allow you to do some pretty fancy stuff while
massively reducing the number of data samples you need, but on the other side
there's a trade off. Using the model usually tends to require lots of not-
very-parallelizable computations, and can be more costly computationally. Very
large problems can get out of hand pretty quickly, and there's still a lot of
work to do before there is something which can be applied in general quickly
and efficiently.

~~~
wenc
Thanks for the insight on RL. That's good context for me.

I would say though that from my experience, computational cost is rarely the
issue with model-based control, because there are various attacks ranging from
model simplification (surrogate models, piecewise-affine multi-models i.e.
switching between many simpler local models, etc) to precomputing the optimal
control law [1] to embedding the model in silicon. Also, some optimal
models/control laws can actually parallelized fairly easily (MLD models are
expressed mixed-integer programs which can be solved in performant ways using
parallel algorithms, with some provisos). This is a well-trodden space with a
tremendous amount of industry-driven research behind it.

Most of these methods come under the Model Predictive Control (MPC) umbrella
which has been studied extensively over 3 decades [2]. The paradigm is
extremely simple: (1) given a model of how output y responds to input u,
predict over the next n time periods the values of u's needed to optimize an
objective function. (2) Implement ONLY the first u. (3) Read the sensor value
for y (actual y in real world). (4) Update your model with the difference
between actual y and predicted y, move the prediction window forward, and
repeat (feedback). When this is applied recursively, you obtain approximately
optimal control on real-life systems even in the presence of model-reality
mismatch, noise and bounded uncertainty.

If you think about it, this is the paradigm behind many planning strategies --
forecast, take a small action, get feedback, try again. The difference though
is that MPC is a strategy with a substantial amount of mathematical theory
(including stability analysis, reachability, controllability, etc.), software,
and industrial practice behind it.

[1] Explicit MPC
[http://divf.eng.cam.ac.uk/cfes/pub/Main/Presentations/Morari...](http://divf.eng.cam.ac.uk/cfes/pub/Main/Presentations/Morari.pdf)

[2]
[https://en.wikipedia.org/wiki/Model_predictive_control](https://en.wikipedia.org/wiki/Model_predictive_control)

~~~
breatheoften
(disclaimer: I am not a RL researcher) I think grandparent was using 'model'
to refer to model-based or 'value-based' reinforcement learning algorithms (as
distinct from 'model-free' methods (ex: 'policy-based' methods)). I don't
think they were directly referring to the same 'model' as is meant by MPC.

In RL, the goal is to try to find a function that produces actions that
optimize the expected reward of some reward function. Model-based RL methods
typically try to extract a function for 'representing' the environment and
employ techniques to optimize action selection over that 'representation'
(replace the word 'representation' with the word 'model'). Model-free RL
methods instead try to directly learn to predict which actions to take without
extracting a representation. A good paper describing deep q-learning -- a
commonly cited model-free method that was one of the earliest to employ deep-
learning for a reinforcement learning task [1].

I think it's worth clarifying -- RL algorithms as a whole are more akin to
search than to control algorithms. RL algorithms can be used to solve some
control problems -- but that is not all they are used for unless you take an
extremely broad view about what constitutes a 'control problem' ... I don't
think it would be common to model playing 'go' as a control problem for
example -- nor would I consider learning how to play all atari games ever
created given only image frames and the current score and no other pre-
supplied knowledge to be a control problem ...?

(I'm talking way passed my familiarity now) -- That said, optimal control
theory intersects with RL quite a bit in the foundations -- Q-Learning
techniques (a foundational family of methods in RL) have proofs that show
under what conditions they will converge on the optimal policy -- I believe
this mathematics to be quite similar to the mathematics used in optimal
control theory...

[1] Deep Q-Networks [https://storage.googleapis.com/deepmind-
media/dqn/DQNNatureP...](https://storage.googleapis.com/deepmind-
media/dqn/DQNNaturePaper.pdf)

~~~
wenc
Thanks for sharing some really interesting thoughts. Just to add on to your
comment...

The goal of optimal control is broadly similar to RL in that it aims to
optimize some expected reward function by optimizing action selection for
implementation in the environment.

The difference is the optimal control does not seek to learn either a
representation or a policy in real-time -- it assumes both are known a priori.

Both can be thought of as containing hidden Markov models, though in optimal
control the transition functions are assumed to be known whereas in RL they
are unknown.

Another difference is that in control theory, we assume there is always a
model -- though some models are implicit. You see, control algorithms either
assume that the environment is explicitly characterized (model-based, like
MPC), or that the controller contains an implicit model of the environment
(internal model control principle, i.e. we adjust tuning parameters in PID
control... there's no explicit model, but a correctly tuned controller behaves
like a model-inverse/mirror of reality). In either of these cases, either the
implicit or explicit model are arrived at before hand -- once deployed, no
learning or continual updating of the controller structure is done.

In contrast, RL has an exploration (i.e. learning) component that is missing
from most control algorithms [1], and actively trades-off exploration vs
exploitation. In that sense, RL encompasses a larger class of problems than
just control theory, whereas control theory is specialized towards the
exploitation part of the exploration vs exploitation spectrum.

[1] Though there are some learning controllers like ILCs (iterative learning
control) and adaptive controllers which continually adapt to the environment.
They have a weakness (perhaps RL suffers from the same) in that if a transient
anomalous event comes through, they learn it and it messes up their subsequent
behavior...

~~~
breatheoften
I’m not sure how comparable adaptive control theory notions are to
“reinforcement learning”. Adaptive obviously isn’t a perfectly defined word —
but your usage makes me think you might be pondering applying RL to non-
stationary environments which I’m not sure is something RL would currently be
necessarily likely to perform well for - many reinforcement learning
techniques _do_ require (or at least perform much better) when the environment
is approximately stationary — of course it can be stochastic but the
distributions should be mostly fixed or else convergence challenges are likely
to be exacerbated.

