
A Long Peek into Reinforcement Learning - tosh
https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html
======
clickok
Pretty cool, this is actually a great reference for a lot of things. Even if
you're familiar with RL, you might be reminded of something or learn something
new.

Sutton and Barto's book is also good if you want to do more than dip your toes
into RL: [http://incompleteideas.net/book/the-
book-2nd.html](http://incompleteideas.net/book/the-book-2nd.html) That page
has the PDF with links to code, problem solutions, and course material.

If you just want an idea of what an RL algorithm looks like, I've got a
(heavily) documented version of TD(λ) for linear function approximation here:
[https://github.com/rldotai/rl-
algorithms/blob/master/py3/td....](https://github.com/rldotai/rl-
algorithms/blob/master/py3/td.py)

------
hhmc
If anyone is interested in a fun toy-program to learn Temporal Difference
Learning on, the game 2048 is amenable to the approach - in fact it gives the
best current results of any current AIs.

It uses the N-tuple technique - an instructive paper describing N-tuples and
TDL applies to 2048 is here:
[https://arxiv.org/ftp/arxiv/papers/1606/1606.07374.pdf](https://arxiv.org/ftp/arxiv/papers/1606/1606.07374.pdf)

An alternative project would be a game like "connect-4", which is amenable to
the same approach.

~~~
jrx
I’d be very interested to hear for which similar problems (games, control
problems) reinforcement learning is not the best approach (there exists other,
qualitatively better method£. We hear a lot about success stories but not a
lot about limitations of RL

~~~
clickok
There are a couple of places where RL/Evolutionary Methods/Something Else are
jockeying for the top of the leaderboards, but usually if one technique
succeeds over the others it just means that someone's put a lot of work into
that one particular approach.

Anecdotally, I've heard about some optimization problems that should be a good
fit for dynamic programming or reinforcement learning where it turns out they
actually don't seem to work as desired.

For example, optimizing elevator policy: an agent controlling an elevator
wants to achieve high throughput of passengers while also minimizing the wait-
time of each individual. We want people to get to their floor quickly but also
want to avoid having any individual wait for an excessively long period of
time. The agent can only observe information about the buttons pressed on each
floor, although it gets a reward like that reflects our desiderata[1]. As it
turns out, most elevators are not running some sort of cool machine optimized
policy, but rather something hand-coded.

This is not for lack of opportunity, either-- apparently, some researchers
pitched Otis on this in the 90s, and it didn't work as well as what they
already had, despite this looking like a case where theory should match
reality. _Why_ this is, I don't know, but it might come down to the fact that
there's really a lot more to a "pleasant elevator experience" than might be
assumed (or modeled by the reward function), or perhaps the hand coded policy
incorporates background knowledge about human vertical locomotion unavailable
to a machine.

\-----

1\. Maybe something `R_{t} = Transporting(t) - ∑ Waiting(i, t)`, meaning
something like "number of people currently being transported minus the people
who are waiting times how long they've been waiting for".

~~~
hhmc
Or perhaps those researchers in the 90s just whiffed?

It's ostensibly the case for other RL problems that we've only recently
crossed some compute power viability threshold. Maybe contemporary researchers
with contemporary compute power would have better luck?

