
Deep reinforcement learning doesn't work yet - deepGem
https://www.alexirpan.com/2018/02/14/rl-hard.html
======
tehsauce
"A friend is training a simulated robot arm to reach towards a point above a
table. It turns out the point was defined with respect to the table, and the
table wasn’t anchored to anything. The policy learned to slam the table really
hard, making the table fall over, which moved the target point too."

Seems as though the problem of learning unintended techniques sometimes may be
better described as the model being too creative! Hitting the table is a
really clever solution for the problem it was given. These examples show that
the real challenge for researchers is constraining the models enormous
capacity for creativity without stifling its ability to learn.

~~~
mtgx
Can't wait to see how "creative" our future autonomous drone strikes will be.

~~~
IshKebab
Like if we program our AI overlords to minimise unavoidable deaths, and it
learns the best way to do in the long term is to sterilise everyone and wipe
out the human race!

~~~
PeterisP
Well, sure, 7 billion quick deaths means less death and suffering than even a
measly one gruesome random accident per year for 8 billion years in the
future.

------
blt
There was a good talk at NIPS this year about the difficulty of reproducing
results and benchmarking in RL. In supervised learning you can easily compare
results on standard datasets. In RL though, once the actions of two policies
diverge, you're at the whim of the PRNG. Even with the same random seeds,
gradient descent hyperparameters, etc., it is not possible to meaningfully
compare two policies with a single training run. Even if you hold everything
else constant, random seed had a huge effect. Ideally papers would show
aggregated data over 100 runs for each setup, but RL is too computationally
expensive for that.

It is frustrating, but also exciting because the field had so many open
problems.

On other hand, it really sucks that those with the 1000 machine cluster have
such a huge advantage over smaller labs.

------
mistercow
>Whenever someone asks me if reinforcement learning can solve their problem, I
tell them it can’t. I think this is right at least 70% of the time.

This seems to be a really strange calibration for "doesn't work". If you
replace "reinforcement learning" with other well known technologies, and ask
"of the instances where someone asks if X is a good solution, what percentage
of the time is it actually a good solution?" I feel like 30% would be on the
high end of the scale.

The rest of the article seems has a lot of interesting discussion about RL's
limitations, but it seems weird to make the article's thesis that RL doesn't
work, rather than just "RL still has a lot of limitations".

~~~
hateduser2
You don’t understand that quote. It’s saying 70% of the time when people think
reinforcement learning is the solution it’s not. Not saying that 30% of all
problems are solvable by reinforcement learning. Sometimes I can’t tell if
people here you included are intentionally misreading things to further their
agenda.

~~~
TeMPOraL
> _Sometimes I can’t tell if people here you included are intentionally
> misreading things to further their agenda._

More likely, sometimes people are just tired, distracted, or after consumption
of alcohol/drugs.

Personally, every other week I find myself writing a comment - sometimes long
- and then deleting it a minute later, after re-reading the comment I was
replying to and realizing I completely misunderstood it, and/or was arguing
against a strawman of my own creation. And every other month I have a comment
wrt. which I realize my mistake only few hours later.

------
alexbeloi
It's a shame how data hungry DRL is (even when compared to DL), but the DRL
framework encompasses all the standard classification/regression tasks and
also includes decision making, planning and pretty much anything else you can
think of.

Model generality and data efficiency are in an inverse relationship and a lot
of research has been in moving up or down this hyperbola. On one extreme,
tailoring models to specific use cases/datasets/environments, on the other end
transferring learning across domains. DRL is stuck pretty high on that
generality end. Some breakthroughs seemed to have moved progress to a higher
level curve, optimizers (Adam, DQN, TRPO) have gotten better which helps
everything in general, core structures like CNNs or memory cells, which seem
to be somewhat universal (or our best guess yet), but there's still something
fundamental that seems to be missing. Or maybe this is all there is and we
just need a computer with a richer/higher-resolution sense field and the flops
to process them.

~~~
deepGem
but the DRL framework encompasses all the standard classification/regression
tasks and also includes decision making

Except labelling data, correct ? which is non trivial effort.

~~~
gwern
> Except labelling data, correct ? which is non trivial effort.

Choosing choice of labeling is a RL problem too.

If you're choosing environment actions to learn how the environment 'labels'
them, that's the classic topic of 'how do we make DRL models explore' well
(and arguably is the Achilles heel of the model-free DRL approaches OP is
criticizing: the NNs can _easily_ learn to optimize their actions, even tiny
NNs are more than enough, but they just don't get fed the 'right' data ie.
exploration is bad). Relevant papers:
[https://www.reddit.com/r/reinforcementlearning/search?q=flai...](https://www.reddit.com/r/reinforcementlearning/search?q=flair%3AExp&sort=new&restrict_sr=on)

If you're being very narrow and considering a classification problem, well,
that's a RL problem too: you can optimize which datapoints you get labels for
based on how informative a datapoint is (most datapoints are simply redundant)
or how expensive it is to label. That's called 'active learning':
[https://www.reddit.com/r/reinforcementlearning/search?q=flai...](https://www.reddit.com/r/reinforcementlearning/search?q=flair%3AActive&sort=new&restrict_sr=on)
It's particularly natural if you are doing large-scale image classification
and have a service like Amazon Turk plugged in to get (or correct) labels.

------
bluetwo
I'm watching an AI agent right now learn to play Texas Hold'em and after
75,000 hands, starting with zero knowledge, it plays about as well as some of
my friends.

What should I try to teach it next?

~~~
yazr
75K hands seems very low to me

Can you give more details ?

Is it HU ? NL ? Using just self-play from 0-knowledge ?

How many roll-outs do u perform for each game action?

~~~
bluetwo
Home-brewed AI, not a NN. Similar in some ways to AlphaGo Zero's MCTS.

Learns to play with 0 knowledge. No training data. No rollouts. It squeezes a
lot of info from each hand, more than a NN, which is why it needs less trials.

Currently 100k hands and most respectable play.

~~~
yazr
Would love to chat and exchange some info

yazr2yazr@gmail.com

------
jostmey
I do not have the experience to support or refute the author's claim, but he
writes:

"The paper does not clarify what “worker” means, but I assume it means 1 CPU."

That seems like way under-powered to me. It's deepmind and so I would assume
that 1 worker is 1 GPU/TPU _node_ , meaning there are multiple GPUs for each
worker. I could see how not having enough compute power could result in a poor
solution

~~~
gwern
DRL is different from regular DL in that it tends towards CPU-heavy, not GPU-
heavy. It's hard to saturate a single GPU/TPU since you're using tiny little
NNs and only once in a while updating them based on long episodes through the
environment.

It might not be using GPUs/TPUs at all! If you look at the algorithm which
that DM paper is based on, PPO, the original OpenAI paper & implementation
([https://blog.openai.com/openai-baselines-
ppo/](https://blog.openai.com/openai-baselines-ppo/)) doesn't use GPUs, it's
pure-CPU. (They have a second version which adds GPU support.)

Or in a DM vein, look at their latest IMPALA which you might've noticed on the
front page a few days ago:
[https://arxiv.org/pdf/1802.01561.pdf](https://arxiv.org/pdf/1802.01561.pdf)
Look at Table 1 pg5's computational resources for various agents: note how
many of them have 0 GPUs whatsoever. Even the largest configuration, 500 CPUs,
only saturates 1 Nvidia P100 GPU.

(So, 'worker' could hypothetically refer to a server with X cores and 1 GPU
processing them locally, but this is almost certainly not the case since it
would imply scaling up to thousands of CPUs which is actually highly difficult
and requires careful engineering like with IMPALA.)

------
edraferi
> In this run, the initial random weights tended to output highly positive or
> highly negative action outputs. This makes most of the actions output the
> maximum or minimum acceleration possible. It’s really easy to spin super
> fast: just output high magnitude forces at every joint. Once the robot gets
> going, it’s hard to deviate from this policy in a meaningful way - to
> deviate, you have to take several exploration steps to stop the rampant
> spinning. It’s certainly possible, but in this run, it didn’t happen.

This is extremely human. Once you're deeply committed to something, it's hard
to imagine alternatives, never mind embrace them.

------
korbonits
This is an excellent write-up.

~~~
abledon
I might show this to that Uncle who talks about Kurzweil and the Singularity
at Thanksgiving.

~~~
darkmighty
Even AI researchers fall to the Singularity (imo) fallacy, including
Schmidhuber:

[http://people.idsia.ch/~juergen/history.html](http://people.idsia.ch/~juergen/history.html)

Although at least there's some display of self-skepticism:

"Kurzweil (2005) plots exponential speedups in sequences of historic paradigm
shifts identified by various historians, to back up the hypothesis that "the
singularity is near." His historians are all contemporary though, presumably
being subject to a similar bias. People of past ages might have held quite
different views. For example, possibly some historians of the year 1525 felt
inclined to predict a convergence of history around 1540, deriving this date
from an exponential speedup of recent breakthroughs such as Western bookprint
(around 1444), the re-discovery of America (48 years later), the Reformation
(again 24 years later - see the pattern?), and other events they deemed
important although today they are mostly forgotten."

Which is a little rare, if you know the curious character of Jürgen
Schmidhuber :)

~~~
taneq
Well, in a way, they're right. You could consider the printing press, or the
aeroplane, or electricity, or the telegraph, to be a mini-"singularity" event
since it drastically changed the world in ways unpredictable beforehand.
"Singularity" doesn't necessarily equate to "rapture of the nerds where AI
gods make everything awesome (and/or kill us all)", it just means "point where
things get weird and we can't predict what will happen next."

~~~
darkmighty
Well that's IMO devoiding the word of original meaning then. You're referring
to a revolution, which is a well established term, not a bona fide
'Singularity', which comes from mathematics as a point where the speed of
change properly diverges -- as would be the case if we had a geometric time
series of events with constant improvements.

This usage of the term really originated in the context of rampant
intelligence growth (through a supposed explosive self-improvement), see the
wikipedia article:

[https://en.wikipedia.org/wiki/Technological_singularity](https://en.wikipedia.org/wiki/Technological_singularity)

------
fabmilo
I don't like the title of the article. I think is sensationalist. Deep
reinforcement learning has beaten a human at the game of Go. It does work. The
model are not easy to train for most of the laymen out there and it has a lot
to be addressed to make it production-ready. But it does work. Great write up
overall. Thank you for the effort.

~~~
PeterisP
The current AlphaZero system has little in common with deep reinforcement
learning; the NN-guided MTCS is effective but not something that can be
applied to general reinforcement learning tasks. And the first AlphaGo system
was highly reliant on standard supervised learning (training the value network
and policy network on grandmaster games).

If anything, AlphaGo, AlphaGo Zero and AlphaZero are illustrations of how
"pure" deep reinforcement learning is insufficient, that the non-RL parts have
an enormous impact.

------
sushirain
Can autonomous vehicles work without deep reinforcement learning? I thought
that things like negotiating entry into a crossroad required DRL.

~~~
wnoise
Of course they can. DRL is a very very specific set of techniques to train
decision-making over multiple timesteps.

