
Human-level control through deep reinforcement learning - daisystanton
http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
======
erostrate
The code is online if you want to play with it.
[https://sites.google.com/a/deepmind.com/dqn/](https://sites.google.com/a/deepmind.com/dqn/)

If you're interested, one of the main authors (David Silver) teaches a very
good and intuitive introductory class on reinforcement learning at UCL:
[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)

~~~
fchollet
Interesting that they're using Torch7. The code is pretty concise and
readable, very cool stuff.

~~~
yablak
They wrote Torch7...

~~~
fchollet
I'm sure a number of people who have contributed to Torch are working at
DeepMind. However, Torch has been around for much longer than DeepMind (about
12 years at this point). Two of the major contributors to Torch, Ronan
Collobert and Clement Farabet, were never DeepMind employees.

To be fair, another major contributor to Torch is a co-author of this paper
(Kavukcuoglu).

------
bmh100
> _...the authors used the same algorithm, network architecture, and
> hyperparameters on each game..._

This is huge. It shows that the algorithm was able to generalize across
multiple problem sets within the same domain of "playing Atari 2600 games",
and not simply a "lucky" choice of algorithm, network architecture, or
hyperparameters that a random search for each game might choose. This is also
not a violation of the No Free Lunch (NFL) Theorem [1], because the domain is
limited to playing Atari 2600 games, which share many characteristics.

[1]:
[https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...](https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization)

~~~
bsdetector
> the algorithm was able to generalize across multiple problem sets

Did it really? I think they reset it and retrained it for each game.

I'd like to know how much more is needed to make one instance of the AI that
can successfully play any of the games. To play all 49 games that it could
learn, does it need to be an extra level deep? Or 49 times larger? Or 2^49
times more?

~~~
meric
Can one human successfully play all of the games without prior practice in
each game? As far as I know a human has to practice almost every game to be
able to play all of them without losses. I think for an AI, achieving this
standard is a good result - first practicing in each game and then play
through all of them without losses.

~~~
bsdetector
A person that learns 1 game will learn the next game much faster, because they
have learned a concept such as a bullet or switch or reflection or wrapping.
We take this for granted, but there was a time when Breakout was actually
marginally fun because it was new.

A person that's played all the other games in the list can win Montezuma's
Revenge on the first try; this AI can't play Montezuma's Revenge at all.

------
sjtrny
Watch it play:

[http://www.nature.com/nature/journal/v518/n7540/extref/natur...](http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv1.mov)

[http://www.nature.com/nature/journal/v518/n7540/extref/natur...](http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv2.mov)

~~~
dwaltrip
This is so cool. I'd love to work on this stuff...

Anyone know how hard it would be for someone who is fairly good at programming
(works as a full stack developer and feels quite comfortable learning new
things) and has strong math skills (undergrad degree) to break into this
field? Is going back to school for a masters/phd the best way?

~~~
ratsimihah
How about you read the article and get a few books on the relevant topics? It
would probably be much cheaper than going back to school.

~~~
dwaltrip
Very good point. I taught myself web dev (now working at a pretty awesome
startup) so I'm definitely familiar with that route.

I have a few cool AI ideas I'm hoping to start spending more time in the
coming months, and I have heard of some great online courses to check out. I
was just curious as to how important institutional credentials are for this
kind of thing, seeing as it much more academic than building CRUD web apps.

~~~
meric
I think there are lots of AI competitions where you can join and make your
name.

------
superfx
Here's a publicly-accessible link to the full paper:

[http://rdcu.be/cdlg](http://rdcu.be/cdlg)

~~~
jmnhr
Only the first page, the rest is blurred and has to be paid for.

~~~
frandroid
Let it load...

~~~
jmnhr
They are using tricks. When I tried the first time the pdf was blurred and the
page automatically opened the payment menu. On the second try it showed the
entire pdf, disallowing download.

~~~
leereeves
It loaded completely for me the first time, but doesn't allow download.

A step toward open access, but they're still trying to claim copyright and
control the work of others.

------
j_m_b
It is interesting how they are using various biological models to develop
their own model. They gave their model a reward system and a memory. It will
be interesting to see how far deep Q-networks can be extended and at what
point they hit the wall of diminishing returns.

|Nevertheless, games demanding more temporally extended planning strategies
still constitute a major challenge for all existing agents including DQN.

|Notably, the succesfsful integration of reinforcement learning with deep
network architectures was critically dependent on our incorporation of a
replay algorithm involving the storage and representations of recently
experienced transitions.

I am not for sure what data the replay algorithm has access to, but I wonder
what happens if you extend the amount of data it has. This might be the brick
wall this algorithm hits of diminishing returns.

It would be interesting to hear what the authors think could help help improve
how their model deals with temporally extended planning strategies.

As someone who grew up on Atari, Nintendo and Sony this is pretty cool work.

~~~
morenoh149
I expect it could go far. Mind you I only did parts of _Artificial
Intellegence: A modern approach_ but the Q-learning algorthim seems very
flexible.
[https://en.wikipedia.org/wiki/Q-learning](https://en.wikipedia.org/wiki/Q-learning)
It basically keeps doing good stuff, while exploring to get out of local
minimas.

------
albertzeyer
An interesting critic by Schmidhuber about this publication:

[https://plus.google.com/100849856540000067209/posts/eLQf4KC9...](https://plus.google.com/100849856540000067209/posts/eLQf4KC97Bs)

~~~
arvinjoar
Seems to be critiquing the claim that this is "groundbreaking" and not much
else. Nice to get some further context though. :)

------
nl
Is this a different paper to the original DeepMind video game paper?
[http://arxiv.org/abs/1312.5602](http://arxiv.org/abs/1312.5602)

~~~
sp332
Yes, I can't access the full paper but at least the figures are different :)

Edit: Ars Technica has a summary of this new paper.
[https://arstechnica.com/science/2015/02/ai-
masters-49-atari-...](https://arstechnica.com/science/2015/02/ai-
masters-49-atari-2600-games-without-instructions/)

------
discardorama
Is there a chance this paper will be available as PDF? I' finding it difficult
to read the readcube version. :-(

~~~
p1esk
[http://www.nature.com/nature/journal/v518/n7540/pdf/nature14...](http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf)

~~~
teraflop
Maybe you have institutional access or something, but for the rest of us, that
link just redirects to the abstract.

------
javierluraschi
I think qlearning is really interesting, I posted yesterday a simple
implementation/demo in Javascript of qlearning. This paper goes way beyond
qlearning by deducing states based on a deep neural network from the actual
game rendering, really cool. Regardless, as a first intro to qlearning I had
fun putting this together
[https://news.ycombinator.com/item?id=9105818](https://news.ycombinator.com/item?id=9105818)

------
javierluraschi
Here is the marketing side of this publication in which Google scientists
(aquihired from Deepmind) have developed a way to outperform humans in Atari
games: [http://m.phys.org/news/2015-02-hal-bests-humans-space-
invade...](http://m.phys.org/news/2015-02-hal-bests-humans-space-
invaders.html)

------
plinkplonk
Is the paper available anywhere to read without having to pay Nature? From the
comments it seems as if everyone is able to read this but me! Even in their
"readcube" access method, only the first page is (barely) visible, the rest
seems blurred.

------
nl
The most interesting thing about this is that it shows significant progress
towards goal-oriented AI. The fact this system is effectively learning what
"win" means in the context of a game is something of a breakthrough.

~~~
eveningcoffee
I do not think that it figures out "what the win is" as the score parameter is
explicitly made available to the algorithm.

In some sense this paper even demonstrates that how simple the problem
actually is.

I think more important question is that what else can be modelled as such
problem.

------
craftit
It is an amazingly powerful technique. We've been working on a service which
lets you do this kind of learning with any JSON stream. You can see a demo
here:

[https://aiseedo.com/demos/cookiemonster/](https://aiseedo.com/demos/cookiemonster/)

~~~
ya3r
The amazing part of what DeepMind has achieved is its capability to learn from
raw pixel input with deep convolutional neural networks, which as I understand
it quite different from what you do.

Still the reinforcement learning part is the same, but reinforcement learning
was not the main contribution of this nature paper.

~~~
craftit
Its not all that different, we take multiple asynchronous streams of messages
integrate them into a coherent predictive model, and use that to feed the
reinforcement learning. The messages can contain images, a simple case can be
seen in the demo with a 1d vision sensor.

------
viggity
Can someone convert "academia nerd language" down one notch into "regular nerd
language". On the surface, this sounds interesting but despite being a huge
nerd I'm not really sure what the hell they're talking about.

~~~
brentjanderson
Here is a very rough translation from my POV:

> The theory of reinforcement learning provides a normative account, deeply
> rooted in psychological and neuroscientific perspectives on animal
> behaviour, of how agents may optimize their control of an environment.

Reinforcement (rewards/punishments) is a highly effective way to train
autonomous individuals to succeed in arbitrary environments.

> To use reinforcement learning successfully in situations approaching real-
> world complexity, however, agents are confronted with a difficult task: they
> must derive efficient representations of the environment from high-
> dimensional sensory inputs, and use these to generalize past experience to
> new situations.

This kind of open-ended learning in a simulation is hard: Think of the number
of inputs from all your senses being continually processed by the nervous
system at a given time. Being able to take all those inputs, each of which
changes meaning depending on context, to figure out what to do right now
(while learning from the past in the process) is a hard problem, especially
for a computer to solve.

> Remarkably, humans and other animals seem to solve this problem through a
> harmonious combination of reinforcement learning and hierarchical sensory
> processing systems, the former evidenced by a wealth of neural data
> revealing notable parallels between the phasic signals emitted by
> dopaminergic neurons and temporal difference reinforcement learning
> algorithms.

Humans and other animals have this figured out through our own brains
("dopaminergic neurons") and the combination of our senses and various parts
of our nervous systems/biology (e.g. our reflexes respond faster than our
cognition due to the hierarchy of autonomic nervous system response).

> While reinforcement learning agents have achieved some successes in a
> variety of domains, their applicability has previously been limited to
> domains in which useful features can be handcrafted, or to domains with
> fully observed, low-dimensional state spaces.

Past reinforcement-based algorithms have worked well, but require thorough
understanding of the problem being solved, or for the problem to be relatively
simple and predictable.

> Here we use recent advances in training deep neural networks to develop a
> novel artificial agent, termed a deep Q-network, that can learn successful
> policies directly from high-dimensional sensory inputs using end-to-end
> reinforcement learning.

By combining advances in deep neural network training with reinforcement
learning in a "novel artificial agent" ("a deep Q-network"), our agent can
learn sophisticated problems through only reinforcement learning.

> We tested this agent on the challenging domain of classic Atari 2600 games.
> We demonstrate that the deep Q-network agent, receiving only the pixels and
> the game score as inputs, was able to surpass the performance of all
> previous algorithms and achieve a level comparable to that of a professional
> human games tester across a set of 49 games, using the same algorithm,
> network architecture and hyperparameters.

Our new approach works across 49 games using the same approach for each game
(where each game presumably has different rules and dynamics), and is able to
perform at the same level as a professional human being.

> This work bridges the divide between high-dimensional sensory inputs and
> actions, resulting in the first artificial agent that is capable of learning
> to excel at a diverse array of challenging tasks.

We've created a universal learning algorithm that can take a multitude of
inputs and consistently respond correctly without having to re-define the
model for each game or problem.

~~~
j_m_b
|> This work bridges the divide between high-dimensional sensory inputs and
actions, resulting in the first artificial agent that is capable of learning
to excel at a diverse array of challenging tasks.

|We've created a universal learning algorithm that can take a multitude of
inputs and consistently respond correctly without having to re-define the
model for each game or problem.

Further along in the article, it turns out it is actually a subset of games
which don't involve long term planning strategies.

~~~
chriswarbo
> Further along in the article, it turns out it is actually a subset of games
> which don't involve long term planning strategies.

Unfortunately I can't access the paper, and the ReadCube link seems to require
Flash, but I'm assuming a "Q-network" is based on Q-learning. Q-learning is
essentially a lookup table of states to actions, where we guess what state
we're in from our observations and look up which action to perform that will
get us the most reward.

In that sense, it's clear that a Q-learning approach would struggle with long-
term planning, since its memory only goes as far as 1 action. Of course there
are ways to extend Q-learning, but these tend to destroy it's best feature:
implementation efficiency.

One nice alternative I've seen in recent years is Gradient Temporal
Difference, which allows linear functions to be learned rather than just
single actions, and retains lots of the performance properties of Q-learning
(O(n) in the number of functions learned, off-policy learning, etc.).

~~~
DavidSJ
Q-learning is not a lookup table from states to actions. It's a learned
mapping (using any method you want, such as neural networks) from agent
history (i.e. the agent is allowed to consider the past) & action pairs, to
estimates of future reward, learned through temporal differences between the
prediction at time step t, and the sum of new prediction plus immediate reward
at time step t+1.

In this way, information can flow back in time as the agent learns that some
observation is predictive of reward, and then learns some observation is
predictive of that observation, and so on until it connects the reward with
some action K time steps ago. So long-term planning is definitely possible.

This does require that some relevant information be available at each
intermediate time step to connect the actions with the ultimate reward. The
nice thing about these Atari games is you can usually judge value just by
what's immediately on the screen, and in this paper they only use the last
three frames' worth of state. In a game requiring a memory for past sensory
information (e.g. where information appears on the screen then disappears),
this might not do so well, but that's more a matter of working memory than
long-term planning, and a different Q-learning system could contain a working
memory (e.g. if it was RNN-based).

------
sharemywin
PDF:

[http://arxiv.org/pdf/1312.5602v1.pdf](http://arxiv.org/pdf/1312.5602v1.pdf)

~~~
p1esk
No, this is a new paper.

------
eveningcoffee
I am wondering that what kind of real life problems could be modelled this
way.

~~~
phkahler
>> I am wondering that what kind of real life problems could be modelled this
way.

Driving a car? Making a taco? Working the checkout counter?

~~~
eveningcoffee
So could you define an cost function for these activities?

~~~
craftit
I suspect setting up motivations for the AI is going to be a big research
issue before too long. If you can write a simulator for the task you want to
it solve, you should be able to train it. Often writing the simulator is much
easier than solving a task itself. For example, the atlas robot, then can
simulate it but struggle to control it.

------
Someone
For comparison:
[http://www.cs.cmu.edu/~tom7/mario/](http://www.cs.cmu.edu/~tom7/mario/). That
is way more of a hack, but I am not sure this is that big a step forward.
Space invaders and breakout aren't the hardest games and I haven't heard a
hard argument that it is just a matter of scale to create a machine that, say,
plays chess.

~~~
TheEzEzz
The biggest differences are this:

1\. The Mario algo has direct access to the game state, and will only work for
games where it has that game state access. The DeepMind algo plays directly
from the screen pixels. That means DeepMind has to first learn to interpret
time varying (!) visual information correctly, then deduce rules and good play
strategy on top of that leaky abstraction. That's hard. It also means the
algorithm can be applied to any game with a screen output, not just to an
Atari.

2\. The Mario algo is doing a direct search through move space. It can back up
and explore a different branch of the tree and play differently to see a
different outcome. When the DeepMind algo plays Atari, it can't undo a move
that it just did. It has to make good choices, using intelligence, just like a
human player would.

The impressive thing here is not that it plays Atari games. You're right, we
have had AIs that can do this for a long time, even better than this. The
impressive thing is that it's a single AI algorithm that works for many games,
and that is learning directly from the screen. We have not had anything like
this before.

