
Deep Mind Playing Montezuma's Revenge with Intrinsic Motivation [video] - tonybeltramelli
https://www.youtube.com/watch?v=0yI2wJ6F8r0
======
jerf
If you end up blocked by a popup blocker, or just don't feel like reading
fluff, or like me you've got the browser locked down too tightly for their
integrated video player to work, the paper is at
[https://arxiv.org/pdf/1606.01868v1.pdf](https://arxiv.org/pdf/1606.01868v1.pdf)
and the video mentioned is (probably, since I didn't see the original) at
[https://www.youtube.com/watch?v=0yI2wJ6F8r0](https://www.youtube.com/watch?v=0yI2wJ6F8r0)
.

~~~
dang
Thanks. We've changed the URL from [http://www.wired.co.uk/article/google-ai-
montezuma-revenge](http://www.wired.co.uk/article/google-ai-montezuma-
revenge). At first to
[https://arxiv.org/abs/1606.01868](https://arxiv.org/abs/1606.01868), but
since everyone's going to want to watch the video, that seems a bit too
abstruse. Hopefully someone will say meaningful things about the paper too.

------
ChuckMcM
I look forward to the day when you are taunted on the Starcraft boards for
playing "like an AI" :-)

That said, game theory has always been an excellent way to analyze AI systems.
And using "modern" games (which generally provide attractive skins over a
classic mechanic) certainly makes it easier to watch/sit through. When
DeepMind starts beating people playing Diplomacy then we'll know we're in a
whole new game.

~~~
bytefactory
Indeed, I can't wait for the day we see DeepMind (or its ilk) playing games
like StarCraft, I expect it'll be quite uncanny.

------
folli
Is it really true that Montezuma's Revenge is more challenging for DeepMind
than Go, as they mention in the article?

~~~
kastnerkyle
Montezuma's Revenge, Castle Wolfenstein (not the shooter), and puzzle games in
general have a problem of long term credit assignment and sparse reward. This
"intrinsic reward" approach form the paper, based on pseudo counts seems to be
one way to get an intermediate reward which helps the model learn toward an
overall goal (winning/progress) which happens rarely. The previous best work
had to pre-define the intrinsic rewards as I understand it [0], and DeepMind
has been tracking this general problem for a while [1] along with a whole
bunch of earlier work from the 70s/80s/90s (cited in the background of this
new paper).

Credit assignment in a nutshell is "what actions helped me get reward"? For
action games this is fairly easy - there are only a few moves between rewards.
For puzzlers, something like left, up, right, up, left, left, left, left, up,
up could get a reward. We can see there is a cycle in there which is probably
not necessary, but maybe this was a much longer path than the ideal as well.
Deciding which moves should get credit is a hard problem, but an important
one. [2]

If you look at the results of the original DQN paper [3] you will see the
games they fared best were ones where there were frequent rewards (e.g.
Breakout). Things that are puzzle-like (such as Q-Bert) fared much worse
versus human benchmarks, whereas action games like Breakout (which is fully
observable given 4 frame context IIRC) were generally better than the human
benchmark.

This paper seems to be a big step toward deep RL for more than just short term
decisions and a huge jump towards goal oriented planning.

[0] Kulkarni et. al
[https://arxiv.org/abs/1604.06057](https://arxiv.org/abs/1604.06057)

[1] Mohamed, Rezende
[https://arxiv.org/pdf/1509.08731.pdf](https://arxiv.org/pdf/1509.08731.pdf)

[2]
[http://www.scholarpedia.org/article/Reinforcement_learning#....](http://www.scholarpedia.org/article/Reinforcement_learning#.28Temporal.29_Credit_Assignment_Problem)

[3] Nature results are better but paywalled :/ NIPS paper here
[https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
.
[http://www.nature.com/nature/journal/v518/n7540/abs/nature14...](http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html)
\- Figure 3

~~~
bytefactory
Thank you, that was an excellent summary!

On a somewhat related note, it seems clear that AI research and breakthroughs
are occurring at breakneck speed. I wish there was a place where you could see
expert commentary like your in layman terms on interesting or important papers
that stand out.

~~~
argonaut
The reddit r/machinelearning subreddit has generally higher quality technical
discussion than HN, although kastnerkyle's comment is really great.

------
chongli
This is interesting but it's still far from how a human would learn how to
play the game. Humans don't have inbuilt rewards for Montezuma's Revenge, they
acquire them culturally. How much of what was learned (by the machine, not the
researchers) in playing Montezuma's Revenge could be applied to a game like
Zelda? A human would instantly notice many of the connections between the two
games: enemies that follow simple patterns and harm the player on contact,
rooms that connect to one another laid out on a grid pattern, single use
consumable keys that open doors, valuable gems to collect. Is the machine able
to make any of these connections on its own?

~~~
aab0
I think if you watch a child play anything, even a video game, you'll see that
when it's not immediately obvious what to do, a human will just 'mess around'
and 'try stuff out' to 'just to see what happens'. As the paper says, the idea
of novelty bonuses in reinforcement learning draws directly from discussions
of curiosity and play and intrinsic motivation in animals and humans.

~~~
chongli
Initially, yes, but then as a human learns more about the game they begin to
make plans at a much higher level than just the basic mechanics of moving
around. In the video, the machine continues to die in very basic ways even
after a hundred million iterations. A human who had played the game even a
tiny fraction of that many times would be expected to complete an entire play-
through flawlessly.

And this still doesn't address my original question. A human who was able to
master Montezuma's Revenge would have a dramatic advantage in learning and
mastering the game Zelda compared to somebody who had played neither game
before. What experience, if any, could this machine be expected to bring to
Zelda, assuming no modification by the researchers?

~~~
gwern
You would get some boost if you use a NN factored to have game-specific module
feeding into a universal game-playing NN:
[https://arxiv.org/abs/1511.06342](https://arxiv.org/abs/1511.06342) " Actor-
Mimic: Deep Multitask and Transfer Reinforcement Learning", Parisotto et al
2015

> The ability to act in multiple environments and transfer previous knowledge
> to new situations can be considered a critical aspect of any intelligent
> agent. Towards this goal, we define a novel method of multitask and transfer
> learning that enables an autonomous agent to learn how to behave in multiple
> tasks simultaneously, and then generalize its knowledge to new domains. This
> method, termed "Actor-Mimic", exploits the use of deep reinforcement
> learning and model compression techniques to train a single policy network
> that learns how to act in a set of distinct tasks by using the guidance of
> several expert teachers. We then show that the representations learnt by the
> deep policy network are capable of generalizing to new tasks with no prior
> expert guidance, speeding up learning in novel environments. Although our
> method can in general be applied to a wide range of problems, we use Atari
> games as a testing environment to demonstrate these methods.

------
biztos
On behalf of all who have "played" Montezuma's Revenge for real, let me say we
look forward to the day when it can be outsourced to the Cloud.

[https://en.wikipedia.org/wiki/Montezuma%27s_Revenge](https://en.wikipedia.org/wiki/Montezuma%27s_Revenge)

------
louhike
It might be a interesting future if computers become powerful enough to be
able to run this kind of IA. Then games companies might be able to use
directly a generic IA like this instead of developing specific ones. But I
suppose it will be hard to do and won't happen (if it does) before a long
time.

------
discardorama
So lets say I take the trained network, and flip the color of the pixels, and
make some other cosmetic changes (keeping the game intact). Will the network
then solve it in the first try?

~~~
gwern
The ALE agents are almost always using grey-scale pixels in the first place
(to save on processing and NN space requirements), so flipping the color makes
no difference.

~~~
seanwilson
Is it learning the specific level layout or can it adapt when the screens are
different and order swapped?

~~~
gwern
DQN is stateless, so at every point it's reacting to just the current frame
(technically, the average of the last 4 frames IIRC) being fed into a CNN and
outputing a motion. It doesn't care about ordering all that much because it
has no memory. In this case, it's probably learning stuff like 'if there's a
key, go towards it' and 'if you have a key, go towards a barrier'. If each
screen keeps working the same way, then I guess it would achieve about the
same performance as it does now. New screens with new layouts/patterns of
enemies will test how well its learned heuristics generalize, though.

Personally, I think that someone should be trying a DQN with an RNN rather
than CNN in it to see if that helps on the harder levels. Or better yet,
combine it with some of the memory mechanisms and see if it can start doing
some real long-term planning.

~~~
SandB0x
I believe the input state is the past 4 frames stacked or concatenated.

------
bra-ket
I have a feeling DeepMind AI will specialize in playing video games

~~~
thisisdave
Here's one impressive counterexample:
[https://arxiv.org/abs/1604.08772](https://arxiv.org/abs/1604.08772)

------
schoen
I'm confused by the subtitle of the linked article, which says "The AI system
was able to solve the complex game in just four tries". But the video shows
the AI dying many more times than that, and not ultimately winning the game,
just learning to explore a portion of it successfully.

What did the Wired editor mean by "solve" and "four tries"? (Or, for that
matter, "complete"?)

~~~
wrsh07
Oh, I think I get it now. I think they mean that in one play [ie 4 lives], it
solves significantly more than any previous network.

------
pgrote
The text of the article is cut in half for me. Can anyone else confirm the
same rendering behavior?

[http://i.imgur.com/FfcXAmi.png](http://i.imgur.com/FfcXAmi.png)

I am on Windows 7 using Chrome 51.0.2704.79 m and the text issue occurs in
incognito mode as well.

~~~
dragontamer
Win7 using Firefox. Text issue for me, I can't read it.

------
tyingq
Super confused.

Mon·te·zu·ma's re·venge (noun, informal)

 _Diarrhea suffered by travelers, especially visitors to Mexico._

~~~
schoen
It's a video game for Apple II which is named after the illness.

[https://en.wikipedia.org/wiki/Montezuma's_Revenge_%28video_g...](https://en.wikipedia.org/wiki/Montezuma's_Revenge_%28video_game%29)

