
Learning ‘Montezuma’s Revenge’ from a single demonstration - gdb
https://blog.openai.com/learning-montezumas-revenge-from-a-single-demonstration/
======
rsf
> In addition, the agent learns to exploit a flaw in the emulator to make a
> key re-appear at minute 4:25 of the video

After a bit of debugging, this appears to be a very intentional feature in the
game rather than a flaw. That key appears after a while if you're not in the
room (and don't have one).

Based on this disassembly:
[http://www.bjars.com/source/Montezuma.asm](http://www.bjars.com/source/Montezuma.asm)

Here's the relevant code with some annotations added:

[https://goo.gl/VUDr9F](https://goo.gl/VUDr9F)

I'm not sure if this is a previously known feature in the game (a quick google
search does not reveal much). It would be quite interesting if the RL agent
was the first to find it!

PS: If you launch MAME with the "-debug" option and press CTRL+M you can see
the whole memory (atari 2600 only has 128 bytes!!) while playing the game. If
you keep an eye on the byte at 0xEA you will know when the key is about to pop
up. Alternatively you can speed things along by changing it yourself to a
value just below 0x3F.

------
bryanh
One thing that is striking to me in almost all these sorts of otherwise
impressive demonstrations are the apparently bizarre "jitter" movements while
waiting for a door to open or path to clear in the game. Clearly there is no
fitness in quietly waiting.

It is darkly humorous to contrast Hollywood's or scifi's "killer AI robots"
that methodically hunt you down to these real world demonstrations of emerging
AI. Maybe the first "killer AI robots" would exhibit similarly bizarre
behaviors while they methodically hunt down the unlikely hero. :-)

~~~
throwaway37585
This behavior wouldn’t necessarily transfer to the real world because the real
world has costs (e.g. energy utilization and hardware damage, both very
important in nature) which are not always accurately reflected in these
simulations. It brings to mind the example where an agent learned how to make
a cheetah “run” while repeatedly banging its head on the ground, which
wouldn’t work in the real world for obvious reasons.

~~~
adrianmonk
Also, for a human there is cognitive load in moving around. If you are safe
where you are now and nothing significant changes, it's mentally easier to
stay still instead of re-evaluating everything constantly. And this frees your
brain to better plan your next move, so it's advantageous. For an AI, CPU
power isn't as scarce.

And even with a computer playing a video game (not the real world), your
joystick hand gets tired.

------
dane-pgp
"By multiplying N of these probabilities together, we end up with the
resulting probability p(get key) that is exponentially smaller than any of the
individual input probabilities."

So they solved this by feeding the AI with a human demonstration, but have
there been any attempts at giving the AI an explicit reward for maximising the
"novelty" of the input state (i.e. the image on the screen)?

The game does not give the player points for reaching new rooms, but if the AI
was rewarded for producing the "novel" state of a new room, then that would
give it a drive to explore. Similarly, there would be an implicit penalty to
the AI for repeatedly falling off a ledge or returning back to a room it had
already visited (although some amount of back-tracking would no doubt be
useful), whereas reaching a new part of the screen (by climbing a ladder, say)
would be rewarded.

There are times where the AI would have to be patient and wait, but the window
could be learned or set as a hyper-parameter. This might be enough to stop the
unproductive behaviour of it jittering left and right continuously, since
doing so does not produce a new state, relative to just standing still at
least.

~~~
Eridrus
As the sibling comment says, there is work on this, but the key question
becomes "how do you define novelty?" You could say new rooms are novel, but
this is basically just the engineering approach of defining a new reward, and
isn't really an interesting solution since it's not really applicable to other
RL use cases, or even other games.

~~~
dane-pgp
In the worst case, the AI could store every single input it has received (for
example, every frame it has seen) and calculate how similar each new input is
to its past corpus.

Calculating similarity of images is quite a well-understood problem, but
you're right that generalising the idea of similarity across all types of
input data, in a way that is helpful for the AI to learn from, and efficient
to calculate, may end up requiring a lot of coding that's specific to the
individual use case.

------
jwcrux
I thought this was going to be very different than what it was.

~~~
seanmcdirmid
I was pleasantly surprised this was about the first thing that popped into
mind (having played the game a lot as a kid on a Coleco).

~~~
jhbadger
I played it on the Apple ][ -- it seemed to be ported to many platforms. But
it is a fairly obscure game -- it makes me wonder why they picked it rather
than a more common one.

~~~
jonrei
It was picked because it is difficult to train a reinforcement learning model
to play it well. In most other games you can create a reward function based on
the score or something similar, and then the AI can explore possible actions
that gives the best score. In those cases AI players are already doing quite
well. In this case finding the key requires long term planning to get an
actual reward and the AI has previously got stuck before that.

------
empressplay
Montezuma's Revenge is one of the more impressive "3D conversions" done by our
Apple II emulator [1]

[1] [https://paleotronic.com/wp-
content/uploads/2018/05/5.png](https://paleotronic.com/wp-
content/uploads/2018/05/5.png)

------
goatlover
It's an impressive achievement, but it does seem to get stuck at times, like
from around 1:35 to 2:10 and 3:45 to 4:30 (irritatingly on the edge of two
screens), but that second time actually resulted in a new key showing up,
which the article says was a flaw in the emulation that it was exploiting.

Interesting that their approach didn't work for Pitfall (never played
Gravitar).

~~~
rsf
Interestingly it's actually not a flaw, the key appears after a while if
you're not in the room (and don't have one):

[https://news.ycombinator.com/item?id=17460392](https://news.ycombinator.com/item?id=17460392)

I assume the agent somehow found this out and developed the behavior of going
in and out of the room until the key shows up (which, with enough agent
randomness it apparently will).

------
gfodor
If I'm understanding this right, the AI wasn't given a "full demonstration" of
the game, but specific frame snapshots at goal completion points. So it
basically learned how to get from goal A to goal B, but it had to be given
examples of what goal A and goal B looked like visually.

Iow, it was showing what beating the game would _look like_ at some level of
granularity. I guess the next obvious question is how far up you could dial
the granularity and result in the AI still learning how to beat the game.

~~~
Groxx
I read it as: "single demonstration" means they had a single run worth of data
that they could train against however long they wanted.

So they took that single play-through, chopped it up by room, and trained each
room in reverse.

------
YeGoblynQueenne
>> The exploration problem can largely be bypassed in Montezuma’s Revenge by
starting each RL episode by resetting from a state in a demonstration. By
starting from demonstration states, the agent needs to perform much less
exploration to learn to play the game compared to when it starts from the
beginning of the game at every episode. Doing so enables us to disentangle
exploration and learning.

Or in other words- use the Domain Knowledge, Luke. Quit trying to learn
everything from scratch. Because that's just dumb.

~~~
radarsat1
Interesting though that they seem to be using the demonstration only for
initial states and _not_ for action choices. It's like using an example of
solving a maze just to get a bunch of places to start exploring from, but not
to actually try and copy someone's "turn right at every corner" strategy. The
use of domain knowledge is actually pretty limited in that sense..

------
seanwilson
> Our agent playing Montezuma’s Revenge. The agent achieves a final score of
> 74,500 over approximately 12 minutes of play (video is double speed).
> Although much of the agent’s game mirrors our demonstration, the agent
> surpasses the demonstration score of 71,500 by picking up more diamonds
> along the way.

How well would this adapt if the map/layout changed then?

------
AstralStorm
Please call me again when they actually solve the exploration problem instead
of falling back to a good example.

People beating this game do not do it based on a let's play video.

~~~
alexcnwy
This is highly significant because it was one of the games that the Atari
games that the original Deep Q learning model could not beat.

~~~
backpropaganda
DQN is solving a model-free RL problem. This method is not. You're not allowed
to reset states in model-free RL. If you have access to the model/simulator
which allows you to reset, you might as well use a model-based method like
MCTS.

------
Johnny555
Not being familiar with the game, I thought Montezuma's Revenge was something
entirely different.

[https://en.wikipedia.org/wiki/Traveler%27s_diarrhea](https://en.wikipedia.org/wiki/Traveler%27s_diarrhea)

