Hacker News new | past | comments | ask | show | jobs | submit login
Learning ‘Montezuma’s Revenge’ from a single demonstration (blog.openai.com)
151 points by gdb on July 4, 2018 | hide | past | favorite | 45 comments



> In addition, the agent learns to exploit a flaw in the emulator to make a key re-appear at minute 4:25 of the video

After a bit of debugging, this appears to be a very intentional feature in the game rather than a flaw. That key appears after a while if you're not in the room (and don't have one).

Based on this disassembly: http://www.bjars.com/source/Montezuma.asm

Here's the relevant code with some annotations added:

https://goo.gl/VUDr9F

I'm not sure if this is a previously known feature in the game (a quick google search does not reveal much). It would be quite interesting if the RL agent was the first to find it!

PS: If you launch MAME with the "-debug" option and press CTRL+M you can see the whole memory (atari 2600 only has 128 bytes!!) while playing the game. If you keep an eye on the byte at 0xEA you will know when the key is about to pop up. Alternatively you can speed things along by changing it yourself to a value just below 0x3F.


One thing that is striking to me in almost all these sorts of otherwise impressive demonstrations are the apparently bizarre "jitter" movements while waiting for a door to open or path to clear in the game. Clearly there is no fitness in quietly waiting.

It is darkly humorous to contrast Hollywood's or scifi's "killer AI robots" that methodically hunt you down to these real world demonstrations of emerging AI. Maybe the first "killer AI robots" would exhibit similarly bizarre behaviors while they methodically hunt down the unlikely hero. :-)


This behavior wouldn’t necessarily transfer to the real world because the real world has costs (e.g. energy utilization and hardware damage, both very important in nature) which are not always accurately reflected in these simulations. It brings to mind the example where an agent learned how to make a cheetah “run” while repeatedly banging its head on the ground, which wouldn’t work in the real world for obvious reasons.


Also, for a human there is cognitive load in moving around. If you are safe where you are now and nothing significant changes, it's mentally easier to stay still instead of re-evaluating everything constantly. And this frees your brain to better plan your next move, so it's advantageous. For an AI, CPU power isn't as scarce.

And even with a computer playing a video game (not the real world), your joystick hand gets tired.


I'm not sure about systems older than the NES, but as of the NES era, it was reasonably common for a game's "physics engine" to model player position and velocity at finer values than just the displayed pixels. That is, an x position might consist of one byte for x [in pixels] plus one byte for x_sub [in sub-pixels] and similarly, the minimum non-zero speed could be smaller than 1 px/frame.

If there is some modeling of acceleration, it will be slower to stand and wait right next to a barrier or trap and not start moving until it disappears. Instead, it's best to have some speed already and be positioned so that you're about to run into the barrier on the frame before it disappears. Then on the next frame, where it does disappear, you move into the vacated space without slowing down or taking damage as the case may be (or starting from a stop, as you would have if you stood still while waiting).

If you're not already familiar with tool-assisted speedruns, I highly recommend tasvideos.org

In some TASes, the player will move/shoot/jump/etc. in ways that seem bizarre but are actually done to favorably affect the pseudorandom number generator (e.g. prevent enemies from spawning or shooting, so the additional sprites don't cause lag frames). Since the goal of this bot was score, not speed, there's probably not much of that going on here. I really like the move at 0:35 in the top video for Montezuma's Revenge, where the bot climbs up to the next screen and then immediately back down, presumably to reset the position of the green bug thing and get on the left side of it. That reminded me of some ladder action I've seen in Mega Man TASes.


> In some TASes, the player will move/shoot/jump/etc. in ways that seem bizarre but are actually done to favorably affect the pseudorandom number generator

And then there’s movements like the arm pumping in Super Metroid, where it’s done “because you can” while performing long stretches of movement that aren’t very entertaining/complex on their own. (One might say that it’s done for a demonstration of skill when human players do it, but there’s no similar justification for a TAS doing it.)


https://wiki.supermetroid.run/Arm_Pumping

Arm pumping actually moves Samus forward a pixel each time. It was considered "too hard" for human runners, until I think hotarubi's 0:32 in 2006.


Okay, bad example, I guess. There’s still the entire “playaround” TAS category, mostly featuring TASes that will be the same length no matter what you do (e.g. Gradius, Brain Age.)


> It is darkly humorous to contrast Hollywood's or scifi's "killer AI robots" that methodically hunt you down to these real world demonstrations of emerging AI. Maybe the first "killer AI robots" would exhibit similarly bizarre behaviors while they methodically hunt down the unlikely hero. :-)

That sounds very much like ED-209 in RoboCop (a film whose satire and themes in general are far more on-point and prescient than one might expect from the premise).


Sounds kind of like fly-by-wire, where more maneuverability is gained by being in a constant state of instability.

On the other hand, if they are trained to optimize for energy usage, staying still when movement isn't needed can be an advantage.


Sorry for being pedantic, but that's more a property of certain airframe designs, not a property of fly-by-wire. Planes can be designed that are inherently less stable, with the instability counteracted using software, in order to produce more maneuverability. (Presumably these designs would be less viable if a human had to deal with the instability.)

But you can have a mundane, stable aircraft that is fly-by-wire as well. Like the Airbus A320.


Entirely speculation, but that is typically what you do when finding a cause of the RL/NN model's behavior :)

If you consider the action state space, removing the 'do-nothing' state could provide a learning benefit. Consider a set of models that happen set the do-nothing state weights to zero, but manage to achieve a similar action using quick left/right movement. Perhaps these models train slightly better, meaning that in the same number of iteration steps they get to a better score than the models that do consider the do-nothing state.

Checking the video, you do see the person waiting from time to time. Perhaps this is an artifact of its demonstration learning episodes?

I do not see their Dota2 bot do this jitter movement. (Interestingly, the official/vanilla Dota2 bot does have this jitter!) This is likely because there is a benefit in being economical in your movements in that game: turning takes time. I postulate that an OpenAI bot for League of Legends, where turning is instant and free, would exhibit the jitter movements ;)

edit: Alternatively, inspired by the 'fly-by-wire' sibling comment: maybe spamming the emulator with left/right actions does provide a slight benefit. It wouldn't be the first time an AI finds video game exploits[1].

[1] https://arstechnica.com/gaming/2013/04/this-ai-solves-super-...


Counterpoint:

Real humans that play starcraft often have bizzarely high Actions Per Minute (APM) scores in the 200-300 range, and often most of that movement is spamming the same commands over and over.


Which suggests the problem is a bad command-interface, rather than the decision mechanism.

Consider how that clicking would plunge if there were a single "kite move" command for units.


StarCraft is a strategy game but it's also an action sport. The "how" of micro actions is largely the sport itself, both from a decision-making and execution standpoint. Imagine a footballer having a "kick goal" or a "kick ball past goalie" button -- it'd largely defeat the purpose.

I believe the command spamming /u/tomlock is referring to is the phenomenon where players issue as many or more commands when doing nothing as when an intense sequence is happening. This is primarily done to keep rhythm.


Not necessarily, it could just be that we don't understand the mechanism of making fast decisions well enough. Many players with high APMs anecdotally report that it helps them maintain a quick response time. It could be that the jittering in an AI's command inputs is aligning to some quality of fast decision making that we don't understand, but the AI intuits.


Perhaps I wasn't clear: I'm saying that there is at least one contributing factor to humans clicking a lot, and that factor is NOT attributable to the human/computer player's thinking or decision process.

Rather, a fair portion of that churn is due to an inefficient communication system or control structure between the player and the game.

It's analogous to pitting an android against a human at racing an old car: Both of them will have a high volume of activity with their left foot on the clutch and right hand changing gears, but it doesn't reveal anything amazing about their thought-processes. It's simply what manual-transmission cars require in order to accomplish a certain task.


>Rather, a fair portion of that churn is due to an inefficient communication system or control structure between the player and the game.

I'm not sure what this statement is based on.


This reminds me of professional StarCraft players. Watch how much NaDa rapidly hits keys and moves the mouse around even as very little is happening in the game [1].

[1] https://youtu.be/EfkQ-JmBJ5o


Figar, a simple enhancement to the deep net heads behind the RL mechanism fixes the jitter for most Atari games as far as I have tested it. See https://openreview.net/forum?id=B1GOWV5eg and https://github.com/jolibrain/manette for code and results (not the author of the paper).


That makes me think of microsacades. Our eyes are really not jittering either.


"By multiplying N of these probabilities together, we end up with the resulting probability p(get key) that is exponentially smaller than any of the individual input probabilities."

So they solved this by feeding the AI with a human demonstration, but have there been any attempts at giving the AI an explicit reward for maximising the "novelty" of the input state (i.e. the image on the screen)?

The game does not give the player points for reaching new rooms, but if the AI was rewarded for producing the "novel" state of a new room, then that would give it a drive to explore. Similarly, there would be an implicit penalty to the AI for repeatedly falling off a ledge or returning back to a room it had already visited (although some amount of back-tracking would no doubt be useful), whereas reaching a new part of the screen (by climbing a ladder, say) would be rewarded.

There are times where the AI would have to be patient and wait, but the window could be learned or set as a hyper-parameter. This might be enough to stop the unproductive behaviour of it jittering left and right continuously, since doing so does not produce a new state, relative to just standing still at least.


There has been a fair bit of past work exploring the idea you described (examples: https://arxiv.org/pdf/1606.01868.pdf, https://arxiv.org/pdf/1703.01310.pdf, https://pathak22.github.io/noreward-rl/resources/icml17.pdf, https://openreview.net/forum?id=H1RPJf5Tz). Such methods can't solve games like Montezuma's revenge to a comparable level of performance yet, but I'm sure they'll eventually get there.


As the sibling comment says, there is work on this, but the key question becomes "how do you define novelty?" You could say new rooms are novel, but this is basically just the engineering approach of defining a new reward, and isn't really an interesting solution since it's not really applicable to other RL use cases, or even other games.


In the worst case, the AI could store every single input it has received (for example, every frame it has seen) and calculate how similar each new input is to its past corpus.

Calculating similarity of images is quite a well-understood problem, but you're right that generalising the idea of similarity across all types of input data, in a way that is helpful for the AI to learn from, and efficient to calculate, may end up requiring a lot of coding that's specific to the individual use case.


> and isn't really an interesting solution since it's not really applicable to other RL use cases, or even other games.

I think it applies to many games: you typically progress through different screens/levels/whatnot on your way to completing the game.


I thought this was going to be very different than what it was.


I was pleasantly surprised this was about the first thing that popped into mind (having played the game a lot as a kid on a Coleco).


I played it on the Apple ][ -- it seemed to be ported to many platforms. But it is a fairly obscure game -- it makes me wonder why they picked it rather than a more common one.


It was picked because it is difficult to train a reinforcement learning model to play it well. In most other games you can create a reward function based on the score or something similar, and then the AI can explore possible actions that gives the best score. In those cases AI players are already doing quite well. In this case finding the key requires long term planning to get an actual reward and the AI has previously got stuck before that.


A computer that has diarrhea? Now that’s novel.


Also read the source domain differently, subliminally.


Montezuma's Revenge is one of the more impressive "3D conversions" done by our Apple II emulator [1]

[1] https://paleotronic.com/wp-content/uploads/2018/05/5.png


It's an impressive achievement, but it does seem to get stuck at times, like from around 1:35 to 2:10 and 3:45 to 4:30 (irritatingly on the edge of two screens), but that second time actually resulted in a new key showing up, which the article says was a flaw in the emulation that it was exploiting.

Interesting that their approach didn't work for Pitfall (never played Gravitar).


Interestingly it's actually not a flaw, the key appears after a while if you're not in the room (and don't have one):

https://news.ycombinator.com/item?id=17460392

I assume the agent somehow found this out and developed the behavior of going in and out of the room until the key shows up (which, with enough agent randomness it apparently will).


If I'm understanding this right, the AI wasn't given a "full demonstration" of the game, but specific frame snapshots at goal completion points. So it basically learned how to get from goal A to goal B, but it had to be given examples of what goal A and goal B looked like visually.

Iow, it was showing what beating the game would look like at some level of granularity. I guess the next obvious question is how far up you could dial the granularity and result in the AI still learning how to beat the game.


I read it as: "single demonstration" means they had a single run worth of data that they could train against however long they wanted.

So they took that single play-through, chopped it up by room, and trained each room in reverse.


>> The exploration problem can largely be bypassed in Montezuma’s Revenge by starting each RL episode by resetting from a state in a demonstration. By starting from demonstration states, the agent needs to perform much less exploration to learn to play the game compared to when it starts from the beginning of the game at every episode. Doing so enables us to disentangle exploration and learning.

Or in other words- use the Domain Knowledge, Luke. Quit trying to learn everything from scratch. Because that's just dumb.


Interesting though that they seem to be using the demonstration only for initial states and not for action choices. It's like using an example of solving a maze just to get a bunch of places to start exploring from, but not to actually try and copy someone's "turn right at every corner" strategy. The use of domain knowledge is actually pretty limited in that sense..


> Our agent playing Montezuma’s Revenge. The agent achieves a final score of 74,500 over approximately 12 minutes of play (video is double speed). Although much of the agent’s game mirrors our demonstration, the agent surpasses the demonstration score of 71,500 by picking up more diamonds along the way.

How well would this adapt if the map/layout changed then?


Please call me again when they actually solve the exploration problem instead of falling back to a good example.

People beating this game do not do it based on a let's play video.


And no human ever was able to play montezuma's revenge, given only a let's play video. Usually it requires 5-10 years of accumulating prior information. It is still remarkable achievement.


This is highly significant because it was one of the games that the Atari games that the original Deep Q learning model could not beat.


DQN is solving a model-free RL problem. This method is not. You're not allowed to reset states in model-free RL. If you have access to the model/simulator which allows you to reset, you might as well use a model-based method like MCTS.


Not being familiar with the game, I thought Montezuma's Revenge was something entirely different.

https://en.wikipedia.org/wiki/Traveler%27s_diarrhea




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: