
Montezuma’s Revenge Solved by Go-Explore - henning
http://eng.uber.com/go-explore/
======
reidjs
For those of you who are confused like I was, this is referencing a video
game, not stomach issues from travel
[https://en.m.wikipedia.org/wiki/Montezuma's_Revenge_(video_g...](https://en.m.wikipedia.org/wiki/Montezuma's_Revenge_\(video_game\))

------
taliesinb
This is very cool work. Well done to team!

“Mapping” the state space has seemed like a really exciting direction to me
and I’ve been puzzled to see so little work in this vein (I’d be interested if
anyone knows of other examples).

Just to add, Kenneth if you are reading: we met at GECCO 2017, hi again, and
really nice work! (I remember first learning about map-elites there, a simple
yet beautiful idea from another of the co-authors. Satisfying to see it play a
role in RL!)

------
yazr
The reddit discussion is

[https://old.reddit.com/r/MachineLearning/comments/a0nnp7/r_m...](https://old.reddit.com/r/MachineLearning/comments/a0nnp7/r_montezumas_revenge_solved_by_goexplore_a_new/)

Some skepticism in the comments, since the algorithm assumes you can "return
to a previous state" (e.g. with a simulator). So could be interesting for
robotics, games..

~~~
taliesinb
State reset is the kind of unique thing computers can do that we should be
exploring and developing, rather than shying away from.

------
gambler
From what I'm reading, their initial algorithm doesn't use neural networks at
all. It's something that resembles MCTS.

I don't understand from the article, does their "robust" version just run on
policy network alone, or does it still run the resetting "world simulation"?

~~~
joosthuizinga
Correct. For the results presented here, we do not use a neural network during
the exploration phase of the algorithm.

However, we train a neural network during the robustification phase, and it is
this trained network that actually plays the game in the end. This final
policy is much more robust than the deterministic trajectories collected
earlier, and it is able to deal with slight changes in the environment.

Note, while not tested yet, you can also run this algorithm by immediately
training a policy to reach different states. This variant would be slower (as
you now need to train a network during the exploration phase), but this
variant would be directly applicable to non-deterministic environments.

~~~
henning
Will the forthcoming paper clarify details like this? Thanks for sharing this
exciting result.

------
spirosbax
Very impressive!

