I found this passage very interesting, as it seems like the definition of the benefits that can extend from delayed gratification. That optimizing for a local maxima isn't always the best way to get the best global maxima. Maybe algorithms that can encapsulate that will be useful in getting corporations away from the habit of chasing the highest short-term gains at the expense of long term viability.
I don't like bad-mouthing the work of others, so I kind of feel bad for saying that and maybe someone can prove me wrong: But is this not borderline "cheating" (in quotes because obviously there is no agreed-upon rules)?
The goal never was to solve Montezuma's revenge just for the sake of it. We could have done that way before by using hard engineering.
The interesting thing about this game is that it reveals inherent flaws in our reinforcement learning approach. And so, if you manage to solve this game using a domain-transferable approach, surely that means you have made some significant progress in RL.
That's doesn't seem to be the case here?
The most efficient way to learn real world tasks (for say a robot) is to learn in a realistic simulated environment, which ideally can be controlled by the agent, e.g. rewinding state after a failure to correct that particular behaviour. Ideally the simulation is regularly calibrated against real experiments, including confidence intervals on the real and simulated sensor inputs.
Given that model of learning, Uber's approach here is very reasonable.
This solution does not seem to bring us much closer to the possibly unsolvable problem of finding an ultimate reward for RL, as it is very specific to the game.
If they manage to get that behavior emerge from something not so explicit though, that would be a huge achievement.
I was sour on the results themselves, because they smelled too much like PR, like a result that was shaped by PR, warped in a way that preferred flashy numbers too much and applicability too little
Harsh! OpenReview's double blind system seems to work quite well in this regard as a peer review mechanism.
Quick Opinions on Go-Explore
> Like Go-Explore, this post had interesting ideas that I hadn’t seen before, which is everything you could want out of research. And like Go-Explore, I was sour on the results themselves, because they smelled too much like PR, like a result that was shaped by PR, warped in a way that preferred flashy numbers too much and applicability too little.
From the linked page:
To enable the community to benefit from Go-Explore and help investigate its potential, source code and a full paper describing Go-Explore will be available here shortly.
Tangent: I notice I find it really annoying whenever an internet article talks about a blog post or other article, yet doesn't link to the source on the spot. Take this sentence from technologyreview's article:
> The approach leads to some interesting practical applications, Clune and his team write in a blog post released today
There is no excuse to have a sentence like this and not have "a blog post" be a hyperlink. It feels rude somehow, like it's breaking internet etiquette.
I played pitfall as a kid and it seems quite straightforward for a computer to solve... jump over the puddle. I'd like if someone could talk more about this game in particular, specifically why it's so hard for AI to solve. Any interesting paper/link on the subject?
1) Train a model to predict what happens next given an input
2) Each frame, predict the next frames given all possible inputs
3) Choose the input that maximizes uncertainty
I would expect this to learn to avoid deaths relatively quickly. It doesn’t need to be good at knowing what will happen next, just better at recognizing specific dead ends (e.g. spikes or holes).
"AI researchers have typically tried to get around the issues posed by by Montezuma’s Revenge and Pitfall! by instructing reinforcement-learning algorithms to explore randomly at times, while adding rewards for exploration—what’s known as “intrinsic motivation.”
But the Uber researchers believe this fails to capture an important aspect of human curiosity. “We hypothesize that a major weakness of current intrinsic motivation algorithms is detachment,” they write. “Wherein the algorithms forget about promising areas they have visited, meaning they do not return to them to see if they lead to new states."
An AI tries to figure out how to win at nuclear war, then concludes that:
"Nuclear war is a strange game in which the only winning move is not to play."
Opinion: Equally true in non incubator-assisted entrepreneurship... that is, real entrepreneurship...
Should they tell investors to patiently wait while some academic comes up with a solution to their problem?
If I were VC with stakes in uber, and they wouldn't be researching this area, I would be pretty mad.
I'm pretty confident that this is flat out wrong. Just within the tech sphere you have Intel, Microsoft, Apple, Google, IBM, Samsung, and Baidu that have all historically put out large amounts of research. Outside of the tech sphere look at Pfizer, Johnson & Johnson, GM, Toyota.
Here's a slightly dated summary of research spending by large (profitable) companies https://www.recode.net/2017/9/1/16236506/tech-amazon-apple-g....