
Curiosity Killed the Mario - MichaelBurge
http://www.michaelburge.us/2019/05/21/marai-agent.html
======
AgentME
I find the video super interesting to watch. It's not like previous Mario bots
I've seen that were programmed with path-finding and a go-right goal which
race through the level perfectly. Instead, this makes me think of a very young
kid playing the game who isn't entirely sure what they should do, and is more
interested in figuring out what the controls do and usually goes right because
that's where more stuff is, not because they know they're "supposed" to do
that.

The idea of curiosity (seeking out states that lead to unpredictable stuff)
being a good scoring function for an AI seems really compelling to me. I'm
really curious about what other places this idea can be applied to.

~~~
gipp
At first glance it reminds me of an old paper I read on causal entropic
forces, [1] which is kind of a thermodynamic approach to understanding the
emergence of complex behaviors. I've always had that in the back of my mind as
an idea I'd never have the time or resources to investigate further.

[1]
[https://www.google.com/amp/s/phys.org/news/2013-04-emergence...](https://www.google.com/amp/s/phys.org/news/2013-04-emergence-
complex-behaviors-causal-entropic.amp)

------
jolfdb
See also playfun/learn fun, a game-agnostic AI that uses raw uninterpreted
memory bytes as it's signal: [http://tom7.org/mario/](http://tom7.org/mario/)

------
chriswarbo
I remember Schmidhuber showing off "artificial curiosity" stuff a while back
(e.g.
[http://people.idsia.ch/~juergen/interest.html](http://people.idsia.ch/~juergen/interest.html)
). In particular, ideas like "compression progress" have been influential on
my own research about how to measure what's "interesting", and I've
implemented a rudimentary version of PowerPlay (and a slight alternative) at
[http://chriswarbo.net/projects/powerplay](http://chriswarbo.net/projects/powerplay)

(He's since applied these ideas to art, humour, etc. which I think is a nice
thought, but not worth taking particularly seriously)

------
AstralStorm
Pity it actually fails at the truly complex levels. Gets stuck in decision
minima due to lack of memory?

Or is the reward too sparse?

Cannot find the symbolic notion of progress and dies of boredom?

~~~
teeki
For longer levels, I think training on later parts of the level tend to be
change the policy to not do as well in the earlier parts. I suspect it would
do fine on the linear levels if the number of agents and batch size was
increased.

------
fartcannon
The author mentions self-driving cars and, indeed, it works well for Mario
Kart:
[https://www.youtube.com/playlist?list=PLTWFMbPFsvz122oi3aEWZ...](https://www.youtube.com/playlist?list=PLTWFMbPFsvz122oi3aEWZ9QTJwwF78UPr)

------
debatem1
This is an interesting idea for concolic execution. I assume it's been done
before, but a quick search doesn't turn it up. Maybe it's just assumed?

------
DanielleMolloy
There have been labs at my uni in non-deep RL or evolutionary robotics
researching curiosity rewards a few years back, but I've not seen something
like this until this year.

Is it just scale and more powerful NNs that is making curiosity work well now?
The Montezuma and this Mario video look like a serious breakthrough to me
(non-expert).

------
cwyers
The video is set to play back at a funky speed, playing at 1.5x speed is much
more natural.

------
sdan
Is there a paper published (or preprinted) on this?

~~~
flooo
There is a link to the original work by OpenAI:
[https://openai.com/blog/reinforcement-learning-with-
predicti...](https://openai.com/blog/reinforcement-learning-with-prediction-
based-rewards/)

------
negamax
How is this different from brute forcing and then recording a success path?
Not sure this qualifies as AI

~~~
teeki
A naive brute force algorithm would take a lot longer to complete some of
these levels.

Let's say it takes 45 seconds to complete a level. That's 45 * (60 / 12) = 225
moves. The size of the action space is 14, so you'd be looking at 14 ^ 255
(give or take a few orders of magnitude) trajectories before finding the
solution.

To brute force in a reasonable a time, you would have to look at the
environment and have the algorithm iterate through all trajectories in a
clever way. For instance, you may choose to only try trajectories that are
constantly moving right. This strategy may find a solution to 1-1 fairly
quick, but this does not generalize to other levels, especially ones that
requires backtracking or waiting.

You'd have to design a pretty gnarly algorithm for it to beat 1-1, 1-4 and
2-2. This gets even more complicated if you bring in other environments: the
original paper also trained on Montezuma's Revenge, Private Eye, Venture,
Freeway and Gravitar

~~~
negamax
They have 350 hrs of playing per level. They could brute force on these
parameters.

1\. Detect to_avoid moving obstacles

2\. Detect to_avoid stationary obstacles (gaps)

3\. Player moves (jump (left, right)(up, far, farthest), walk)

4\. Success (hitting the flag, reaching princess)

Once above information is available on the screen, it becomes easily
bruteforcible. I was hoping use of genetic algorithms i.e. they could have
taken success from one part (path) of level and crossed it with another and
tried that with across levels. But there won't be a generic strategy or
learning anyways as there's fair bit of randomness. So this does seem like
brute forcing for success path

~~~
null000
The whole point is that the algorithm doesn't know about obstacles or success
as a concept baked into the algorithm. Likewise, this is pretty initial
research, meant to inform and promote

In other words, this isn't meant to be super useful by itself. It seems tailor
made (as many of these things do) to play super-simple 80's video games and
literally nothing else, but it's an interesting proof of concept. I'd also be
interested in different iterations on this general pattern - for instance,
something that didn't translate directly from screen + button -> prediction,
and instead had some interstitial systems - translating from screen ->
entities, then predicting entity state of entities given button presses. It'd
also be interesting to see how this performs with ML algorithms designed to
learn on the fly instead of through training from a static set of data (at
least, this looked like it learned through back propagation - I skimmed).

But I can see broader practical applications for this in, for instance,
recommender systems trying to break users out of the closed feedback loop that
people tend to end up in when going down certain rabbit holes (e.g. watch one
Flat Earther conspiracy video and suddenly that's all you see for a week
because the recommender system _knows_ that people who look at one will look
at more). The point being: the real test comes when this strategy is exposed
to more diverse problem spaces, it's just that those are harder to model and
we need to weed out the pointless stuff first.

------
bicepjai
In reinforcement learning this can be attributed as exploration be
exploitation

------
otakucode
A couple years ago there was a paper reported somewhere (it may have been
here) that dealt with unsupervised learning using entropy as the only fitness
function. Regardless of the task or any other factors, the researchers used
maximizing entropy as the only goal. And this immediately led to the
development of complex, interesting, and desirable behavior. When used for a
system balancing a pole, it would learn to balance the pole upright. When
given a ball where a hoop was present, it would automatically navigate the
ball through the hoop. I tried to reach out to the author of the paper to get
a full copy of the paper (could only find a paywalled abstract online) but
never got a response. It seemed like a very interesting approach, and this
sounds like doing basically the same thing. Favor moving to any state which
increases the maximum likely future states. Increase entropy, 'intelligence'
emerges.

