
A step-by-step guide to the “World Models” AI paper - davidfoster
https://applied-data.science/blog/hallucinogenic-deep-reinforcement-learning-using-python-and-keras
======
hardmaru
Hi, I'm one of the authors of this paper
([https://arxiv.org/abs/1803.10122](https://arxiv.org/abs/1803.10122),
[https://worldmodels.github.io](https://worldmodels.github.io)).

Happy to answer any questions you may have.

~~~
birthcert
How did you get into contact with Schmidhuber for co-authoring? What stage was
the research at when he joined?

Were you expecting the net to generalize from dream to reality, before you
wrote the paper, or did this materialize during experimentation?

Do you expect this approach is also feasible for more difficult games: higher
dimensionality, longer delayed rewards?

Both congrats and thanks for writing this very accessible paper. Really found
this a creative paper with a lot of inspiration, and the presentation of the
results was marvelous.

(BTW: I remember you from the RNN-volleyball game. Back then you had quite
some jealous detractors, telling you DeepMind would be too difficult/academic
for you. You sure shut those people up!)

~~~
hardmaru
> How did you get into contact with Schmidhuber for co-authoring? What stage
> was the research at when he joined?

The first time I discussed this topic with Jürgen Schmidhuber was at NIPS
2016, when he gave a talk about "Learning to Think" [1], during the break at
one of the sessions, and we kept in contact afterwards.

> Were you expecting the net to generalize from dream to reality, before you
> wrote the paper, or did this materialize during experimentation?

When I tried this, I didn't expect this to work at all, to be honest! And in
fact, as discussed in the paper, it didn't work at the beginning (the agent
would just cheat the world model). That's why I tried to adjust the
temperature parameter to control the stochasticity of the generated
environment, and trained the agent inside a more difficult dream.

> Do you expect this approach is also feasible for more difficult games:
> higher dimensionality, longer delayed rewards?

I expect the iterative training approach to be promising for difficult games
with higher dimensionality, where we need to use better V and M models with
more capabilities and capacities (we can already find many candidates for V/M
already by looking at the deep learning literature), and still train these
models efficiently with backprop on GPUs/TPUs. Using policy search methods
such as evolution (or even augmented random search), allow us to work only
with cumulative rewards we see at the end, rather than demanding a dense
reward signal at every single time step, and I think this will help cope with
environments with sparse, delayed rewards. Even in the experiments in this
paper, we only work with cumulative rewards at the end of each rollout, and we
don't care about intermediate rewards.

> Both congrats and thanks for writing this very accessible paper. Really
> found this a creative paper with a lot of inspiration, and the presentation
> of the results was marvelous. (BTW: I remember you from the RNN-volleyball
> game. Back then you had quite some jealous detractors, telling you DeepMind
> would be too difficult/academic for you. You sure shut those people up!)

Thanks! The RNN-volleyball game from 2015 was a lot of fun to make. Back then,
I trained the agents using self-play, with evolution, and I remember people
telling me I should really be using DQN or something back then. Fast forward a
few years, self-play is now a really popular area of research (for instance,
many nice works from OpenAI and DeepMind last year), and evolution methods are
really making a comeback. I think it is best to work with something you
believe in, and sometimes it is okay to not pursue what everyone else is
doing.

[1] On Learning to Think: Algorithmic Information Theory for Novel
Combinations of Reinforcement Learning Controllers and Recurrent Neural World
Models [https://arxiv.org/abs/1511.09249](https://arxiv.org/abs/1511.09249)

------
npr11
This is a neat paper - it's an interesting empirical result combining known
techniques - but machine learning academics should really know better than to
contribute to the over-hyping of results. For example, talking about "dreams"
and "hallucinations" is not helpful - it doesn't make the work more accessible
and only adds unnecessary hype.

~~~
hardmaru
Hi, thanks for the feedback! Honestly we didn't intend to over-hype the
results. We took the terms from existing works that we knew:

1) Alex Graves on Hallucination with Recurrent Neural Networks, a 2015 lecture
at the University of Oxford from a course by Nando de Freitas (highly
recommended).

[http://www.creativeai.net/posts/kp4bTG993JTQcqy2d/alex-
grave...](http://www.creativeai.net/posts/kp4bTG993JTQcqy2d/alex-graves-on-
hallucination-with-recurrent-neural-networks)

2) Generating Sequences With Recurrent Neural Networks

[https://arxiv.org/abs/1308.0850](https://arxiv.org/abs/1308.0850)

"Assuming the predictions are probabilistic, novel sequences can be generated
from a trained network by iteratively sampling from the network’s output
distribution, then feeding in the sample as input at the next step. In other
words by making the network treat its inventions as if they were real, much
like a person dreaming."

There are other terms, such as Imagination, also used in the literature:

3) Imagination-Augmented Agents for Deep Reinforcement Learning

[https://arxiv.org/abs/1707.06203](https://arxiv.org/abs/1707.06203)

4) Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning

[http://proceedings.mlr.press/v78/kalweit17a/kalweit17a.pdf](http://proceedings.mlr.press/v78/kalweit17a/kalweit17a.pdf)

In our work, the procedure is closer to the approaches in (1) and (2), rather
than the "Imagination" approach in (3) and (4) where there are more subtle
differences (i.e. planning), so we followed the terms in (1) and (2).

~~~
ppod
I completely agree with you. Dreams, imagination, or hallucination are
appropriate terms for an agent working through solutions within its own world-
model without using new external input. Would we reserve the verb 'to fly'
only for birds? As Dijkstra said, "the question of whether a computer can
think is no more interesting than whether a submarine can swim".

~~~
therein
I guess the question is, why did we need to move away from `to generate` or
`to permutate` on feedback with no additional input?

It seems to have coincided with the re-emergence of neural networks and the
only way I can see it is that it romanticizes the field in the expense of some
accuracy of statement.

I however definitely can't claim to be immune to the charm of this
romanticization, it surely appeals to something inside me.

~~~
ppod
'generate' and 'permutate' are more semantically general words. To convey what
you mean you have to add "on feedback with no additional input". 'imagine' or
'dream' fully includes this specific meaning: it is more accurate. The only
difficulty is that we are not used to applying these verbs to non-animal
subjects. It is just like going out of your way to say "the submarine
propelled itself through the water" or "the plane propelled itself through the
air" because you don't want to use the verbs swim or fly with inanimate
subjects. Why the distinction in those two particular cases; I have no idea.
Maybe we're used to seeing birds glide without moving while you don't really
see fish swimming without that distinctive wriggling-flapping motion.

------
bassman9000
_Our agent consists of three components that work closely together: Vision
(V), Memory (M), and Controller (C)_

Next web frameworks are going to be smart!

------
make3
The original interactive blog post is also really awesome
[https://worldmodels.github.io/](https://worldmodels.github.io/)

------
minimaxir
The post talks about running "video" on a remote server for the RL training,
but not how to take that image and visualize it locally (which would be
helpful for debugging failing models).

Let's say I wanted to run a Twitch stream of RL training on a remote server
(and stream directly from the server to Twitch). What is the intended way to
render the video in real time remotely?

------
BrandonSmithJ
Is this similar to Dyna-Q learning, but with modeling/simulation being handled
by the RNN?

It looks like the VAE is just used to create a feature vector, so the main
difference seems to be in the MDN-RNN - which is taking the place of the usual
state/action simulation in Dyna-Q.

~~~
Cybiote
Yeah, it's the same general principle of using a model to cheaply speed up
policy learning. An advantage to their approach however, is that it learns a
latent space and generalizes better.

The VAE learns a compressed vector and the latent variables are somewhat
meaningful. The VAE can also be sampled from and is not just a table of
memorized examples. The RNN maintains coherence with actions and observations
of previous time-steps and a separate controller is also learned. The end
result is their approach is richer and more flexible.

------
hmate9
This posts' author is fantastic. Breaks things down and explains everything
very nicely.

------
flyingcircus3
Who decides what is the correct information to learn? What will prevent a bad
actor from providing subject material that teaches people to bring harm to
themselves or others. Post Traumatic Stress Disorder sounds, at least to the
layman, as this very design pattern, but obviously reinforces undesirable
subjects.

~~~
Maybestring
>What will prevent a bad actor from providing subject material that teaches
people to bring harm to themselves or others.

Well the bad actor would need root access to your brain. Make sure you set a
good password, and don't tell anyone what it is.

~~~
birthcert
Not necessarily. At a minimum you need access to the sensory environment of
the subject: Teens on Twitter are more easily radicalized when their timeline
consists largely of terrorist propaganda or war front reporting on civilian
casualties. Facebook has done experiments where they changed the sentiment of
the timeline for a certain user and saw a significant sentiment change in
future posts by that user.

Besides, the average human is not able to set a password, and their brains are
open to all sorts of attacks. Cults, terrorist organizations, and multi-level
marketing schemes abuse these weaknesses to get their followers to do things
that may not be in their own best interest.

