Hacker News new | past | comments | ask | show | jobs | submit login
A step-by-step guide to the “World Models” AI paper (applied-data.science)
261 points by datashrimp on April 17, 2018 | hide | past | favorite | 37 comments



Hi, I'm one of the authors of this paper (https://arxiv.org/abs/1803.10122, https://worldmodels.github.io).

Happy to answer any questions you may have.


Do you not feel like ML researchers have a duty not to inflate their research with loaded, anthropmorphic terms that could me misinterpreted by the public? I'm talking particularly about using words like 'dreaming' and 'imagination' for what is essentially prediction (albeit in complex sensor modalities like vision).

Similarly, your title 'World Models' is equally ambitious and deceptive. It only hints at its relation to model-based reinforcment learning, and using 'world' to mean 'rendering of a gym environment' is definitely an exaggeration.

This is not to say i don't like your work, but i am becoming increasingly frustrated by the language and habits of the newer breed of ML / robotics reseachers.


How did you get into contact with Schmidhuber for co-authoring? What stage was the research at when he joined?

Were you expecting the net to generalize from dream to reality, before you wrote the paper, or did this materialize during experimentation?

Do you expect this approach is also feasible for more difficult games: higher dimensionality, longer delayed rewards?

Both congrats and thanks for writing this very accessible paper. Really found this a creative paper with a lot of inspiration, and the presentation of the results was marvelous.

(BTW: I remember you from the RNN-volleyball game. Back then you had quite some jealous detractors, telling you DeepMind would be too difficult/academic for you. You sure shut those people up!)


> How did you get into contact with Schmidhuber for co-authoring? What stage was the research at when he joined?

The first time I discussed this topic with Jürgen Schmidhuber was at NIPS 2016, when he gave a talk about "Learning to Think" [1], during the break at one of the sessions, and we kept in contact afterwards.

> Were you expecting the net to generalize from dream to reality, before you wrote the paper, or did this materialize during experimentation?

When I tried this, I didn't expect this to work at all, to be honest! And in fact, as discussed in the paper, it didn't work at the beginning (the agent would just cheat the world model). That's why I tried to adjust the temperature parameter to control the stochasticity of the generated environment, and trained the agent inside a more difficult dream.

> Do you expect this approach is also feasible for more difficult games: higher dimensionality, longer delayed rewards?

I expect the iterative training approach to be promising for difficult games with higher dimensionality, where we need to use better V and M models with more capabilities and capacities (we can already find many candidates for V/M already by looking at the deep learning literature), and still train these models efficiently with backprop on GPUs/TPUs. Using policy search methods such as evolution (or even augmented random search), allow us to work only with cumulative rewards we see at the end, rather than demanding a dense reward signal at every single time step, and I think this will help cope with environments with sparse, delayed rewards. Even in the experiments in this paper, we only work with cumulative rewards at the end of each rollout, and we don't care about intermediate rewards.

> Both congrats and thanks for writing this very accessible paper. Really found this a creative paper with a lot of inspiration, and the presentation of the results was marvelous. (BTW: I remember you from the RNN-volleyball game. Back then you had quite some jealous detractors, telling you DeepMind would be too difficult/academic for you. You sure shut those people up!)

Thanks! The RNN-volleyball game from 2015 was a lot of fun to make. Back then, I trained the agents using self-play, with evolution, and I remember people telling me I should really be using DQN or something back then. Fast forward a few years, self-play is now a really popular area of research (for instance, many nice works from OpenAI and DeepMind last year), and evolution methods are really making a comeback. I think it is best to work with something you believe in, and sometimes it is okay to not pursue what everyone else is doing.

[1] On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models https://arxiv.org/abs/1511.09249


Are you familiar with the current debate about predictive processing / free energy minimisation / active inference, driven by philosophers such as Clark & Hohwy on the one side and the neuroscience tribe around Friston along side them?


I'm familiar with the works of Andy Clark. In particular, I found parts (but not all) of "Being There" and "Supersizing the Mind" to be interesting to read, although he does ramble on sometimes. There's an interesting article called "The Mind-Expanding Ideas of Andy Clark" that I can recommend reading:

https://www.newyorker.com/magazine/2018/04/02/the-mind-expan...

There's an older work from Hod Lipson, that is often referenced in Clark's writing, that I also found inspirational. An old TED talk (2007) from Lipson about "Building 'self-aware' robots":

https://www.ted.com/talks/hod_lipson_builds_self_aware_robot...


Ha, I did not know that there had been a New Yorker profile profile of him just a few days ago. They are really jumping on the predictive processing train hard: In the same issue, there had been a profile of Metzinger [0], who plays an important role in bridging the divide between Friston's work on perception and theories of the self.

I'm very curious how that line of research will turn out. My interest comes from the behavioural economics perspective on decision making. Big names (Akerlof, Kahneman, Tirole) have approached narratives as a way to cope with multiple-selves [1] but I belief that the free energy principle, when integrated with the work by Metzinger may be able to introduce a naturalistic way to ground both preferences and the development of preferences in empirical findings from neuroscience.

[0] https://www.newyorker.com/magazine/2018/04/02/are-we-already... [1, paywalled] https://www.tandfonline.com/doi/abs/10.1080/1350178X.2017.12...


I'm curious to know how your paper differs from Learning and Querying Fast Generative Models for Reinforcement Learning. It seems relevant, but you don't mention it iirc.


Thanks for pointing out this paper. We are not aware of this paper, as it was published only a few weeks before our publication date. Upon going through this paper, it seems like it is an extension of "Imagination-Augmented Agents for Deep Reinforcement Learning" (from Weber et al 2017, which btw, is an _amazing_ paper I can highly recommend, or even to just watch Theo's recorded talk at NIPS2017). Going through and preparing for the publication process for ML papers take time, and in some cases even months. In our case, it certainly took months to build the interactive article and go through many rounds of editing and revisions, and also test that the interactive demos are working well for all sorts of test cases, tablets, smartphones, browsers, in addition to just the arxiv pdf.

That being said, here are a few differences I noticed:

- We minimize the parameters needed for the controller module, and solve for the parameters using Evolution Strategies.

- We try to replace the actual environment entirely with the generated environment, discuss when this approach will fail, and also suggest practical methods to make this work better. (This part of our work is not really discussed in detail in this particular blog post here.)

- Rather than create new architectures, we take on a minimalist design approach. We tried to keep the building blocks as simple as possible, sticking to plain vanilla VAEs and MDN-RNNs, tiny linear layers for controllers, to reinforce key concepts clearly. For instance, when we were training the VAE, we didn't even use batchnorm, and just used L2 loss, so that someone implementing the method for similar problems would have less issues getting it to work, and didn't have to spend too much time tweaking it or tuning hyperparameters. This might come at the expense of performance, but we feel it is the right tradeoff.

- We wrote the article with clarity in mind, and invested considerable effort to communicate the ideas as clearly as possible, with the hope that readers with some ML background can understand, and even reproduce and extend some of the experiments from first principles.


I'm also curious what your thoughts are on this paper: https://arxiv.org/abs/1803.10760 As a hobbyist/independent researcher I think it's really interesting to compare the two in terms of the way you model the environment and the parallels with neuroscience. It seems like their use of a DNC could address some of the points you mention about the limited historical capacity of LSTMs and catastrophic forgetting.

I was very glad when I saw on github where you said the whole system could be trained in a reasonably short amount of time, because it makes it so much more feasible to try out and experiment with it as an individual. Awesome paper, and I thought the way the material was presented was excellent and made for a great read. I hope this kind of interactive presentation become more common in the future!


Fantastic paper, glad to see it is so powerful! I'm just graduating as a computer scientist and independently came up with a very similar idea. Nice to see it validated, especially with it solving previously unsolved problems. Called the latent space of the VAE "Mental Space", having similar purpose to the Vision Model.

Have you done any experiments feeding the cell states into the Controller in addition to the latent vector and hidden states? If so how did it perform?


Thanks! We did try feeding both the "cell" of the LSTM in addition to the hidden state of the LSTM and the latent vector into the controller, and this works better. We discussed this in the Appendix section.

I would still encourage you to pursue your idea, since there are still lots of limitations in this model (discussed in the paper), and a lot of work remains to be done to solve more difficult problems.


super impressive that it works! Have you thought to use a GAN instead of a Variational Autoencoder?


This is a neat paper - it's an interesting empirical result combining known techniques - but machine learning academics should really know better than to contribute to the over-hyping of results. For example, talking about "dreams" and "hallucinations" is not helpful - it doesn't make the work more accessible and only adds unnecessary hype.


Hi, thanks for the feedback! Honestly we didn't intend to over-hype the results. We took the terms from existing works that we knew:

1) Alex Graves on Hallucination with Recurrent Neural Networks, a 2015 lecture at the University of Oxford from a course by Nando de Freitas (highly recommended).

http://www.creativeai.net/posts/kp4bTG993JTQcqy2d/alex-grave...

2) Generating Sequences With Recurrent Neural Networks

https://arxiv.org/abs/1308.0850

"Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming."

There are other terms, such as Imagination, also used in the literature:

3) Imagination-Augmented Agents for Deep Reinforcement Learning

https://arxiv.org/abs/1707.06203

4) Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning

http://proceedings.mlr.press/v78/kalweit17a/kalweit17a.pdf

In our work, the procedure is closer to the approaches in (1) and (2), rather than the "Imagination" approach in (3) and (4) where there are more subtle differences (i.e. planning), so we followed the terms in (1) and (2).


I completely agree with you. Dreams, imagination, or hallucination are appropriate terms for an agent working through solutions within its own world-model without using new external input. Would we reserve the verb 'to fly' only for birds? As Dijkstra said, "the question of whether a computer can think is no more interesting than whether a submarine can swim".


I guess the question is, why did we need to move away from `to generate` or `to permutate` on feedback with no additional input?

It seems to have coincided with the re-emergence of neural networks and the only way I can see it is that it romanticizes the field in the expense of some accuracy of statement.

I however definitely can't claim to be immune to the charm of this romanticization, it surely appeals to something inside me.


'generate' and 'permutate' are more semantically general words. To convey what you mean you have to add "on feedback with no additional input". 'imagine' or 'dream' fully includes this specific meaning: it is more accurate. The only difficulty is that we are not used to applying these verbs to non-animal subjects. It is just like going out of your way to say "the submarine propelled itself through the water" or "the plane propelled itself through the air" because you don't want to use the verbs swim or fly with inanimate subjects. Why the distinction in those two particular cases; I have no idea. Maybe we're used to seeing birds glide without moving while you don't really see fish swimming without that distinctive wriggling-flapping motion.


9.9 times out of 10, I'd be in strong agreement with you. This time however, I think the use of the term "dream" is not as egregious as would be typical. When we dream, we often believe we're conscious and multiple senses will seem to corroborate this. It must be the case that we're interacting with a generative world model. Hippocampal replay is hypothesized to, in part, facilitate what their model is doing when they call it dreaming.

Hallucination on the other hand, is a less defensible use here. Hallucination refers to when predictions overwhelm sense input. Since their agent was not behaving in a way uncorrelated with its inputs, such as its predictive model overriding input data, it doesn't qualify as hallucination.


I find that intuitive words like "dream" and "hallucination," as opposed to exclusively using dry jargon, make papers much more accessible.

I also don't think it's the author of the paper's job to manage hype.


There is a difference between ML research and AI research. AI, traditionally, has more leeway in using intuitive, abstract, or anthropomorphized terms, over ML, which has established learning and optimization theory and a more solid foundation in applied mathematics.

Basically, the deep learning hype in popular science media, owes this status in large part, because it allows nice pictures to be shown. RL research fares well, because they can show video of the agent playing the game. I bet the choice of Doom was also made with this PR in mind, and of course publications like Wired are going to show this work to their readers, over, say, the RELU-paper (impactful in the field, but not much to write an article around).


It's also just plain easier and more fun to work on something when you can see how it behaves rather than trying to infer it by reading some numbers.


Our agent consists of three components that work closely together: Vision (V), Memory (M), and Controller (C)

Next web frameworks are going to be smart!


The original interactive blog post is also really awesome https://worldmodels.github.io/


The post talks about running "video" on a remote server for the RL training, but not how to take that image and visualize it locally (which would be helpful for debugging failing models).

Let's say I wanted to run a Twitch stream of RL training on a remote server (and stream directly from the server to Twitch). What is the intended way to render the video in real time remotely?


Is this similar to Dyna-Q learning, but with modeling/simulation being handled by the RNN?

It looks like the VAE is just used to create a feature vector, so the main difference seems to be in the MDN-RNN - which is taking the place of the usual state/action simulation in Dyna-Q.


Yeah, it's the same general principle of using a model to cheaply speed up policy learning. An advantage to their approach however, is that it learns a latent space and generalizes better.

The VAE learns a compressed vector and the latent variables are somewhat meaningful. The VAE can also be sampled from and is not just a table of memorized examples. The RNN maintains coherence with actions and observations of previous time-steps and a separate controller is also learned. The end result is their approach is richer and more flexible.


This posts' author is fantastic. Breaks things down and explains everything very nicely.


Who decides what is the correct information to learn? What will prevent a bad actor from providing subject material that teaches people to bring harm to themselves or others. Post Traumatic Stress Disorder sounds, at least to the layman, as this very design pattern, but obviously reinforces undesirable subjects.


It sounds like you have grossly misunderstood what this paper is about. Like, in the order of several scientific disciplines wrong. You were probably confused by the authors using some of the commonly used terms, but nevertheless, this work is in the field of Machine Learning.


I can see healthy discussion abounds on HN. As I understand it, Machine Learning is aiming to create computer systems that can learn from a set of training data. The machine starts out like an infant, with no experience, and no knowledge, just sensors and memory. But like all computers, garbage in begets garbage out.

The machine/infant may receive stimuli through their transducers. Computers are provided the stimuli via digitized images, audio, or text, depending on type of learning system. Infants are provided stimuli via their 5 senses. They receive their images through their eyes. Their audio is received through their ears. Its probably a little early at the infant stage, but in a few years, they will receive text through their eyes as well. Lets leave Taste and Smell for another time.

What is traditionally considered good parentage consists of curating these stimuli. There are all sorts of dangers a child will encounter. Without the shepherd to mitigate these dangers, and reinforce their negative consequences, the child will almost certainly die in infancy. But past a few years, although they may still have several years of development left, once their basic needs of food and shelter are met, a toddler will learn whatever they are continuously exposed to, in accordance to how the people they respect react to the situation.

To the extent the child does not die, it accrues positive associations with stimuli that their parents approve of. If my parents are smiling and laughing, I'm going to associate the activity with happiness. If my parents are yelling and have an angry face, I'm going to associate that activity with anger. As I accrue these associations, I slowly begin to become more and more self sufficient.

However, notice how I didn't explain any singular activities. If my parents lacked the patience or resources to teach me, they might start sending me mixed messages. Perhaps I'm playing with legos one day, learning all of the positive things we all assume legos teach children. But my dad has a rough day at work, and comes home, steps on a lego. Now my dad is furious, screaming at me about my legos. This is now a negative association for legos. If I continue to accrue similar experiences, I'll likely have an irrational aversion to legos later in life.

So with as much snark as you can muster, could you please patronize me a little more, and correct any misunderstandings I still maintain about how machine learning is not analogous to developing human brains?


>What will prevent a bad actor from providing subject material that teaches people to bring harm to themselves or others.

Well the bad actor would need root access to your brain. Make sure you set a good password, and don't tell anyone what it is.


Not necessarily. At a minimum you need access to the sensory environment of the subject: Teens on Twitter are more easily radicalized when their timeline consists largely of terrorist propaganda or war front reporting on civilian casualties. Facebook has done experiments where they changed the sentiment of the timeline for a certain user and saw a significant sentiment change in future posts by that user.

Besides, the average human is not able to set a password, and their brains are open to all sorts of attacks. Cults, terrorist organizations, and multi-level marketing schemes abuse these weaknesses to get their followers to do things that may not be in their own best interest.


Hallucinations can be induced through LSD, or psylocibin, or just being in the middle East over the last century. Perhaps you should ask yourself if you've gotten good at parrying Russian troll accounts in the last couple of weeks, or if you've learned to question your own assumptions. My grandfather got his ptsd in Okinawa. My brother's friend got his in a statepark when he took too much LSD.

My original point is that cognitive behavioral therapy is a medium. It is just as good at creating addicts as it is helping them recover. Teenagers learning to put up with the downsides of cigarettes to gain their peers social proof is cognitive behavioral therapy. Its pretty successful too, if you happen to manufacture tobacco.


You may be interested in the field of adversarial reinforcement learning. In adversarial reinforcement learning, an agent operates in the presence of a destabilizing adversary that applies disturbance forces to its system.

See also the Adversarial Bandit:

> Another variant of the multi-armed bandit problem is called the adversarial bandit, first introduced by Auer and Cesa-Bianchi (1998). In this variant, at each iteration an agent chooses an arm and an adversary simultaneously chooses the payoff structure for each arm. This is one of the strongest generalizations of the bandit problem as it removes all assumptions of the distribution and a solution to the adversarial bandit problem is a generalized solution to the more specific bandit problems.

Good robust RL algorithms are able to learn in the presence of adversarial noise. Correct information is information that allows you to compress reality better. When an agent is able to compress reality better (has access to a better generalizing world model), it will be rewarded. Correct information is information that helps an agent better optimize its policy function.

You actually hit on an interesting angle of research, and you probably will be vindicated in the near future, when adversarial images (those that fool state-of-the-art image classifiers to fail), move to adversarial agents (those that fool other agents into making bad decisions). However, this research was not about multi-agent systems, though the opponents (those that shoot fireballs and try to kill the agent) can already be seen as adversaries to the agent's goal of staying alive longer.


To stay in our abstract mode of thinking, does this effectively kick off an arms race? Lets assume Bob has bad intentions, and wants to rule the world to benefit himself at the expense of others, and Alice has good intentions, and wants to improve living conditions for everyone around the world. If Bob has sufficiently larger data centers and greater overall throughput in his system, would it be accurate to say Bob will be able to always deduce and subsequently employ the "Trojan horse" which meets all of Alice's criteria of what an authorized user of her system must meet?


Yes. Though AI is already in an arms race (mostly US vs. China/Russia).

Likely: Future AI will be decentralized for exactly these reasons. We don't want a single bad actor to control it. Security agencies are now warning that Russia is building a large botnet in the case it needs to go to war, and wants to disable enemy infrastructure. The US has similar needs.

Well designed game theory makes it possible for adversaries to cooperate. So it is no guarantee that Alice is always susceptible to Bob's attacks. Cryptography provides methods that can't be attacked if properly implemented. Defense and offense also can have differing costs: It can be way (computationally) cheaper to create defenses for Alice, than it is to craft adversarial offenses for Bob.

Though the risk is real: Spam preceded spam-filters. There was a short period (in internet years) where spam was more effective than our methods to counter it. So intelligent self-modifying worms/viruses will probably precede intelligent self-learning anti-viruses.

We also see both inverse reinforcement learning (learning about the policy of another agent through observing its behavior), adversarial RL (forcing another trading bot to make unprofitable decisions), and computational arms-races (who has the lowest latency?) between High Frequency Trading firms.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: