Happy to answer any questions you may have.
Similarly, your title 'World Models' is equally ambitious and deceptive. It only hints at its relation to model-based reinforcment learning, and using 'world' to mean 'rendering of a gym environment' is definitely an exaggeration.
This is not to say i don't like your work, but i am becoming increasingly frustrated by the language and habits of the newer breed of ML / robotics reseachers.
Were you expecting the net to generalize from dream to reality, before you wrote the paper, or did this materialize during experimentation?
Do you expect this approach is also feasible for more difficult games: higher dimensionality, longer delayed rewards?
Both congrats and thanks for writing this very accessible paper. Really found this a creative paper with a lot of inspiration, and the presentation of the results was marvelous.
(BTW: I remember you from the RNN-volleyball game. Back then you had quite some jealous detractors, telling you DeepMind would be too difficult/academic for you. You sure shut those people up!)
The first time I discussed this topic with Jürgen Schmidhuber was at NIPS 2016, when he gave a talk about "Learning to Think" , during the break at one of the sessions, and we kept in contact afterwards.
> Were you expecting the net to generalize from dream to reality, before you wrote the paper, or did this materialize during experimentation?
When I tried this, I didn't expect this to work at all, to be honest! And in fact, as discussed in the paper, it didn't work at the beginning (the agent would just cheat the world model). That's why I tried to adjust the temperature parameter to control the stochasticity of the generated environment, and trained the agent inside a more difficult dream.
> Do you expect this approach is also feasible for more difficult games: higher dimensionality, longer delayed rewards?
I expect the iterative training approach to be promising for difficult games with higher dimensionality, where we need to use better V and M models with more capabilities and capacities (we can already find many candidates for V/M already by looking at the deep learning literature), and still train these models efficiently with backprop on GPUs/TPUs. Using policy search methods such as evolution (or even augmented random search), allow us to work only with cumulative rewards we see at the end, rather than demanding a dense reward signal at every single time step, and I think this will help cope with environments with sparse, delayed rewards. Even in the experiments in this paper, we only work with cumulative rewards at the end of each rollout, and we don't care about intermediate rewards.
> Both congrats and thanks for writing this very accessible paper. Really found this a creative paper with a lot of inspiration, and the presentation of the results was marvelous. (BTW: I remember you from the RNN-volleyball game. Back then you had quite some jealous detractors, telling you DeepMind would be too difficult/academic for you. You sure shut those people up!)
Thanks! The RNN-volleyball game from 2015 was a lot of fun to make. Back then, I trained the agents using self-play, with evolution, and I remember people telling me I should really be using DQN or something back then. Fast forward a few years, self-play is now a really popular area of research (for instance, many nice works from OpenAI and DeepMind last year), and evolution methods are really making a comeback. I think it is best to work with something you believe in, and sometimes it is okay to not pursue what everyone else is doing.
 On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models https://arxiv.org/abs/1511.09249
There's an older work from Hod Lipson, that is often referenced in Clark's writing, that I also found inspirational. An old TED talk (2007) from Lipson about "Building 'self-aware' robots":
I'm very curious how that line of research will turn out. My interest comes from the behavioural economics perspective on decision making. Big names (Akerlof, Kahneman, Tirole) have approached narratives as a way to cope with multiple-selves  but I belief that the free energy principle, when integrated with the work by Metzinger may be able to introduce a naturalistic way to ground both preferences and the development of preferences in empirical findings from neuroscience.
[1, paywalled] https://www.tandfonline.com/doi/abs/10.1080/1350178X.2017.12...
That being said, here are a few differences I noticed:
- We minimize the parameters needed for the controller module, and solve for the parameters using Evolution Strategies.
- We try to replace the actual environment entirely with the generated environment, discuss when this approach will fail, and also suggest practical methods to make this work better. (This part of our work is not really discussed in detail in this particular blog post here.)
- Rather than create new architectures, we take on a minimalist design approach. We tried to keep the building blocks as simple as possible, sticking to plain vanilla VAEs and MDN-RNNs, tiny linear layers for controllers, to reinforce key concepts clearly. For instance, when we were training the VAE, we didn't even use batchnorm, and just used L2 loss, so that someone implementing the method for similar problems would have less issues getting it to work, and didn't have to spend too much time tweaking it or tuning hyperparameters. This might come at the expense of performance, but we feel it is the right tradeoff.
- We wrote the article with clarity in mind, and invested considerable effort to communicate the ideas as clearly as possible, with the hope that readers with some ML background can understand, and even reproduce and extend some of the experiments from first principles.
I was very glad when I saw on github where you said the whole system could be trained in a reasonably short amount of time, because it makes it so much more feasible to try out and experiment with it as an individual. Awesome paper, and I thought the way the material was presented was excellent and made for a great read. I hope this kind of interactive presentation become more common in the future!
Have you done any experiments feeding the cell states into the Controller in addition to the latent vector and hidden states? If so how did it perform?
I would still encourage you to pursue your idea, since there are still lots of limitations in this model (discussed in the paper), and a lot of work remains to be done to solve more difficult problems.
1) Alex Graves on Hallucination with Recurrent Neural Networks, a 2015 lecture at the University of Oxford from a course by Nando de Freitas (highly recommended).
2) Generating Sequences With Recurrent Neural Networks
"Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming."
There are other terms, such as Imagination, also used in the literature:
3) Imagination-Augmented Agents for Deep Reinforcement Learning
4) Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning
In our work, the procedure is closer to the approaches in (1) and (2), rather than the "Imagination" approach in (3) and (4) where there are more subtle differences (i.e. planning), so we followed the terms in (1) and (2).
It seems to have coincided with the re-emergence of neural networks and the only way I can see it is that it romanticizes the field in the expense of some accuracy of statement.
I however definitely can't claim to be immune to the charm of this romanticization, it surely appeals to something inside me.
Hallucination on the other hand, is a less defensible use here. Hallucination refers to when predictions overwhelm sense input. Since their agent was not behaving in a way uncorrelated with its inputs, such as its predictive model overriding input data, it doesn't qualify as hallucination.
I also don't think it's the author of the paper's job to manage hype.
Basically, the deep learning hype in popular science media, owes this status in large part, because it allows nice pictures to be shown. RL research fares well, because they can show video of the agent playing the game. I bet the choice of Doom was also made with this PR in mind, and of course publications like Wired are going to show this work to their readers, over, say, the RELU-paper (impactful in the field, but not much to write an article around).
Next web frameworks are going to be smart!
Let's say I wanted to run a Twitch stream of RL training on a remote server (and stream directly from the server to Twitch). What is the intended way to render the video in real time remotely?
It looks like the VAE is just used to create a feature vector, so the main difference seems to be in the MDN-RNN - which is taking the place of the usual state/action simulation in Dyna-Q.
The VAE learns a compressed vector and the latent variables are somewhat meaningful. The VAE can also be sampled from and is not just a table of memorized examples. The RNN maintains coherence with actions and observations of previous time-steps and a separate controller is also learned. The end result is their approach is richer and more flexible.
The machine/infant may receive stimuli through their transducers. Computers are provided the stimuli via digitized images, audio, or text, depending on type of learning system. Infants are provided stimuli via their 5 senses. They receive their images through their eyes. Their audio is received through their ears. Its probably a little early at the infant stage, but in a few years, they will receive text through their eyes as well. Lets leave Taste and Smell for another time.
What is traditionally considered good parentage consists of curating these stimuli. There are all sorts of dangers a child will encounter. Without the shepherd to mitigate these dangers, and reinforce their negative consequences, the child will almost certainly die in infancy. But past a few years, although they may still have several years of development left, once their basic needs of food and shelter are met, a toddler will learn whatever they are continuously exposed to, in accordance to how the people they respect react to the situation.
To the extent the child does not die, it accrues positive associations with stimuli that their parents approve of. If my parents are smiling and laughing, I'm going to associate the activity with happiness. If my parents are yelling and have an angry face, I'm going to associate that activity with anger. As I accrue these associations, I slowly begin to become more and more self sufficient.
However, notice how I didn't explain any singular activities. If my parents lacked the patience or resources to teach me, they might start sending me mixed messages. Perhaps I'm playing with legos one day, learning all of the positive things we all assume legos teach children. But my dad has a rough day at work, and comes home, steps on a lego. Now my dad is furious, screaming at me about my legos. This is now a negative association for legos. If I continue to accrue similar experiences, I'll likely have an irrational aversion to legos later in life.
So with as much snark as you can muster, could you please patronize me a little more, and correct any misunderstandings I still maintain about how machine learning is not analogous to developing human brains?
Well the bad actor would need root access to your brain. Make sure you set a good password, and don't tell anyone what it is.
Besides, the average human is not able to set a password, and their brains are open to all sorts of attacks. Cults, terrorist organizations, and multi-level marketing schemes abuse these weaknesses to get their followers to do things that may not be in their own best interest.
My original point is that cognitive behavioral therapy is a medium. It is just as good at creating addicts as it is helping them recover. Teenagers learning to put up with the downsides of cigarettes to gain their peers social proof is cognitive behavioral therapy. Its pretty successful too, if you happen to manufacture tobacco.
See also the Adversarial Bandit:
> Another variant of the multi-armed bandit problem is called the adversarial bandit, first introduced by Auer and Cesa-Bianchi (1998). In this variant, at each iteration an agent chooses an arm and an adversary simultaneously chooses the payoff structure for each arm. This is one of the strongest generalizations of the bandit problem as it removes all assumptions of the distribution and a solution to the adversarial bandit problem is a generalized solution to the more specific bandit problems.
Good robust RL algorithms are able to learn in the presence of adversarial noise. Correct information is information that allows you to compress reality better. When an agent is able to compress reality better (has access to a better generalizing world model), it will be rewarded. Correct information is information that helps an agent better optimize its policy function.
You actually hit on an interesting angle of research, and you probably will be vindicated in the near future, when adversarial images (those that fool state-of-the-art image classifiers to fail), move to adversarial agents (those that fool other agents into making bad decisions). However, this research was not about multi-agent systems, though the opponents (those that shoot fireballs and try to kill the agent) can already be seen as adversaries to the agent's goal of staying alive longer.
Likely: Future AI will be decentralized for exactly these reasons. We don't want a single bad actor to control it. Security agencies are now warning that Russia is building a large botnet in the case it needs to go to war, and wants to disable enemy infrastructure. The US has similar needs.
Well designed game theory makes it possible for adversaries to cooperate. So it is no guarantee that Alice is always susceptible to Bob's attacks. Cryptography provides methods that can't be attacked if properly implemented. Defense and offense also can have differing costs: It can be way (computationally) cheaper to create defenses for Alice, than it is to craft adversarial offenses for Bob.
Though the risk is real: Spam preceded spam-filters. There was a short period (in internet years) where spam was more effective than our methods to counter it. So intelligent self-modifying worms/viruses will probably precede intelligent self-learning anti-viruses.
We also see both inverse reinforcement learning (learning about the policy of another agent through observing its behavior), adversarial RL (forcing another trading bot to make unprofitable decisions), and computational arms-races (who has the lowest latency?) between High Frequency Trading firms.