
Model-Based Reinforcement Learning for Atari - henning
https://arxiv.org/abs/1903.00374
======
taliesinb
Hasselt and co, who authored the original Rainbow DQN model that was bested by
SimPLe (this paper), responded recently with "When to use parametric models in
reinforcement learning?" [1], which is an interesting read. Punchline: "We
explain why their [SimPLe's] results may perhaps be considered surprising, and
show that in a like-for-like comparison Rainbow DQN outperformed the scores of
the model-based agent, with less experience and computation.". In other words,
in an apples-to-apples comparison, model-based RL have not been shown to be
more sample efficient than suitably optimized DQNs.

Hasselt argues this is because if you need _accurate_ updates to your Q
values, you need to trust the learned model you are using for simulated
rollouts on the states on which you are sampling. But if your simulation model
is trustworthy on these states, it is because it saw a lot of real transitions
from these states from the actual environment. But then you might as well just
have stored those transitions in a big enough replay buffer and use ordinary
Q-learning with experience replay. And this indeed seems to be the case: when
you give Rainbow DQN a nice big replay buffer, it is more sample efficient
(both real and imagined samples) than SimPLe. Hasselt leaves some wiggle room
for learned models to help with action selection and credit assignment,
though.

My counterargument to this (supplied with zero evidence of course!) would be
that with the right inductive biases, a learned model can generalize quite
accurately and with very few seen transitions, and hence be so sample
efficient that it would outperform the replay memory approach. I'd imagine
that the kinds of inductive biases that are appropriate for a varied meta-
environment like Atari are quite general things like 'visually localized
objects typically only interact when they approach or touch each other', and
'the arrow keys likely control one localized object'. There are approaches for
how to encode such priors; [2] is a good survey paper, and [3] employs some of
these ideas for RL. Moreover, these are the kinds of priors that one imagines
are encoded or biased towards by evolution in actual animal brains.

[1] [https://arxiv.org/abs/1906.05243](https://arxiv.org/abs/1906.05243)

[2] [https://arxiv.org/abs/1806.01261](https://arxiv.org/abs/1806.01261)

[3] [https://arxiv.org/abs/1806.01830](https://arxiv.org/abs/1806.01830)

------
s_Hogg
I wonder if these guys are a bit quick to give themselves credit.

There's a lengthy (and depressing) series of posts by Ben Recht[1] that
basically outlines all the reasons that model-free RL is effectively bunk if
you have access to any model of how the environment evolves. These guys (as
far as I can tell), haven't made any attempt to disambiguate how much of the
improvement they see is down to having a model of the environment _at all_, as
against the specific model they propose. I think there's probably more these
guys could do to prove that they're actually onto something and aren't just
confusing throwing compute at the problem with a genuinely better solution.

[1]: [http://www.argmin.net/outsider-rl.html](http://www.argmin.net/outsider-
rl.html)

~~~
habitue
I mean, everyone knows if you hand a human-created model to a learning
algorithm, it's going to do better. The linked posts you're pointing to here
are talking about using human intelligence to build a model for the system. I
think the point in this paper is that the model is learned in a self-
supervised way. And that learning environment dynamics through predicting the
next frame is useful by itself. That's not as obvious

