Hacker News new | past | comments | ask | show | jobs | submit login
Model-Based Reinforcement Learning for Atari (arxiv.org)
58 points by henning 30 days ago | hide | past | web | favorite | 3 comments

Hasselt and co, who authored the original Rainbow DQN model that was bested by SimPLe (this paper), responded recently with "When to use parametric models in reinforcement learning?" [1], which is an interesting read. Punchline: "We explain why their [SimPLe's] results may perhaps be considered surprising, and show that in a like-for-like comparison Rainbow DQN outperformed the scores of the model-based agent, with less experience and computation.". In other words, in an apples-to-apples comparison, model-based RL have not been shown to be more sample efficient than suitably optimized DQNs.

Hasselt argues this is because if you need accurate updates to your Q values, you need to trust the learned model you are using for simulated rollouts on the states on which you are sampling. But if your simulation model is trustworthy on these states, it is because it saw a lot of real transitions from these states from the actual environment. But then you might as well just have stored those transitions in a big enough replay buffer and use ordinary Q-learning with experience replay. And this indeed seems to be the case: when you give Rainbow DQN a nice big replay buffer, it is more sample efficient (both real and imagined samples) than SimPLe. Hasselt leaves some wiggle room for learned models to help with action selection and credit assignment, though.

My counterargument to this (supplied with zero evidence of course!) would be that with the right inductive biases, a learned model can generalize quite accurately and with very few seen transitions, and hence be so sample efficient that it would outperform the replay memory approach. I'd imagine that the kinds of inductive biases that are appropriate for a varied meta-environment like Atari are quite general things like 'visually localized objects typically only interact when they approach or touch each other', and 'the arrow keys likely control one localized object'. There are approaches for how to encode such priors; [2] is a good survey paper, and [3] employs some of these ideas for RL. Moreover, these are the kinds of priors that one imagines are encoded or biased towards by evolution in actual animal brains.

[1] https://arxiv.org/abs/1906.05243

[2] https://arxiv.org/abs/1806.01261

[3] https://arxiv.org/abs/1806.01830

I wonder if these guys are a bit quick to give themselves credit.

There's a lengthy (and depressing) series of posts by Ben Recht[1] that basically outlines all the reasons that model-free RL is effectively bunk if you have access to any model of how the environment evolves. These guys (as far as I can tell), haven't made any attempt to disambiguate how much of the improvement they see is down to having a model of the environment _at all_, as against the specific model they propose. I think there's probably more these guys could do to prove that they're actually onto something and aren't just confusing throwing compute at the problem with a genuinely better solution.

[1]: http://www.argmin.net/outsider-rl.html

I mean, everyone knows if you hand a human-created model to a learning algorithm, it's going to do better. The linked posts you're pointing to here are talking about using human intelligence to build a model for the system. I think the point in this paper is that the model is learned in a self-supervised way. And that learning environment dynamics through predicting the next frame is useful by itself. That's not as obvious

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact