
PlaNet: A Deep Planning Network for Reinforcement Learning - danijar
https://ai.googleblog.com/2019/02/introducing-planet-deep-planning.html
======
mark_l_watson
Nice, another large contribution to the field from Google just a day after
Open AI’s paper on better language models and their implications. This is in
addition to other nice recent public contributions from Uber, Facebook,
Microsoft, etc.

I think I understand these huge tech company’s “generosity”: these public
contributions to the field probably help in recruiting efforts like salary and
fringe benefits do. The field is moving so fast and growing so fast it is
difficult to hire talent right now (I manage a machine learning team at a very
large company, and at least this is my experience).

This paper is claiming a 5000 times increase in performance over previous
state of the art techniques. Huge.

~~~
GistNoesis
It's not as "huge", as they make it look. The goal of the technique is to
increase data efficiency (number of real world tries it needs to learn). So
instead of using real trajectories it simulates trajectories (that's
planning), and learn from these.

These lines of ideas is not new. The main problems associated with it are that
it is almost always more computationally expensive (you learn from real and
dreamed trajectories) and it is harder to learn as it is susceptible to a kind
of exposure bias : Once you have built a model like "the earth is flat", then
you will simulate/dream trajectory according to it, diluting the weak evidence
you can get from real data telling you that the "earth is round", and so it
gets stuck with a wrong model.

The performance gain you refer to is a gain relative to a naive way of doing
things i.e. working in pixel space.

Don't get me wrong, I'm a big fan of the model based approach, and every small
step in this direction is good as it helps with explain-ability. This paper is
one of these nice small steps, but doesn't compare to the gain of previous
techniques like experience-replay, or hindsight-experience-replay.

~~~
danijar
Author here. First of all, I'd like to clarify that the data efficiency gain
over D4PG is 5000% or 50x.

Regarding computational efficiency, we match D4PG, a top model-free agent that
uses experience replay among other techniques (actor critic, distributional
loss, n-step returns, prioritized replay, distributed experience collection).

Your point about exposure bias is interesting, and applies equally to agents
that do not learn a model. Personally, I think we need reliable uncertainty
estimates in neural networks to make progress on this research question, so
the agent can know what it doesn't know.

Hindsight experience replay doesn't apply to tasks where the inputs are images
because it requires knowledge of a meaningful goal space with a distance
function (e.g. 2D coordinates of goal positions).

