
TDM: From Model-Free to Model-Based Deep Reinforcement Learning - jonbaer
http://bair.berkeley.edu/blog/2018/04/26/tdm/
======
jampekka
Although the naming has a rationale (as explained in the footnote), using the
name Temporal Difference Model for this method is a recipe for a lot of
confusion.

[https://en.wikipedia.org/wiki/Temporal_difference_learning](https://en.wikipedia.org/wiki/Temporal_difference_learning)

------
keyle
Great article, started very easy to read, then it got to the math section and
it lost me.

Any developers like me that get lost at the math?

I really wish I could contribute and learn more about this field, but every
time I try, there is a point where I hit a brick wall and turn away.

~~~
yorwba
I have been steeped in math for as long as I remember, so I frequently have a
hard time telling whether an explanation is actually helpful to someone who
doesn't already understand the underlying concept, and I'd like to use this
opportunity to improve.

Could you specify where exactly you get lost, and why? (Is it something that's
just not explained, or is there an explanation, but one that doesn't make it
easier to understand?)

~~~
sarabande
Things that weren't obvious to me as a non-mathematician:

    
    
        f:S×A↦S
    

(I assumed it meant: the model is represented by function f, whose inputs can
be any combination of S and A [domain], and will produce an output value in S
[codomain])

Why do we set the Q to 0 below?

    
    
        The constraint that Q(st,at,st+K,K)=0 enforces the feasibility of the trajectory

~~~
yorwba
> the model is represented by function f, whose inputs can be any combination
> of S and A [domain], and will produce an output value in S [codomain])

Exactly. f:S×A↦S is a function signature, just like in a programming language.
Basically, the model tells you which state you end up in after taking a
certain action in the given state.

> Why do we set the Q to 0 below?

Q is introduced as

 _A temporal difference model (TDM)†, which we will write as Q(s,a,s_g,τ), is
a function that, given a state s∈S, action a∈A, and goal state s_g∈S, predicts
how close an agent can get to the goal within τ time steps. Intuitively, a TDM
answers the question, “If I try to bike to San Francisco in 30 minutes, how
close will I get?”_

That means that setting Q(s_t,a_t,s_{t+K},K)=0 is the same as enforcing that
the state s_{t+K} can actually be reached (distance 0) from s_t in K time
steps. Without the constraint, it would be possible to plan a trajectory that
can't be executed because the intermediate goals are too far away.

~~~
sarabande
Ah, thanks for the explanation!

------
mooneater
Models that have an explicit goal state, and a distance estimator, are
naturally going to be more efficient than vanilla RL without this side
information where it has to learn purely by exploration.

Edit: Im saying it looks like apples vs oranges.

------
edhu2017
How is the distance getting estimated? Like another comment said, if a good
distance estimator is provided this simplifies the task. Is there a baseline
that uses distance in its input as well?

------
gok
> You’ve decided that you want to bike from your house by UC Berkeley to the
> Golden Gate Bridge

Kind of off topic but... is this even possible with the route drawn? Isn’t it
still impossible to take a bicycle on the West span of the Bay Bridge?

