Hacker News new | past | comments | ask | show | jobs | submit login
TDM: From Model-Free to Model-Based Deep Reinforcement Learning (bair.berkeley.edu)
97 points by jonbaer on April 26, 2018 | hide | past | favorite | 11 comments

Although the naming has a rationale (as explained in the footnote), using the name Temporal Difference Model for this method is a recipe for a lot of confusion.


Great article, started very easy to read, then it got to the math section and it lost me.

Any developers like me that get lost at the math?

I really wish I could contribute and learn more about this field, but every time I try, there is a point where I hit a brick wall and turn away.

You are not alone, but I personally think (and I think many others) that AI will come to the point that you dont really need to understand how the math behind it works but only how it works. So for example you only need to know how back propagation works and what it does, but not the exact math formula for it. I think you can see it already happening with Keras. You would need to know the math and the nitty gritty if you want to build or research state of the art ML/DL

I have been steeped in math for as long as I remember, so I frequently have a hard time telling whether an explanation is actually helpful to someone who doesn't already understand the underlying concept, and I'd like to use this opportunity to improve.

Could you specify where exactly you get lost, and why? (Is it something that's just not explained, or is there an explanation, but one that doesn't make it easier to understand?)

Things that weren't obvious to me as a non-mathematician:

(I assumed it meant: the model is represented by function f, whose inputs can be any combination of S and A [domain], and will produce an output value in S [codomain])

Why do we set the Q to 0 below?

    The constraint that Q(st,at,st+K,K)=0 enforces the feasibility of the trajectory

> the model is represented by function f, whose inputs can be any combination of S and A [domain], and will produce an output value in S [codomain])

Exactly. f:S×A↦S is a function signature, just like in a programming language. Basically, the model tells you which state you end up in after taking a certain action in the given state.

> Why do we set the Q to 0 below?

Q is introduced as

A temporal difference model (TDM)†, which we will write as Q(s,a,s_g,τ), is a function that, given a state s∈S, action a∈A, and goal state s_g∈S, predicts how close an agent can get to the goal within τ time steps. Intuitively, a TDM answers the question, “If I try to bike to San Francisco in 30 minutes, how close will I get?”

That means that setting Q(s_t,a_t,s_{t+K},K)=0 is the same as enforcing that the state s_{t+K} can actually be reached (distance 0) from s_t in K time steps. Without the constraint, it would be possible to plan a trajectory that can't be executed because the intermediate goals are too far away.

Ah, thanks for the explanation!

I hate to say it, but you should probably understand some math before trying to do some math.

This would be like a statistician reading an article about some software implementation and complaining that they couldn't understand the jargon.

Models that have an explicit goal state, and a distance estimator, are naturally going to be more efficient than vanilla RL without this side information where it has to learn purely by exploration.

Edit: Im saying it looks like apples vs oranges.

How is the distance getting estimated? Like another comment said, if a good distance estimator is provided this simplifies the task. Is there a baseline that uses distance in its input as well?

> You’ve decided that you want to bike from your house by UC Berkeley to the Golden Gate Bridge

Kind of off topic but... is this even possible with the route drawn? Isn’t it still impossible to take a bicycle on the West span of the Bay Bridge?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact