
Reinforcement Learning: An Introduction, Second Edition - Buttons840
http://incompleteideas.net/book/the-book.html
======
snrji
Can someone explain for the layman the following doubt?

One of the (many) things I still don't understand about RL even after having
tried to read about it is how the loss function of Q Networks is computed if
you don't have the target value. I understand that you are trying to predict
the "value" of a pair (state,action), and I know that there is a formula for
updating the weights based on the difference between the expected value of the
next state and the maximum predicted value available plus the reward when you
actually get to the state (probably I'm messing up here), but then doesn't
that mean that the initialization is extremely relevant? As in, how will the
rewards ever be superior to the random weights already present in the network?

~~~
GistNoesis
It's easier to understand by considering finite horizon problems. With finite
horizon problem, the target value for the last time-step will exactly be the
expected received reward. So your network learn to approximate it and use this
approximation for computing the approximation at last time-step - 1. So you
construct the value from the end (in a "dynamic programming" kind of way).

But the bellman equation works just the same with infinite horizon problem.
The weights converges all simultaneously to make the network approximation
solve the equation.

A great resource to understand RL is openAI's spinning up.

~~~
snrji
I see, thanks for your time!

------
Buttons840
The Second Edition of this book appears to be complete. You can get a free PDF
from the authors site.

