
Reinforcement Learning: From Zero to State of the Art with Pytorch 4 - codentropy
https://github.com/higgsfield/RL-Adventure-2
======
activatedgeek
The title is "click-baity" but I've used this repo in the past to verify parts
of my code and it is highly recommended. The code is amazingly clean!

Once you've verified that the implementations are correct, it is easier to
start the journey to reproduce SOTA on harder problems by playing around with
the side-tricks that are often employed.

------
RobertoG
Pytorch 4?

Pytorch 1 is not available yet.

I suppose it means Pytorch 0.4. but it doesn't sounds so good. We should leave
marketing outside technical explanations.

What will happen in the future when there is a Pytorch 4 and somebody find
this repo?

~~~
Sean1708
As far as I can tell it's only the Hacker News title that mentions PyTorch 4
(in fact the repo description says "PyTorch0.4 tutorial ..."), and I think
it's far more likely that that was just a typo/misunderstanding that some sort
of insidious marketing technique.

~~~
RobertoG
Maybe I just dreamed it, but I think the description in GitHub said "Pytorch
4". Not anymore.

Well, if it was changed because of my comment, I'm happy to be useful. If it
was not changed and I dreamed it, my excuses.

------
2bitencryption
I did a great deal of reading on Q-learning around the time of the original
AlphaGo, it looks like that was covered in a previous repo (RL-Adventures-1).

This new one seems to not mention Q-learning, is that because all these
examples are based implicitly on Q-learning, or are these totally new
alternatives?

~~~
blt
In Deep Q Network (DQN), the Q network itself is the policy. You have one
output for each possible action, and the the neural network estimates the Q
value for each action in the current state. You act by selecting the action
with the highest Q output. This doesn't work if the action space is
continuous, e.g. motor torques for a robot.

You might think you can fix this by making the action an input to the Q
network and keeping only one output; then you could find the action with the
highest output. But due to the nonlinearity in the neural network, this is an
intractable nonconvex optimization problem.

So instead, you train a neural network to output the action given the state.
The algorithms are harder to understand, because Q learning is kind of like
supervised learning but policy gradients really aren't. A lot of algorithms
(A2C, DDPG, TRPO, etc.) still use one-output Q network (as described in the
previous paragraph), but this is just a part of the learning algorithm * .
Once training is done, you throw away this Q network. The learned behavior is
entirely contained in the policy network. These methods are usually called
policy gradient methods.

This article covers policy gradient methods only.

* it's possible to do "pure" policy gradients using only the empirical return, but the Q network helps reduce the variance of the gradient estimate and stabilize the learning.

------
poiuytqwer
Anyone got anything similar to this for deep learning?

~~~
poppingtonic
[https://github.com/fastai/fastai/tree/master/courses/](https://github.com/fastai/fastai/tree/master/courses/)

------
zxcvvcxz
Hate to be that person, but -

If you can go from zero to SOTA from a web tutorial in anything less than, I
don't know, a year, then the field is severely underdeveloped.

But let that be an opportunity: what it really means is that the field is
quickly growing and there aren't well-established experts or leaders.

~~~
laGrenouille
Looking at the material here, I think you may be under estimating what is
meant by "zero" here. To fully follow these notes you would need to have
significant research-level knowledge of general machine learning techniques,
with a good deal of specific experience working with neural networks (both in
theory and application).

I think someone with a general graduate-level background in most fields of ML
or computer science could acquire a working knowledge of the SOTA in a
particular associated sub-topic after reading through a half dozen or so
research papers and code. That doesn't seem unreasonable or surprising.

~~~
backpropaganda
He's also under estimating what it means to be a hero. Most of the experiments
here are on Cartpole which is a very basic environment (not even MNIST level
of complexity). Most papers in RL use Atari, which requires some amount of
engineering that this repo does not have.

