
OpenAI Baselines - astdb
https://blog.openai.com/openai-baselines-dqn/
======
Smerity
To extend on what was written in the article, reproducibility is difficult in
science generally but can be insanely difficult for machine learning. The key
insight I've found over the years in the field, especially when applied to
deep learning, is that gradient descent is highly effective at hiding my darn
bugs.

I ran into this recently by accident when writing a simple RL example. With
two weight matrices to learn, the first weight matrix was given correct
gradients, the last weight matrix was only supplied with partial information.
Surprise, still works, and I only discovered the bug _after_ submitting to
OpenAI's Gym with quite reasonable results. I've seen similar issues in the
past such as accidentally leaving a part of the network frozen (i.e. it was
randomly initialized and never changed) yet the model still happily went along
with it.

This is good and bad. Bad in that it makes errors difficult to catch. Good in
that, if you had a reason for freezing part of the network (maybe transfer
learning etc) your model will learn to happily use it, even if that
"information" is more akin to noise.

Regarding reproducibility, most papers I've gone to reproduce take far longer
than expected and usually involve deducing / requesting additional information
from the lead authors. Even minor issues, such as how the loss is calculated
(loss / (batch * timestep) vs loss / batch) can confuse substantially and
given they seem "insignificant" and that there are space constraints in
papers, they are rarely written down.

Worst I have seen recently was a state of the art published result where the
paper was accepted to a conference yet they didn't include a single
hyperparameter for their best performing model - and no code. There is near
zero ability to reproduce that given the authors spent a small nuclear reactor
worth of compute performing grid search to get the optimal hyperparameters.

tldr There are reproducibility issues all the way up the stack, from gradient
descent working against you to minor omissions in the papers to full fledged
omissions that are still accepted by the community.

~~~
cf
I wish there were more ways to unit test the gradient calculations in my code.
While I try to use AD as much as possible, it isn't always feasible. Of course
it isn't just gradient since if you are implementing something like Adagrad
there are additional corners bugs can lurk.

~~~
Smerity
I presume you're already familiar with computing the numerical and analytical
Jacobian[1][2] and just wishing for a better way? :) They're memory intensive
as all hell and pretty finicky but at least it's something. I'll admit that
when floating point calculations are involved it can all go to hell anyway.

Recently I had to implement gradient calculations by hand recently (writing
custom CUDA code) and had a pretty terrible time. Mixing the complications of
CUDA code with my iffy manual differentiations and floating point silliness
can drive you a little bonkers. I ended up implementing a slow automatic
differentiated version and compared resulting outputs and gradients to help
work through my bugs.

Here's hoping that Tensorflow's XLA and other JIT style CUDA
compilers/optimizers will make much of this obsolete in the near future.

For those not familiar, the overhead for calling a CUDA kernel can be insanely
high, especially when you're just doing an elementwise operation such as an
add. Given your neural network likely has many many of these, wrapping many of
these into one small piece of custom CUDA can result in substantial speed
increases. Unfortunately there's not really any automatic way of doing that
yet. We're stuck in the days of either writing manual assembly or being fine
with suboptimal compiled C.

[1]:
[https://www.tensorflow.org/versions/r0.11/api_docs/python/te...](https://www.tensorflow.org/versions/r0.11/api_docs/python/test/gradient_checking)

[2]:
[https://github.com/pytorch/pytorch/blob/master/torch/autogra...](https://github.com/pytorch/pytorch/blob/master/torch/autograd/gradcheck.py)

~~~
agibsonccc
We spent a ton of time thinking about this. We have an "op executioner" in our
tensor library that handles special cases like this. We call it "grid
execution" where we look for opportunities for grouping ops automatically. We
will be combining that with our new computation graph to automatically look
for optimization opportunities like that.

Right now we hand write all of our own gradients as well.

The overhead can come from a ton of different places. This is why we wrote
workspaces:
[http://deeplearning4j.org/workspaces](http://deeplearning4j.org/workspaces)

Allocation reduction and op grouping are only a few things you can do.

------
gwern
Advertising: if you're interested in RL, subscribe to
[https://www.reddit.com/r/reinforcementlearning/](https://www.reddit.com/r/reinforcementlearning/)
!

------
SamBam
This is good advice, and just as important is for authors to release the code
they used.

For my Master's thesis in AI (12 years ago, so before most of this open stuff)
I compared an existing Genetic Algorithm, described in a published paper,
against my improvement. My improvement was significantly better.

However, I relied on the prose description of the original algorithm. The
original paper (cited many, many times) didn't even have pseudocode, let alone
source code.

For my paper, I included pseudocode of both the original algorithm and my
improved algorithm. But we still didn't have established practices for how to
make source code available to readers, in such a way that they'd be archived
long term.

Is there an established way now?

~~~
zardo
[http://www.gitxiv.com](http://www.gitxiv.com)

Seems to be gaining popularity.

------
Houshalter
It's horrifying to think about how many published papers have incorrect
results due to bugs. There was a famous incident last year where someone
published a paper with amazing results and got a ton of attention. No one
could reproduce it and eventually it was withdrawn.

~~~
Smerity
Agreed. Admittedly the paper that you're speaking about was on the extreme
level of "bugged". The results were beyond stellar, attracting a great deal of
interest, and researchers who are usually very reserved quickly pointed out
methodological flaws and showed through previous / current experimentation how
broken it was. A friend noted that their process was so flawed it was
potentially _worse_ than if you'd just trained on the test set.

Your broader point is spot on however. My general hope is that people are wary
when their results are strong^, ensuring that you don't get a good result via
"cheating", so the majority of bugs are likely to harm performance. If a
result is also not reproducible (i.e. "cheating" bug) it won't be used and
built on - but if a result is bugged but reproducible (i.e. bug where
performance was lower than it should be) then we can still move the field
forward even in spite of these issues.

^ When I achieved state of the art for a task - especially given it was a huge
jump in accuracy for a relatively small model compared to the previous state
of the art - I spent many days sitting there double checking I hadn't
accidentally cheated ;)

~~~
Houshalter
Yeah the SARM paper was an extreme example, but that's why it got caught so
quickly. How many papers have less extreme but still serious flaws, and don't
get caught?

~~~
Smerity
As an extreme example it actually brought me a bit of hope. No source code was
released but the "peer review" via arXiv, Twitter, and other various channels
ended up bringing the story to a close.

I'd like to imagine it's how effective peer review could be if given
sufficient motivation ;)

As you note, though, most papers don't get anywhere in the same magnitude of
focus, and others which do may still be entirely unreproducible anyway :(

------
aub3bhat
Can anyone at OpenAI explain the sole focus on RL while ignoring Vision, NLP ?

~~~
karpathy
Our focus at OpenAI is on AGI research. Many of us believe that Vision/NLP
research falls into the category of AI applications, and does not inform
insights into how to achieve generally intelligent agents. Instead, the
approach is to work with full-stack agents and the core challenge is to get
them to develop a cognitive toolkit for general problem solving, not anything
that has to do with the specifics of perception.

This is a historically backed insight. If you're interested in a good critique
of the decompose-by-function-then-combine-later approach, I recommend
"Intelligence without Representation" from Rodney Brooks
[http://www.scs.ryerson.ca/aferworn/courses/CPS607/CLASSES/su...](http://www.scs.ryerson.ca/aferworn/courses/CPS607/CLASSES/subsumption.pdf)

~~~
mindcrime
_Many of us believe that Vision /NLP research falls into the category of AI
applications, and does not inform insights into how to achieve generally
intelligent agents._

Hmm... I can agree that vision and NLP could be seen as "applications", from
one point of view. But I can see another position where each simply represents
a different aspect of underlying cognition. Language in particular, seems to
be closely tied up in how we (humans) think. And without proposing a strong
version of the Sapir-Whorf hypotheses, I can't help but believe that a lot of
human cognition is carried out _in_ our primary language. Now to be fair, this
belief comes from not much more than obsessively trying to introspect on my
own thinking and "observe" my own mental processes.

In any case, it leads me to suspect that building generally intelligent AI's
will be tightly bound up with understanding how language works in the brain
and the extent to which there is a "mentalese" and how - if at all - a
language like English (or Mandarin or Tamil, whatever) maps with "mentalese".
Vision also seems crucial to the way humans learn, given our status as
embodied agents that learn from the environment using site, smell, sound,
kinesthetic awareness, proprioception, etc.

Quite likely I'm wrong, but I have a hunch that building a truly intelligent
agent may well require creating an agent that can see, hear, touch, smell,
balance, etc. At least to the extent that humans serve as our model of how to
construct intelligence.

On the other hand, as the old saying goes "we didn't build flying machines by
creating mechanical birds with flapping wings". :-)

~~~
dkarapetyan
I have to disagree with you on language and cognition. If you went around
asking famous mathematicians how they think the last thing they would tell you
would be "words".

Creative and inventive thought is very picturesque and non-linear:
[https://www.amazon.com/Psychology-Invention-Mathematical-
Fie...](https://www.amazon.com/Psychology-Invention-Mathematical-
Field/dp/0486201074)

~~~
mindcrime
Also, thanks for the book recommendation. I ordered a copy. Looking forward to
digging into it.

------
sanjeetsuhag
This site is so pretty.

------
kmicklas
Releasing both code and models needs to be STANDARD for ML research.

~~~
joshmarlow
s/STANDARD/ALL/

~~~
SamBam
Did you substitute the wrong word?

~~~
joshmarlow
Yes, I definitely did.

