
A Recipe for Training Neural Networks - yigitdemirag
https://karpathy.github.io/2019/04/25/recipe/
======
GistNoesis
I usually try a technique Andrej didn't mention here which helps me a lot in
the debugging and modelling phase : Simulated data that encompass a single key
difficulty of the problem.

For example in this line of thought in question answering, there is the BABI
dataset which create auxiliary problems. So you know where the problems in
modelisation are.

By pushing this problem to the extreme (for example in nlp you can have tasks
that consist of repeating a sequence of character in reverse order to
demonstrate that the architecture is indeed capable of memorizing like a
parrot), you can often create trivial problems, which takes minutes to run on
a single machine, and help discover most bugs.

You can also create hierarchies of such problems, so you know in which order
you have to tackle them. And you can build sub-modules and reuse them.

Quite often the code you obtain then is very explainable and you know what
situation will work and what will probably not work. But this network
architecture is usually "verbose" and numerically optimize a little less well
on large scale problems. The trick is then to simplify your network
mathematically into something that is more linear and more general. You can
reorder some operations like summing along a different dimension first.
Semantically this will be different but will converge better. Because for a
network to optimize well it needs to work well in both the forward direction
and the backward direction so that the gradient flows well.

Once you have a set of simple problems that encompass your general problem, a
good solution architecture is usually a more general mixture of the model
architecture of the simple problems.

~~~
dual_basis
This is great, like TDD for ML!

------
shmageggy
This might be the most "Deep Learning" thing I've ever read:

> _One time I accidentally left a model training during the winter break and
> when I got back in January it was SOTA (“state of the art”)._

~~~
akhilcacharya
Hope they're not using a cloud instance - that sounds incredibly expensive!

------
6gvONxR4sf7o
>The first step to training a neural net is to not touch any neural net code
at all and instead begin by thoroughly inspecting your data. This step is
critical.

This can't be overstated. I can't count the number of times I'm the first
person to find a problem with the data. It's incredibly frustrating. Just look
at your damned data to sanity check it and understand what's going on. Do the
same with your model outputs. Don't just look at aggregates. Look at
individual instances. Lots of them.

~~~
odnes
I was guilty of this once and would add some more specific advice; if your
dataset consists of multiple possible labels for the same samples, do not just
assume that the average of these labels will describe the best label. And
don't assume that training with the unclean data will produce a net that
magically learns to do the aggregation for you.

------
olooney
Well, this was unexpectedly excellent.

I don't think "stick with supervised learning" is very good advice, though.
Unsupervised techniques sometimes work well for NLP and has worked well for
other domains, such as medical records[1]. In particular, anytime you have
access to much more unlabeled data than labeled data, it should be something
you should at least consider.

[1]:
[https://www.nature.com/articles/srep26094](https://www.nature.com/articles/srep26094)

~~~
m0zg
Why "unexpectedly"? Karpathy has weapons grade knack for explaining complex
subjects in plain terms. Case in point:
[http://karpathy.github.io/2016/05/31/rl/](http://karpathy.github.io/2016/05/31/rl/)
explains RL in a way even a non-practitioner will have little trouble
understanding. Another prominent person with this skill is Chris Olah, one of
the people behind Distill.

~~~
olooney
Not knocking on this author at all... It's just that nowadays if I see a title
in the vein of "7 Tips to Train Deep Neural Nets for Complete Beginners On
Rails With Keras and TensorFlow" I click on it more out of a sense of
obligation than anything but I don't go in with very high expectations. So I
was pleasantly surprised to find this article was substantive and high
quality.

------
cs702
As a practitioner, I found myself nodding in agreement again and again and
again.

This blog post is full of the kind of real-world knowledge and how-to details
that are not taught in books and often take endless hours to learn the hard
way.

If you're interested in deep learning, do yourself a favor and go read this.

It is worth its weight in gold.

~~~
mark_l_watson
I agree. I have been using machine learning since the 1980s and deep learning
for the last 4 years. This is great advice that I have both bookmarked and
made into a PDF to store away in my searchable collection of research
material.

Karpathy is amazing. I have had so much ‘mileage’ on two projects out of his
unreasonable effectiveness of RNNs article.

------
IOT_Apprentice
Andrej is the guy doing Telsa's neural network for their FSD hardware. I truly
appreciated his talk during the autonomy reveal the other day.

~~~
spectramax
I have the opposite opinion. I dislike marketing wankery and investor
bullshit, however entertaining it may be. Instead, I tremendously enjoyed
Karpathy’s Stanford class and lecture videos.

~~~
Fricken
I thought Karpathy's contribution to the presentation was an excellent
technical summary of what Tesla is trying to do. It clarified a lot of things
for me what they're doing with Autopilot. Up until a few days ago my attempts
at scrutinizing autopilot have been limited to little snippets of information
here and there, rumours, and guesswork.

Bullshitting and wankery doesn't come naturally to Karpathy so the few spots
where he was under pressure to do as much stood out like a sore thumb.

------
ArtWomb
>>> There is a large number of fancy bayesian hyper-parameter optimization
toolboxes around and a few of my friends have also reported success with them,
but my personal experience is that the state of the art approach to exploring
a nice and wide space of models and hyperparameters is to use an intern :).
Just kidding.

LOL. Human assisted training at scale is perfectly allowable for mission
critical success. Especially if you enjoy an unlimited research budget!

You can follow these instructions to the letter. And the same problems around
generalization will arise. It's foundational.

For 30fps camera images, handling new data in real time works fine for 99% of
scenarios. But seeking usable convergence rates on petascale sized data
problems such as NVidia's recent work on Deep Learning for fusion reaction
container design requires a breakthrough. Not just in software. But
computation architectures as well.

Deep Reinforcement Learning and the Deadly Triad

[https://arxiv.org/pdf/1812.02648.pdf](https://arxiv.org/pdf/1812.02648.pdf)

Identifying and Understanding Deep Learning Phenomena

[http://deep-phenomena.org/](http://deep-phenomena.org/)

------
kriro
"""If you have an imbalanced dataset of a ratio 1:10 of positives:negatives,
set the bias on your logits such that your network predicts probability of 0.1
at initialization."""

Can someone translate this to PyTorch for me? Or give a simple example of how
one would go about doing this?

It means, that if I have a 1:10 ratio in the data, an untrained net should
predict positive in 10% of the cases, right?

------
OceanKing
I am bookmarking this article, this is pure gold.

Also, it seems to me that most of what he says can be distilled into a
boilerplate/template structure for any given deep learning framework, from
which new projects can be forked - does this already exist?

~~~
hnarayanan
Yes, it’s called [http://fast.ai](http://fast.ai)

~~~
gojima2
lol, when he wrote

>> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50,
SGDOptimizer)

under "Neural net training is a leaky abstraction" my first thought was, this
IS fastai's API

------
mitchellgoffpc
For anyone learning to build and train neural nets, this is a fantastic cheat
sheet; Andrej is top-notch at explaining these kinds of things. The other
posts on this blog are definitely worth a read as well!

~~~
eanzenberg
A lot of this process goes beyond NN into generic ML. Especially understanding
and diving into the data.

~~~
mitchellgoffpc
Yep agreed!

------
mollerhoj
"though NLP seems to be doing pretty well with BERT and friends these days,
quite likely owing to the more deliberate nature of text, and a higher signal
to noise ratio"

What is he talking about here? BERT, GPT etc are not unsupervised, they are
pretrained on a task that has naturally supervised data (language modelling).

------
indweller
In the blog, he refers to test losses at an early stage, like in "add
significant digits to your eval". Does he actually refer to the test data or
is he referring to validation data? I was under the idea that we were supposed
to touch the test data only once at the end of all training and validation.
What is the right way to handle the test data?

~~~
snrji
By "eval" you can also mean the training subset. As I understood is at the
code to evaluate the network at a given point with a given dataset. For
instance, after epoch epoch, the model is evaluated for both training and
validation (you see both losses)

As you said, the test subset should only be used at the very last.

------
mendeza
This recipe results in large amounts of time spent before any results occur
(depending on the task you are trying to solve). Classification is an easy
task to use this recipe, but when you venture into object detection or pose
estimation, data collection, labeling, and setting up training and evaluation
infrastructure is much more complex.

~~~
liuliu
Can you expand a little bit? I often find if I skip one or more steps
mentioned here, the later debugging is tremendously harder (and often involves
go back to these steps again). Some of these advice like visualization are
well supported in many frameworks usually through TensorBoard. Others really
just good common-sense try-first-or-you-will-regret-later steps that don't
require significant amount of time investment.

------
cdelsolar
What a fantastic post, thank you for this.

