A Recipe for Training Neural Networks

GistNoesis · on April 25, 2019

I usually try a technique Andrej didn't mention here which helps me a lot in the debugging and modelling phase : Simulated data that encompass a single key difficulty of the problem.

For example in this line of thought in question answering, there is the BABI dataset which create auxiliary problems. So you know where the problems in modelisation are.

By pushing this problem to the extreme (for example in nlp you can have tasks that consist of repeating a sequence of character in reverse order to demonstrate that the architecture is indeed capable of memorizing like a parrot), you can often create trivial problems, which takes minutes to run on a single machine, and help discover most bugs.

You can also create hierarchies of such problems, so you know in which order you have to tackle them. And you can build sub-modules and reuse them.

Quite often the code you obtain then is very explainable and you know what situation will work and what will probably not work. But this network architecture is usually "verbose" and numerically optimize a little less well on large scale problems. The trick is then to simplify your network mathematically into something that is more linear and more general. You can reorder some operations like summing along a different dimension first. Semantically this will be different but will converge better. Because for a network to optimize well it needs to work well in both the forward direction and the backward direction so that the gradient flows well.

Once you have a set of simple problems that encompass your general problem, a good solution architecture is usually a more general mixture of the model architecture of the simple problems.

dual_basis · on April 26, 2019

This is great, like TDD for ML!

shmageggy · on April 25, 2019

This might be the most "Deep Learning" thing I've ever read:

> One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).

akhilcacharya · on April 26, 2019

Hope they're not using a cloud instance - that sounds incredibly expensive!

6gvONxR4sf7o · on April 26, 2019

>The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical.

This can't be overstated. I can't count the number of times I'm the first person to find a problem with the data. It's incredibly frustrating. Just look at your damned data to sanity check it and understand what's going on. Do the same with your model outputs. Don't just look at aggregates. Look at individual instances. Lots of them.

odnes · on April 26, 2019

I was guilty of this once and would add some more specific advice; if your dataset consists of multiple possible labels for the same samples, do not just assume that the average of these labels will describe the best label. And don't assume that training with the unclean data will produce a net that magically learns to do the aggregation for you.

distant_hat · on April 26, 2019

The number of kids who dive right into modelling without sanity checking, visualizing, or otherwise exploring through the data is shocking. Also, people who just look at the metrics and decide the model is better because lower error ignoring that at times the output is meaningless, like getting negative probabilities.

olooney · on April 25, 2019

Well, this was unexpectedly excellent.

I don't think "stick with supervised learning" is very good advice, though. Unsupervised techniques sometimes work well for NLP and has worked well for other domains, such as medical records[1]. In particular, anytime you have access to much more unlabeled data than labeled data, it should be something you should at least consider.

[1]: https://www.nature.com/articles/srep26094

m0zg · on April 26, 2019

Why "unexpectedly"? Karpathy has weapons grade knack for explaining complex subjects in plain terms. Case in point: http://karpathy.github.io/2016/05/31/rl/ explains RL in a way even a non-practitioner will have little trouble understanding. Another prominent person with this skill is Chris Olah, one of the people behind Distill.

olooney · on April 26, 2019

Not knocking on this author at all... It's just that nowadays if I see a title in the vein of "7 Tips to Train Deep Neural Nets for Complete Beginners On Rails With Keras and TensorFlow" I click on it more out of a sense of obligation than anything but I don't go in with very high expectations. So I was pleasantly surprised to find this article was substantive and high quality.

cs702 · on April 25, 2019

As a practitioner, I found myself nodding in agreement again and again and again.

This blog post is full of the kind of real-world knowledge and how-to details that are not taught in books and often take endless hours to learn the hard way.

If you're interested in deep learning, do yourself a favor and go read this.

It is worth its weight in gold.

mark_l_watson · on April 26, 2019

I agree. I have been using machine learning since the 1980s and deep learning for the last 4 years. This is great advice that I have both bookmarked and made into a PDF to store away in my searchable collection of research material.

Karpathy is amazing. I have had so much ‘mileage’ on two projects out of his unreasonable effectiveness of RNNs article.

IOT_Apprentice · on April 25, 2019

Andrej is the guy doing Telsa's neural network for their FSD hardware. I truly appreciated his talk during the autonomy reveal the other day.

spectramax · on April 25, 2019

I have the opposite opinion. I dislike marketing wankery and investor bullshit, however entertaining it may be. Instead, I tremendously enjoyed Karpathy’s Stanford class and lecture videos.

Fricken · on April 26, 2019

I thought Karpathy's contribution to the presentation was an excellent technical summary of what Tesla is trying to do. It clarified a lot of things for me what they're doing with Autopilot. Up until a few days ago my attempts at scrutinizing autopilot have been limited to little snippets of information here and there, rumours, and guesswork.

Bullshitting and wankery doesn't come naturally to Karpathy so the few spots where he was under pressure to do as much stood out like a sore thumb.

coder543 · on April 26, 2019

Based on your description, I can confidently say you and I didn't watch the same Investor Day presentation by Andrej.

ArtWomb · on April 26, 2019

>>> There is a large number of fancy bayesian hyper-parameter optimization toolboxes around and a few of my friends have also reported success with them, but my personal experience is that the state of the art approach to exploring a nice and wide space of models and hyperparameters is to use an intern :). Just kidding.

LOL. Human assisted training at scale is perfectly allowable for mission critical success. Especially if you enjoy an unlimited research budget!

You can follow these instructions to the letter. And the same problems around generalization will arise. It's foundational.

For 30fps camera images, handling new data in real time works fine for 99% of scenarios. But seeking usable convergence rates on petascale sized data problems such as NVidia's recent work on Deep Learning for fusion reaction container design requires a breakthrough. Not just in software. But computation architectures as well.

Deep Reinforcement Learning and the Deadly Triad

https://arxiv.org/pdf/1812.02648.pdf

Identifying and Understanding Deep Learning Phenomena

http://deep-phenomena.org/

kriro · on April 26, 2019

"""If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization."""

Can someone translate this to PyTorch for me? Or give a simple example of how one would go about doing this?

It means, that if I have a 1:10 ratio in the data, an untrained net should predict positive in 10% of the cases, right?

OceanKing · on April 25, 2019

I am bookmarking this article, this is pure gold.

Also, it seems to me that most of what he says can be distilled into a boilerplate/template structure for any given deep learning framework, from which new projects can be forked - does this already exist?

hnarayanan · on April 25, 2019

Yes, it’s called http://fast.ai

gojima2 · on April 25, 2019

lol, when he wrote

>> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50, SGDOptimizer)

under "Neural net training is a leaky abstraction" my first thought was, this IS fastai's API

yorwba · on April 25, 2019

Fast.ai is nice, but not "a boilerplate/template structure for any given deep learning framework"

laughingman2 · on April 26, 2019

Allennlp is one such project for NLP. Uses dependency injection to separate out the components driving for maximum re-usability. https://github.com/allenai/allennlp

mitchellgoffpc · on April 25, 2019

For anyone learning to build and train neural nets, this is a fantastic cheat sheet; Andrej is top-notch at explaining these kinds of things. The other posts on this blog are definitely worth a read as well!

eanzenberg · on April 25, 2019

A lot of this process goes beyond NN into generic ML. Especially understanding and diving into the data.

mitchellgoffpc · on April 26, 2019

Yep agreed!

mollerhoj · on April 29, 2019

"though NLP seems to be doing pretty well with BERT and friends these days, quite likely owing to the more deliberate nature of text, and a higher signal to noise ratio"

What is he talking about here? BERT, GPT etc are not unsupervised, they are pretrained on a task that has naturally supervised data (language modelling).

indweller · on April 26, 2019

In the blog, he refers to test losses at an early stage, like in "add significant digits to your eval". Does he actually refer to the test data or is he referring to validation data? I was under the idea that we were supposed to touch the test data only once at the end of all training and validation. What is the right way to handle the test data?

snrji · on April 26, 2019

By "eval" you can also mean the training subset. As I understood is at the code to evaluate the network at a given point with a given dataset. For instance, after epoch epoch, the model is evaluated for both training and validation (you see both losses)

As you said, the test subset should only be used at the very last.

mendeza · on April 25, 2019

This recipe results in large amounts of time spent before any results occur (depending on the task you are trying to solve). Classification is an easy task to use this recipe, but when you venture into object detection or pose estimation, data collection, labeling, and setting up training and evaluation infrastructure is much more complex.

liuliu · on April 25, 2019

Can you expand a little bit? I often find if I skip one or more steps mentioned here, the later debugging is tremendously harder (and often involves go back to these steps again). Some of these advice like visualization are well supported in many frameworks usually through TensorBoard. Others really just good common-sense try-first-or-you-will-regret-later steps that don't require significant amount of time investment.

cdelsolar · on April 25, 2019

What a fantastic post, thank you for this.