
AdamW and Super-convergence is now the fastest way to train neural nets - tim_sw
http://www.fast.ai/2018/07/02/adam-weight-decay/
======
cs702
A 5x to 10x reduction in training time versus other approaches is _impressive_
, regardless of how it's achieved. Hours of waiting become minutes; days
become hours; weeks become days.

Among the examples mentioned:

* They fine-tuned a Resnet50 to 90% accuracy on the Cars Stanford Dataset in 60 epochs vs 600 in previous reports.

* They trained an AWD LSTM RNN from scratch to state-of-the-art perplexity on Wikitext-2 in 90 epochs vs 750 epochs in previous reports.

* The trained a QRNN from scratch to state-of-the-art perplexity on Wikitext-2 in 90 epochs vs 500 epochs in previous reports.

There are more examples of improvements in training speed (and accuracy) in
the blog post, which provides persuasive evidence that a training regime
combining (a) "AdamW" (which fixes a well-known issue with weight-decay in
Adam) and (b) "superconvergence" (i.e., raising then decreasing LR, and doing
the inverse with momentum) now appears to be the best/fastest way to train
deep neural nets.

Interestingly, no one yet really knows _why_ or _how_ AdamW and SuperC
together can achieve a 10x improvement in training efficiency.

~~~
jph00
Jeremy from fast.ai here. Thanks for the thoughtful summary. A few more
clarifications which hopefully are of some interest:

\- Sylvain tried AdamW + 1cycle on many different datasets and with different
architectures. The 5-10x improvement over regular SGD and LR annealing was not
a rare occurrence, but was the most common result

\- Accuracy was always at least about as good as the regular approach, and
usually better

\- So our main result here is to strongly suggest that AdamW + 1cycle should
be the default for most neural net training

\- The goal of this research was to improve the fastai library, not to write a
paper. But since the results were so practically useful we figured we'd take
the time to document them in a blog post so others can benefit too

\- fastai is a self-funded (i.e out of my own pocket) non-profit research lab.

~~~
cs702
Thank YOU.

I posted this comment here only because the prior top comment was unfairly
negative, in my view. Sometimes I'm afflicted by this:
[https://xkcd.com/386/](https://xkcd.com/386/)

Had you posted on the main thread I would have upvoted your comment over mine.

Have you considered accepting donations from third parties?

------
Permit
Does this work on any other dataset than Cifar-10 with ResNets? I ask because
I worked on reproducing this paper for ICLR 2018 Reproducibility Challenge and
the paper noted that this was the only setup in which super-convergence would
be observed.

The paper's authors didn't succeed with Adam (which this article seems to have
overcome) so I'm curious if they attempted this training method on any other
datasets?

From the last paragraph of:
[https://openreview.net/forum?id=H1A5ztj3b](https://openreview.net/forum?id=H1A5ztj3b)

> Our experiments with Densenets and all our experiments with the Imagenet
> dataset for a wide variety of architectures failed to produce the super-
> convergence phenomenon. Here we will list some of the experiments we tried
> that failed to produce the super-convergence behavior. Super-convergence did
> not occur when training with the Imagenet dataset; that is, we ran Imagenet
> experiments with Resnets, ResNeXt, GoogleNet/Inception, VGG, AlexNet, and
> Densenet without success. Other architectures we tried with Cifar-10 that
> did not show super-convergence capability included ResNeXt, Densenet, and a
> bottleneck version of Resnet.

EDIT: I see now that they mention a few other datasets: Cars Stanford Dataset
and Wikitext-2.

~~~
jph00
Thanks for your work on the reproducibility challenge - we read your results
and found them interesting. The good news is that we had a lot more success,
although it took many months of work to really make it sing, and bringing in
AdamW was an important part of the success. Note that the super-convergence
paper has been greatly improved through the recent 1cycle work, which includes
a lot of important points to make it work in practice.

We tried the most divergent datasets and architectures we could think of,
including even AWD-LSTM and QRNN. We got great results for pretty much
everything we tried.

(We didn't try ResNeXt, since it's so slow in practice, or VGG or AlexNet,
since they're largely obsolete. We did look at inception-resnet-v2; I'll have
to go back to check to see the results, but IIRC it worked quite well.)

------
desku
I'm not sure how to interpret the argument of the article or the results in
the appendix here.

The first table shows AdamW having the best results, which follows the
argument of the article. However, the following three tables all have plain
Adam producing the best results.

The way the article is written it seems to be championing AdamW, but the
results just seem to conclude that AMSGrad is bad and Adam is the best with
AdamW having negligible performance increase over Adam in a single task.

~~~
yorwba
From the article:

 _So, weight decay is always better than L2 regularization with Adam then? We
haven’t found a situation where it’s significantly worse, but for either a
transfer-learning problem (e.g. fine-tuning Resnet50 on Stanford cars) or
RNNs, it didn’t give better results._

------
rerx
From the article:

> 200% speed up in training! “Overall, we found Adam to be robust and well-
> suited to a wide range of non-convex optimization problems in the field
> machine learning” concluded the paper. Ah yes, those were the days, over
> three years ago now, a life-time in deep-learning-years. But it started to
> become clear that all was not as we hoped. Few research articles used it to
> train their models, new studies began to clearly discourage to apply it and
> showed on several experiments that plain ole SGD with momentum was
> performing better.

This is not true for all domains. In machine translation Adam is used in all
top results papers I can think of.

------
sandeepeecs
Nice to see great advancements in the training speed for deep neural network
algorithms. We are also very interested in improving the speed of training. We
at alpes.ai have developed a non-recursive neural network algorithm which has
a very fast training time. We got very good results for standard open
datasets, training time in the range of seconds and accuracy on par with
standard results on normal laptops without any special GPU`s or hardware.

[http://alpes.ai/](http://alpes.ai/)

These are the results of some of the datasets

DataSet Training Time Accuracy

Extended Yale DB 40 sec 94.00%

Human Activity Detection Dataset 3 sec 86%

MNIST 90 sec 97%

Google Speech Dataset 60 sec 92%

Liver Dataset 2 sec 89%

~~~
osaariki
The page you linked is very light on details. Could you point to a paper?

~~~
sandeepeecs
hey paper is on its way probably we will be publishing it in couple of weeks
from now.

~~~
yorwba
Make sure to post it to HN so we can criticize your work into the ground ;)

------
mlthoughts2018
This reads a bit too much like a plug for fastai’s framework, which, frankly,
is not very good and training performance results are not much of a reason to
care about it (for instance, adding AdamW to Keras is basically trivial).

In most use cases, the difference between these optimizer choices, and how
they interact with initialization schemes and metaparameters, is extremely
overkill. If you’re training a pretty standard variant of e.g. ResNet50, for
some fairly common classification or localization task, splitting hairs
between these things just doesn’t matter.

I actually really think these optimizer hacks are not a productive line of
research for deep learning. Too many projects try to do just the slightest
amount of incremental tweaks to an optimization scheme to eke out something
they can call “state of the art” and milk it for conference presentations and
the release of some sexy new deep learning package. But these changes are
often entirely immaterial except for the very largest network training
schemes, which are usually already doing their own highly customized and
distributed optimization schemes anyway.

Especially when it’s tied to making a plug for a framework, like this is for
fastai, it’s just too focused on hype and creates this peacock feather effect
where everyone had to spread themselves so thin just to keep up with all this
middling, incremental tweaks, that it actually very likely inhibits research
that might actually produce _difference of kind_ results.

~~~
bitL
I have an orthogonal perspective - I really think that current Deep Learning
works only because somehow those non-linear optimization tricks like Adam fit
the model structure - you as a model-preparer restrict the search space with
some structural rules (CNN, LSTM, 1x1 conv etc.) and unleashing fairly simple
non-linear optimization procedures suddenly works as they don't have to wander
unguided around much more complex space (like why these simple optimizers
don't work well with large fully-connected networks, which in theory should be
better than any specializations).

By combining both structural rules (reducing dimensionality, providing some
guidance in form of restricted connectivity) and making optimization
procedures that can take advantage of it IMO gives us current great DL
performance; if we can squeeze even more from optimizers that could help, and
I am glad fast.ai people are doing it. I simply view it in a
dual(-metaoptimization) way and both structural and optimization "match"
should be researched.

~~~
mlthoughts2018
What you call “structural rules” is just the machine learning term
regularization. Dropout, early stopping, batch norm, advanced initialization,
etc. are all equally useful ways of regularizing models as are explicit
penalty terms or implicit restrictions to subspaces of the possible weight
space. Whichever of these _engineering_ tricks works best for your use case
either (a) doesn’t matter much because they all work about the same or (b) is
a matter of specialized experimentation and case-by-case analysis anyway.

Also, the theory of using this type of approach for model fitting is not new
in modern deep learning. Older models like RANSAC did similar things even when
the model itself is far simpler and there is no network topology effect.
Arguably, advanced MCMC methods like NUTS, or approximation methods like ADVI,
are even more formalized approaches to the deep learning hacks, where the
network structure and regularization constraints defines a prior over model
parameter space, and when combining with the data to get a likelihood model,
you really just want something that draws from the posterior over model
parameter space.

The reason it’s a “hack” in deep learning is that nobody is trying to formally
define the posterior based on the model and the complicated metaparameters,
rather they are just adding tweaks on top of tweaks to use deeply inefficient
momentum SGD methods to sample from the posterior and optimize a la simulated
annealing, instead of making the models truly amenable to something like NUTS.
Which IMO is yet another reason to view incremental optimizer tweaks as a
negative thing, slowing down and distracting the community.

~~~
cs702
Yes, deep learning today is a _trade_. If we want to be more charitable, we
can call it an _experimental science_ \-- some work truly deserves the
moniker.

The state of deep learning today is analogous to the state of bridge-
engineering before the advent of physics:
[https://www.technologyreview.com/s/608911/is-ai-riding-a-
one...](https://www.technologyreview.com/s/608911/is-ai-riding-a-one-trick-
pony/) \-- everyone in the trade is aware of this, I think.

But that doesn't mean we should stop building better deep-learning models,
finding more efficient ways to train them, and improving state-of-the-art
performance in increasingly challenging tasks.

~~~
amelius
Perhaps we can reach the singularity with slowly-trained models, and from
there get fast training methods for free ;)

------
raverbashing
I wonder how this compares to Neuroevolution and other GA based methods
[https://eng.uber.com/deep-neuroevolution/](https://eng.uber.com/deep-
neuroevolution/)

------
aldanor
Wonder if anything similar can be applied to gradient boosting to speed up
learning forests by tweaking LR?

