Hacker News new | past | comments | ask | show | jobs | submit login
AdamW and Super-convergence is now the fastest way to train neural nets (fast.ai)
329 points by tim_sw 10 months ago | hide | past | web | favorite | 37 comments

A 5x to 10x reduction in training time versus other approaches is impressive, regardless of how it's achieved. Hours of waiting become minutes; days become hours; weeks become days.

Among the examples mentioned:

* They fine-tuned a Resnet50 to 90% accuracy on the Cars Stanford Dataset in 60 epochs vs 600 in previous reports.

* They trained an AWD LSTM RNN from scratch to state-of-the-art perplexity on Wikitext-2 in 90 epochs vs 750 epochs in previous reports.

* The trained a QRNN from scratch to state-of-the-art perplexity on Wikitext-2 in 90 epochs vs 500 epochs in previous reports.

There are more examples of improvements in training speed (and accuracy) in the blog post, which provides persuasive evidence that a training regime combining (a) "AdamW" (which fixes a well-known issue with weight-decay in Adam) and (b) "superconvergence" (i.e., raising then decreasing LR, and doing the inverse with momentum) now appears to be the best/fastest way to train deep neural nets.

Interestingly, no one yet really knows why or how AdamW and SuperC together can achieve a 10x improvement in training efficiency.

Jeremy from fast.ai here. Thanks for the thoughtful summary. A few more clarifications which hopefully are of some interest:

- Sylvain tried AdamW + 1cycle on many different datasets and with different architectures. The 5-10x improvement over regular SGD and LR annealing was not a rare occurrence, but was the most common result

- Accuracy was always at least about as good as the regular approach, and usually better

- So our main result here is to strongly suggest that AdamW + 1cycle should be the default for most neural net training

- The goal of this research was to improve the fastai library, not to write a paper. But since the results were so practically useful we figured we'd take the time to document them in a blog post so others can benefit too

- fastai is a self-funded (i.e out of my own pocket) non-profit research lab.

Thank YOU.

I posted this comment here only because the prior top comment was unfairly negative, in my view. Sometimes I'm afflicted by this: https://xkcd.com/386/

Had you posted on the main thread I would have upvoted your comment over mine.

Have you considered accepting donations from third parties?

How can we send you $/BTC donations for funding support? Couldn't find it anywhere on fast.ai.

Question about terminology: is superconvergence that particular schedule for LR/momentum updates, or is it a desirable emergent property that can be achieved by that LR/momentum schedule?

I've been assuming it's the latter, but your usage sounds very much like the former. Is "1cycle" the schedule itself, or some other training policy?

It's a desirable emergent property. AFAIK, no one knows of another way to achieve it other than by this kind of LR/momentum policy, so informally I think it's OK to use the term "superconvergence" (or "SuperC" for short) for the policy itself. "1 cycle" refers to having a "single cycle" over all epochs, as opposed to multiple cycles, for example as in SGDR.

PS. Here's a good walkthrough of the training policy by a coauthor of the blogpost, with link to a sample notebook: https://sgugger.github.io/the-1cycle-policy.html

Does this work on any other dataset than Cifar-10 with ResNets? I ask because I worked on reproducing this paper for ICLR 2018 Reproducibility Challenge and the paper noted that this was the only setup in which super-convergence would be observed.

The paper's authors didn't succeed with Adam (which this article seems to have overcome) so I'm curious if they attempted this training method on any other datasets?

From the last paragraph of: https://openreview.net/forum?id=H1A5ztj3b

> Our experiments with Densenets and all our experiments with the Imagenet dataset for a wide variety of architectures failed to produce the super-convergence phenomenon. Here we will list some of the experiments we tried that failed to produce the super-convergence behavior. Super-convergence did not occur when training with the Imagenet dataset; that is, we ran Imagenet experiments with Resnets, ResNeXt, GoogleNet/Inception, VGG, AlexNet, and Densenet without success. Other architectures we tried with Cifar-10 that did not show super-convergence capability included ResNeXt, Densenet, and a bottleneck version of Resnet.

EDIT: I see now that they mention a few other datasets: Cars Stanford Dataset and Wikitext-2.

Thanks for your work on the reproducibility challenge - we read your results and found them interesting. The good news is that we had a lot more success, although it took many months of work to really make it sing, and bringing in AdamW was an important part of the success. Note that the super-convergence paper has been greatly improved through the recent 1cycle work, which includes a lot of important points to make it work in practice.

We tried the most divergent datasets and architectures we could think of, including even AWD-LSTM and QRNN. We got great results for pretty much everything we tried.

(We didn't try ResNeXt, since it's so slow in practice, or VGG or AlexNet, since they're largely obsolete. We did look at inception-resnet-v2; I'll have to go back to check to see the results, but IIRC it worked quite well.)

I'm not sure how to interpret the argument of the article or the results in the appendix here.

The first table shows AdamW having the best results, which follows the argument of the article. However, the following three tables all have plain Adam producing the best results.

The way the article is written it seems to be championing AdamW, but the results just seem to conclude that AMSGrad is bad and Adam is the best with AdamW having negligible performance increase over Adam in a single task.

From the article:

So, weight decay is always better than L2 regularization with Adam then? We haven’t found a situation where it’s significantly worse, but for either a transfer-learning problem (e.g. fine-tuning Resnet50 on Stanford cars) or RNNs, it didn’t give better results.

From the article:

> 200% speed up in training! “Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning” concluded the paper. Ah yes, those were the days, over three years ago now, a life-time in deep-learning-years. But it started to become clear that all was not as we hoped. Few research articles used it to train their models, new studies began to clearly discourage to apply it and showed on several experiments that plain ole SGD with momentum was performing better.

This is not true for all domains. In machine translation Adam is used in all top results papers I can think of.

Nice to see great advancements in the training speed for deep neural network algorithms. We are also very interested in improving the speed of training. We at alpes.ai have developed a non-recursive neural network algorithm which has a very fast training time. We got very good results for standard open datasets, training time in the range of seconds and accuracy on par with standard results on normal laptops without any special GPU`s or hardware.


These are the results of some of the datasets

DataSet Training Time Accuracy

Extended Yale DB 40 sec 94.00%

Human Activity Detection Dataset 3 sec 86%

MNIST 90 sec 97%

Google Speech Dataset 60 sec 92%

Liver Dataset 2 sec 89%

> We believe, the SNN is the correct representation of the how the mind works. In fact, we have mathematical proof that our representation is the most efficient representation.

No details on the website, though. This is either very poor marketing, or (much more likely) snake oil.

The page you linked is very light on details. Could you point to a paper?

hey paper is on its way probably we will be publishing it in couple of weeks from now.

Make sure to post it to HN so we can criticize your work into the ground ;)

This reads a bit too much like a plug for fastai’s framework, which, frankly, is not very good and training performance results are not much of a reason to care about it (for instance, adding AdamW to Keras is basically trivial).

In most use cases, the difference between these optimizer choices, and how they interact with initialization schemes and metaparameters, is extremely overkill. If you’re training a pretty standard variant of e.g. ResNet50, for some fairly common classification or localization task, splitting hairs between these things just doesn’t matter.

I actually really think these optimizer hacks are not a productive line of research for deep learning. Too many projects try to do just the slightest amount of incremental tweaks to an optimization scheme to eke out something they can call “state of the art” and milk it for conference presentations and the release of some sexy new deep learning package. But these changes are often entirely immaterial except for the very largest network training schemes, which are usually already doing their own highly customized and distributed optimization schemes anyway.

Especially when it’s tied to making a plug for a framework, like this is for fastai, it’s just too focused on hype and creates this peacock feather effect where everyone had to spread themselves so thin just to keep up with all this middling, incremental tweaks, that it actually very likely inhibits research that might actually produce difference of kind results.

I have an orthogonal perspective - I really think that current Deep Learning works only because somehow those non-linear optimization tricks like Adam fit the model structure - you as a model-preparer restrict the search space with some structural rules (CNN, LSTM, 1x1 conv etc.) and unleashing fairly simple non-linear optimization procedures suddenly works as they don't have to wander unguided around much more complex space (like why these simple optimizers don't work well with large fully-connected networks, which in theory should be better than any specializations).

By combining both structural rules (reducing dimensionality, providing some guidance in form of restricted connectivity) and making optimization procedures that can take advantage of it IMO gives us current great DL performance; if we can squeeze even more from optimizers that could help, and I am glad fast.ai people are doing it. I simply view it in a dual(-metaoptimization) way and both structural and optimization "match" should be researched.

What you call “structural rules” is just the machine learning term regularization. Dropout, early stopping, batch norm, advanced initialization, etc. are all equally useful ways of regularizing models as are explicit penalty terms or implicit restrictions to subspaces of the possible weight space. Whichever of these engineering tricks works best for your use case either (a) doesn’t matter much because they all work about the same or (b) is a matter of specialized experimentation and case-by-case analysis anyway.

Also, the theory of using this type of approach for model fitting is not new in modern deep learning. Older models like RANSAC did similar things even when the model itself is far simpler and there is no network topology effect. Arguably, advanced MCMC methods like NUTS, or approximation methods like ADVI, are even more formalized approaches to the deep learning hacks, where the network structure and regularization constraints defines a prior over model parameter space, and when combining with the data to get a likelihood model, you really just want something that draws from the posterior over model parameter space.

The reason it’s a “hack” in deep learning is that nobody is trying to formally define the posterior based on the model and the complicated metaparameters, rather they are just adding tweaks on top of tweaks to use deeply inefficient momentum SGD methods to sample from the posterior and optimize a la simulated annealing, instead of making the models truly amenable to something like NUTS. Which IMO is yet another reason to view incremental optimizer tweaks as a negative thing, slowing down and distracting the community.

Yes, deep learning today is a trade. If we want to be more charitable, we can call it an experimental science -- some work truly deserves the moniker.

The state of deep learning today is analogous to the state of bridge-engineering before the advent of physics: https://www.technologyreview.com/s/608911/is-ai-riding-a-one... -- everyone in the trade is aware of this, I think.

But that doesn't mean we should stop building better deep-learning models, finding more efficient ways to train them, and improving state-of-the-art performance in increasingly challenging tasks.

Perhaps we can reach the singularity with slowly-trained models, and from there get fast training methods for free ;)

There doesn't seem to be much interest, among NN researchers, for finding the right tool for the job, in terms of understanding the task at hand and engineering appropriate inputs and features, appropriate neural network types and appropriate training and optimization techniques.

Finding the right job for the tool, to the point of repurposing and adapting "proven" network structures or even already trained networks, is an easier and more popular approach.

If your perspective were truly orthogonal, wouldn't that mean, by definition, that you have nothing to say at all on the topic of the long term intellectual value to humanity of fiddling with loss functions and making optimizer tweaks?

Relatedly, optimizer performance can behave very wildly on different types of models. Although many new DL projects use Adam as a baseline, in one of my projects simple RMSprop performed significantly better after a bit of testing.

The optimizer is just another hyperparameter to tune.

Besides like 10 papers a year almost everything else is incremental tweaks that are largely immaterial

I agree, but it seems like we went past a bend in the hype curve with how this plays out in deep learning vs. other fields or historically in machine learning.

A paper like this used to be a minor thing. Neat result, A for effort, great to see if the authors take the time to generate PRs adding the updated method into mainstream packages and libraries. But no more than that.

Now, with deep learning, every little result has to be trumpeted out with blog posts, claims about what is “state of the art” on some cherry picked example cases, and you’ll be treated like a dunce if you don’t get push notifications direct from arxiv and drop what you’re doing to read everything immediately.

It’s a form of credit inflation. Researchers want sweet, sweet deep learning hype credit for fame & fortune. Instead of just admitting that a lot of results are inapplicable to most practitioners and that’s totally fine and common in published research ... it has to be always a hype arms race about who knows more trivia.

Very well said.

Edit: feel like I should clarify a bit to avoid the "me too +1" style comment. I've been torn in recent months between fast.ai's great hands on and to the point style of teaching (which I think is great and have strived to use as an example in my own courses) and their somewhat overhyped presentation of the fast.ai library. It's not bad, but ultimately a collection of rather hacky function calls that are build on top of other frameworks.

This being said, from an academicians point of view (to which I so far can still consider myself to belong to, albeit perhaps not for much longer), I do appreciate most things Jeremy and his team has brought into the spotlight, e.g. insights around learning rate and embeddings for sparse categoricals. What amuses me most is that these things are by no means very new, but I enjoy the spirit of fast.ai pointing out that these things are not new and should be considered (again) today in the midst of a research community that is going way too fast.

Then again, I also think this post is a bit too much hype centric and hence falling into the same traps. I am also somewhat disappointed with fast.ai's claims to state of art lying mainly in "known" image detection and text regions, whilst not talking much or at all about e.g. RL, in courses or papers. I think the work being done here by OpenAI for instance during the past few months is far more exciting.

> It’s a form of credit inflation. Researchers want sweet, sweet deep learning hype credit for fame & fortune.

The authors of the AdamW or super-convergence papers did not write the blog post - we did. There is no connection between the authors of the papers mentioned and the authors of the blog post.

It's simply a genuine and honest attempt by us to study these papers and share what we learned, in order to help others get similar results.

I was referring only to the fastai post plugging the fastai toolkit. The research papers themselves seem fine enough, as they aren’t making a bombastic claim like saying AdamW + 1cycle should be considered the general default starting point for training any NN.

Not my field, but from a humble programmer's perspective, a 5x to 10x speedup sure sounds like a significant result. It certainly would be in most areas where we already have efficient algorithms.

Why is this one "inapplicable to most practitioners"?

Because the published result is almost surely not repeatable in each nuanced case for day to day practitioners that have custom layers, extra regularization, heterogeneous hardware for training, etc.

When you’ve seen enough of these same papers come and go, your eyes glaze over. It’s like p-hacking, but for finding tasks that show X% improvement. Basically it needs a huge Your Mileage May Vary sticker attached, because for practitioner’s specific situations, you still have to do the experiments to tune these parameters.

Use AdamW or rmsprop or adagrad or ... and use dropout (what fraction?) and use penalty terms, and use early stopping, etc. etc., ... ?

You can’t rely on papers like this for much insight beyond just saying AdamW is yet another tool in the toolbox... and that’s all.

It would be comparable to someone writing a paper about some slight variation of Timsort that is ~10% faster on some empirical distribution of data, like data already 80% sorted. Is it good for you? You have to do legwork to know. Such a result is perfectly good research, but would be silly for a bunch of blog posts about “state of the art” sorting, and plugs for some software package that has it.

> the published result is almost surely not repeatable in each nuanced case for day to day practitioners that have custom layers, extra regularization, heterogeneous hardware for training

As shown in the post and linked code, it works very well with custom layers and extra regularization. Hardware does not impact the results at all. The results hold across CNNs, RNNs, and even QRNNs, across a wide variety of datasets.

We literally tried to find a real-world NLP or vision dataset it didn't work with, and failed to do so.

This is working against the fundamental theorem of machine learning "there is no free lunch". (i.e. there is no "one for all" solution in ml)

There is no explanation about why it worked. With that understanding it is very easy to generate a synthetic counter example where it does not work.

I totally agree with you. Here is one more example https://openreview.net/forum?id=SyrGJYlRZ

I wonder how this compares to Neuroevolution and other GA based methods https://eng.uber.com/deep-neuroevolution/

Wonder if anything similar can be applied to gradient boosting to speed up learning forests by tweaking LR?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact