Among the examples mentioned:
* They fine-tuned a Resnet50 to 90% accuracy on the Cars Stanford Dataset in 60 epochs vs 600 in previous reports.
* They trained an AWD LSTM RNN from scratch to state-of-the-art perplexity on Wikitext-2 in 90 epochs vs 750 epochs in previous reports.
* The trained a QRNN from scratch to state-of-the-art perplexity on Wikitext-2 in 90 epochs vs 500 epochs in previous reports.
There are more examples of improvements in training speed (and accuracy) in the blog post, which provides persuasive evidence that a training regime combining (a) "AdamW" (which fixes a well-known issue with weight-decay in Adam) and (b) "superconvergence" (i.e., raising then decreasing LR, and doing the inverse with momentum) now appears to be the best/fastest way to train deep neural nets.
Interestingly, no one yet really knows why or how AdamW and SuperC together can achieve a 10x improvement in training efficiency.
- Sylvain tried AdamW + 1cycle on many different datasets and with different architectures. The 5-10x improvement over regular SGD and LR annealing was not a rare occurrence, but was the most common result
- Accuracy was always at least about as good as the regular approach, and usually better
- So our main result here is to strongly suggest that AdamW + 1cycle should be the default for most neural net training
- The goal of this research was to improve the fastai library, not to write a paper. But since the results were so practically useful we figured we'd take the time to document them in a blog post so others can benefit too
- fastai is a self-funded (i.e out of my own pocket) non-profit research lab.
I posted this comment here only because the prior top comment was unfairly negative, in my view. Sometimes I'm afflicted by this: https://xkcd.com/386/
Had you posted on the main thread I would have upvoted your comment over mine.
Have you considered accepting donations from third parties?
I've been assuming it's the latter, but your usage sounds very much like the former. Is "1cycle" the schedule itself, or some other training policy?
PS. Here's a good walkthrough of the training policy by a coauthor of the blogpost, with link to a sample notebook: https://sgugger.github.io/the-1cycle-policy.html
The paper's authors didn't succeed with Adam (which this article seems to have overcome) so I'm curious if they attempted this training method on any other datasets?
From the last paragraph of: https://openreview.net/forum?id=H1A5ztj3b
> Our experiments with Densenets and all our experiments with the Imagenet dataset for a wide variety
of architectures failed to produce the super-convergence phenomenon. Here we will list some of the
experiments we tried that failed to produce the super-convergence behavior. Super-convergence did not occur when training with the Imagenet dataset; that is, we ran Imagenet experiments with Resnets,
ResNeXt, GoogleNet/Inception, VGG, AlexNet, and Densenet without success. Other architectures
we tried with Cifar-10 that did not show super-convergence capability included ResNeXt, Densenet,
and a bottleneck version of Resnet.
EDIT: I see now that they mention a few other datasets: Cars Stanford Dataset and Wikitext-2.
We tried the most divergent datasets and architectures we could think of, including even AWD-LSTM and QRNN. We got great results for pretty much everything we tried.
(We didn't try ResNeXt, since it's so slow in practice, or VGG or AlexNet, since they're largely obsolete. We did look at inception-resnet-v2; I'll have to go back to check to see the results, but IIRC it worked quite well.)
The first table shows AdamW having the best results, which follows the argument of the article. However, the following three tables all have plain Adam producing the best results.
The way the article is written it seems to be championing AdamW, but the results just seem to conclude that AMSGrad is bad and Adam is the best with AdamW having negligible performance increase over Adam in a single task.
So, weight decay is always better than L2 regularization with Adam then? We haven’t found a situation where it’s significantly worse, but for either a transfer-learning problem (e.g. fine-tuning Resnet50 on Stanford cars) or RNNs, it didn’t give better results.
> 200% speed up in training! “Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning” concluded the paper. Ah yes, those were the days, over three years ago now, a life-time in deep-learning-years. But it started to become clear that all was not as we hoped. Few research articles used it to train their models, new studies began to clearly discourage to apply it and showed on several experiments that plain ole SGD with momentum was performing better.
This is not true for all domains. In machine translation Adam is used in all top results papers I can think of.
These are the results of some of the datasets
DataSet Training Time Accuracy
Extended Yale DB 40 sec 94.00%
Human Activity Detection Dataset 3 sec 86%
MNIST 90 sec 97%
Google Speech Dataset 60 sec 92%
Liver Dataset 2 sec 89%
No details on the website, though. This is either very poor marketing, or (much more likely) snake oil.
In most use cases, the difference between these optimizer choices, and how they interact with initialization schemes and metaparameters, is extremely overkill. If you’re training a pretty standard variant of e.g. ResNet50, for some fairly common classification or localization task, splitting hairs between these things just doesn’t matter.
I actually really think these optimizer hacks are not a productive line of research for deep learning. Too many projects try to do just the slightest amount of incremental tweaks to an optimization scheme to eke out something they can call “state of the art” and milk it for conference presentations and the release of some sexy new deep learning package. But these changes are often entirely immaterial except for the very largest network training schemes, which are usually already doing their own highly customized and distributed optimization schemes anyway.
Especially when it’s tied to making a plug for a framework, like this is for fastai, it’s just too focused on hype and creates this peacock feather effect where everyone had to spread themselves so thin just to keep up with all this middling, incremental tweaks, that it actually very likely inhibits research that might actually produce difference of kind results.
By combining both structural rules (reducing dimensionality, providing some guidance in form of restricted connectivity) and making optimization procedures that can take advantage of it IMO gives us current great DL performance; if we can squeeze even more from optimizers that could help, and I am glad fast.ai people are doing it. I simply view it in a dual(-metaoptimization) way and both structural and optimization "match" should be researched.
Also, the theory of using this type of approach for model fitting is not new in modern deep learning. Older models like RANSAC did similar things even when the model itself is far simpler and there is no network topology effect. Arguably, advanced MCMC methods like NUTS, or approximation methods like ADVI, are even more formalized approaches to the deep learning hacks, where the network structure and regularization constraints defines a prior over model parameter space, and when combining with the data to get a likelihood model, you really just want something that draws from the posterior over model parameter space.
The reason it’s a “hack” in deep learning is that nobody is trying to formally define the posterior based on the model and the complicated metaparameters, rather they are just adding tweaks on top of tweaks to use deeply inefficient momentum SGD methods to sample from the posterior and optimize a la simulated annealing, instead of making the models truly amenable to something like NUTS. Which IMO is yet another reason to view incremental optimizer tweaks as a negative thing, slowing down and distracting the community.
The state of deep learning today is analogous to the state of bridge-engineering before the advent of physics: https://www.technologyreview.com/s/608911/is-ai-riding-a-one... -- everyone in the trade is aware of this, I think.
But that doesn't mean we should stop building better deep-learning models, finding more efficient ways to train them, and improving state-of-the-art performance in increasingly challenging tasks.
Finding the right job for the tool, to the point of repurposing and adapting "proven" network structures or even already trained networks, is an easier and more popular approach.
The optimizer is just another hyperparameter to tune.
A paper like this used to be a minor thing. Neat result, A for effort, great to see if the authors take the time to generate PRs adding the updated method into mainstream packages and libraries. But no more than that.
Now, with deep learning, every little result has to be trumpeted out with blog posts, claims about what is “state of the art” on some cherry picked example cases, and you’ll be treated like a dunce if you don’t get push notifications direct from arxiv and drop what you’re doing to read everything immediately.
It’s a form of credit inflation. Researchers want sweet, sweet deep learning hype credit for fame & fortune. Instead of just admitting that a lot of results are inapplicable to most practitioners and that’s totally fine and common in published research ... it has to be always a hype arms race about who knows more trivia.
Edit: feel like I should clarify a bit to avoid the "me too +1" style comment. I've been torn in recent months between fast.ai's great hands on and to the point style of teaching (which I think is great and have strived to use as an example in my own courses) and their somewhat overhyped presentation of the fast.ai library. It's not bad, but ultimately a collection of rather hacky function calls that are build on top of other frameworks.
This being said, from an academicians point of view (to which I so far can still consider myself to belong to, albeit perhaps not for much longer), I do appreciate most things Jeremy and his team has brought into the spotlight, e.g. insights around learning rate and embeddings for sparse categoricals. What amuses me most is that these things are by no means very new, but I enjoy the spirit of fast.ai pointing out that these things are not new and should be considered (again) today in the midst of a research community that is going way too fast.
Then again, I also think this post is a bit too much hype centric and hence falling into the same traps. I am also somewhat disappointed with fast.ai's claims to state of art lying mainly in "known" image detection and text regions, whilst not talking much or at all about e.g. RL, in courses or papers. I think the work being done here by OpenAI for instance during the past few months is far more exciting.
The authors of the AdamW or super-convergence papers did not write the blog post - we did. There is no connection between the authors of the papers mentioned and the authors of the blog post.
It's simply a genuine and honest attempt by us to study these papers and share what we learned, in order to help others get similar results.
Why is this one "inapplicable to most practitioners"?
When you’ve seen enough of these same papers come and go, your eyes glaze over. It’s like p-hacking, but for finding tasks that show X% improvement. Basically it needs a huge Your Mileage May Vary sticker attached, because for practitioner’s specific situations, you still have to do the experiments to tune these parameters.
Use AdamW or rmsprop or adagrad or ... and use dropout (what fraction?) and use penalty terms, and use early stopping, etc. etc., ... ?
You can’t rely on papers like this for much insight beyond just saying AdamW is yet another tool in the toolbox... and that’s all.
It would be comparable to someone writing a paper about some slight variation of Timsort that is ~10% faster on some empirical distribution of data, like data already 80% sorted. Is it good for you? You have to do legwork to know. Such a result is perfectly good research, but would be silly for a bunch of blog posts about “state of the art” sorting, and plugs for some software package that has it.
As shown in the post and linked code, it works very well with custom layers and extra regularization. Hardware does not impact the results at all. The results hold across CNNs, RNNs, and even QRNNs, across a wide variety of datasets.
We literally tried to find a real-world NLP or vision dataset it didn't work with, and failed to do so.
There is no explanation about why it worked. With that understanding it is very easy to generate a synthetic counter example where it does not work.