Hacker News new | past | comments | ask | show | jobs | submit login
Training EfficientNets at Scale: 83% ImageNet Top-1 Accuracy in One Hour (arxiv.org)
10 points by jonbaer 18 days ago | hide | past | favorite | 7 comments

I wonder at what stage notions of metabolism will be baked directly into neural network operations -- e.g optimize not just for reducing some notion of error on a dataset, but also for the training efficiency itself.

EfficientNet is extremely difficult to train: it took people months to replicate even a subset of original results from the paper, and for larger configs IIRC the results still haven't been replicated, same with EfficientDet - the detection model that uses EfficientNet as a backbone. It's also not so much faster to run on e.g. NVIDIA GPUs, where much lower GFLOPs don't necessarily translate into much lower milliseconds due to EfficientNet having _a ton_ more ops in the graph, and those ops being "small" and unable to take advantage of all the throughput. Indeed, some configs are slower than comparably accurate alternatives. TL;DR: just because Google is able to get sexy results out of this doesn't mean there aren't better options available to you to solve practical tasks.

Agreed, I trained a 3D version of b0-b2 on a classification task I worked on and besides being very slow to train they did not outperform a simple baseline VGG architecture. Interestingly training time was much improved by setting the cudnn benchmark flags in pytorch. Haven’t seen this reduction in training time for any other architecture but for b2 it went from about 12 sec per iteration to about 0.8. I guess there is more margin optimizing NAS models.

What would be some practical alternative model architectures one could use? I've seen EfficientNet be used fairly often (apart from ResNet/ImageNet etc.) in transfer learning.

The answere is unfortunately "it's complicated" and "it depends on the task". Very large EfficientNet variants are impractical anyway. For detection I've been very impressed with https://www.paperswithcode.com/method/vovnetv2. You can use "plain" convolutions when you need higher accuracy and depthwise separable convolutions when you need speed, at the expense of a relatively small accuracy drop. Generally GPU hardware prefers the kind of nets where it can "stretch its legs" so to speak. VoVNet does have more parameters, but it's _far_ easier to train to practical levels of accuracy, and it does well on NVIDIA hardware, and especially under TensorRT. Note that you can plug in VoVNet2 as a backbone into just about any detector, it doesn't have to be CenterNet. For mobile GPUs MobileNet V3 is still adequate for a lot of things.

If anyone from NVIDIA is reading this, I feel like this is a good, and under-explored area of research that'd be very tractable for you: figure out architectures custom-designed to do well on recent NVIDIA GPUs, and especially on Jetson Xavier.

One thing people don't realize is that EfficientNet/EfficientDet aren't necessarily the best choice _for their specific dataset_. In a way, a lot of these academic networks are overfit to the task of e.g. detecting objects in MSCOCO. If your dataset doesn't look like MSCOCO, there's no guarantee whatsoever that they will do well on it. Same with ImageNet for classification. ImageNet is very hard. To do well on it your net has to do something most humans won't be able to do without substantial training - recognize the various dog breeds. If your problem is simpler (which nearly all of them are), chances are you don't need as complicated a model to do well on it. Indeed, a "complicated" model is likely to actually do worse than a model that's "just complicated enough". Due to e.g. overfitting, or being more sensitive to noise in real-world data, and so on. Not to mention it will naturally limit your experiment throughput, which is one of the most important factors for getting a good model that does something practical.

I appreciate your answer and your take on the subject, I'll definitely check VoVNet2 out. I think the practice of taking whatever model is currently achieving SotA results and fine-tuning it on one's data is bringing results, but at the same time it makes it too easy to just blindly apply those models.

I think part of the problem is that it's currently too easy to reach for the big models - why train one from scratch where you can just tweak a few layers and get better results.

That's the thing, if your dataset is not similar to the dataset on which your model was pre-trained (which it often isn't), the initial weights may actually prove to be either worthless, or even detrimental. That is, they may be worse than random initialization.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact