Hacker News new | past | comments | ask | show | jobs | submit login
DeepMind achieves SOTA image recognition with 8.7x faster training (arxiv.org)
291 points by highfrequency 19 days ago | hide | past | favorite | 83 comments

I have a feeling the ML community is going to pivot focus to faster and smaller training before larger advancements are made. It's simply too expensive for much AI research to happen when state of the art models take 500k of hardware to train.

For all the mathematician hype around ML research, much of the work is closer to alchemy than science. We simply don't understand a great deal of why these neural nets work.

The people doing math above algebra are few and the scene is dominated by "guess and check" style model tinkering.

Many "state of the art models" are simply a bunch of common strategies glued together in a way researchers found worked the best (by trying a bunch of different ones).

An average Joe could probably write influential ML papers by gluing RNN/GAN layers to existing models and fiddling with the parameters until they beat current state of the art. In fact, in NLP models, this is essentially what has happened with roBERTa, XLNET, ELECTRA, etc. They're all somewhat trivial variations on Google's BERT, which is more creative but yet again built on existing models.

Anyways, my point is, none of this required math or genius or particularly demanding thought. It was basically let's tinker with this until we find a way that's better, using guess and check. No equations needed.

We are a long way from the type of simulations done for protein folding and materials strength and basically every other scientific field. It's still the wild west

There's a lot of interest in various ML communities on more efficient training and inference. Both vision and NLP have had a growing focus on these problems in recent years.

I think you make a good observation that much of ML progress is driven by tinkering with existing models, though instead of describing it as more "alchemy than science" it's probably more accurate to say it's very experimental right now. Being very experimental is neither unscientific nor unusual in the development of knowledge. James Watt worked as an instrument maker (not a theoretician) when he invented the Watt steam engine in 1776 [1], and at the time the idea of heat as Phlogiston [2] was still more prevalent than anything that looks like modern thermodynamics. Theory and practice naturally take turns outpacing each other, which is part of why we need both.

I'd also caution against the belief that experimental work doesn't require "particularly demanding thought". There are many things one can tweak in current ML models (the search space is exponential) and, as you point out, the experiments are expensive. Having a solid understanding of the system, great intuition, and good heuristics is necessary to reliably make progress.

For those who are interested in the theory of deep learning, the community has recently made great strides on developing a mathematical understanding of neural networks. The research is still very cutting edge, but the following PDF helps introduce the topic [3].

[1]: https://en.wikipedia.org/wiki/James_Watt

[2]: https://en.wikipedia.org/wiki/Phlogiston_theory

[3]: https://www.cs.princeton.edu/courses/archive/fall19/cos597B/...

that course from princeton looks great! This paper is a nice short read and gives some geometric insight: https://arxiv.org/abs/1805.10451

Apply to TFRC! https://www.tensorflow.org/tfrc

They are very permissive. And you get to play with $500k worth of hardware. Been a member for over a year now. Jonathan is singlehandedly the best support person I've ever worked with, or perhaps ever will work with.

I would've completely agreed with you if not for TFRC. And I couldn't resist the opportunity of playing with some big metal, even if it's hard to work with.

just applied. thanks for sharing!

>We are a long way from the type of simulations done for protein folding and materials strength and basically every other scientific field.

Have you not heard of alphafold?

Thanks for mentioning Alphafold. This article by the DeepMind team is very insightful.


> Anyways, my point is, none of this required math or genius or particularly demanding thought. It was basically let's tinker with this until we find a way that's better, using guess and check. No equations needed.

I get that you'd like to have a clear theoretical basis for what works and we're far from there. But in the meantime we're stumbling in the dark, discovering tricks and forming intuitions, not knowing even where the road is going to lead us.

This is an evolutionary process of ideas, similar to biological evolution that managed to make us. If you know where you're going you can optimise your actions but when you don't even know what might be useful later on, then all attempts are good. They increase diversity and discover blind spots. Some of them will be the stepping stones for the future, but we can't say in advance which and how.

Link to a long discussion about the evolution of ML ideas and the book "Why greatness cannot be planned" by Kenneth Stanley - https://youtu.be/lhYGXYeMq_E?t=416

I see your point when it comes to paper publication in general, but I feel that your post is very unwarranted with respect to the original post: DeepMind has recruited top-quality theoretical researchers from public institutions, it is not just experimental work which your average Joe could do with a few guesses. These researchers published a lot of theoretical papers before they were recruited, and they still publish a lot of them now that they are working at DeepMind, but they have more computational hardware to apply their ideas.

Here is just one uncurated example of a publication: - https://deepmind.com/research/publications/Taylor-Expansion-... - https://arxiv.org/pdf/2003.06259.pdf

> An average Joe could probably write influential ML papers by gluing RNN/GAN layers to existing models and fiddling with the parameters until they beat current state of the art.

Right, but you have to remember there are legions of grad students doing exactly this so it ends up being quite competitive to churn out papers this way.

> The people doing math above algebra are few and the scene is dominated by "guess and check" style model tinkering.

"guess and check" is terribly ineffective with multi-day training runs. Brings us right back to the batch processing paradigm of the 1960s.

This feels much like the sentiment in the field about two years ago or so. While I feel like the "alchemy" storyline is still somewhat in play, most of the big important parts of the deep learning process have enough ideological linear approximators stacked around them that if you know what you're doing or looking at, you can jump to an unexplored trench with some reasonable feeling about whether you'll get something good or not. I feel like the "alchemy" approach is when people new to the field are innundated with information about it, and while I think that still holds, there very much is a well-understood science of principles in most parts of it.

There's the neural tangent kernel work that's achieved a lot, and the transformers themselves are really taking off a lot as the blockwise/lower rank approximation algorithms look more and more like circuits built off of basic, more well-established components.

"An average Joe could probably write influential ML papers by gluing RNN/GAN layers to existing models and fiddling with the parameters until they beat current state of the art. In fact, in NLP models, this is essentially what has happened with roBERTa, XLNET, ELECTRA, etc. They're all somewhat trivial variations on Google's BERT, which is more creative but yet again built on existing models."

This feels like it trivializes a lot of the work and collapses some of the major advancements in training at scale down to a more one-dimensional outlook. Companies are doing both, but it's easy to throw money and compute at an absolutely guaranteed logarithmic improvement in results. It's not stupidity, it's just reducing variance in scaling known laws as we work on making things more efficient, which weirdly enough starts the iterative process of academics frantically trying to mine the expensive, inefficient compute tactics to flag plant their own materials.

With respect to you comment on protein folding and such, I feel you might have missed a lot of the major work in that aren more recently. There really and truly been some field-shattering work on that in combining deep learning systems with last-mile supervision and refinement systems. I'd posit that we're very much out of the wild west and in the mild, but still rambunctious west, if I were to put terms on it.

With reference to guess and check -- yes, that especially was prevalent and worked 2-3 years ago and I'd be in favor of advocating that it does still happen somewhat in a more refined fashion, but I personally believe we'd not get too far beyond the SOTA if we're not working (effectively) with your data manifold now and tightly incorporating whatever projections/constraints of that data distillation process into your network training procedure. I really do agree with you in that I think average Joe breakthroughs will happen and continue to benefit the the field, and I'd certainly agree that there's always going to be the mediocre paper churn of paper mills I think that you alluded to trying to justify their own existence as academics/paper writers, but I really do legitimately think there's enough precedent set in most parts of the field that you need to have some kind of thoughtful improvement to move forward (like AdaBelief, which is still terrible because they straight up lie about what they do in the abstract, even though the improvement of debiasing the variance estimates during training is an exceptionally good idea).

Just my 2c, hope this helps. I think we may have a similar end perspective from two different sides, like two explorers looking at the same peak from the different side of the mountain. :thumbsup:

Good write-up. Indeed I'm a novice tinkering with a decent gaming GPU :) . I was initially daunted by ML but the more I read I began to realize the field is quite accessible these days. Most of the time, you don't need to understand why or how this stuff works at a deep level. You just need a good feel of what might work and a training dataset.

Much of that is the enormous amount of work done plastering over complex GPU programming. But some of it is the tinkering nature of solving ML problems.

The field I'm most interested in right now for instance, NLP, is highly dataset dependent. It's fairly easy to exceed SoTA right now using open sourced models if you have a better, more specialized dataset than what's freely available.

Absolutely, couldn't agree more. If you want a secret, just find what scaling laws are there and find tunnels to bypass them. There's always a way to the secret garden, you just sometimes have to look long and hard... ;)

I started with my 1070 flat, and have had some people far, far, far smarter and more experienced than me help me understand a lot of the underlying mathematics a lot. Semi-supervised/bootstrapping may be a fun topic, if you can avoid the giant CAT trucks of the FAANG monoliths blazing through there, and there's always really good artisanal work to be done if you can prove certain mathematical conditions hold such that other (oftentimes counterintuitive and bizzare) operators still work, or work when they shouldn't before.

You could also get into the rat race of the *formers -- the Nyströmformer is quite spectacular and nearly linear, and yes, if you're hot on your feet and clever enough, you might be able to beat everything into submission.

Also, distrust every non-bayesian thing involving means and sigmas. Those are always ad hoc and beat the real data manifold into submission, which really does a disservice a lot of the time, I think. There's a lot to get around that (I suppose including the above, which I'd forgotten about, but there's always, uh, SeLU if you're looking for inspiration plus a phenomenal appendix. You want universal attractors? Set up and prove something that's more amenable to a good manifold structure than simply a certain distribution of activations -- that truly tells us nothing!)

Hope those are fun ideas -- and my deepest apologies if I was uncharitable to you in my former post. I went back and edited it for politeness but reading it again felt some of my earlier aggression fall through, and I'm certainly sorry about that -- I should be helping new folks, not being an aggressive gatekeeper against that.

In any case, so long as you're able to keep mathematical interest, there's always a nice hole to square yourself away into. Talk to a good accomplished research professional and they might be able to point you in fun directions (aside from my personal noobishness ;))

Let me know if any of those catch your eye and end up going anywhere, I'm happy to help when it moves the field forward! :)))

Minor variations on top of existing stuff and occasional leaps forward is most research though, it's not surprising ML research follows a similar pattern.

While I generally agree to some extent: roBERTa, XLNET, ELECTRA, etc. They're all somewhat trivial variations on Google's BERT, which is more creative Researchers take inspirations from existing models of course and some BERT derivatives are trivial. However, XLnet is in it's own league, while the author (a genius chinese student) was inspired by BERT it is one of the few SOTA pré trained models to be not based on BERT and is actually an auto regressive one! Such difference allow it to be better at many things as it doesn't has to corrupt the tokens (from my shallow understanding). This model is two years old but is still sadly the one that ranks the most SOTA in key tasks e.g dependency parsing. And after all those time nobody cared enough to even test it on other foundational tasks (which is extremely sad and pathetic) like e.g coreference resolution. Sadly because of conformism effects almost zero researcher has created XLnet derivatives. Almost all researchers continue to search in the local minima that is BERT, which I find, immensely ironic.

While ad hoc empirical fine tuning is a big part of improving sota, mathematical genius can still enable revolutions e.g this recent alternative to classical backpropagation that is 300X faster with low accuracy loss https://paperswithcode.com/paper/zorb-a-derivative-free-back...

Not sure why you're being downvoted. I was about to swoop in and mention that the top level comment was wrong about XLNet being some bert based model but you beat me to it.

Sometimes HN is full of people who think they know what they're talking about but just don't. This is one of those times.

Honestly my comments that have negative Karma have generally much more useful truths than my comments that have positive Karma, this is almost systemic. This show how low quality the HN community is, epistemologically speaking. There are much less lazy communities out there like lesswrong.com but unfortunately they don't talk much about computer science

Interesting, but in not sure you're completely right about XLNET. I heard it takes an absurd amount of resources to train. Even more than the BERT variations. And this is likely why there's not a ton of interest in it

https://github.com/renatoviolin/xlnet XLnet running on very low end hardware (a single 8GB 2080 non ti) significantly outperform BERT large on e.g the reference question answering benchmarck: SQUAD 2 86% vs 81%

Nobody has even tried to create a spanXLnet (akin to spanBERT) How many years will be wasted before researchers get out of the BERT local minima? I'm a afraid it might last a decade

I was skeptical of the 83%-whatever top-1 accuracy on Imagenet. But someone pointed out that the accuracy increases when the model is pretrained on JFT, google's proprietary 100-million image dataset, the model's accuracy increases to 87%-whatever.

That's pretty interesting. It implies that the original accuracy rating might be legit. The concern is that we're chasing the imagenet benchmark as if it's the holy grail, when in fact it's a very narrow slice of what we normally care about as ML researchers. However, the fact that pretraining on JFT increases the accuracy means that the model is generalizing, which is very interesting; it implies that models might be "just that good now."

Or more succinctly, if the result was bogus, you'd expect JFT pretraining to have no effect whatsoever (or a negative effect). But it has a positive result.

The other thing worth mentioning is that AJMooch seems to have killed batch normalization dead, which is very strange to think about. BN has had a long reign of some ~4 years, but the drawbacks are significant: you have to maintain counters yourself, for example, which was quite annoying.

It always seemed like a neural net ought to be able to learn what BN forces you to keep track of. And AJMooch et al seem to prove this is true. I recommend giving evonorm-s a try; it worked perfectly for us the first time, with no loss in generality, and it's basically a copy-paste replacement.

(Our BigGAN-Deep model is so good that I doubt you can tell the difference vs the official model. It uses AJMooch's evonorm-s rather than batchnorm: [1] https://i.imgur.com/sfGVbuq.png [2] https://i.imgur.com/JMJ1Ll0.png and lol at the fake speedometer.)

Not sure if I follow your JFT argument, but there's a large body of work on both (a) studying whether chasing ImageNet accuracy yields models that generalize well to out of distribution data [1, 2, 3] and (b) contextualizing progress on ImageNet (i.e., what does high accuracy on ImageNet really mean?) [4, 5, 6].

For (a), maybe surprisingly the answer is mostly yes! Better ImageNet accuracy generally corresponds to better out of distribution accuracy. For (b), it turns out that the ImageNet dataset is full of contradictions---many images have multiple ImageNet-relevant objects, and often are ambiguously or mis-labeled, etc---so it's hard to disentangle progress in identifying objects vs. models overfitting to the quirks of the benchmark.

[1] ObjectNet: https://objectnet.dev / associated paper

[2] ImageNet-v2: https://arxiv.org/abs/1902.10811

[3] An Unbiased Lookat Dataset Bias: https://people.csail.mit.edu/torralba/publications/datasets_... (pre-AlexNet!)

[4] From ImageNet to Image Classification: https://arxiv.org/abs/2005.11295

[5] Are we done with ImageNet? https://arxiv.org/abs/2006.07159

[6] Evaluating Machine Accuracy on ImageNet: http://proceedings.mlr.press/v119/shankar20c.html

Those example pictures are trippy! Some of them look like those weird DeepMind sick-fever dream creations. Except I assume they are all real photos, not generated. It's very possible that a trained AI would be better-able to identify some of the candidates than I would.

For example, from OP's post, w/ coordinate system starting at lower left, I have no idea what I'm looking at in these examples, except they look organic-ish: [1]: [1,4], [3,2], [4,1]

sillysaurusx: I've never seen conglomerate pictures like this used in AI training. Do you train models on these 4x4 images? What's the purpose vs a single picture at a time? Does the model know that you're feeding it 4x4 examples, or does it have to figure that out itself?

Aside: Another awesome 'sick-fever dream creation' example if you missed it when it made the rounds on HN is this[3]. Slide the creativity filter up for weirdness!

[3] https://thisanimedoesnotexist.ai/

I'm surprised so many people want to see our BigGAN images. Thank you for asking :)

You can watch the training process here: http://song.tensorfork.com:8097/#images

It's been going on for a month and a half, but I leave it running mostly as a fishtank rather than to get to a specific objective. It's fun to load it up and look at a new random image whenever I want. Plus I like the idea of my little TPU being like "look at me! I'm doing work! Here's what I've prepared for you!" so I try to keep my little fella online all the time.

- https://i.imgur.com/0O5KZdE.png

- Plus stuff like this makes me laugh really hard. https://i.imgur.com/EnfIBz3.png

- Some nice flowers and a boat. https://i.imgur.com/mrFkIx0.png

The model is getting quite good. I kind of forgot about it over the past few weeks. StyleGAN could never get anywhere close to this level of detail. I had to spend roughly a year tracking down a crucial bug in the implementation that prevented biggan from working very well until now: https://github.com/google/compare_gan/issues/54

And we also seemed to solve BigGAN collapse, so theoretically the model can improve forever now. I leave it running to see how good it can get.

I've never seen conglomerate pictures like this used in AI training. Do you train models on these 4x4 images? What's the purpose vs a single picture at a time? Does the model know that you're feeding it 4x4 examples, or does it have to figure that out itself?

Nah, the grid is just for convenient viewing for humans. Robots see one image at a time. (Or more specifically, a batch of images; we happen to use batch size 2 or 4, I forget, so each core sees two images at a time, and then all 8 cores broadcast their gradients to each other and average, so it's really seeing 16 or 32 images at a time.)

I feel a bit silly plugging our community so much, but it's really true. If you like tricks like this, join the Tensorfork discord:


My theory when I set it up was that everyone has little tricks like this, but there's no central repository of knowledge / place to ask questions. But now that there are 1,200+ of us, it's become the de facto place to pop in and share random ideas and tricks.

For what it's worth, https://thisanimedoesnotexist.ai/ was a joint collaboration of several Tensorfork discord members. :)

If you want future updates about this specific BigGAN model, twitter is your best bet: https://twitter.com/search?q=(from%3Atheshawwn)%20biggan&src...

This is awesome, thanks.

What safeguards are there or what assurances do we have that JFT is not contaminated with images from (or extremely similar to) the validation set?

Haha. None whatsoever.

The assurance is that everyone in the field seems to take the work seriously. But the reality is that errors creep in from a variety of corners. I would not be even slightly surprised to find that the validation data is substantially similar. We're still at the "bangs rocks together to make fire" phase of ML, which is both exciting and challenging; we're building the future from the ground up.

People rarely take the time to look at the actual images, but if you do, you'll notice they have some interesting errors in them: https://twitter.com/theshawwn/status/1262535747975868418

I built an interactive viewer for the tagging site: https://tags.tagpls.com/

(Someone tagged all 70 shoes in this one, which was kind of impressive... https://tags.shawwn.com/tags/https://battle.shawwn.com/sdc/i... )

Anyway, some of the validation images happen to be rotated 90 degrees and no one noticed. That made me wonder what other sorts of unexpected errors are in these specific 50,000 validation images that the world just-so-happened to decide were Super Important to the future of AI.

The trouble is, images in general are substantially similar to the imagenet validation dataset. In other words, it's tempting to try to think of some way of "dividing up" the data so that there's some sort of validation phase that you can cleanly separate. But reality isn't so kind. When you're at the scale of millions of images, holding out 10% is just a way of sanity checking that your model isn't memorizing the training data; nothing more.

Besides, random 90 degree rotations are introduced on purpose now, so it's funny that old mistakes tend not to matter.

Just the sheer size of JFT (latest versions I heard approach 1B images), so it's probably impractical to train on it till overfitting.

That’s awesome! Is your model available publically? I run a site [0] where users can generate images from text prompts using models like the official BigGAN-Deep one and I’d love to try it out for this purpose. Do you also have somewhere whereupon discuss this stuff? I’m new to ML in general and was wondering if there’s somewhere where y’all experts gather.

[0]: https://dank.xyz

> pretraining on JFT increases the accuracy means that the model is generalizing

not necessarily, it may be mostly a bonus of the transfer, especially considering that JFT is that much larger - getting for example the first conv layers kernels to converge to Gabor-like takes time, yet those layers kernels are very similar across the well trained image nets (and there were some works showing that it is optimal in a sense, and that it is one of the reasons it is in our visual cortex) and thus transferrable, and can practically be treated as fixed in the new model (especially if those layers were pretrained on very large model and reached the state of generic feature extraction). I suspect the similar is applicable for the low level feature aggregating layers too.

Is there any (influence) SW framework that takes youtube video as input and split out object/timestamp as output?

Where are these images from? Are there more?

Oh, you! I'm so flattered. You're making me blush.

Sure, you can have as many as you want. Watch it train in real time:


Though I imagine HN might swamp our little training server running tensorboard, so here you go.


We've been training a BigGAN-Deep model for 1.5 months now. Though that sounds like a long time, in reality it's completely automatic and I've been leaving it running just to see what will happen. Every single other BigGAN implementation reports that eventually the training run will collapse. We observed the same thing. But gwern came up with a brilliantly simple way to solve this:

  if D_loss < 0.2:
    D_loss = 0
It takes some thinking about why this solves collapse. But in short, the discriminator isn't allowed to get too intelligent. Therefore the generator is never forced to degenerate into single examples that happen to fool the discriminator, i.e. collapse.

If you like this sort of thing in general, I encourage you to come join the Tensorfork community discord server, which we affectionately named the "TPU podcast": https://discord.com/invite/x52Xz3y

There are some 1,200 of us now, and people are always showing off stuff like this in our (way too many) channels.

Do you take advantage of previous iterations of the generator and discriminator? i.e. the generator should be able to fool all previous discriminators, and the discriminator should be able to recognise the work of all previous generators?

Nope! It's an interesting balance. The truth of the situation seems to be: the generator and discriminator provide a "signal" to each other, like two planets orbiting around each other.

If you cut the signal from one, the other will rapidly veer off into infinity, i.e. collapse quickly. Or it will veer off in the other direction, i.e. all progress will stop and the model won't improve.

So it's a constant "signal", you see, where one is dependent on the other in the current state. Therefore I am skeptical of attempts to use previous states of discriminators.

However! One of the counterintuitive aspects of AI is that the strangest-sounding ideas often have a chance of being good ideas. It's also so hard to try new ideas that you have to pick specific ones. So, roll up your sleeves and implement yours; I would personally be delighted to see what the code would look like for "the current generator can fool all previous discriminators".

I really do not mean that in any sort of negative or dismissive way. I really hope that you will come try it, because DL has never been more accessible. And the time is ripe for fresh takes on old ideas; there's a very real chance that you'll stumble across something that works quite well, if you follow your line of thinking.

But for practical purposes, the current theory with generators and discriminators is that they react to their current states. So there's not really any way of testing "can the generator fool all previous discriminators?" because in reality, the generator isn't fooling the discriminator at all -- they simply notice when each other deviates by a small amount, and they make a corresponding "small delta change" in response. Kind of like an ongoing chess game.

Thanks for the detailed answer.

I don't claim it to be a novel idea, I just remember the Alpha Go (zero?) paper that said they played it against older versions to make sure it hadn't got into a bad state.

Ah! This is an interesting difference, and illustrates one fun aspect of GANs vs other types of models: Alpha Go had a very specific "win condition" that you can measure precisely. (Can the model win the game?)

Whereas it's very difficult to quantify what it means to be "better" at generating images, once you get to a certain threshold of realism. (Was Leonardo better than michelangelo? Probably, but it's hard to measure precisely.)

The way that Alpha Go worked was, it gathered a bunch of experiences, i.e. it played a bunch of games. Then, after playing tons of games -- tens of dozens! just kidding, probably like 20 million -- it then performed a single gradient update.

In other words, you gather your current experiences, and then you react to them. It's a two-phase commit. There's an explicit "gather" step, which you then react to by updating your database of parameters, so to speak.

Whereas with GANs, that happens continuously. There's no "gather" step. The generator simply tries to maximize the discrimiantor's loss, and the discriminator tries to minimize it.

Balancing the two has been very tricky. But the results speak for themselves.

The big deal here is the removal of BatchNorm. People never really liked BatchNorm for various theoretical and practical reasons, and yet it was required for all the top performing models. If this allows us to get rid of it forever that will be really nice.

Yannic Kilcher has a great video on this out already, including some thoughtful critique. https://youtu.be/rNkHjZtH0RQ

One of the things he mentioned was that they introduce this fancy alternative to BatchNorm AND come up with a fancy new architecture. The combination does really well, but it isn't clear how much of the gain is due to the new improved architecture Vs the 'adaptive gradient clipping' they introduce.

> 8.7x faster to train

This is an achievement but it would be helpful to have put "to train" in the title as this is quite different from efficiency at inference time, which is what often actually matters in deployed applications.

From Table 3 on Page 7 it appears to me that NFNet is significantly heavier in the number of parameters than EfficientNet for similar accuracies. For example EffNet-B5 achieves 83.7% with 30M params and 9.9B FLOPs on Top-1 while NFNet-F0 achieves 83.6% with 71.5M params and 12.38B FLOPs on Top-1.

It appears to me at first glance that NFNet has not achieved SOTA at inference.

> It appears to me at first glance that NFNet has not achieved SOTA at inference.

It has, for larger models (F1 vs B7). See Fig 4 in the Appendix.

> No it hasn't https://paperswithcode.com/sota/image-classification-on-imag...

We were talking about models trained on ImageNet, specifically about the trade-off between accuracy and FLOPs. But the higher-accuracy models listed in your link use extra data. So it's not quite the same benchmark we were talking about.

The deepmind paper NFNet-F4+ you were talking about also has external training data.

The number one in accuracy (Meta pseudo labels) is also faster for inference (390M vs 570M parameters) vs the deepmind one. So what are you disagreeing with?

> The deepmind paper NFNet-F4+ you were talking about also has external training data.

@dheera and I did not mention NFNet-F4+. All models, tables, figures and numbers that we did mention resulted from training on ImageNet alone.

Make it work then make it fast and efficient.

People complaining about how slow and expensive brand new models are to train are ignorant of the history of machine learning and of engineering in general.

*11.5% as much compute.

But what was the baseline hardware for a reasonable training time?

Page 7 has a table of one training step on TPUv3 and V100 GPUs.

I don't completely understand this: NFNet is slower than its competitors on this benchmark, but they claim higher efficiency. This isn't obvious to me.

> I don't completely understand this: NFNet is slower than its competitors on this benchmark, but they claim higher efficiency.

Take a look at F1 and B7: They have the same accuracy, but F1 is smaller and much faster.

Rediscovering NLMS now, good for ML folks.


tl;dr: Don't use batch norm for preventing exploding gradients but adaptive gradient thresholds.

For this they compute the Frobenius norm (square root of the sum of squares) of the weight layer and its gradient and take the ratio of these as clipping threshold.

That saves the meta search for the optimal threshold but also is better than a fixed threshold could ever be.

Very simple idea.

Thanks for macroexpanding frobnorm.

I'm skeptical that these hand-coded thresholds can ever match what a model can learn automatically. But it's hard to argue with results.

Is this the first time that the gradient clip threshold has been chosen relative to the size of the weight matrix?

Lol, why is my comment down here with 7 upvotes.

As distributed training becomes the norm, the lack of BN will become more and more desirable.

The speed improvements are certainly interesting, the performance improvements seem decidedly not. This method has more than 2x the parameters of all but one of the models it was compared against.

If I’m off-base here can someone explain?

I don't care how many parameters my model has per se. What I care about is how expensive it is to train in time and dollars. If this makes it cheaper to train better models despite more parameters, that's still a win.

There's one important caveat, though I agree with your thrust: at GPT-3 scale, cutting params in half is a nontrivial optimization. So it's worth keeping an eye out for that concern.

(Yeah, none of us are anywhere near GPT-3 scale. But I spend most of my time thinking about scaling issues, and it was interesting to see your comment pop up; I would've agreed entirely with it myself, if not for seeing all the anguish caused by attempting to train and deploy billions of parameters.)

In cases where you have to deploy the model and you are limited in terms of flops, this paper does not help much, unless it’s removal of batchnorm somehow allows a future network that is actually faster at inference time.

There are a lot of techniques for sparsifying or pruning or distilling models to reduce inference FLOPS, and they almost always produce better results when starting with a better model. Also, if your model is 8x faster to train at the same size then you can do 8x as much hyperparameter tuning and get a better result.

This model is much more expensive than efficientnet at inference (I think the flops are about 2x?). You can use these same techniques with efficientnet.

> they almost always produce better results when starting with a better model.

If you have a flops limit this new model would first need to be shrank by 2-3x as much as Efficientnet in order to fit the same constraint. So you would be starting with a smaller model and thus lower performance. Efficientnet is still better for embedded applications most likely.

But for deployment in smaller devices you can use techniques such as distillation, quantization and sparsity. Training and inference are very different problems in practice.

Yes but you can do that with efficientnet as well. The point is that this is an improvement only for training because it uses computations which are highly optimized on TPU

Some models are still memory limited. Fewer parameters are very useful in those settings.

Is there any video work being done in the realm of image processing? I mean a video is just essentially an array/string of images/frames no?

Do they disclose any important techniques/ideas on how to achieve these results in the paper, or it's more of a technical press release?

Yes they do (arxiv is a place for scientific papers not press releases). I've only skimmed it, but the paper introduce an adaptive way to clip gradients. Meaning that if the ratio of the gradient norm to weight norm surpasses a certain threshold, they clip it. This stabilizes learning and seems to avoid the need for batch normalization. Seems quite promising imo and something that could stick (I'm quite happy if we could finally do away with batchnorm).

You missed a big part: they did a big NAS run to make it work.

Where did you see that they used NAS? Their preliminary results show it works even for the baseline model

they did a lot of manual hyperparameter optimization, and spend a fair amount of time unpacking the rationale for their choices, including a negative results section (!) in the appendix

Is there an implementation for what they are describing in PyTorch?

They compare the training latency for different models with a fixed batch size of 32. But if the DeepMind models are several times larger than the comparison models in each latency class, it seems that the comparison models could use larger batch sizes for faster overall training time.

For ConvNets, the memory use of the models themselves is pretty modest. For example, even with 0.5B parameters, with FP32, weights+gradients+momentum should use just 6GB (unless your framework sucks, or you have extra overhead from distributed training) So, if your model is twice smaller, you'll only save 3GB. If your VRAM is 32GB, saving 3GB won't let you use a much bigger batch size. On the other hand, the absence of batch norm can actually lead to memory savings proportional to batch size.

Do what do attribute to the gains? Adaptive clipping? Or $$$ spent on NAS??

They do a little study in section 4.1 comparing batchnorm to adaptive gradient clipping for resnets over a range of hyperparameters, and they also compare perf to batchnorm versions in table 6. The results indicate AGC does give a real boost over batchnorm

They do a bunch of manual hyperparameter tuning that seems necessary to get the state of the art results, from my reading it doesn’t seem like they actually used NAS. Just that the baseline they compare to was found with NAS

eli5 what SOTA image recognition is?

SOTA is 'state of the art'. Image recognition is a task classically appraised by calculating the accuracy on the ImageNet dataset, which requires a system to classify images each as one of 1,000 pre-determined classes.

so how many images does the current SOTA take to train a classifier? Trying to gauge how much of an improvement Deepmind has made here.

The results that should be used to compare their results to others involves training on just under 1.3 million images across the 1,000 classes.

Their best results involve 'pretraining' on a dataset of 300 million examples, before 'tuning' it on the actual ImageNet training dataset as above.

State of The Art == SOTA

You might enjoy paperswithcode.com

This isn't the real SOTA, "Meta pseudo labels" has ~10% less errors, while having less parameters. https://paperswithcode.com/sota/image-classification-on-imag... However the fast training is an interesting property.

It would be interesting to test thoses efficientNets with zeroth order backpropagation as it allows a 300X speedup (vs 8.7x) while not regressing accuracy that much https://paperswithcode.com/paper/zorb-a-derivative-free-back...

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact