For all the mathematician hype around ML research, much of the work is closer to alchemy than science. We simply don't understand a great deal of why these neural nets work.
The people doing math above algebra are few and the scene is dominated by "guess and check" style model tinkering.
Many "state of the art models" are simply a bunch of common strategies glued together in a way researchers found worked the best (by trying a bunch of different ones).
An average Joe could probably write influential ML papers by gluing RNN/GAN layers to existing models and fiddling with the parameters until they beat current state of the art. In fact, in NLP models, this is essentially what has happened with roBERTa, XLNET, ELECTRA, etc. They're all somewhat trivial variations on Google's BERT, which is more creative but yet again built on existing models.
Anyways, my point is, none of this required math or genius or particularly demanding thought. It was basically let's tinker with this until we find a way that's better, using guess and check. No equations needed.
We are a long way from the type of simulations done for protein folding and materials strength and basically every other scientific field. It's still the wild west
I think you make a good observation that much of ML progress is driven by tinkering with existing models, though instead of describing it as more "alchemy than science" it's probably more accurate to say it's very experimental right now. Being very experimental is neither unscientific nor unusual in the development of knowledge. James Watt worked as an instrument maker (not a theoretician) when he invented the Watt steam engine in 1776 , and at the time the idea of heat as Phlogiston  was still more prevalent than anything that looks like modern thermodynamics. Theory and practice naturally take turns outpacing each other, which is part of why we need both.
I'd also caution against the belief that experimental work doesn't require "particularly demanding thought". There are many things one can tweak in current ML models (the search space is exponential) and, as you point out, the experiments are expensive. Having a solid understanding of the system, great intuition, and good heuristics is necessary to reliably make progress.
For those who are interested in the theory of deep learning, the community has recently made great strides on developing a mathematical understanding of neural networks. The research is still very cutting edge, but the following PDF helps introduce the topic .
They are very permissive. And you get to play with $500k worth of hardware. Been a member for over a year now. Jonathan is singlehandedly the best support person I've ever worked with, or perhaps ever will work with.
I would've completely agreed with you if not for TFRC. And I couldn't resist the opportunity of playing with some big metal, even if it's hard to work with.
Have you not heard of alphafold?
I get that you'd like to have a clear theoretical basis for what works and we're far from there. But in the meantime we're stumbling in the dark, discovering tricks and forming intuitions, not knowing even where the road is going to lead us.
This is an evolutionary process of ideas, similar to biological evolution that managed to make us. If you know where you're going you can optimise your actions but when you don't even know what might be useful later on, then all attempts are good. They increase diversity and discover blind spots. Some of them will be the stepping stones for the future, but we can't say in advance which and how.
Link to a long discussion about the evolution of ML ideas and the book "Why greatness cannot be planned" by Kenneth Stanley - https://youtu.be/lhYGXYeMq_E?t=416
Here is just one uncurated example of a publication:
Right, but you have to remember there are legions of grad students doing exactly this so it ends up being quite competitive to churn out papers this way.
"guess and check" is terribly ineffective with multi-day training runs. Brings us right back to the batch processing paradigm of the 1960s.
There's the neural tangent kernel work that's achieved a lot, and the transformers themselves are really taking off a lot as the blockwise/lower rank approximation algorithms look more and more like circuits built off of basic, more well-established components.
"An average Joe could probably write influential ML papers by gluing RNN/GAN layers to existing models and fiddling with the parameters until they beat current state of the art. In fact, in NLP models, this is essentially what has happened with roBERTa, XLNET, ELECTRA, etc. They're all somewhat trivial variations on Google's BERT, which is more creative but yet again built on existing models."
This feels like it trivializes a lot of the work and collapses some of the major advancements in training at scale down to a more one-dimensional outlook. Companies are doing both, but it's easy to throw money and compute at an absolutely guaranteed logarithmic improvement in results. It's not stupidity, it's just reducing variance in scaling known laws as we work on making things more efficient, which weirdly enough starts the iterative process of academics frantically trying to mine the expensive, inefficient compute tactics to flag plant their own materials.
With respect to you comment on protein folding and such, I feel you might have missed a lot of the major work in that aren more recently. There really and truly been some field-shattering work on that in combining deep learning systems with last-mile supervision and refinement systems. I'd posit that we're very much out of the wild west and in the mild, but still rambunctious west, if I were to put terms on it.
With reference to guess and check -- yes, that especially was prevalent and worked 2-3 years ago and I'd be in favor of advocating that it does still happen somewhat in a more refined fashion, but I personally believe we'd not get too far beyond the SOTA if we're not working (effectively) with your data manifold now and tightly incorporating whatever projections/constraints of that data distillation process into your network training procedure. I really do agree with you in that I think average Joe breakthroughs will happen and continue to benefit the the field, and I'd certainly agree that there's always going to be the mediocre paper churn of paper mills I think that you alluded to trying to justify their own existence as academics/paper writers, but I really do legitimately think there's enough precedent set in most parts of the field that you need to have some kind of thoughtful improvement to move forward (like AdaBelief, which is still terrible because they straight up lie about what they do in the abstract, even though the improvement of debiasing the variance estimates during training is an exceptionally good idea).
Just my 2c, hope this helps. I think we may have a similar end perspective from two different sides, like two explorers looking at the same peak from the different side of the mountain. :thumbsup:
Much of that is the enormous amount of work done plastering over complex GPU programming. But some of it is the tinkering nature of solving ML problems.
The field I'm most interested in right now for instance, NLP, is highly dataset dependent. It's fairly easy to exceed SoTA right now using open sourced models if you have a better, more specialized dataset than what's freely available.
I started with my 1070 flat, and have had some people far, far, far smarter and more experienced than me help me understand a lot of the underlying mathematics a lot. Semi-supervised/bootstrapping may be a fun topic, if you can avoid the giant CAT trucks of the FAANG monoliths blazing through there, and there's always really good artisanal work to be done if you can prove certain mathematical conditions hold such that other (oftentimes counterintuitive and bizzare) operators still work, or work when they shouldn't before.
You could also get into the rat race of the *formers -- the Nyströmformer is quite spectacular and nearly linear, and yes, if you're hot on your feet and clever enough, you might be able to beat everything into submission.
Also, distrust every non-bayesian thing involving means and sigmas. Those are always ad hoc and beat the real data manifold into submission, which really does a disservice a lot of the time, I think. There's a lot to get around that (I suppose including the above, which I'd forgotten about, but there's always, uh, SeLU if you're looking for inspiration plus a phenomenal appendix. You want universal attractors? Set up and prove something that's more amenable to a good manifold structure than simply a certain distribution of activations -- that truly tells us nothing!)
Hope those are fun ideas -- and my deepest apologies if I was uncharitable to you in my former post. I went back and edited it for politeness but reading it again felt some of my earlier aggression fall through, and I'm certainly sorry about that -- I should be helping new folks, not being an aggressive gatekeeper against that.
In any case, so long as you're able to keep mathematical interest, there's always a nice hole to square yourself away into. Talk to a good accomplished research professional and they might be able to point you in fun directions (aside from my personal noobishness ;))
Let me know if any of those catch your eye and end up going anywhere, I'm happy to help when it moves the field forward! :)))
While ad hoc empirical fine tuning is a big part of improving sota, mathematical genius can still enable revolutions e.g this recent alternative to classical backpropagation that is 300X faster with low accuracy loss
Sometimes HN is full of people who think they know what they're talking about but just don't. This is one of those times.
Nobody has even tried to create a spanXLnet (akin to spanBERT)
How many years will be wasted before researchers get out of the BERT local minima? I'm a afraid it might last a decade
That's pretty interesting. It implies that the original accuracy rating might be legit. The concern is that we're chasing the imagenet benchmark as if it's the holy grail, when in fact it's a very narrow slice of what we normally care about as ML researchers. However, the fact that pretraining on JFT increases the accuracy means that the model is generalizing, which is very interesting; it implies that models might be "just that good now."
Or more succinctly, if the result was bogus, you'd expect JFT pretraining to have no effect whatsoever (or a negative effect). But it has a positive result.
The other thing worth mentioning is that AJMooch seems to have killed batch normalization dead, which is very strange to think about. BN has had a long reign of some ~4 years, but the drawbacks are significant: you have to maintain counters yourself, for example, which was quite annoying.
It always seemed like a neural net ought to be able to learn what BN forces you to keep track of. And AJMooch et al seem to prove this is true. I recommend giving evonorm-s a try; it worked perfectly for us the first time, with no loss in generality, and it's basically a copy-paste replacement.
(Our BigGAN-Deep model is so good that I doubt you can tell the difference vs the official model. It uses AJMooch's evonorm-s rather than batchnorm:  https://i.imgur.com/sfGVbuq.png  https://i.imgur.com/JMJ1Ll0.png and lol at the fake speedometer.)
For (a), maybe surprisingly the answer is mostly yes! Better ImageNet accuracy generally corresponds to better out of distribution accuracy. For (b), it turns out that the ImageNet dataset is full of contradictions---many images have multiple ImageNet-relevant objects, and often are ambiguously or mis-labeled, etc---so it's hard to disentangle progress in identifying objects vs. models overfitting to the quirks of the benchmark.
 ObjectNet: https://objectnet.dev / associated paper
 ImageNet-v2: https://arxiv.org/abs/1902.10811
 An Unbiased Lookat Dataset Bias: https://people.csail.mit.edu/torralba/publications/datasets_... (pre-AlexNet!)
 From ImageNet to Image Classification: https://arxiv.org/abs/2005.11295
 Are we done with ImageNet? https://arxiv.org/abs/2006.07159
 Evaluating Machine Accuracy on ImageNet: http://proceedings.mlr.press/v119/shankar20c.html
For example, from OP's post, w/ coordinate system starting at lower left, I have no idea what I'm looking at in these examples, except they look organic-ish:
: [1,4], [3,2], [4,1]
sillysaurusx: I've never seen conglomerate pictures like this used in AI training. Do you train models on these 4x4 images? What's the purpose vs a single picture at a time? Does the model know that you're feeding it 4x4 examples, or does it have to figure that out itself?
Aside: Another awesome 'sick-fever dream creation' example if you missed it when it made the rounds on HN is this. Slide the creativity filter up for weirdness!
You can watch the training process here: http://song.tensorfork.com:8097/#images
It's been going on for a month and a half, but I leave it running mostly as a fishtank rather than to get to a specific objective. It's fun to load it up and look at a new random image whenever I want. Plus I like the idea of my little TPU being like "look at me! I'm doing work! Here's what I've prepared for you!" so I try to keep my little fella online all the time.
- Plus stuff like this makes me laugh really hard. https://i.imgur.com/EnfIBz3.png
- Some nice flowers and a boat. https://i.imgur.com/mrFkIx0.png
The model is getting quite good. I kind of forgot about it over the past few weeks. StyleGAN could never get anywhere close to this level of detail. I had to spend roughly a year tracking down a crucial bug in the implementation that prevented biggan from working very well until now: https://github.com/google/compare_gan/issues/54
And we also seemed to solve BigGAN collapse, so theoretically the model can improve forever now. I leave it running to see how good it can get.
I've never seen conglomerate pictures like this used in AI training. Do you train models on these 4x4 images? What's the purpose vs a single picture at a time? Does the model know that you're feeding it 4x4 examples, or does it have to figure that out itself?
Nah, the grid is just for convenient viewing for humans. Robots see one image at a time. (Or more specifically, a batch of images; we happen to use batch size 2 or 4, I forget, so each core sees two images at a time, and then all 8 cores broadcast their gradients to each other and average, so it's really seeing 16 or 32 images at a time.)
I feel a bit silly plugging our community so much, but it's really true. If you like tricks like this, join the Tensorfork discord:
My theory when I set it up was that everyone has little tricks like this, but there's no central repository of knowledge / place to ask questions. But now that there are 1,200+ of us, it's become the de facto place to pop in and share random ideas and tricks.
For what it's worth, https://thisanimedoesnotexist.ai/ was a joint collaboration of several Tensorfork discord members. :)
If you want future updates about this specific BigGAN model, twitter is your best bet: https://twitter.com/search?q=(from%3Atheshawwn)%20biggan&src...
The assurance is that everyone in the field seems to take the work seriously. But the reality is that errors creep in from a variety of corners. I would not be even slightly surprised to find that the validation data is substantially similar. We're still at the "bangs rocks together to make fire" phase of ML, which is both exciting and challenging; we're building the future from the ground up.
People rarely take the time to look at the actual images, but if you do, you'll notice they have some interesting errors in them: https://twitter.com/theshawwn/status/1262535747975868418
I built an interactive viewer for the tagging site: https://tags.tagpls.com/
(Someone tagged all 70 shoes in this one, which was kind of impressive... https://tags.shawwn.com/tags/https://battle.shawwn.com/sdc/i... )
Anyway, some of the validation images happen to be rotated 90 degrees and no one noticed. That made me wonder what other sorts of unexpected errors are in these specific 50,000 validation images that the world just-so-happened to decide were Super Important to the future of AI.
The trouble is, images in general are substantially similar to the imagenet validation dataset. In other words, it's tempting to try to think of some way of "dividing up" the data so that there's some sort of validation phase that you can cleanly separate. But reality isn't so kind. When you're at the scale of millions of images, holding out 10% is just a way of sanity checking that your model isn't memorizing the training data; nothing more.
Besides, random 90 degree rotations are introduced on purpose now, so it's funny that old mistakes tend not to matter.
not necessarily, it may be mostly a bonus of the transfer, especially considering that JFT is that much larger - getting for example the first conv layers kernels to converge to Gabor-like takes time, yet those layers kernels are very similar across the well trained image nets (and there were some works showing that it is optimal in a sense, and that it is one of the reasons it is in our visual cortex) and thus transferrable, and can practically be treated as fixed in the new model (especially if those layers were pretrained on very large model and reached the state of generic feature extraction). I suspect the similar is applicable for the low level feature aggregating layers too.
Sure, you can have as many as you want. Watch it train in real time:
Though I imagine HN might swamp our little training server running tensorboard, so here you go.
We've been training a BigGAN-Deep model for 1.5 months now. Though that sounds like a long time, in reality it's completely automatic and I've been leaving it running just to see what will happen. Every single other BigGAN implementation reports that eventually the training run will collapse. We observed the same thing. But gwern came up with a brilliantly simple way to solve this:
if D_loss < 0.2:
D_loss = 0
If you like this sort of thing in general, I encourage you to come join the Tensorfork community discord server, which we affectionately named the "TPU podcast": https://discord.com/invite/x52Xz3y
There are some 1,200 of us now, and people are always showing off stuff like this in our (way too many) channels.
If you cut the signal from one, the other will rapidly veer off into infinity, i.e. collapse quickly. Or it will veer off in the other direction, i.e. all progress will stop and the model won't improve.
So it's a constant "signal", you see, where one is dependent on the other in the current state. Therefore I am skeptical of attempts to use previous states of discriminators.
However! One of the counterintuitive aspects of AI is that the strangest-sounding ideas often have a chance of being good ideas. It's also so hard to try new ideas that you have to pick specific ones. So, roll up your sleeves and implement yours; I would personally be delighted to see what the code would look like for "the current generator can fool all previous discriminators".
I really do not mean that in any sort of negative or dismissive way. I really hope that you will come try it, because DL has never been more accessible. And the time is ripe for fresh takes on old ideas; there's a very real chance that you'll stumble across something that works quite well, if you follow your line of thinking.
But for practical purposes, the current theory with generators and discriminators is that they react to their current states. So there's not really any way of testing "can the generator fool all previous discriminators?" because in reality, the generator isn't fooling the discriminator at all -- they simply notice when each other deviates by a small amount, and they make a corresponding "small delta change" in response. Kind of like an ongoing chess game.
I don't claim it to be a novel idea, I just remember the Alpha Go (zero?) paper that said they played it against older versions to make sure it hadn't got into a bad state.
Whereas it's very difficult to quantify what it means to be "better" at generating images, once you get to a certain threshold of realism. (Was Leonardo better than michelangelo? Probably, but it's hard to measure precisely.)
The way that Alpha Go worked was, it gathered a bunch of experiences, i.e. it played a bunch of games. Then, after playing tons of games -- tens of dozens! just kidding, probably like 20 million -- it then performed a single gradient update.
In other words, you gather your current experiences, and then you react to them. It's a two-phase commit. There's an explicit "gather" step, which you then react to by updating your database of parameters, so to speak.
Whereas with GANs, that happens continuously. There's no "gather" step. The generator simply tries to maximize the discrimiantor's loss, and the discriminator tries to minimize it.
Balancing the two has been very tricky. But the results speak for themselves.
One of the things he mentioned was that they introduce this fancy alternative to BatchNorm AND come up with a fancy new architecture. The combination does really well, but it isn't clear how much of the gain is due to the new improved architecture Vs the 'adaptive gradient clipping' they introduce.
This is an achievement but it would be helpful to have put "to train" in the title as this is quite different from efficiency at inference time, which is what often actually matters in deployed applications.
From Table 3 on Page 7 it appears to me that NFNet is significantly heavier in the number of parameters than EfficientNet for similar accuracies. For example EffNet-B5 achieves 83.7% with 30M params and 9.9B FLOPs on Top-1 while NFNet-F0 achieves 83.6% with 71.5M params and 12.38B FLOPs on Top-1.
It appears to me at first glance that NFNet has not achieved SOTA at inference.
It has, for larger models (F1 vs B7). See Fig 4 in the Appendix.
We were talking about models trained on ImageNet, specifically about the trade-off between accuracy and FLOPs. But the higher-accuracy models listed in your link use extra data. So it's not quite the same benchmark we were talking about.
The number one in accuracy (Meta pseudo labels) is also faster for inference (390M vs 570M parameters) vs the deepmind one.
So what are you disagreeing with?
@dheera and I did not mention NFNet-F4+. All models, tables, figures and numbers that we did mention resulted from training on ImageNet alone.
People complaining about how slow and expensive brand new models are to train are ignorant of the history of machine learning and of engineering in general.
I don't completely understand this: NFNet is slower than its competitors on this benchmark, but they claim higher efficiency. This isn't obvious to me.
Take a look at F1 and B7: They have the same accuracy, but F1 is smaller and much faster.
For this they compute the Frobenius norm (square root of the sum of squares) of the weight layer and its gradient and take the ratio of these as clipping threshold.
That saves the meta search for the optimal threshold but also is better than a fixed threshold could ever be.
Very simple idea.
I'm skeptical that these hand-coded thresholds can ever match what a model can learn automatically. But it's hard to argue with results.
If I’m off-base here can someone explain?
(Yeah, none of us are anywhere near GPT-3 scale. But I spend most of my time thinking about scaling issues, and it was interesting to see your comment pop up; I would've agreed entirely with it myself, if not for seeing all the anguish caused by attempting to train and deploy billions of parameters.)
they did a lot of manual hyperparameter optimization, and spend a fair amount of time unpacking the rationale for their choices, including a negative results section (!) in the appendix
They do a bunch of manual hyperparameter tuning that seems necessary to get the state of the art results, from my reading it doesn’t seem like they actually used NAS. Just that the baseline they compare to was found with NAS
Their best results involve 'pretraining' on a dataset of 300 million examples, before 'tuning' it on the actual ImageNet training dataset as above.
You might enjoy paperswithcode.com
It would be interesting to test thoses efficientNets with
zeroth order backpropagation as it allows a 300X speedup (vs 8.7x) while not regressing accuracy that much