A New Lens on Understanding Generalization in Deep Learning

chroem- · on March 21, 2021

> We trained a generative model on CIFAR-10, which we then used to generate ~6 million images. The scale of the dataset was chosen to ensure that it is “virtually infinite” from the model’s perspective, so that the model never resamples the same data.

I don't think so, Tim. They observe identical performance between cifar-10 and cifar-5m, because the generative model for cifar-5m learned to replicate the distribution that cifar-10 was sampled from. It's the same dataset.

preetum · on March 21, 2021

Note that we also have ImageNet experiments, with entirely real data (non-synthetic). See Section 4, and in particular Figure 3 of the full paper: https://arxiv.org/abs/2010.08127

To clarify some other comments on this post: In all settings, we compare "Real World" and "Ideal World" for the same underlying distribution. Eg, we never compare CIFAR-10 and CIFAR-5m, we only compare "Real World CIFAR-5m" vs "Ideal World CIFAR-5m".

choppaface · on March 21, 2021

But who actually uses your ImageNet DogBird dataset in an existing study? Part of the problem is that you're inventing your own benchmark. And yet you could have used SVHN and/or the same fixtures (i.e. datasets and ablations) of some of the work you cite e.g. as in "Fantastic Generalization Measures and Where to Find Them" https://arxiv.org/pdf/1912.02178.pdf

The CIFAR-5m result is interesting, but it's misleading to the reader since the dataset is so contrived. And yet you lead with it on your front page. There's wayyyy too much hype going on here.

habitue · on March 21, 2021

This looked suspect to me at first as well, but on reflection it doesn't seem so bad. The important aspect here is that they have some large dataset that a model will converge on in less than one epoch. The benefit of generating it from cifar-10 is just that they already have multiple reasonable models to compare with, and they already have the hyperparameters for them.

I haven't read the paper, but my guess is that the 50K images in the real-world epoch are not just real images from the cifar-10 dataset, they're 50K random images from cifar-5m. I'm also guessing they don't ever compare performance between a model trained on cifar-10 vs. cifar-5m, they only compare performance of real vs. ideal. So in effect, you can ignore the cifar-10 dataset.

chroem- · on March 21, 2021

I've had the opportunity to train on an 88 billion example dataset, and while you do see diminishing returns with each additional sample, it is still important to try to cover as much of the dataset as possible. The limiting factor on training on datasets that large seems to be some sort of hysteresis overfit on the earlier samples that prevents the model from generalizing to the entire dataset. Normally it goes unnoticed, but there is still a small amount of overfit that accumulates the longer you train the model, even if every sample is new.

ShamelessC · on March 21, 2021

Thanks this insight helped me understand an issues I'm currently having with training actually!

gwern · on March 21, 2021

n = 88 billion is larger than I can think of for any DL research I've read about. Can you (ahem) enlarge on that?

chroem- · on March 21, 2021

It's an ultra high resolution sales forecast model for a national retailer. It's made somewhat more manageable by the fact that the data is 1D instead of 2D though.

gwern · on March 22, 2021

I wonder if that's an important difference? Sales data seems far more noisy and less informative than all of the scaling research datasets I can think of like images or text. It could also just be that you had some minor flaw causing underfitting in your approach like using too-small models which aren't compute-optimal or too much regularization (a critical part of scaling papers is removing standard regularizers like dropout to avoid hobbling the model).

chroem- · on March 23, 2021

I've found that it's surprisingly insensitive to the amount of regularization applied, however it definitely does warrant further experimentation. Also, I expect that one other distinction vs. image data is much higher correlation between samples in the dataset.

gradys · on March 21, 2021

But maybe with 5 million real examples sampled from the distribution CIFAR-10 was sampled from they would in fact see a difference. Maybe the generative model is capturing only a limited slice of the diversity that the ideal model would really see.

It seems like they should have downsampled from an actually large dataset rather than generatively upsampled from a small dataset. Unless I'm missing something?

habitue · on March 21, 2021

I think what you're saying is plausible, but I would expect the different models to diverge at different rates in that case. So, for example, if resnet-real and resnet-ideal stayed within 1% and the other models showed a bigger range, I would be more suspicious that the generative model was simply creating a dataset that was easy for some architectures to learn.

That being said, I think it would have been much better if they compared some non-convolutional architectures just as a sanity check

Edit: after I wrote this, I checked, and ViT-b/4 is actually a transformer architecture, not CNN. So they did this! And it stayed very close to the same error range from ideal as the CNNs. I am much more confident now that what they did is fine

ansk · on March 21, 2021

I glanced over the paper and it appears that the generative model used for cifar-5m wasn't even trained with a proper train/test split. It was simply trained until the FID on the training set stopped decreasing. That's a pretty good way to overfit a model -- especially on a dataset as small as cifar-10 -- so it's hard to trust that cifar-5m is a decent proxy for the underlying cifar-10 data distribution.

kingsleyopara · on March 21, 2021

This stood out to me too and puts a big question mark over the whole framework.

dontreact · on March 21, 2021

Seriously...

What's surprising is that Google has an enormous dataset called JFT where they could have tested this without this confound. Just shrink the images and you can make something cifar-like and something cifar-5m-like

choppaface · on March 21, 2021

or even just an ablation on Imagenet or some other “large” dataset. Did this paper get accepted because it has five pages of citations and invokes Vapnik in the second sentence?

preetum · on March 21, 2021

We have ImageNet experiments in Section 4 of the full paper: https://arxiv.org/abs/2010.08127

choppaface · on March 21, 2021

You're not ablating anything there. What happens when Train Infinity (Train 150K) doubles in size (to Train Infinity_2 -> 300K)? What happens when you add an unseen class? These are real-world conditions that hamper existing theoretical estimation of the generalization gap-- the "ideal world" always gets larger. In Bengio's group paper (Predicting the Generalization Gap https://arxiv.org/pdf/1810.00113.pdf ) they actually do these sorts of ablations.

Also, you use K (thousands) and $K$ (latex K) interchangeably; it's really hard to decipher is K is a variable or what you mean.

6gvONxR4sf7o · on March 21, 2021

Why would you criticize them so hard for not doing imagenet experiments when they did do them?

choppaface · on March 21, 2021

This article was promoted by Google's PR blog and the conference is ultra selective (probably excessively selective). Hyped results deserve extra criticism. And no, they didn't do any useful ablation experiments.

cambalache · on March 21, 2021

I dont know if it is depressing or encouraging that a team of learned guys who get paid mid 6-figures come up with this kind of strategy.

woopwoop · on March 21, 2021

I agree with the philosophical conclusions here (e.g. I take their point on data augmentation to basically be that the name "data augmentation" is well chosen - it enlarges the restricted real world). I'm confused by the setup though. The training algorithm seeks to minimize the training loss on, in one case, the CIFAR-10 distribution D, and in the second case on D', their "ideal world" distribution. They find that these two tasks look very similar. I don't see what conclusions you can draw from this except that they did a good job training their generative model and D' is indeed very close to D, and so minimizers for the training loss on D' will look similar to those for D. (Close can mean something like the distributions D|_{label = c} and D|_{label=c'} are close in Earth-mover distance in some appropriate embedding space).

antipaul · on March 21, 2021

Indeed. Isn’t that the whole point of the 100 years or whatever of statistical inference, where D is a sample and is representative of D’?

plaidfuji · on March 21, 2021

I guess this makes sense, intuitively. If I understand a cat as a “furry animal with four legs, a long tail and pointy ears”, that’s a pretty generalizable model. If I see a new picture of a cat that has orange fur for the first time ever, I will still guess it’s a cat. I’ll also be able to quickly update my model parameters to include orange fur (I.e. on-line learning is fast). Whereas if my model of a cat is “brown stripy thing that’s usually on a window sill”, and I see a picture of an orange cat outdoors, not only will I probably guess it’s a pumpkin or something, it will also take many steps in parameter space to adjust my model to incorporate this new data (on-line learning will be slow).

sgt101 · on March 21, 2021

Resnet works better than MLP not because it optimizes faster, but because the architecture of the network contains a bias towards spatially structured data. MLP's see images as sorted or scrambled, the relationship of each pixel to another pixel is the same at the start (or randomly stronger or weaker if you initialize the network randomly).

davnn · on March 21, 2021

So the main claim of this paper is: "...good models and training procedures are those that (1) optimize quickly in the ideal world and (2) do not optimize too quickly in the real world."

Will be interesting to replicate those results on different data.

blueblisters · on March 21, 2021

Maybe I'm pedantic but the choice of the terms "real-world" and "ideal world" are really confusing for identifying a finite vs. infinite training regime. Even in the real-world, you can have effectively infinite training sample if you have a large enough dataset, just like they did for the experiments.

ersiees · on March 21, 2021

The real comparison would have been between Cifar-10 (or Imagenet) and a downsampled version there of, not upsampled, and I actually know for sure that this harms performance.. So this ideal world and real world training are definitely not the same!

preetum · on March 21, 2021

This is the exact comparison we make in the paper. We have subsampled ImageNet experiments as well; see Figure 3 in the full paper: https://arxiv.org/abs/2010.08127

mustafa_pasi · on March 21, 2021

Is there anyone doing Deep Learning research who acknowledges that it is an optimization subfield and accordingly uses rigorous math? I don't mean to be snarky, and I am sure there is some use for this result, but I hope somebody is working on the math and I want to know who that person is.

plaidfuji · on March 21, 2021

I like to imagine that it’s just one guy named Adam.

joe_the_user · on March 21, 2021

To me, the strange thing about talk of "real world generalization" is that the characteristics of this world aren't described. In fact, nothing really delineates what it is - not that we humans don't have an immediate understanding of it but still.

Overall, one idea is experiments on data sets that have structures but where neural network don't generalize well. Or data sets where neural generalize even better than "the real world".

webmaven · on March 21, 2021

> To me, the strange thing about talk of "real world generalization" is that the characteristics of this world aren't described. In fact, nothing really delineates what it is - not that we humans don't have an immediate understanding of it but still.

But it is defined: In the real world, you have a finite dataset, and so have to use it over and over during training. You can shuffle the order, augment it with transformations, divide it into batches, but ultimately you reuse the whole dataset (other than the subset withheld for evaluation) in each training epoch. If the trained network deals correctly with never-before-seen examples (whether they were examples witheld from the training set or brand new ones), then it has generalized well.

Ideally, you wouldn't have to go through any of that rigmarole with a finite dataset, every sample you train on would be fresh and unique, and you would have an endless stream of them. That's the 'ideal world' scenario. The same standard for generalization applies, except that every training sample is never-before-seen as well (so, no epochs).

joe_the_user · on March 22, 2021

But it is defined: In the real world, you have a finite dataset, and so have to use it over and over during training. You can shuffle the order, augment it with transformations, divide it into batches, but ultimately you reuse the whole dataset (other than the subset withheld for evaluation) in each training epoch.

I have no idea how "you have a finite dataset" defines "real world generalization". This line of reasoning seems completely incoherent.

webmaven · on March 22, 2021

Sorry, that got scrambled in an edit due to my conflation of "real world generalization" and "real world vs. ideal world training." The result is, indeed, incoherent.

Anyway, while "real world vs. ideal world training" is defined, you are correct that generalization is not, at least here. But definitions do exist, and more or less conform to what I wrote. Withholding a subset of training data for validation is meant as a proxy for the entirely new samples that will (presumably) be encountered upon deploying the model, but if the training data is biased or otherwise not representative of the real world data, then it is probable that testing the withheld data won't reveal the problem. In other words, the model generalizes well within the parameters of the training set, but not to the real-world data that has a different distribution, because the training set wasn't representative.

There is some interesting discussion of this within Gwern's investigation of the apocryphal 'tank story':

https://www.gwern.net/Tanks