> We trained a generative model on CIFAR-10, which we then used to generate ~6 million images. The scale of the dataset was chosen to ensure that it is “virtually infinite” from the model’s perspective, so that the model never resamples the same data.
I don't think so, Tim. They observe identical performance between cifar-10 and cifar-5m, because the generative model for cifar-5m learned to replicate the distribution that cifar-10 was sampled from. It's the same dataset.
Note that we also have ImageNet experiments, with entirely real data (non-synthetic). See Section 4, and in particular Figure 3 of the full paper: https://arxiv.org/abs/2010.08127
To clarify some other comments on this post: In all settings, we compare "Real World" and "Ideal World" for the same underlying distribution. Eg, we never compare CIFAR-10 and CIFAR-5m, we only compare "Real World CIFAR-5m" vs "Ideal World CIFAR-5m".
But who actually uses your ImageNet DogBird dataset in an existing study? Part of the problem is that you're inventing your own benchmark. And yet you could have used SVHN and/or the same fixtures (i.e. datasets and ablations) of some of the work you cite e.g. as in "Fantastic Generalization Measures and Where to Find Them" https://arxiv.org/pdf/1912.02178.pdf
The CIFAR-5m result is interesting, but it's misleading to the reader since the dataset is so contrived. And yet you lead with it on your front page. There's wayyyy too much hype going on here.
This looked suspect to me at first as well, but on reflection it doesn't seem so bad. The important aspect here is that they have some large dataset that a model will converge on in less than one epoch. The benefit of generating it from cifar-10 is just that they already have multiple reasonable models to compare with, and they already have the hyperparameters for them.
I haven't read the paper, but my guess is that the 50K images in the real-world epoch are not just real images from the cifar-10 dataset, they're 50K random images from cifar-5m. I'm also guessing they don't ever compare performance between a model trained on cifar-10 vs. cifar-5m, they only compare performance of real vs. ideal. So in effect, you can ignore the cifar-10 dataset.
I've had the opportunity to train on an 88 billion example dataset, and while you do see diminishing returns with each additional sample, it is still important to try to cover as much of the dataset as possible. The limiting factor on training on datasets that large seems to be some sort of hysteresis overfit on the earlier samples that prevents the model from generalizing to the entire dataset. Normally it goes unnoticed, but there is still a small amount of overfit that accumulates the longer you train the model, even if every sample is new.
It's an ultra high resolution sales forecast model for a national retailer. It's made somewhat more manageable by the fact that the data is 1D instead of 2D though.
I wonder if that's an important difference? Sales data seems far more noisy and less informative than all of the scaling research datasets I can think of like images or text. It could also just be that you had some minor flaw causing underfitting in your approach like using too-small models which aren't compute-optimal or too much regularization (a critical part of scaling papers is removing standard regularizers like dropout to avoid hobbling the model).
I've found that it's surprisingly insensitive to the amount of regularization applied, however it definitely does warrant further experimentation. Also, I expect that one other distinction vs. image data is much higher correlation between samples in the dataset.
But maybe with 5 million real examples sampled from the distribution CIFAR-10 was sampled from they would in fact see a difference. Maybe the generative model is capturing only a limited slice of the diversity that the ideal model would really see.
It seems like they should have downsampled from an actually large dataset rather than generatively upsampled from a small dataset. Unless I'm missing something?
I think what you're saying is plausible, but I would expect the different models to diverge at different rates in that case. So, for example, if resnet-real and resnet-ideal stayed within 1% and the other models showed a bigger range, I would be more suspicious that the generative model was simply creating a dataset that was easy for some architectures to learn.
That being said, I think it would have been much better if they compared some non-convolutional architectures just as a sanity check
Edit: after I wrote this, I checked, and ViT-b/4 is actually a transformer architecture, not CNN. So they did this! And it stayed very close to the same error range from ideal as the CNNs. I am much more confident now that what they did is fine
I glanced over the paper and it appears that the generative model used for cifar-5m wasn't even trained with a proper train/test split. It was simply trained until the FID on the training set stopped decreasing. That's a pretty good way to overfit a model -- especially on a dataset as small as cifar-10 -- so it's hard to trust that cifar-5m is a decent proxy for the underlying cifar-10 data distribution.
What's surprising is that Google has an enormous dataset called JFT where they could have tested this without this confound. Just shrink the images and you can make something cifar-like and something cifar-5m-like
or even just an ablation on Imagenet or some other “large” dataset. Did this paper get accepted because it has five pages of citations and invokes Vapnik in the second sentence?
You're not ablating anything there. What happens when Train Infinity (Train 150K) doubles in size (to Train Infinity_2 -> 300K)? What happens when you add an unseen class? These are real-world conditions that hamper existing theoretical estimation of the generalization gap-- the "ideal world" always gets larger. In Bengio's group paper (Predicting the Generalization Gap https://arxiv.org/pdf/1810.00113.pdf ) they actually do these sorts of ablations.
Also, you use K (thousands) and $K$ (latex K) interchangeably; it's really hard to decipher is K is a variable or what you mean.
This article was promoted by Google's PR blog and the conference is ultra selective (probably excessively selective). Hyped results deserve extra criticism. And no, they didn't do any useful ablation experiments.
I agree with the philosophical conclusions here (e.g. I take their point on data augmentation to basically be that the name "data augmentation" is well chosen - it enlarges the restricted real world). I'm confused by the setup though. The training algorithm seeks to minimize the training loss on, in one case, the CIFAR-10 distribution D, and in the second case on D', their "ideal world" distribution. They find that these two tasks look very similar. I don't see what conclusions you can draw from this except that they did a good job training their generative model and D' is indeed very close to D, and so minimizers for the training loss on D' will look similar to those for D. (Close can mean something like the distributions D|_{label = c} and D|_{label=c'} are close in Earth-mover distance in some appropriate embedding space).
I guess this makes sense, intuitively. If I understand a cat as a “furry animal with four legs, a long tail and pointy ears”, that’s a pretty generalizable model. If I see a new picture of a cat that has orange fur for the first time ever, I will still guess it’s a cat. I’ll also be able to quickly update my model parameters to include orange fur (I.e. on-line learning is fast). Whereas if my model of a cat is “brown stripy thing that’s usually on a window sill”, and I see a picture of an orange cat outdoors, not only will I probably guess it’s a pumpkin or something, it will also take many steps in parameter space to adjust my model to incorporate this new data (on-line learning will be slow).
Resnet works better than MLP not because it optimizes faster, but because the architecture of the network contains a bias towards spatially structured data. MLP's see images as sorted or scrambled, the relationship of each pixel to another pixel is the same at the start (or randomly stronger or weaker if you initialize the network randomly).
So the main claim of this paper is: "...good models and training procedures are those that (1) optimize quickly in the ideal world and (2) do not optimize too quickly in the real world."
Will be interesting to replicate those results on different data.
Maybe I'm pedantic but the choice of the terms "real-world" and "ideal world" are really confusing for identifying a finite vs. infinite training regime. Even in the real-world, you can have effectively infinite training sample if you have a large enough dataset, just like they did for the experiments.
The real comparison would have been between Cifar-10 (or Imagenet) and a downsampled version there of, not upsampled, and I actually know for sure that this harms performance.. So this ideal world and real world training are definitely not the same!
This is the exact comparison we make in the paper.
We have subsampled ImageNet experiments as well; see Figure 3 in the full paper: https://arxiv.org/abs/2010.08127
Is there anyone doing Deep Learning research who acknowledges that it is an optimization subfield and accordingly uses rigorous math?
I don't mean to be snarky, and I am sure there is some use for this result, but I hope somebody is working on the math and I want to know who that person is.
To me, the strange thing about talk of "real world generalization" is that the characteristics of this world aren't described. In fact, nothing really delineates what it is - not that we humans don't have an immediate understanding of it but still.
Overall, one idea is experiments on data sets that have structures but where neural network don't generalize well. Or data sets where neural generalize even better than "the real world".
> To me, the strange thing about talk of "real world generalization" is that the characteristics of this world aren't described. In fact, nothing really delineates what it is - not that we humans don't have an immediate understanding of it but still.
But it is defined: In the real world, you have a finite dataset, and so have to use it over and over during training. You can shuffle the order, augment it with transformations, divide it into batches, but ultimately you reuse the whole dataset (other than the subset withheld for evaluation) in each training epoch. If the trained network deals correctly with never-before-seen examples (whether they were examples witheld from the training set or brand new ones), then it has generalized well.
Ideally, you wouldn't have to go through any of that rigmarole with a finite dataset, every sample you train on would be fresh and unique, and you would have an endless stream of them. That's the 'ideal world' scenario. The same standard for generalization applies, except that every training sample is never-before-seen as well (so, no epochs).
But it is defined: In the real world, you have a finite dataset, and so have to use it over and over during training. You can shuffle the order, augment it with transformations, divide it into batches, but ultimately you reuse the whole dataset (other than the subset withheld for evaluation) in each training epoch.
I have no idea how "you have a finite dataset" defines "real world generalization". This line of reasoning seems completely incoherent.
Sorry, that got scrambled in an edit due to my conflation of "real world generalization" and "real world vs. ideal world training." The result is, indeed, incoherent.
Anyway, while "real world vs. ideal world training" is defined, you are correct that generalization is not, at least here. But definitions do exist, and more or less conform to what I wrote. Withholding a subset of training data for validation is meant as a proxy for the entirely new samples that will (presumably) be encountered upon deploying the model, but if the training data is biased or otherwise not representative of the real world data, then it is probable that testing the withheld data won't reveal the problem. In other words, the model generalizes well within the parameters of the training set, but not to the real-world data that has a different distribution, because the training set wasn't representative.
There is some interesting discussion of this within Gwern's investigation of the apocryphal 'tank story':
I don't think so, Tim. They observe identical performance between cifar-10 and cifar-5m, because the generative model for cifar-5m learned to replicate the distribution that cifar-10 was sampled from. It's the same dataset.