Hacker News new | past | comments | ask | show | jobs | submit login
An overview of the theory of overparameterized machine learning (arxiv.org)
123 points by sebg 32 days ago | hide | past | favorite | 52 comments

A quick summary/translation for those of us who don't speak ML.

We keep hearing about these giant models like GPT3 with 1.5 billion paramaters. Parameters are the things that change when we train a model, you can think about them as degrees of freedom. If you have a lot of parameters, theory made us believe that the model would just "overfit" the training data, e.g. memorize it. That's bad, because when new data comes in in production we'd expect the model to not be able to "generalize" to it, e.g. make accurate predictions on data it hasn't seen before, because it's just memorized training data instead of uncovering the "guiding principles" of the data so to speak.

In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

This paper is a survey or overview of what "too many paramaters" are, and all the research into why these models work even though they shouldn't.

My big beef with a lot of the 'leading edge' ML research is that it tends to be waaaaay too focused on classification problems, and ImageNet in particular. And, last I checked, you /do/ still fight with overfitting in classification models, by cleverly choosing learning rate schedules and using early stopping schemes, 'double descent' be damned.

You can solve classification with a hash function: Hash the image, and then just memorize which label goes with which hash. You can try to dodge this obviously dodgy solution by adding augmentation to the dataset. Then you instead learn to find a representation invariant under the set of augmentations, and learn the hash of that representation. It turns out these augmentation-invariant representations are actually pretty good, so we can solve the classification problem in what looks like a general way.

However, there are many other classes of problems where the hash problem doesn't exist, because the information density of the outputs is too high to memorize in the same way. Specifically, generative models, and the sorts of predictive/infill problems used for self-supervision. In these spaces, the problems are more like: "Given this pile of augmented input, generate half a megabyte of coherent output." These kinds of problems simply don't overfit: Train a speech separation model on a big dataset, and the train+eval quality metrics will just asymptote their way up and to the right until you run out of training budget.

Memorization is only an issue if you allow it to be. If design the model with a "narrow" enough inner stage then that limits the level of detail (in terms of distinct representable values) passed to subsequent stages. This should give you an ML algorithm that consists of a fingerprint (approximates your hashing) stage followed by a classifier that works based on the fingerprint input (approximates a table lookup). Such an algorithm should not have such a problem with over-fitting was you describe.

Memorization is only an issue if you allow it to be.

Sure, it's a potential problem that can appear in the process implementing a deep learning solution. It's not an insurmountable problem. But the fact that still appears seems like an indication the situation in deep learning is more complicated than "overparameterization is not a problem".

When you say hash function, do you mean a cryptographic hash function? How on earth could the performance of that be anywhere near the simplest probabilistic algorithm on unseen examples?

No, nothing cryptographic here. All I'm saying is that you can memorize the dataset by extracting a small fingerprint of each training example and associating it with an output label: ie, learn by lookup table. Then you don't need to memorize the whole training set, you just need to find/learn the fingerprinting function. With no augmentation, you might as well use MD5... With augmentation, you do need to do some actual work to learn to extract an augmentation invariant projection of the training examples, but the basic principle is the same.

I have nothing to do with machine learning but it seems like the hashing approach would only work if you are “training” on the evaluation set instead of a separate training set. Afaik in image net like challenges the set of labeled training images does not contain any of the evaluation images so there wouldn’t be any hashes matching any of the evaluation data.

Yes, you're right. You should never see the test/evaluation dataset during training so it would be impossible to "memorize" the test cases. You would get good near perfect accuracy on the training data, but not the test set. I think the closest analogue would be models that produce conceptual embeddings somewhere in them -- those are kind of like hashes with the property that similar things have similar embeddings. Many classification neural networks kind of operate like that -- the initial layers produce a representation of the data and then the final layer actually performs the classification.

Err.. hash functions like MD5 and SHA256 are "cryptographic". That just means one with a random distribution of outputs as opposed to maybe Apple's "neural" hash function which has outputs that do the "augmentation invariant projection" you speak of.

What I'm trying to say is that neural networks are "universal approximators of continuous real functions". You can think of them as finding the curve of a function which matches the data to an expected and they get their predictive power by matching the underlying "function" of the problem.

Applying a cryptographic hash function is like completely scrambling the underlying function. The only way for a neural network to match it is if it was somehow a universal approximator of a discontinuous real function. You can either do that by getting into unexplored chaos theory or making a gigantic lookup table for every single possible bit combination. The former no human being knows how to do, and the latter is impossible for even a 64 bit combination (nevermind an entire image, audio clip, or video).

>> making a gigantic lookup table for every single possible bit combination

You don't need this to achieve zero loss on the training set, though: You only need a lookup table for the images in the train set.

We know that neural networks can do something like this (learning the lookup table) because large networks can get to zero training loss on randomly assigned labels. (I linked the paper a bit further down in the thread.) This means there's some memorization capability in the architecture, even if it's a weird emulation of some memorization strategy that we would consider easy.

The actual mechanism here is probably closer to random projection + nearest neighbor; NNs are not obviously learning crypto functions. But they /are/ learning some kind of lookup mechanism. There's some indication (see Sara Hooker's work) that in practice they use a mixture of 'reasonable' strategies and memorization for long-tail training examples. We don't know /how much/ the leading networks trained on real labels rely on memorization because we don't have any real insight into the learned structures.

(as an aside, we train neural networks for discontinuous functions all the time: Classification is discontinuous, by the nature of the labels. We turn it into a continuous+trainable problem by choosing a probabilistic framing.)

Okay but that would only work for examples with which you already have. All interesting cases of neural networks are applying it to unseen inputs. How does your technique work with unseen inputs?

And while we interpret the result of a classification as a 1 or 0, the underlying result is a continuous probability. Even in reality, our training examples are labeled with too much confidence - some labels are vague even for humans. If it approximates a discontinuous function, then it does so by approximating a continuous function. You can read here for more information: https://www.sciencedirect.com/science/article/abs/pii/089360...

Yes, this is the point: When we train a neural network, especially on a classification problem, it has multiple avenues to solve the problem. We know they are capable of ineffectual memorization, as well as some other less ridiculous things. When we train, it's not clear what mix we're getting of 'neural hashing' vs learning abstracted features.

My point up above is that classification problems are too weak, exactly because these kinds of shortcuts are readily available. The leading edge of ML research is over-focused on ImageNet classification in particular.

Ok so according to your theory, we could make this hypothesis: if we applied a neural network to an unseen example (for example, a validation dataset), then we would get accuracy that is equivalent to randomly picking a random label. Well surprise, surprise - we obviously don't get that. So there is clearly more going on than "neural hashing".

You're not answering this problem with unseen data so it's really hard for me to follow your reasoning here.

Has this been implemented? What kinds of hashing functions are you talking about? How would you guarantee the same hash for all the augmentations?

It seems like the approach you describe just moves the complexity of the task solved by neural networks into the hashing function.


"our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise."

What's your point? It's not clear at all from this post.

Fuzzy hashing is a thing and may actually use cryptographic hashing internally.

See for example recent work on hashing for malware detection.

> In practice, these huge models are, in laymans terms, fucking awesome and work really well

A similarly surprising result from an adjacent community, Bayesian Statistics, is that in the case of hierarchical models, increasing your number of parameters can paradoxically reduce overfitting.

The scale of parameters in Bayesian model's is no where near that of these deep neural nets, but nonetheless this is a similarly shocking result since typically adding parameters is penalized when model building.

It's a bit more explainable in Bayesian stats since what you're using some parameters for is limiting the impact of more granular parameters (i.e. you're learning a prior probability distribution for the other parameters, which prevents extreme overfitting in cases with less information).

I wouldn't be too surprised if eventually we realized there was a similar cause preventing overfitting is overparameterized ml models.

Do you have any good references of this phenomenon in hierarchical models?

German and Hill has a brief intro and some references.

This is likely part of the reason. The only problem is said models require a lot of data but Humans can learn from a very small number of examples.

Humans are continuously pretrained on a variety of tasks, though. Teaching a kid to say one word takes about a year...

And the very same system that is being trained to say a word is also being trained to recognize intent from intonation all the while. As soon as it can say it, the child will likely use that one word with different tones to mean different thing successfully.

We are insanely complex machines...

Any chance you could share a link to a relevant paper?

I don’t know if it’s correct , but I often think of a classification model as learning the parameters of a dirchlet distribution with the final softmax layer being a sample from it

>In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

To add nuance to this, these models are awesome at interpolation, but not so much at extrapolation. Or in different terms, they generalize very well to an IID test set, but don't generalize under (even slight) distribution shift.

The main reason for this is that these models tend to solve classification and regression problem quite differently from how humans do it. Broadly speaking, a large, flexible NN will find a "shortcut", i.e. a simple relation between some part of the input and the output, which may not be informative in the way we want; such as a watermark in the corner of an image, or statistical regularities in textures which disappear in slightly different lighting conditions. See e.g. https://thegradient.pub/shortcuts-neural-networks-love-to-ch...

I think it's fair to say that these models are great when you have an enormous dataset that covers the entire domain, but sub-Google-scale problems are usually still solved by underparametrized models (even at Google).

It depends. It really doesn’t take that much data to train a pretty stunning (if simple) RNN character-level “language model” that beats any n-gram. Or on mnist. ANNs really are a useful tool for a vast class of problems, many of which can be solved with comparatively little data.

Maybe your point stands, and it’s just that some domains need less data, just saying.

>ANNs really are a useful tool for a vast class of problems, many of which can be solved with comparatively little data.

For sure, it all depends on how robust the model needs to be, how strongly it needs to generalize. If your dataset covers the entire domain, you don't need a robust model. If you need strong generalization, then you need to build in stronger priors.

Take f(x) = x^2. If your model only needs to work in finite interval, you just need a decent sample that covers that interval. But if it needs to generalize outside that interval, no amount of parameters will give you good performance. Outside the boundaries of the interval, the NN will either be constant (with a sigmoid activation) or linear (with ReLU type activations).

My sister works in the NLP arm of ML and analogized it to the Clever Hans effect.

To add to this, there's a misleading phenomenon that first occurs where the performance actually gets worse with too much data/parameters/epochs, but oddly improves again if you throw even more at the model.

For the interested, this phenomenon is known as (deep) double descent:



(Edit: Oh, the definition appears in the abstract of the linked paper.)

Is this the ML equivalent of Dunning–Kruger effect? A model with a bit of data is too afraid of being wrong to be overconfident. A model with a bit more data is overconfident in itself and gets things wrong. Finally, a model with tons and tons of data understands the complexity of the problem set and once again becomes too afraid of being wrong.

Model confidence as reported by softmax probability scores is notoriously noisy and miscalibrated. With larger models and more data the confidence estimation gets more nuanced.

> In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

How about the resulting weights? If most of them are close to 0, then that would mean that a part of the training is for NN to learn which of 1.5B parameters are relevant, and which are not.

There is something called the golden ticket theory (maybe mentioned in the paper, I’m on my phone), that says indeed that the large models are effectively ensembles of massive random models, and the top levels of the network pick the one or two that randomly happen to work.

Maybe true but even then only part of the story, kernels in CNN genuinely seem to learn features like edges and textures.

There are two answers to this. First, empirically we see that the more parameters we add the better the model performs ==> Weights continue to contribute (and aren't dead) .

Second, there is a very popular paper called "The lottery ticket hypothesis" [1] that in any network you can find subnetworks that work just as well. e.g. The parameters are redundant. This was written in 2018, which is a long time ago in big NN world, so I'm not sure how it holds up to current insanity sized models.


A couple notes...

1) Imagine the loss surface of a given model architecture; each point on the surface corresponds to a full set of weights, and the value at the point is the model loss. So, a billion-dimensional surface, give or take. There's a massive amount of flexibility in that space. Some models in the surface are sparse, but they are adjacent to models which are just as good but not sparse at all. Likewise, if you 'rotate' a sparse model, you can end up with an entirely equivalent dense model. So, you really need additional 'pressure' on the learning problem to ensure you actually get sparsity, even if the sparsity is in some sense natural.

2) IIUC, lottery ticket kinda breaks with larger models/problems. For small enough problems, the initial random projection given by the random starting weights is already good enough to build on. For bigger + more complicated problems, you need to really adapt in early training, and so lottery ticket breaks down.

My take as a 90s math grad (out of touch with modern teaching): Theory is useful to show human society is stagnant.

There’s an infinite number of sentences but our ML models are having tons of “success” as society relies on finite set in daily life; those that instigate commerce.

Like religion relied on an acceptable finite set of sentences, so too does our society. We’re a bunch of weird little missionaries living in one geometric world, still believing in bigger purpose.

ML isn’t really outputting novelty, it’s spewing our own inanity at us, and helping correct some bad math in engineering cases.

We’re easily mesmerized apes.

Super pedantic comment. GPT-3 has 175 Billion parameters. GPT-2 was the 1.5 Billion model.

Thanks for the summary!

I'm not sure if it's related, but I've seen discussions of modern ML methods (in particular those trained using stochastic algorithms...maybe also models with low float precision...?) approximating Bayesian methods. The way I've imagined it is that the training path, by virtue of its stochasticity, resembles MCMC sampling and therefore tends to end up in regions of high posterior volume (the "typical set"), rather than high posterior density. I could see this resulting in a fit with parameters closer to their conditional expectations (in the Bayesian sense), which should be more generalizable to new data, hence fewer issues with overfitting.

A consequence of this would be that if somehow a method were able to successfully find the _global_ loss-function minimum on the training data, it would perform worse on the the test set. Fortunately, our optimization methods _don't_ find the global minimum at all.

Can anybody point me to literature on this idea? I don't know if my uninformed interpretation is actually close to what experts are thinking.

You're looking for flat minima / wide basins. (Amusingly, this one actually does go back to Schmidhuber etc.) Explains a lot of phenomenon like poorer generalization of second-order optimizers, SGD sometimes working surprisingly better, stochastic weight averaging / EMA, grokking, or patient teachers.

Oh, and I enjoyed reading this primer on the Double Descent Phenomenon for anybody, like me, who hadn't heard of it before: https://openai.com/blog/deep-double-descent/

I have to reread it but the model misspecification interpretation (that highly overparameterized models exhibit DD because they reduce misspecification error) is the only thing I've read in this area that makes some sense to me theoretically.

I think DD is a huge issue for a number of fields and is really underappreciated a lot. Without meaning to sound disrespectful, much of this literature seems a little superficial or dismissive, not aware of the broad implications of the claims often being made.

This is because of the ties between information-theory and statistics/modeling. In some sense, at least in the way I've thought about it, the DD seems to imply some kind of violations of fundamental information theory and comes across to me a bit as if someone in chemistry started claiming that some basic laws of thermodynamics in physics didn't apply anymore. Basically, the DD seems to imply that someone can extract more information from a string than the string contains. If you put it this way, it makes no sense, which is why I think this is such a hugely important issue.

On the other hand, the empirical results are there, so figuring out what's going on is worthwhile and I have an open mind.

This paper seems nice with the misspecification angle, because it is realistic and seems to open a path to some interpretations that might not violate some fundamental identities in IT. Misspecification (mismatched coding in IT) can lead to some weird phenomena that's not always intuitive.

Another thing in the paper that's made clear is that DD might not always happen, and it seems informative to figure out when that's the case.

In the background I have to say I'm still skeptical of the empirical breadth of DD. These weird cases of ML failures due to subtle challenge inputs (the example of errors in identifying Obama based on positioning and ties (?) is one example) to me seems like prime examples of overfitting. I still have a hunch that something about the training and test samples relative to the universe of actual intended samples is at play, or the whole phenomenon of DD is misleading because the overfitting problem is really in terms of model flexibility versus data complexity, and not necessarily in terms of number of parameters per se versus sample size.

DD disappears when doing Bayesian model averaging: https://arxiv.org/pdf/2002.08791.pdf It seems DD is a phenomenon specific to point estimates.

>overfitting problem is really in terms of model flexibility versus data complexity, and not necessarily in terms of number of parameters per se versus sample size

Yep, well put.

Thanks for the link! I hadn't thought about model averaging in this context but definitely bears on how to interpret this stuff. Interesting to think about too re: connections with model misspecification.

Briefly skimmed the paper and I'm not a ML or math dude but familiar with the double descent problem.

I wonder if there's a good way to measure information content in models and how it scales with model parameters, if there are any invariants, scaling law's that arise etc. Reminds me of renormalization group methods which could be applied to a space of models, etc...

Again, I'm no expert in any of this stuff, just arm-chairing away.

Would it work to measure the size of compressed model?

I'll come out and say what's on many practicioners' minds:

It could very well be that generations of academics have been WRONG about the relationship between the number of parameters (or complexity) of a model and its ability to generalize to new, previously unseen data, particularly when the data is drawn from many naturally occurring distributions -- as opposed to, say, distributions randomly chosen from the space of all possible distributions, as assumed by many theoretical frameworks (e.g., No Free Lunch theorem). It could be that generations of students have been taught The Wrong Thing™.

In many cases we must increase, not decrease, model complexity to improve generalization!

I'm not surprised. In physics/engineering related model fitting, there is a double-edged risk of over-parametrization as well as under-parameterization. The former is almost always non-physical - it fits but the parameters are meaningless as predictor, conclusions or tying to ANYTHING physical as a design strategy. The latter fits only certain corners of the data space and if you don't know which ones, you can get yourself in to trouble that way as well.

The former can be worse than the latter in terms of meaningful use for engineering.

It's obvious no? For double descent: the network with a billion parameters is so large that it learns to simulate a neural network itself which first uses the test data set to train itself and then produces the final overfitted values??

One can say, hardwork-beats-talent in some cases and in some cases talent-beats-hardwork :P

Talent: a "smarter/clever" model Hardwork: more and more parameters

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact