We keep hearing about these giant models like GPT3 with 1.5 billion paramaters. Parameters are the things that change when we train a model, you can think about them as degrees of freedom.
If you have a lot of parameters, theory made us believe that the model would just "overfit" the training data, e.g. memorize it. That's bad, because when new data comes in in production we'd expect the model to not be able to "generalize" to it, e.g. make accurate predictions on data it hasn't seen before, because it's just memorized training data instead of uncovering the "guiding principles" of the data so to speak.
In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.
This paper is a survey or overview of what "too many paramaters" are, and all the research into why these models work even though they shouldn't.
You can solve classification with a hash function: Hash the image, and then just memorize which label goes with which hash. You can try to dodge this obviously dodgy solution by adding augmentation to the dataset. Then you instead learn to find a representation invariant under the set of augmentations, and learn the hash of that representation. It turns out these augmentation-invariant representations are actually pretty good, so we can solve the classification problem in what looks like a general way.
However, there are many other classes of problems where the hash problem doesn't exist, because the information density of the outputs is too high to memorize in the same way. Specifically, generative models, and the sorts of predictive/infill problems used for self-supervision. In these spaces, the problems are more like: "Given this pile of augmented input, generate half a megabyte of coherent output." These kinds of problems simply don't overfit: Train a speech separation model on a big dataset, and the train+eval quality metrics will just asymptote their way up and to the right until you run out of training budget.
Sure, it's a potential problem that can appear in the process implementing a deep learning solution. It's not an insurmountable problem. But the fact that still appears seems like an indication the situation in deep learning is more complicated than "overparameterization is not a problem".
What I'm trying to say is that neural networks are "universal approximators of continuous real functions". You can think of them as finding the curve of a function which matches the data to an expected and they get their predictive power by matching the underlying "function" of the problem.
Applying a cryptographic hash function is like completely scrambling the underlying function. The only way for a neural network to match it is if it was somehow a universal approximator of a discontinuous real function. You can either do that by getting into unexplored chaos theory or making a gigantic lookup table for every single possible bit combination. The former no human being knows how to do, and the latter is impossible for even a 64 bit combination (nevermind an entire image, audio clip, or video).
You don't need this to achieve zero loss on the training set, though: You only need a lookup table for the images in the train set.
We know that neural networks can do something like this (learning the lookup table) because large networks can get to zero training loss on randomly assigned labels. (I linked the paper a bit further down in the thread.) This means there's some memorization capability in the architecture, even if it's a weird emulation of some memorization strategy that we would consider easy.
The actual mechanism here is probably closer to random projection + nearest neighbor; NNs are not obviously learning crypto functions. But they /are/ learning some kind of lookup mechanism. There's some indication (see Sara Hooker's work) that in practice they use a mixture of 'reasonable' strategies and memorization for long-tail training examples. We don't know /how much/ the leading networks trained on real labels rely on memorization because we don't have any real insight into the learned structures.
(as an aside, we train neural networks for discontinuous functions all the time: Classification is discontinuous, by the nature of the labels. We turn it into a continuous+trainable problem by choosing a probabilistic framing.)
And while we interpret the result of a classification as a 1 or 0, the underlying result is a continuous probability. Even in reality, our training examples are labeled with too much confidence - some labels are vague even for humans. If it approximates a discontinuous function, then it does so by approximating a continuous function. You can read here for more information: https://www.sciencedirect.com/science/article/abs/pii/089360...
My point up above is that classification problems are too weak, exactly because these kinds of shortcuts are readily available. The leading edge of ML research is over-focused on ImageNet classification in particular.
You're not answering this problem with unseen data so it's really hard for me to follow your reasoning here.
It seems like the approach you describe just moves the complexity of the task solved by neural networks into the hashing function.
"our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise."
See for example recent work on hashing for malware detection.
A similarly surprising result from an adjacent community, Bayesian Statistics, is that in the case of hierarchical models, increasing your number of parameters can paradoxically reduce overfitting.
The scale of parameters in Bayesian model's is no where near that of these deep neural nets, but nonetheless this is a similarly shocking result since typically adding parameters is penalized when model building.
It's a bit more explainable in Bayesian stats since what you're using some parameters for is limiting the impact of more granular parameters (i.e. you're learning a prior probability distribution for the other parameters, which prevents extreme overfitting in cases with less information).
I wouldn't be too surprised if eventually we realized there was a similar cause preventing overfitting is overparameterized ml models.
We are insanely complex machines...
To add nuance to this, these models are awesome at interpolation, but not so much at extrapolation. Or in different terms, they generalize very well to an IID test set, but don't generalize under (even slight) distribution shift.
The main reason for this is that these models tend to solve classification and regression problem quite differently from how humans do it. Broadly speaking, a large, flexible NN will find a "shortcut", i.e. a simple relation between some part of the input and the output, which may not be informative in the way we want; such as a watermark in the corner of an image, or statistical regularities in textures which disappear in slightly different lighting conditions. See e.g. https://thegradient.pub/shortcuts-neural-networks-love-to-ch...
I think it's fair to say that these models are great when you have an enormous dataset that covers the entire domain, but sub-Google-scale problems are usually still solved by underparametrized models (even at Google).
Maybe your point stands, and it’s just that some domains need less data, just saying.
For sure, it all depends on how robust the model needs to be, how strongly it needs to generalize. If your dataset covers the entire domain, you don't need a robust model. If you need strong generalization, then you need to build in stronger priors.
Take f(x) = x^2. If your model only needs to work in finite interval, you just need a decent sample that covers that interval. But if it needs to generalize outside that interval, no amount of parameters will give you good performance. Outside the boundaries of the interval, the NN will either be constant (with a sigmoid activation) or linear (with ReLU type activations).
(Edit: Oh, the definition appears in the abstract of the linked paper.)
How about the resulting weights? If most of them are close to 0, then that would mean that a part of the training is for NN to learn which of 1.5B parameters are relevant, and which are not.
Maybe true but even then only part of the story, kernels in CNN genuinely seem to learn features like edges and textures.
Second, there is a very popular paper called "The lottery ticket hypothesis"  that in any network you can find subnetworks that work just as well. e.g. The parameters are redundant.
This was written in 2018, which is a long time ago in big NN world, so I'm not sure how it holds up to current insanity sized models.
1) Imagine the loss surface of a given model architecture; each point on the surface corresponds to a full set of weights, and the value at the point is the model loss. So, a billion-dimensional surface, give or take. There's a massive amount of flexibility in that space. Some models in the surface are sparse, but they are adjacent to models which are just as good but not sparse at all. Likewise, if you 'rotate' a sparse model, you can end up with an entirely equivalent dense model. So, you really need additional 'pressure' on the learning problem to ensure you actually get sparsity, even if the sparsity is in some sense natural.
2) IIUC, lottery ticket kinda breaks with larger models/problems. For small enough problems, the initial random projection given by the random starting weights is already good enough to build on. For bigger + more complicated problems, you need to really adapt in early training, and so lottery ticket breaks down.
There’s an infinite number of sentences but our ML models are having tons of “success” as society relies on finite set in daily life; those that instigate commerce.
Like religion relied on an acceptable finite set of sentences, so too does our society. We’re a bunch of weird little missionaries living in one geometric world, still believing in bigger purpose.
ML isn’t really outputting novelty, it’s spewing our own inanity at us, and helping correct some bad math in engineering cases.
We’re easily mesmerized apes.
A consequence of this would be that if somehow a method were able to successfully find the _global_ loss-function minimum on the training data, it would perform worse on the the test set. Fortunately, our optimization methods _don't_ find the global minimum at all.
Can anybody point me to literature on this idea? I don't know if my uninformed interpretation is actually close to what experts are thinking.
I think DD is a huge issue for a number of fields and is really underappreciated a lot. Without meaning to sound disrespectful, much of this literature seems a little superficial or dismissive, not aware of the broad implications of the claims often being made.
This is because of the ties between information-theory and statistics/modeling. In some sense, at least in the way I've thought about it, the DD seems to imply some kind of violations of fundamental information theory and comes across to me a bit as if someone in chemistry started claiming that some basic laws of thermodynamics in physics didn't apply anymore. Basically, the DD seems to imply that someone can extract more information from a string than the string contains. If you put it this way, it makes no sense, which is why I think this is such a hugely important issue.
On the other hand, the empirical results are there, so figuring out what's going on is worthwhile and I have an open mind.
This paper seems nice with the misspecification angle, because it is realistic and seems to open a path to some interpretations that might not violate some fundamental identities in IT. Misspecification (mismatched coding in IT) can lead to some weird phenomena that's not always intuitive.
Another thing in the paper that's made clear is that DD might not always happen, and it seems informative to figure out when that's the case.
In the background I have to say I'm still skeptical of the empirical breadth of DD. These weird cases of ML failures due to subtle challenge inputs (the example of errors in identifying Obama based on positioning and ties (?) is one example) to me seems like prime examples of overfitting. I still have a hunch that something about the training and test samples relative to the universe of actual intended samples is at play, or the whole phenomenon of DD is misleading because the overfitting problem is really in terms of model flexibility versus data complexity, and not necessarily in terms of number of parameters per se versus sample size.
>overfitting problem is really in terms of model flexibility versus data complexity, and not necessarily in terms of number of parameters per se versus sample size
Yep, well put.
I wonder if there's a good way to measure information content in models and how it scales with model parameters, if there are any invariants, scaling law's that arise etc. Reminds me of renormalization group methods which could be applied to a space of models, etc...
Again, I'm no expert in any of this stuff, just arm-chairing away.
It could very well be that generations of academics have been WRONG about the relationship between the number of parameters (or complexity) of a model and its ability to generalize to new, previously unseen data, particularly when the data is drawn from many naturally occurring distributions -- as opposed to, say, distributions randomly chosen from the space of all possible distributions, as assumed by many theoretical frameworks (e.g., No Free Lunch theorem). It could be that generations of students have been taught The Wrong Thing™.
In many cases we must increase, not decrease, model complexity to improve generalization!
The former can be worse than the latter in terms of meaningful use for engineering.
Talent: a "smarter/clever" model
Hardwork: more and more parameters