Grokking is certainly an interesting phenomenon, but have practical applications of it been discovered yet?
I remember seeing grokking demonstrated for MNIST (are there any other non synthetic datasets for which it has been shown?), but the authors of that paper had to make the training data smaller and got a test error far below state of the art.
I'm very interested in this research, just curious about how practically relevant it is (yet).
My gut instinct from reading about the phenomenon says that a “grokked” model of X parameters on Y tokens is not going to outperform an “ungrokked” model with 2X parameters on 2Y tokens - since “grokking” uses the same resources as parameter and token scaling, it’s simply not a competitive scaling mechanism at the moment. It might make sense in some applications where some other hard limit (e.g. memory capacity at inference time) occurs before your resource limit AND you would still see good returns on improvements in quality, but I suspect those are still fairly narrow and/or rare applications.
According to https://arxiv.org/abs/2405.15071 their grokked model outperformed GPT4 and Gemini1.5 on the reasoning task.
We can then argue if the task makes sense and the conclusion stands for other use cases but i think grokking can be useful
Improving generalization in deep learning is a big deal. The phenomenon is academically interesting either way, but e.g. making sota nets more training data economical seems like a practical result that might be entirely within reach.
i think y'all are both right. grokking is a phenomenon that by definition applies to severely overfit neural networks, which is a very different regime than modern ML - but we might learn something from this that we can use to improve regularization
grokking doesn't and will not have practical uses, imo - it is just an experiment that revealed cool things that we mostly already suspected about implicit regularization
however, techniques we learn from grokking about implicit regularization might be helpful for the training regimes we actually use
I missed the beginning of the story. Why and when does grokking occur? It seems to be a case of reaching a new basin, casting doubt on the shallow basin hypothesis in over-parameterized neural networks? The last I checked all the extrema in such models were supposed to be good, and easy to reach?
i've worked in this field for 6 years and have never heard of the 'shallow basin hypothesis', care to explain more? is it just the idea that there are many good solutions that can be reached in very different parts of parameter space?
all that grokking really means is that the 'correct', generalizable solution is often simpler than the overfit 'memorize all the datapoints' solution, so if you apply some sort of regularization to a model that you overfit, the regularization will make the memorized solution unstable and you will eventually tunnel over to the 'correct' solution
actual DNNs nowadays are usually not obviously overfit because they are trained on only one epoch
There's also a very interesting body of work on merging trained models, such as by interpolating between points in weight space, which relates to the concept of "basins" of similar solutions. Skim the intro of this if you're interested in learning more: https://arxiv.org/abs/2211.08403
Yes, you both understood what I meant. I just coined the term, having in mind illustrations like Fig. 1 in Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape (https://proceedings.mlr.press/v151/bisla22a.html)
cheers! i'm familiar with those first two papers, just not with the specific term. my intuition was more relatively deep points connected by tunnels than shallow basin - but it might just be the difficulty of describing high dimensional spaces
How does this differ from momentum in practice? Gradient momentum already applies an exponential-decay average to the gradients. The authors discuss how their approach differs from momentum in formula, but not how it differs in practice. Essentially, momentum and Adam and all other second order optimizers already have explored this intellectual space, so I’m not sure why this paper exists unless it has some practical applications on top of the existing practice.
I have a suspicion that this technique will prove most valuable for market oriented data sets (like price related time series), where there isn't necessarily that much massive data scale compared to text corpora, and where there are very tight limits on the amount of training data because you only want to include recent data to reduce the chances of market regime changes. This approach seems to shine when you don't quite have enough training data to completely map out the general case, but if you train for long enough naively, you can get lucky and fall into it.
Why only MNIST and a Graph CNN? Those are small and somewhat odd choices. Scale these days should be at least 100 million param models and something like OpenWebText as a dataset in my opinion. Not sure what the SoTA is for visionm but same argument there.
This paper is from a small group at an academic institution. They are trying to innovate in the idea space and are probably quite compute constrained. But for proving ideas smaller problems can make easier analysis even leaving aside compute resources. Not all research can jump straight to SOTA applications. It looks quite interesting, and I wouldn't be surprised to see it applied soon to larger problems.
Baseline time to grok something looks to be around 1000x normal training time so make that $20k per attempt. Probably takes a while too. Their headline number (50x faster than baseline, $400) looks pretty doable if you can make grokking happen reliably at that speed.
I’ve been in a small group at an academic institution. With our meager resources we trained larger models than this on many different vision problems. I personally train LLMs on OpenWebText than this using a few 4090s (not work related). Is that too much for a small group?
MNIST is solvable using two pixels. It shouldn’t be one of two benchmarks in a paper, again just in my opinion. It’s useful for debugging only.
I thought so at first, but the repo's[0] owner and the first name listed in the article has Seoul National University on their Github profile.
Far away from a small academic institution.
Oh hm, so they are. I thought they were binary because they used a digital pen to create them, IIRC, and logistic regression is always the baseline; but checking, they technically are grayscale and people don't always binarize them. So I guess information-theoretically, if they are 0-255 valued, then 2 pixels could potentially let you classify pretty well if sufficiently pathological.
Grokking may not even occur for datasets of that scale. Even the MNIST experiments require dropping the training data size from 50k examples to 1k. The reason for this is that the phenomenon seems to occur at a critical zone of having just barely enough training data to make generalization possible. See https://arxiv.org/abs/2205.10343 for details.
Even figuring out how to induce grokking behavior on a 100M model or OpenWebText would be a big leap in the understanding of grokking. It's perfectly reasonable for a paper like this to show results on the standard tasks for which grokking has already been characterized.
It’s because that’s where the effect is showing upright now. This is the situation where the analogy to pre-paradigmatic optics is pretty strong. If you telescope to take pictures of Jupiter was having problems with rainbow fringes, so you designed the defraction grading to investigate the fringes
Several reasons - computational effort, effort it takes to reproduce results on complex datasets, academic publishing model.
Academic research is often times about making incremental steps and limiting uncertainty.
Making something work for MNIST is already so much work, researchers don’t have the time, money, and energy to run experiments for 10 datasets.
Complex datasets are much harder to get a proper model trained on due to increased complexity - larger images, tasks, classes, etc.
Also, as soon as you run your experiments on more datasets, you create an opportunity for reviewers to take you down - “why didn’t you test it on this other dataset?”
I remember seeing grokking demonstrated for MNIST (are there any other non synthetic datasets for which it has been shown?), but the authors of that paper had to make the training data smaller and got a test error far below state of the art.
I'm very interested in this research, just curious about how practically relevant it is (yet).