This paper is from a small group at an academic institution. They are trying to innovate in the idea space and are probably quite compute constrained. But for proving ideas smaller problems can make easier analysis even leaving aside compute resources. Not all research can jump straight to SOTA applications. It looks quite interesting, and I wouldn't be surprised to see it applied soon to larger problems.
Baseline time to grok something looks to be around 1000x normal training time so make that $20k per attempt. Probably takes a while too. Their headline number (50x faster than baseline, $400) looks pretty doable if you can make grokking happen reliably at that speed.
I’ve been in a small group at an academic institution. With our meager resources we trained larger models than this on many different vision problems. I personally train LLMs on OpenWebText than this using a few 4090s (not work related). Is that too much for a small group?
MNIST is solvable using two pixels. It shouldn’t be one of two benchmarks in a paper, again just in my opinion. It’s useful for debugging only.
I thought so at first, but the repo's[0] owner and the first name listed in the article has Seoul National University on their Github profile.
Far away from a small academic institution.
Oh hm, so they are. I thought they were binary because they used a digital pen to create them, IIRC, and logistic regression is always the baseline; but checking, they technically are grayscale and people don't always binarize them. So I guess information-theoretically, if they are 0-255 valued, then 2 pixels could potentially let you classify pretty well if sufficiently pathological.