Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unfortunately, as things stand, it’s well-known that behaviors and optimizations in small scale models fail to replicate in larger models.




Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.

the problem is that the eval processes dont really work here if you believe in "Emergent Abilities" https://arxiv.org/abs/2206.07682

Which we probably should not, at least not the "sudden" emergence that those researchers claimed to see.

https://arxiv.org/abs/2304.15004

Good article about why here; this helped me understand a lot:

https://www.wired.com/story/how-quickly-do-large-language-mo...


Why not? It takes models of a certain size to contain xyz neuron/feature.

https://www.youtube.com/watch?v=AgkfIQ4IGaM

That's not a mirage, it's clearly capability that a smaller model cannot demonstrate. A model with less parameters and less hidden layers cannot have a neuron that lights up when it detects a face.


Consider a single-neuron model that just pools all pixels in an image together. It's possible for the average activation of this neuron to be exactly the same on faces and non-faces, but extremely unlikely given the large range of possibilities. So in aggregate, this neuron can distinguish faces from non-faces, even though, when you apply it to classifying a particular image, it'll be better than random only by an extremely tiny amount.

As the number of neurons increases, the best face/non-face distinguisher neuron gets better and better, but there's never a size where the model cannot recognize faces at all and then you add just a single neuron that recognizes them perfectly.


> here's never a size where the model cannot recognize faces at all

True

> then you add just a single neuron that recognizes them perfectly

Not true.

Don't think in terms of neurons, think in terms of features. A feature can be spread out over multiple neurons (polysemanticity), I just use a single neuron as a simplified example. But if those multiple neurons perfectly describe the feature, then all of them are important to describe the feature.

The Universal Approximation Theorem implies that a large enough network to perfectly achieve that goal would exist (let's call it size n or larger), so eventually you'd get what you want between 0 and n neurons.


> if those multiple neurons perfectly describe the feature, then all of them are important to describe the feature.

You could remove any one of those neurons before retraining the model from scratch and polysemanticity would slightly increase while perfomance slightly decreases, but really only slightly. There are no hard size thresholds, just a spectrum of more or less accurate approximations.


Which in itself is very interesting and requires study.

It mostly has to do with sparsity in high dimensional space. When you scale things to the extreme everything is very far away from each other, the space is sparse, and random vectors have very high chance to be orthogonal, etc. All of these makes optimization incredibly slow and difficult. Just another facet of the so called "curse of dimensionality".

Well-known but not well-understood

That's not widely true. E.g the GPT 4 tech report pointed out nearly all their experiments were done on models 1000x smaller than the final model.

Fair point, though I’d argue that there’s inherent selection bias for improvements that could fit a scaling law curve in the small model regime here.

But why? If we don't know why then how do we figure it out?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: