In a very short time, transformers have gone from under 1B, to 1.5B, to 3B, to 5B, to 175B, and now 600B parameters. 1T is only, what, like 67% more parameters, and therefore likely to be achieved in the short term. In fact, the authors of this paper tried 1T but ran into numerical issues that they will surely address soon. Not long after someone crosses 1T, expect 10T to become the next target. And why not? The best-funded AI research groups are in a friendly competition to build the biggest, baddest, meanest m-f-ing models the world has ever seen.
Scores continue to increase with diminishing returns, which is all fine and nice, but more importantly it seems we should expect to see machine-generated text getting much better from a qualitative standpoint -- that is, becoming less and less distinguishable from a lot of human output. That has been the trend so far.
We live in interesting times.
Otherwise, Google already had a 137B parameter model in 2017:
As to comparing parameter counts, I disagree with you. I think it's perfectly OK to compare parameter counts for different kinds of models. It would also be perfectly OK to compare, say, computational efficiency per parameter in each forward pass (which for this model is impressive), but that wasn't the focus on my comment above.
Finally, you're right that I didn't mention all the interim parameter counts that we have seen below 600B in all transformer variants. The list would have been way too long had I tried to include every figure!
The 100's of trillions of connections (synapses) in the human brain are sparsely used -- i.e., your entire brain doesn't light up in response to every single stimulus. But we still talk about 100's of trillions of synapses when we refer to the size of the human brain's connectome. It's a perfectly valid way of measuring model size.
More to your point, the authors measure the computational cost of training in Table 3 of the paper in TPU-core-years for the various mixture-of-expert models, and compare them to an always-densely-used variant.
Obviously you can compare parameter count if you really want to, but from a technical point of view training a densely activated model is a much bigger feat. Also, I have personally spoken to one of the authors of this paper and they said sparsely activated models tend to well better on tasks that require knowledge but not tasks that require intelligence (e.g. GLUE).
Otherwise, as I mentioned elsewhere on this page, we routinely describe the size of the human brain in terms of numbers of synapses (connections), even though they are sparsely activated. Only a small subset of your brain 'lights up' for a given input. Number of parameters (connections) is a perfectly sensible way to measure model size.
Anyway, I expect we will see both much larger sparsely and densely activated models going forward. We live in interesting times :-)
Personally, I think the friendly race to build bigger models is a great development. As I mentioned above, it seems to be leading to models that generate text/sequences that are qualitatively much better.
With a cursory analysis, it's not obvious whether DeepL is better than Google Translate any more.
It does appear that at the initial, resource intensive stages of tech like NLP big tech is primed to pave the way. We saw this happen across cloud, AI more generally, storage etc. But big tech then begins focusing on making the tech accessible to industry value chains (Azure, AWS, Amazon's AI services etc.). But as the industry matures there's more room for specialized startups/companies to enter the space to capture lucrative niches - thats exactly what Snowflake did for Cloud.
Definitely see this kind of scale as a step toward a more robust, mature industry if anything. Better it move forward than not.
In 2013 AWS augmented its core cloud offering with the introduction of Redshift, a ‘data warehousing as a service’ solution. The Redshift solution bundled compute and storage, reducing the ability to meet individual customer needs to scale either component separately in a cost efficient manner. Not having the option to unbundle compute and storage was inconsistent with the flexible nature that cloud had become known for.
Snowflake’s solution separated storage, compute, and services into separate layers, allowing them to scale independently and achieve greater cost efficiencies. By offering flexibility it was able to better address the requirements of a wider range of customers - who had previously been limited to the more restrictive bundled options, like Redshift.
We've barely scratched the surface of what's possible. Even if Moore's Law was dead (though it seems that TSMC may keep it alive for a bit longer) there are huge gains to be had when co-designing models and hardware. Stuff like https://www.cerebras.net/ is the direction I expect things to go.
Still impressive, don't get me wrong, but I am starting to believe that NLP will be dominated increasingly by the big players since they are the only ones who can train a 1 TRILLION parameter model (they show that in the paper). I can't even do inference with a 36 layer, 2048 neuron per layer network with my GTX 2080ti. Sad....
Not even for a single instance? Your GPU has 11GB of RAM. Why isn't 14k per neuron enough? Is the input really large, or does each neuron have very high precision?
What have I missed?
A 1 trillion parameter model should not be far off, which is about the same number of synapses as house mice.
We will be around 1% of the way to human brain complexity (Well, probably not but it is fun to think of it).
On the other hand, we don't have a robot body to house the model in. Without embodiment it won't be able to learn to interact with the world like us.
Thirdly, in humans, specific priors have been baked in the brain by evolution (data symmetries and efficiencies). We don't know all of them yet and how to replicate. We do rely on translation invariance for images and time shift invariance for sequences, and permutation invariance for some set and graph neural nets, but they are not all the priors the brain makes use of.
However it seems fairly reasonable to say a synapse is roughly 1:1 comparable to a network parameter, in that they seem to be doing about the same sort of weighted propagation with about the same computational power. A synapse does work very differently, and has a couple of very low bandwidth side-channels, but its main job is the same job as a network weight.