
GShard: Scaling giant models with conditional computation and automatic sharding - MrUssek
https://arxiv.org/abs/2006.16668
======
cs702
_" Quién es más macho?"_

In a very short time, transformers have gone from under 1B, to 1.5B, to 3B, to
5B, to 175B, and now 600B parameters. 1T is only, what, like 67% more
parameters, and therefore likely to be achieved in the short term. In fact,
the authors of this paper tried 1T but ran into numerical issues that they
will surely address soon. Not long after someone crosses 1T, expect 10T to
become the next target. And why not? The best-funded AI research groups are in
a friendly competition to build the biggest, baddest, meanest m-f-ing models
the world has ever seen.

Scores continue to increase with diminishing returns, which is all fine and
nice, but more importantly it seems we should expect to see machine-generated
text getting much better _from a qualitative standpoint_ \-- that is, becoming
less and less distinguishable from a lot of human output. That has been the
trend so far.

We live in interesting times.

~~~
xbmcuser
Are they using this for google translate yet. As
[https://www.deepl.com/en/translator](https://www.deepl.com/en/translator) is
better than google translate currently. Although for translating forums on a
website etc I think netflix method would be better I hope google adopts it for
its translate app
[https://arxiv.org/abs/2005.11197](https://arxiv.org/abs/2005.11197)

~~~
cs702
Highly unlikely at the moment. But clearly that is the direction in which
translation is going, so companies lacking the economies of scale that come
with owning massive computational infrastructure will be at a serious
disadvantage.

------
dig6x
"...600 billion parameters using automatic sharding. We demonstrate that such
a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days
to achieve far superior quality for translation from 100 languages to English
compared to the prior art."

It does appear that at the initial, resource intensive stages of tech like NLP
big tech is primed to pave the way. We saw this happen across cloud, AI more
generally, storage etc. But big tech then begins focusing on making the tech
accessible to industry value chains (Azure, AWS, Amazon's AI services etc.).
But as the industry matures there's more room for specialized
startups/companies to enter the space to capture lucrative niches - thats
exactly what Snowflake did for Cloud.

Definitely see this kind of scale as a step toward a more robust, mature
industry if anything. Better it move forward than not.

~~~
saltking112
sounds interesting. can you elaborate? Not familiar with what Snowflake does
or how it compares. Thanks

~~~
dig6x
Launched in 2014, its basically a purpose-built SQL cloud data warehouse
solution. Its success pivoted among other factors, on its ability to abstract
compute power and data storage to create a modular solution that could be made
efficient for any data warehousing configuration.

In 2013 AWS augmented its core cloud offering with the introduction of
Redshift, a ‘data warehousing as a service’ solution. The Redshift solution
bundled compute and storage, reducing the ability to meet individual customer
needs to scale either component separately in a cost efficient manner. Not
having the option to unbundle compute and storage was inconsistent with the
flexible nature that cloud had become known for.

Snowflake’s solution separated storage, compute, and services into separate
layers, allowing them to scale independently and achieve greater cost
efficiencies. By offering flexibility it was able to better address the
requirements of a wider range of customers - who had previously been limited
to the more restrictive bundled options, like Redshift.

~~~
saltking112
ic. but from Amazon's perspective, if customers want something that is mostly
turn-key with the ability to customize, wouldn't they just combine AWS
services themselves? I would believe Amazon has DB only solutions, compute
only solutions like EC2 etc... So why was Snowflake able to thrive in this
environment? Was the market simply too big?

~~~
dig6x
Yeh the CLoud market was at a stage where the niche with some convenience add
could thrive. Now we're seeing all these multi cloud platforms emerge because
enterprises are managing multiple server providers at once etc. so you can
imagine all the opportunities for horizontal scaling beyond big tech in the
industry.

------
mensetmanusman
Awe inspiring thinking of the number of transistors working in orchestra to
translate human language to english...

------
modeless
The most important advancements in machine learning for the next 10 years at
least will be in hardware, and the software to take advantage of said
hardware. You could even say that was already true starting with AlexNet, but
it's even more obvious now with these enormous models.

We've barely scratched the surface of what's possible. Even if Moore's Law was
dead (though it seems that TSMC may keep it alive for a bit longer) there are
huge gains to be had when co-designing models and hardware. Stuff like
[https://www.cerebras.net/](https://www.cerebras.net/) is the direction I
expect things to go.

~~~
dna_polymerase
Hardware will be a huge part, yes, but algorithmic advances would be even
better. Utilizing existing commodity hardware to full extent is where the
money is at. Specialized hardware will probably remain just that, specialized
and mostly too expensive.

~~~
modeless
There will be algorithmic advances, but they will benefit larger models too.
Larger models will still win. The value of having the best AI is so great that
it will be worth nearly any level of investment in hardware to the large tech
companies that can afford it.

------
Der_Einzige
Yet another paper with results that basically look like this:
[https://d3b8hk1o42ev08.cloudfront.net/wp-
content/uploads/201...](https://d3b8hk1o42ev08.cloudfront.net/wp-
content/uploads/2018/10/9-752x440.png)

Still impressive, don't get me wrong, but I am starting to believe that NLP
will be dominated increasingly by the big players since they are the only ones
who can train a 1 TRILLION parameter model (they show that in the paper). I
can't even do inference with a 36 layer, 2048 neuron per layer network with my
GTX 2080ti. Sad....

~~~
rahimnathwani
"I can't even do inference with a 36 layer, 2048 neuron per layer network with
my GTX 2080ti."

Not even for a single instance? Your GPU has 11GB of RAM. Why isn't 14k per
neuron enough? Is the input really large, or does each neuron have very high
precision?

~~~
MrUssek
There's an extremely large number of parameters per "neuron". The 600B
parameters will take up more than 1TB of space in memory, far too much for the
2080 TI or even main memory for most systems.

~~~
rahimnathwani
I'm not talking about inference on a 600B parameter model. GP said they can't
do inference on a 32-layer, 2048 neurons-per-layer network. Let's assume every
layer is fully connected. So each neuron will have 2048 parameters. So that's
32 * 2048 * 2048 parameters. That's 132MM parameters in 11GB of RAM, or 82
bytes per parameter. If each parameter is 4 bytes (that seems like a lot of
precision), plus 4 bytes per calculated value, you're still only using 10% of
the GPU's RAM. You should be able to do inference on a batch of 16-20 examples
at a time.

What have I missed?

~~~
tehsauce
2048 neurons per layer isn't really an accurate description, what he means is
2048 dimensional embeddings at each layer. The actual multihead attention
layers in a transformer are not just feed forward 2048*2048, but actually have
many more parameters. That's why there's 600B total.

------
teruakohatu
The brain has ~100+ trillion synapses [1] (There seems to be estimates from
100-1000 T).

A 1 trillion parameter model should not be far off, which is about the same
number of synapses as house mice.

We will be around 1% of the way to human brain complexity (Well, probably not
but it is fun to think of it).

[1]
[https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...](https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons)

~~~
visarga
You can't directly compare biological and artificial neurons like that.
Biological ones have synapses that function in a much more complex way than
weights in a neural net, but are also much slower and noisy.

On the other hand, we don't have a robot body to house the model in. Without
embodiment it won't be able to learn to interact with the world like us.

Thirdly, in humans, specific priors have been baked in the brain by evolution
(data symmetries and efficiencies). We don't know all of them yet and how to
replicate. We do rely on translation invariance for images and time shift
invariance for sequences, and permutation invariance for some set and graph
neural nets, but they are not all the priors the brain makes use of.

~~~
Veedrac
Biological neurons are complex networks of thousands of synapses, and it's
definitely reasonably to say a biological neuron is not 1:1 comparable to an
artificial NN neuron. Biological neurons can compute XOR[1] and some even
contain loops, called autapses.

However it seems fairly reasonable to say a synapse is roughly 1:1 comparable
to a network parameter, in that they seem to be doing about the same sort of
weighted propagation with about the same computational power. A synapse does
work very differently, and has a couple of very low bandwidth side-channels,
but its main job is the same job as a network weight.

[1]
[https://science.sciencemag.org/content/367/6473/83](https://science.sciencemag.org/content/367/6473/83)

------
justicezyx
Note that this is a system paper, not a ML/DL/NLP paper. It's kind of OK to
expand the parameter to such larger number.

