Hacker News new | past | comments | ask | show | jobs | submit login
Google Open-Sources Trillion-Parameter AI Language Model Switch Transformer (infoq.com)
275 points by jonbaer on Feb 17, 2021 | hide | past | favorite | 75 comments



If the model weights aren’t open source, the model isn’t open source in my opinion. It’s not like a switch transformer is that much more complicated to slap into pytorch than any other new research. Also in my personal opinion, while yes this model performs great, it’s ridiculously parameter inefficient. 1.6T parameters is 3.12TB using bfloat 16. I’d rather take 7x longer to train and have a model I can actually put places.


> it’s ridiculously parameter inefficient

How do you know that? Perhaps with this methodology you really need those 3.12TB to reach comparable performance?


I beg to disagree.

[1] provides one with a whole-data-set training method (ADMM, one of such methods). Page 8 contains figure 2(b) - accuracy of training after specified amount of time. Note that ADMM start where stochastic gradient stops.

[1] https://arxiv.org/pdf/1605.02026.pdf

At [2] I tried to apply logistic regression trained using reweighted least squares algorithm on the same Higgs boson data set. I've got the same accuracy (64%) as mentioned in the ADMM paper with much less number of coefficients - basically, just the size of input vector + 1 instead of 300 such rows of coefficients and then 300x1 affine transformation. When I added squares of inputs (for the simplest approximation of polynomial regression) and used the same reweighted iterative least squares algorithm, I've got even better accuracy (66%) for double the number of coefficients.

[2] https://github.com/thesz/higgs-logistic-regression

There's a hypothesis [3] that SGD and ADAM are best optimizers because that everyone use and report on. Rarely if ever you get anything that differ.

[3] https://parameterfree.com/2020/12/06/neural-network-maybe-ev...

So, answering your question of "how do you know" - researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things.

https://github.com/tensorflow/tensorflow/issues/2682

I haven't seen anything about whole-data-set optimization from Google at all. That's why I (and only me - due to standing I take and experiments I did) conclude that they do not quite care about parameter efficiency.


> they do not quite care about parameter efficiency.

Google Research is pretty big, I used to think like you did but I think it's mostly b/c DeepMind just hogs all the spotlight.

Check out PRESS [0] for example.

[0]: https://research.google/pubs/pub46141


Thank you! I skimmed over the abstract and will read the paper later, it seems interesting.

But you gave me another point to support my view: PRESS uses stochastic gradient, not second-order method like IRLS.


I agree, I was really only proposing PRESS for the "parameter efficiency" part of your comment. It'd be interesting to see some modern takes on IRLS. I think generally this goes against the grain of the Cheap Gradient Principle which is why you see less of it (edit: eh, I think Fisher scoring can be cast in this light).

For instance, on modern modelling problems with non-linearities and change points, it's a lot less easy to do something like IRLS in an end-to-end system, but interesting as a research direction.


I also agree with "Cheap Gradient Principle". I see it as a case of "width of two horse backs from Ancient Rome determine Shuttle buster width" (which is untrue but cool as a reference).

The very SGD thing was developed because it was the only way to train something like neural network with small memory and, more importantly, in reasonable time. Multiplying of training time by N (number of parameter) meant having good result in a year, not in a day.

And today we have large batch training with complex synchronization systems to speed up training even more. Which bring us closer to the whole-dataset training and, I guess, second-order optimization as well.


If you read through the paper where they present switch transformers, they train all models to the same performance, so for example t5-large, at 770M parameters[1] is just as good as this model. That’s only 1.54GB

[1] https://huggingface.co/t5-large


Some other thoughts on what makes open source machine learning, from Debian:

https://salsa.debian.org/deeplearning-team/ml-policy


Mixture of experts architectures like this are a specific design decision to increase the parameter count but use and update those parameters sparsely. Sure, that design decision doesn't fit all scenarios, but it fits some, and it has its own advantages, like faster training time.


But do you see any practical scenario where you have to keep 3TB of parameters in accelerator's RAM at inference time?

The accuracy would have to be significantly higher that any alternative to justify monopolizing that many hardware resources.


Yep, it would definitely be difficult to justify running it in production. Accuracy would need to be higher as you said or it would need to be applicable to more tasks such that you can take other models out of production.

This kind of model could be used as the teacher in a distillation setup too though. Then faster training of the teacher is actually a huge benefit since it speeds up model development iteration cycles.

But even if it weren't practical to use in production in any sense, I'd argue there's value in doing the basic research of exploring design space of architectures in this way. This came out of a research team at Google. It may inspire and inform smaller, more practical architectures.


"Yep, it would definitely be difficult to justify running it in production. Accuracy would need to be higher as you said or it would need to be applicable to more tasks such that you can take other models out of production."

Part of the justification is the MoE sparsity by design means only a small part of the model will be activated by a given query. Don't think of it as a single giant 1t model, think of it as 50 small models which happen to share a glue layer at the input. So at deployment, you could, for example, keep the gating-layer in RAM and only pull the necessary sub-model off disk as necessary. Or you could shard the sub-models over 51 GPUs and feed the master gating-layer+GPU lots of queries, and each query will be dispatched to a different expert+GPU pair. This could easily be competitive with running a lot of dense models in parallel trying to keep up with the same load.


Google's legal team would be freaking out if any of its engineers propose open sourcing model weights. This is a mine field full of legal and privacy headaches and I guess it is infeasible to audit 1.6T parameters anyway...


I have to say that I really doubt this as they've released model weights for their several last few models that have been of nearly this size (and have been trained on the same dataset).

See https://github.com/google-research/text-to-text-transfer-tra...


I'm not sure whether I'd call 10e10 nearly the size of 10e12, but the dataset is open, so it shouldn't be an issue.

https://github.com/google-research/text-to-text-transfer-tra...


This is a really good point, and also depends on what the training data is - although arguably if they are using this model to power their services and it can leak personal data that might be an issue in it's own right.


Can someone translate this post into english for those of us not imbued with AI knowledge?


One key idea here is to use a very large number of parameters (model weights), but only use some subset of the parameters on each example. The parameters are divided up into blocks called "experts", and then some subset of experts are used on any given input. Which subset is used is chosen by the model itself in a data-dependent manner. This can be thought of as letting the model specialize different experts to handle different situations.

The advantage, as they show, is that the model can train to a given level of performance much faster with a fixed amount of computing power compared to an architecture that uses all parameters on every step. This might be because it allows you to have a very large number of parameters that can store a lot more specialized information without incurring as much of a computational cost. Of course the downside is that you end up with a very large model that literally won't fit in a lot of environments.


Another naive question: why is this better than creating a separate, smaller model for each expert?


The common argument I've heard: because then you would have to decide how many experts models are required, train and evaluate them separately, and overall make your architecture dependent on this choice. If your expert is wrong and miscalculates how many models are required then your entire architecture is also likely to be wrong (humans, am I right?).

Researchers at Google's scale prefer a single model where you throw all your data in a single bin and get perfect performance out, no tweaking and no pesky humans required.


But this is something you could just use a hp search for, right, to determine the amount of models? Or are hp searches generally not used anymore at that scale?


How would you know which inputs to use for training each expert?


Yes, you can think of this as a collection of small models, with another model choosing which smaller model to use for each input.


Not really, the dispatching between experts happen for every word and every layer, so you can't easily isolate distinct models.


lol they didn’t open source the model weights


If I did the math right it would be 3.12TB of weights, maybe they are trying to upload it to gdrive still. (/s, probably)


They released more data than that for their Google Books n-grams datasets:

https://storage.googleapis.com/books/ngrams/books/datasetsv3...

(I don't remember exactly how much it is, but I remember that the old version was already in the terabytes.)


Another example of Google giving much data away is 50 trillion digits of pi [1], which contains about 42 TB of data (decimal and hexadecimal combined).

[1] https://storage.googleapis.com/pi50t/index.html


The Waymo open dataset is about 1TB. I don't think releasing a 3TB dataset would present a technical challenge for Google.


Even a 3PB model would be very doable for Google...


The daily upload quota for a user is ~750 GB. It'll take a few days to upload that much data to Google Drive!


Google Cloud Storage. The files could be dumped as tfrecord in a bucket with "requester pays". So anybody could reproduce it using the open source code, by paying for the costs incurred to move the data from GCS to the training nodes.


Weights are just numbers (probably floats?), right?

This model has 3.12TB of floats??? That's insane. How do you load that into memory for inferencing?


Use x1e.32xlarge on AWS with 3TB of RAM. Just $12,742/mo - https://calculator.aws/#/estimate?id=7428fa81192c57087ac8cdf...

Alternatively order something like the HP Z8 with 3TB RAM configured, which is only $75k - https://zworkstations.com/configurations/2040422/

It's interesting. It would take ~six years for the Z8 to break even compared to AWS, but traffic into and out of the machine would be $0, and I don't think you're running directly on the metal with AWS, so performance would probably be a bit higher. And then there's storage - I configured, uhh, 120TB of a mixture of SSDs and HDDs. I'm not even going to try and ask AWS for a comparible quote there.

I may or may not have added dual Xeon Platinum 8280s to the Z8 as well. :P


When you're spending that kind of money on a machine, there's no way you're paying retail price. Sales reps would give you a significant discount.

Also - think you meant 6 months, not 6 years anyhow :)


Interesting. I'm very curious... 20%? 35%?

And I did mean 6 months, woops. Didn't even notice...


> It would take ~six years for the Z8 to break even

Do you mean six months?


Oh *dear*. I definitely tripped over there, and I didn't even notice.

Yup.


Z8 sounds like fun. but I might just buy two teslas (roadster and X, or a cybertruck) and a gaming PC. :D


hate to break the party but this model only loads a small part of itself in RAM when inferencing


That's a good thing. Less completely means more energy for interestingness, and less expense means more accessibility.


(They are definitely going to exceed their storage quotas.)

I want to see how well weights for these models compress, but it will take me some time to run this code and generate some. I'm guessing they won't compress well, but I can't articulate a reason why.


If weights compress, they have low information, which would suggest they're either useless or the architecture is bad.


In the source code it says "I have discovered truly marvelous weights for this, which this header file is too small to contain"


data is the new oil, what's the analogy for the data industry's impact on society akin climate change?




Is this because they are afraid of the model misused, like used for generating fake reviews? It is frustrating that I've been hearing great news on NLP but am able to try none of them myself.


It's because the model weights are the valuable thing here. The fancy new architectures are nice and everything, but transformer models are a dime a dozen these days. Seems like they're using this as an example to point at and say "Hey, look at us, we support open source!", whereas unless you're willing to go ahead and spend a small fortune on compute (possibly using their GPUs), these models are somewhat useless.


hah! yeah that's what I was looking for too


There is something fundamentally wrong with these models.

The brain "works" because it's evolved structure matches or reflects reality. It is not about having billions of neurons, but about to have the right structure which matches the environment.

My favourite example is how butterflies evolve pictures of eyes on its wings to scare predators, having literally no idea about existence of other creatures.

It has been evolved because other creatures have eyes, and they are there, of course.

The proper structure of neural networks must be based on such fundamental features, like "most of creatures have eyes" and similar ones.

Brain does not have a flat structure, like a billion x billion matrix. It is more clever and simpler that that.

A language model must be based on the fundamental notion that there are nouns (things), verbs (processes) and adjectives (attributes). It is that simple.


Slightly apples to oranges comparison.

The brain has an incredibly complex architecture, which evolved over millions of years. On top of that, it then develops throughout a human's lifespan. The brain we observe is a "finished product", and even then it has ~150 trillion synapses to do computations [0].

Even massive neural networks have a relatively simple architecture before they are trained. Part of the training process is effectively learning more complex architectures, which are manifested by changing weights.

What I'm getting it is that artificial neural networks aren't equivalent to the brain - ANNs are both learning their own structure, on top of the circuits actually doing computations. They are doing the work of millions of years of evolution, genetics, developmental biology, interaction with the environment etc. Perhaps it's to be expected that ANNs will need orders of magnitude greater number of parameters than a brain.

An interesting development is meta-learning, where we separate the process for learning the architecture (this could be using deep learning, but not necessarily) with the network actually doing computation (equivalent to the brain).

> A language model must be based on the fundamental notion that there are nouns (things), verbs (processes) and adjectives (attributes).

I agree, but how does the brain represent these concepts? Some would argue that ANNs do have these concepts, just hidden away in abstract vector representations. Take the visual system, which has been extensively studied - we see the brain represents contrast, edges, shapes and so on very similarly to convolutional NNs.

[0] It's likely that this number doesn't come close to capturing the brain's complexity, as it doesn't incorporate parameters like long-term potentiation/depression, synchronization, firing rates, habituation vs sensitization, immunomodulation and likely so much more we haven't yet discovered.


If the brain is the answer, the question involves trillions of parameters. Clearly there's more to the brain than just size, but also clearly, the brain is big for a reason. In fact scale is one of the very few things we can say for certain plays a big role in the function of the brain. GOFAI notions of embedding grammars and knowledge webs are just guesses on faith—the evidence points precisely in the opposite direction—and don't really make much sense anyway.


So, almost every possible ngram ever used is a 'input'?

Can someone describe what the lexical reality of 3.5T inputs actually means?

I feel like this is 'Deep Memorization' instead of 'Deep Learning'.

Like a Doctor who passes everything merely by memorizing the textbook with absolutely no ability beyond that.


Even so, memorisation implies word by word information retrieval and interpolation, it's not a hash table.


Yes, I mean, that's cool, but I'm just thinking that this isn't quite the AI we were thinking about before.

It's like a 'new form of storage and lookup' as opposed to the kind of 'magic algorithm' we usually think of when we think of AI. Or maybe that's just me.


Magic algorithms don’t exist. That’s the true reality of AI. I also felt the same thing and became dismayed as an undergrad at what was going on in computation having also studied neurobiology. But as it turns out, a shit ton of data with some stats can get you very, very far.


I didn't mean 'magic' (I know it's not that) - I just meant to imply that I think of AI as a 'function' not a 'lookup'.

Inputs -> Outputs not Search -> Response

Like if you train AI on a small dataset, it feels like what it's doing afterwords is a 'function' or 'algorithm' using what is in the end some arcane algebra.

But if you train on all the data in the world, with a trillion parameters ... well ... I kind of feel that 'all the data' is in an AI-style datastructure, that we are 'querying' with AI.

But that's just an observers abstraction.


See Searle’s Chinese room.


Magic algorithms (or perhaps magic machines) do exist, though. See: humans.


We have ~150 trillion synapses. If every synapse was equivalent to just one parameter in ANN, then we could be just magic memoization machines.


> If every synapse was equivalent to just one parameter in ANN, then we could be just magic memoization machines.

It's a poor comparison, though, since neurons and synapses aren't the same as parameters in a computational network. It's the same trap that news outlets routinely fall into when citing "storage capacity of the brain" in TBs and other such nonsense.


At that many trillion parameters every utterance can be a special case....


Since this type of model can be used for the task of "question answering", would it be possible to create a search engine with this type of model?


Google already uses a combination of NLP + PageRank to serve search queries [1].

[1] https://en.wikipedia.org/wiki/RankBrain


You could make something like an "expert system" with it. But since it doesn't seem to link back to the source, you'd never know if it was giving you a real answer or just making something up.


A bit like people


Not exactly. People can be asked to provide their sources or reasoning, which an informed user can easily check and verify.

This is not the case with black-box models. They only point to an answer/give a reply without any justification or reasoning behind it. This is actually a very severe problem with black-box models: ultimately they cannot be trusted, because it's very hard to verify whether the learned objective function matches the intended objective function (this is called the alignment problem [1].)

Optimisers tend to produce Clever Hans instances whenever they can, because it's the cheapest and therefore most optimal solution. Even if this becomes obvious from failure cases (e.g. common misclassification in image recognition systems), it's still not obvious which clues the system used that lead to the misclassification.

This is in contrast to a person, who an be queried as to why and how they arrived at their conclusion.

[1] https://bdtechtalks.com/2021/01/18/ai-alignment-problem-bria...


Would GDPR or other regulation apply to the data that these model is trained with? Is it not a risk that the model will record some private information in someone's email?


Yes, it has been shown [1] that these models can memorize information even if it only appears once in the training data. It could potentially cause issues with privacy and especially copyright. However, in this case it's not relevant because they didn't release the model weights.

[1] https://arxiv.org/abs/2012.07805


Even if they did not, they have it for themselves. If this uses any personal data, then GDPR and other regulations may not be getting respected.


If they have a policy to respond to GDPR requests, and maybe if they use a bloom filter to avoid repeating the training data, then it should be ok?


I feel like a future version of GPT will easily be able to answer 'What is the home address of ...@gmail.com', given the abundance of data it is trained on and the widespread proliferation of leaked databases.


Maybe host and use it in a non GDPR environment, it's always going to be ambiguous and risky.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: