If the model weights aren’t open source, the model isn’t open source in my opinion.
It’s not like a switch transformer is that much more complicated to slap into pytorch than any other new research.
Also in my personal opinion, while yes this model performs great, it’s ridiculously parameter inefficient. 1.6T parameters is 3.12TB using bfloat 16. I’d rather take 7x longer to train and have a model I can actually put places.
[1] provides one with a whole-data-set training method (ADMM, one of such methods). Page 8 contains figure 2(b) - accuracy of training after specified amount of time. Note that ADMM start where stochastic gradient stops.
At [2] I tried to apply logistic regression trained using reweighted least squares algorithm on the same Higgs boson data set. I've got the same accuracy (64%) as mentioned in the ADMM paper with much less number of coefficients - basically, just the size of input vector + 1 instead of 300 such rows of coefficients and then 300x1 affine transformation. When I added squares of inputs (for the simplest approximation of polynomial regression) and used the same reweighted iterative least squares algorithm, I've got even better accuracy (66%) for double the number of coefficients.
So, answering your question of "how do you know" - researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things.
I haven't seen anything about whole-data-set optimization from Google at all. That's why I (and only me - due to standing I take and experiments I did) conclude that they do not quite care about parameter efficiency.
I agree, I was really only proposing PRESS for the "parameter efficiency" part of your comment. It'd be interesting to see some modern takes on IRLS. I think generally this goes against the grain of the Cheap Gradient Principle which is why you see less of it (edit: eh, I think Fisher scoring can be cast in this light).
For instance, on modern modelling problems with non-linearities and change points, it's a lot less easy to do something like IRLS in an end-to-end system, but interesting as a research direction.
I also agree with "Cheap Gradient Principle". I see it as a case of "width of two horse backs from Ancient Rome determine Shuttle buster width" (which is untrue but cool as a reference).
The very SGD thing was developed because it was the only way to train something like neural network with small memory and, more importantly, in reasonable time. Multiplying of training time by N (number of parameter) meant having good result in a year, not in a day.
And today we have large batch training with complex synchronization systems to speed up training even more. Which bring us closer to the whole-dataset training and, I guess, second-order optimization as well.
If you read through the paper where they present switch transformers, they train all models to the same performance, so for example t5-large, at 770M parameters[1] is just as good as this model. That’s only 1.54GB
Mixture of experts architectures like this are a specific design decision to increase the parameter count but use and update those parameters sparsely. Sure, that design decision doesn't fit all scenarios, but it fits some, and it has its own advantages, like faster training time.
Yep, it would definitely be difficult to justify running it in production. Accuracy would need to be higher as you said or it would need to be applicable to more tasks such that you can take other models out of production.
This kind of model could be used as the teacher in a distillation setup too though. Then faster training of the teacher is actually a huge benefit since it speeds up model development iteration cycles.
But even if it weren't practical to use in production in any sense, I'd argue there's value in doing the basic research of exploring design space of architectures in this way. This came out of a research team at Google. It may inspire and inform smaller, more practical architectures.
"Yep, it would definitely be difficult to justify running it in production. Accuracy would need to be higher as you said or it would need to be applicable to more tasks such that you can take other models out of production."
Part of the justification is the MoE sparsity by design means only a small part of the model will be activated by a given query. Don't think of it as a single giant 1t model, think of it as 50 small models which happen to share a glue layer at the input. So at deployment, you could, for example, keep the gating-layer in RAM and only pull the necessary sub-model off disk as necessary. Or you could shard the sub-models over 51 GPUs and feed the master gating-layer+GPU lots of queries, and each query will be dispatched to a different expert+GPU pair. This could easily be competitive with running a lot of dense models in parallel trying to keep up with the same load.
Google's legal team would be freaking out if any of its engineers propose open sourcing model weights. This is a mine field full of legal and privacy headaches and I guess it is infeasible to audit 1.6T parameters anyway...
I have to say that I really doubt this as they've released model weights for their several last few models that have been of nearly this size (and have been trained on the same dataset).
This is a really good point, and also depends on what the training data is - although arguably if they are using this model to power their services and it can leak personal data that might be an issue in it's own right.
One key idea here is to use a very large number of parameters (model weights), but only use some subset of the parameters on each example. The parameters are divided up into blocks called "experts", and then some subset of experts are used on any given input. Which subset is used is chosen by the model itself in a data-dependent manner. This can be thought of as letting the model specialize different experts to handle different situations.
The advantage, as they show, is that the model can train to a given level of performance much faster with a fixed amount of computing power compared to an architecture that uses all parameters on every step. This might be because it allows you to have a very large number of parameters that can store a lot more specialized information without incurring as much of a computational cost. Of course the downside is that you end up with a very large model that literally won't fit in a lot of environments.
The common argument I've heard: because then you would have to decide how many experts models are required, train and evaluate them separately, and overall make your architecture dependent on this choice. If your expert is wrong and miscalculates how many models are required then your entire architecture is also likely to be wrong (humans, am I right?).
Researchers at Google's scale prefer a single model where you throw all your data in a single bin and get perfect performance out, no tweaking and no pesky humans required.
But this is something you could just use a hp search for, right, to determine the amount of models? Or are hp searches generally not used anymore at that scale?
Another example of Google giving much data away is 50 trillion digits of pi [1], which contains about 42 TB of data (decimal and hexadecimal combined).
Google Cloud Storage. The files could be dumped as tfrecord in a bucket with "requester pays". So anybody could reproduce it using the open source code, by paying for the costs incurred to move the data from GCS to the training nodes.
It's interesting. It would take ~six years for the Z8 to break even compared to AWS, but traffic into and out of the machine would be $0, and I don't think you're running directly on the metal with AWS, so performance would probably be a bit higher. And then there's storage - I configured, uhh, 120TB of a mixture of SSDs and HDDs. I'm not even going to try and ask AWS for a comparible quote there.
I may or may not have added dual Xeon Platinum 8280s to the Z8 as well. :P
(They are definitely going to exceed their storage quotas.)
I want to see how well weights for these models compress, but it will take me some time to run this code and generate some. I'm guessing they won't compress well, but I can't articulate a reason why.
Is this because they are afraid of the model misused, like used for generating fake reviews? It is frustrating that I've been hearing great news on NLP but am able to try none of them myself.
It's because the model weights are the valuable thing here. The fancy new architectures are nice and everything, but transformer models are a dime a dozen these days. Seems like they're using this as an example to point at and say "Hey, look at us, we support open source!", whereas unless you're willing to go ahead and spend a small fortune on compute (possibly using their GPUs), these models are somewhat useless.
There is something fundamentally wrong with these models.
The brain "works" because it's evolved structure matches or reflects reality. It is not about having billions of neurons, but about to have the right structure which matches the environment.
My favourite example is how butterflies evolve pictures of eyes on its wings to scare predators, having literally no idea about existence of other creatures.
It has been evolved because other creatures have eyes, and they are there, of course.
The proper structure of neural networks must be based on such fundamental features, like "most of creatures have eyes" and similar ones.
Brain does not have a flat structure, like a billion x billion matrix. It is more clever and simpler that that.
A language model must be based on the fundamental notion that there are nouns (things), verbs (processes) and adjectives (attributes). It is that simple.
The brain has an incredibly complex architecture, which evolved over millions of years. On top of that, it then develops throughout a human's lifespan. The brain we observe is a "finished product", and even then it has ~150 trillion synapses to do computations [0].
Even massive neural networks have a relatively simple architecture before they are trained. Part of the training process is effectively learning more complex architectures, which are manifested by changing weights.
What I'm getting it is that artificial neural networks aren't equivalent to the brain - ANNs are both learning their own structure, on top of the circuits actually doing computations. They are doing the work of millions of years of evolution, genetics, developmental biology, interaction with the environment etc. Perhaps it's to be expected that ANNs will need orders of magnitude greater number of parameters than a brain.
An interesting development is meta-learning, where we separate the process for learning the architecture (this could be using deep learning, but not necessarily) with the network actually doing computation (equivalent to the brain).
> A language model must be based on the fundamental notion that there are nouns (things), verbs (processes) and adjectives (attributes).
I agree, but how does the brain represent these concepts? Some would argue that ANNs do have these concepts, just hidden away in abstract vector representations. Take the visual system, which has been extensively studied - we see the brain represents contrast, edges, shapes and so on very similarly to convolutional NNs.
[0] It's likely that this number doesn't come close to capturing the brain's complexity, as it doesn't incorporate parameters like long-term potentiation/depression, synchronization, firing rates, habituation vs sensitization, immunomodulation and likely so much more we haven't yet discovered.
If the brain is the answer, the question involves trillions of parameters. Clearly there's more to the brain than just size, but also clearly, the brain is big for a reason. In fact scale is one of the very few things we can say for certain plays a big role in the function of the brain. GOFAI notions of embedding grammars and knowledge webs are just guesses on faith—the evidence points precisely in the opposite direction—and don't really make much sense anyway.
Yes, I mean, that's cool, but I'm just thinking that this isn't quite the AI we were thinking about before.
It's like a 'new form of storage and lookup' as opposed to the kind of 'magic algorithm' we usually think of when we think of AI. Or maybe that's just me.
Magic algorithms don’t exist. That’s the true reality of AI. I also felt the same thing and became dismayed as an undergrad at what was going on in computation having also studied neurobiology. But as it turns out, a shit ton of data with some stats can get you very, very far.
I didn't mean 'magic' (I know it's not that) - I just meant to imply that I think of AI as a 'function' not a 'lookup'.
Inputs -> Outputs not Search -> Response
Like if you train AI on a small dataset, it feels like what it's doing afterwords is a 'function' or 'algorithm' using what is in the end some arcane algebra.
But if you train on all the data in the world, with a trillion parameters ... well ... I kind of feel that 'all the data' is in an AI-style datastructure, that we are 'querying' with AI.
> If every synapse was equivalent to just one parameter in ANN, then we could be just magic memoization machines.
It's a poor comparison, though, since neurons and synapses aren't the same as parameters in a computational network. It's the same trap that news outlets routinely fall into when citing "storage capacity of the brain" in TBs and other such nonsense.
You could make something like an "expert system" with it. But since it doesn't seem to link back to the source, you'd never know if it was giving you a real answer or just making something up.
Not exactly. People can be asked to provide their sources or reasoning, which an informed user can easily check and verify.
This is not the case with black-box models. They only point to an answer/give a reply without any justification or reasoning behind it. This is actually a very severe problem with black-box models: ultimately they cannot be trusted, because it's very hard to verify whether the learned objective function matches the intended objective function (this is called the alignment problem [1].)
Optimisers tend to produce Clever Hans instances whenever they can, because it's the cheapest and therefore most optimal solution. Even if this becomes obvious from failure cases (e.g. common misclassification in image recognition systems), it's still not obvious which clues the system used that lead to the misclassification.
This is in contrast to a person, who an be queried as to why and how they arrived at their conclusion.
Would GDPR or other regulation apply to the data that these model is trained with? Is it not a risk that the model will record some private information in someone's email?
Yes, it has been shown [1] that these models can memorize information even if it only appears once in the training data. It could potentially cause issues with privacy and especially copyright. However, in this case it's not relevant because they didn't release the model weights.
I feel like a future version of GPT will easily be able to answer 'What is the home address of ...@gmail.com', given the abundance of data it is trained on and the widespread proliferation of leaked databases.