* First of all, the GPT-3 authors successfully trained a model with 175 billion parameters. I mean, 175 billion. The previous largest model in the literature, Google’s T5, had "only" 11 billion. Models with trillions of weights are suddenly looking... achievable. That's a significant experimental accomplishment.
* Second, the model achieves competitive results on many NLP tasks and benchmarks without finetuning, using only a context window of text for instructions and input. There is only unsupervised (i.e., autoregressive) pretraining. AFAIK, this is the first paper that has reported a model doing this. It's a significant experimental accomplishment that points to a future in which general-purpose NLP models could be used for novel tasks without requiring additional training from the get-go.
* Finally, the model’s text generation fools human beings without having to cherry-pick examples. AFAIK, this is the first paper that has reported a model doing this. It's another significant experimental accomplishment.
More generally, I find that some AI researchers and practitioners with strong theoretical backgrounds tend to dismiss this kind of paper as "merely" engineering. I think this tendency is misguided. We must build giant machines and gather experimental evidence from them -- akin to physicists who build giant high-energy particle colliders to gather experimental evidence from them.
I'm reminded of Rich Sutton's essay, "The Bitter Lesson:"
This paper implements an architecture that will be out of reach for me for about 5 years. So, I ask myself "why will this paper matter in 5 years?"
There are two reasons I can imagine:
1. It shows that there is no phase change in the size-performance trend already documented over many orders of magnitude.
2. It was used as input data for pruning or distillation algorithms rooted in a better understanding of why language models work.
If the NLP community remains laser focused on hill-climbing compute-agnostic benchmarks, I don't think there will be enough people working on 2.
If 1 is the only reason it matters, I struggle to see how it is worth the cost in high-end research talent. It feels like standard low-risk corporate iteration.
3. First evidence that performance continues to improve at hundreds of billions of weights -- paving the way for trillions of weights, approaching orders of magnitude comparable to that of the human brain connectome.
4. First evidence (AFAIK) that larger NLP models do not need task-specific finetuning -- paving the way for general-purpose models that work well on any NLP task without additional training.
5. First evidence (AFAIK) that larger NLP models fool human beings without cherry-picking -- paving the way for models that can pass ever more challenging Turing tests.
GPT-1,2,3 et al. have all shown that larger is better. While in the short term this means people will simply throw larger and larger clusters at the problem, in the longer term there needs to be inovation in making it more efficient on the clusters we have (as even the cloud has limits).
I think sheer parameter count is an important part of the equation in general intelligence, so it's important that there are labs that work on scaling up promising leads to trillions of parameters on top of labs thinking of new promising directions.
It's just that this kind of work is more interesting as a general member of the public than as an AI researcher.
As a human being I find it really interesting to see where this kind of models can take us. I was amazed playing with GPT-2 online demos and seeing to what extent it could generate text that looked like what a human could produce. With its quirks and problems, but still impressive. And I can't wait to put my fingers on a GPT-3 online demo.
But as an NLP academic researcher (and this is not hypothetical, I'm actually one), what do I learn from this paper? What importance does it have to my research? Actually very little. You need more than 350 GB to fit the 175B parameters in memory, currently the largest GPU I can access has 24 GB (and I can access only one of those, which I use to -barely- run BERT-large). The cost of training the model in the cloud is estimated to be $12 million (https://twitter.com/eturner303/status/1266264358771757057). This is a single training run, not including any neural architecture search, bug fixing, etc. So even though for an academic researcher my funding situation is not bad at all, I'm like a couple of orders of magnitude away from being able to do anything meaningful with models of this size, and can't expect that to change for at least 8-10 years (by which point, at the pace NLP evolves, this will be ancient history anyway).
On the other hand, of course very often you learn useful ideas from papers that you can apply yourself even if it's not by implementing the same models in the paper, but that's not the case either. Here the lesson learned is "bigger is better" and I cannot train these enormous models, so there is not really much here that I can apply.
So as an academic researcher, really there isn't a lot to do with this apart from shrugging, and basically dismissing it and just keeping trying to do our best with what we have. Which is still useful, at least if we don't want NLP applications to be in the hands of an oligopoly of megacorps and restricted only to the few most economically viable languages.
BTW, I like the clever username, Al-Khwarizmi: https://en.wikipedia.org/wiki/Muhammad_ibn_Musa_al-Khwarizmi
I know I'm cherry picking your post, sorry, but this line kinda stood out to me as funny but intriguing.
Doesn't something like this go without saying? Or is GPT-3 advanced enough that we must now distinguish ourselves from the robots and the dogs?
> But as an NLP academic researcher
It is in opposition to that.
Academic researchers shouldn't try to compete with Google or OpenAI in scaling up models. They should try to come up with new approaches. Our brains have been evolving under tight constraints (size, energy, noise, etc). Maybe a good academic problem to solve is "how can I do what GPT-3 does if I only have an 8 GPU workstation?" This might lead to all kinds of breakthroughs.
Rich Sutton is a great scientist but he is fooled by randomness. He initiated his research program just as Moore's law was taking off. Thanks to Moore, his approach saw incredible success and brought him deserved acclaim. But just as Moore's law is pulling the rug from under him, he is using his stature to claim that no other approach but his can work.
One thing that should be learned from the bitter lesson is
the great power of general purpose methods, of methods that
continue to scale with increased computation even as the
available computation becomes very great.
Rich used to be very bullish on neural nets, then somewhat dismissive of them (due to the fragility/inadequacy of FCNs), and then increasingly enthusiastic as the renewed interest demonstrated that those problems could be overcome-- e.g., through better initialization, training, and (as you note) different architecture choices.
His main concern was whether a method could keep working as more resources became available, as otherwise you would tautologically end up with something short of true artificial intelligence.
The important thing is that the technique can scale with increasing data or compute without hitting a hard or soft limit.
I thought HN visitors are not capable of that level of naive thinking.
The analogy is a bit off to me. As far as I can tell, there was significant impetus from within particle physics to commit a huge amount of resources and political effort toward verifying theories with experiment. I don't see anything similar in deep learning, because in this case the "theory" is mostly that "bigger is probably better". I think that idea is pretty uncontroversial for stuff like this. And if the work reduces to marshaling enough resources, what exactly is it?
We should give OpenAI some credit for doing the damn thing, but as is the result kind of seems like an answer to a question that people weren't really asking.
Moore's law is running on fumes at this point. The complexity of further scaling has reached geopolitical proportions. We need to get back to looking at more creative models in both the software and hardware domains.
the model achieves competitive results on many NLP tasks and benchmarks without finetuning
The article dismisses this result with the following analogy:
“No, my 10-year-old math prodigy hasn’t proven any new theorems, but she can get a perfect score on the math SAT in under 10 minutes. Isn’t that groundbreaking?”
And I tend to agree. We've had a game of benchmarking brinkmanship for a while now. At what point are we going to see some groundbreaking applications?
That would be pretty foolish, given the fact that every hand crafted model eventually gets surpassed with brute force. A better use of time would be tackling whatever you mean by "complexity of further scaling has reached geopolitical proportions". I'm not a fan of it, as it is terribly inelegant, but denying the years of consistent brute force wins would just be silly.
The best strategy for any nation with an interest in AI (be it economic or something much more skynety) would be securing two things very quickly: fabrication capacity and nuclear power, because this stuff is going to be measure megawatts - not ANN layers. Improving the efficiency of that conversion would certainly be helpful, but history has shown that to be a lower priority, just take a look at how ridiculously deep software stacks are compared to 20 years ago. I really wish the linguists had been proven right in the 1970s...
Who said anything about hand crafted AI models? I’m talking about revisiting our models of computation. Moore’s law has long made it impossible to challenge the dominance of Von Neumann. Perhaps what we need to make further progress is some sort of decentralized, busless computer? Who knows?
Also, Moore's Law is running on fumes, yes, but there's quite a bit of R&D focused on coming up with hardware that massively scales up (e.g., by more efficiently parallelizing) the dense and sparse multiply-sum operations common to so many AI models. I think Sutton's point about models that leverage computation is spot-on.
> It's a significant experimental accomplishment that points to a future in which general-purpose NLP models could be used for novel tasks without requiring additional training from the get-go.
This premise is still purely science fiction. This model does not touch on either novel tasks nor being free from pretraining (unless I misunderstand). But overall, I think you’re right: it’s significant for a number of reasons.
For each task, the authors feed a context window of text with either zero to a few sample queries and responses, followed by a query without the response. The model generates a response for the last query. BTW, this approach is analogous to what you would do with a human being: you would provide zero to a few sample questions and answers, and then ask a question.
I am having some trouble seeing the benefits in lieu of the mess that is coming.
Because it's not a transformer paper.
This paper goal was to see how far can an increase in compute continue to deliver an increase in model performance.
There is no better way to study this than to take a very well known architecture and keep it the same as possible, otherwise it becomes very hard to know what is due to the increase size of the model and what is due to the tweaks you make.
So yes, it's a disappointing paper if you expect it to be on a different topic than what it is.
Yes it’s true. But there is a difference between what’s interesting and what works. deep learning (RNNs, transformers, etc.) is usually old ideas applied at large scale with slight modifications. Proving a model works well at large scale (175B parameters) is a great contribution and measures our progress towards AI.
This sort of "learning" is not necessarily real learning and it's not new for GPT-3. Even reduced GPT-2 willingly used made-up terms from the prompt in its results:
Search the article for 'Now I will feed it the same thing, but with a bunch of made-up terms.' It has some examples of how that stuff worked.
I've already posted this in the original discussion of GPT-3 paper and I will post it again: statements about whether some system "learns new words" or "does math" require hypothesis formulation and testing. It astounds me that many people in ML community not only don't do these sort of things, but even actively oppose to the very idea of them being necessary.
Recently there was a great live-stream from DarkHorse talking about this problem in science in general:
They talk about "data-driven" science and the fundamental problems with that notion.
The mathematical/conceptual error is that they are assuming each test point is added to the "post-hoc aggregated" prior when they evaluate the bound. This is analogous to including a test point in the training set. Another version of this error would be adding a kernel centered on each test point to a kernel density estimator prior to evaluating test set NLL. In this case, obviously the best kernel has variance 0 and assigns arbitrarily high likelihood to the test data.
You linked paper seems interesting on paper but does it bring any new SOTA?
Can you point me to some example of generated text this model produced? Something similar in quality to that unicorn story from GPT-2?
but does it bring any new SOTA?
Looking at their tables, seems so. The code is open source, and there's demo at https://mosaickg.apps.allenai.org/
So it is a totally new model that will probably keep evolving and being applied to more and more kind of NLP tasks. And it seems that it can have the first place on most NLP tasks, its empirically the breakthrough of the year.
It achieve this while having an extremely small number of parameters which shows:
The model is smarter
The model has room for more parameter hence even more accuracy!
Finally, theoretically it is a breakthrough as it is a port of a computer vision technology (variational autoencoders) to the NLP world.
Actually it might be the successor to the Transformer paradigm.
I wonder what pile of incremental improvement will researchers will be able to bring to it like they have on the Transformer paradigm (spanBert, XLnet, etc)
VAEs are not really exclusively a vision thing, they have been used in a variety of settings. Using VAEs for NLP is also nothing new, an early example is Bowman et al, 2015.
You can read it on ArXiv https://arxiv.org/abs/2002.04013v1 or browse the code here: https://github.com/learning-at-home/hivemind. It's not ready for widespread use yet, but the core functionality is stable and you can see what features we are working on now.
It'll take some work, but I think I can come up with something clever to dump samples on a TPUv2-8. i.e. the free one that comes with Colab.
Realistically, I don't think OpenAI will release the model. Why would they? And I'm not sure they'd dare use "it might be dangerous" as an excuse.
As a relative outsider to this field, I don’t really see the stark line between natural language and general intelligence implied by this statement. Language is just abstractions encoded in symbols, and general intelligence is just the ability to construct and manipulate abstractions. Seems reasonable to think that these are two sides of the same coin.
Put another way, natural language is the product of general intelligence.
I'd conjecture that this might include something like describing where places are in relation to each other, and asking it to describe a route. (Not an NLP expert, but work with AI folks; this task chosen as an example because it seems like something you'd want a planner for rather than anything MLful.)
1) "It’s another big jump in the number, but the underlying architecture hasn’t changed much... it’s pretty annoying and misleading to call it “GPT-3.” GPT-2 was (arguably) a fundamental advance, because it demonstrated the power of way bigger transformers when people didn’t know about that power. Now everyone knows, so it’s the furthest thing from a fundamental advance."
2) "The “zero-shot” learning they demonstrated in the paper – stuff like “adding tl;dr after a text and treating GPT-2′s continuation thereafter as a ‘summary’” – were weird and goofy and not the way anyone would want to do these things in practice... They do better with one task example than zero (the GPT-2 paper used zero), but otherwise it’s a pretty flat line; evidently there is not too much progressive “learning as you go” here."
3) "Coercing it to do well on standard benchmarks was valuable (to me) only as a flamboyant, semi-comedic way of pointing this out, kind of like showing off one’s artistic talent by painting (but not painting especially well) with just one’s non-dominant hand."
4) "On Abstract reasoning..So, if we’re mostly seeing #1 here, this is not a good demo of few-shot learning the way the authors think it is."
1) The fact that we can get so much improvement out of something so "mundane" should be cause for celebration, rather than disappointment. It means that we have found general methods that scale well and a straightforward recipe for brute-forcing our way through solutions we haven't solved before.
At this point it becomes not a question of possibility, but of engineering investment. Isn't that the dream of an AI researcher? To find something that works so well you can stop ``innovating'' on the math stuff?
2) Are we reading the same plot? I see an improvement after >16 shot.
I believe the point of that setup is to illustrate the fact that any model trained to make sequential decisions can be regarded as "learning to learn", because the arbitrary computation in between sequential decisions can incorporate "adaptive feedback". It blurs the semantics between "task learning" and "instance learning"
3) This is a fair point actually, and perhaps now that models are doing better (no thanks to people who spurn big compute), we should propose better metrics to capture general language understanding.
4) It's certainly possible, but you come off as pretty confident for someone who hasn't tried running the model and trying to test its abilities.
Who is the author, anyway? Are they capable of building systems like GPT-3?