
GPT-3: A Disappointing Paper? - reedwolf
https://www.greaterwrong.com/posts/ZHrpjDc3CepSeeBuE/gpt-3-a-disappointing-paper
======
cs702
All valid points, but I disagree with the conclusion, for several reasons:

* First of all, the GPT-3 authors successfully trained a model with 175 billion parameters. I mean, 175 billion. The previous largest model in the literature, Google’s T5, had "only" 11 billion. Models with trillions of weights are suddenly looking... achievable. That's a significant experimental accomplishment.

* Second, the model achieves competitive results on many NLP tasks and benchmarks without finetuning, using only a context window of text for instructions and input. There is only unsupervised (i.e., autoregressive) pretraining. AFAIK, this is the first paper that has reported a model doing this. It's a significant experimental accomplishment that points to a future in which general-purpose NLP models could be used for novel tasks without requiring additional training from the get-go.

* Finally, the model’s text generation fools human beings without having to cherry-pick examples. AFAIK, this is the first paper that has reported a model doing this. It's another significant experimental accomplishment.

More generally, I find that some AI researchers and practitioners with strong
theoretical backgrounds tend to dismiss this kind of paper as "merely"
engineering. I think this tendency is misguided. We _must_ build giant
machines and gather experimental evidence from them -- akin to physicists who
build giant high-energy particle colliders to gather experimental evidence
from them.

I'm reminded of Rich Sutton's essay, "The Bitter Lesson:"

[http://www.incompleteideas.net/IncIdeas/BitterLesson.html](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)

~~~
Al-Khwarizmi
> More generally, I find that some AI researchers and practitioners with
> strong theoretical backgrounds tend to dismiss this kind of paper as
> "merely" engineering.

It's just that this kind of work is more interesting as a general member of
the public than as an AI researcher.

As a human being I find it really interesting to see where this kind of models
can take us. I was amazed playing with GPT-2 online demos and seeing to what
extent it could generate text that looked like what a human could produce.
With its quirks and problems, but still impressive. And I can't wait to put my
fingers on a GPT-3 online demo.

But as an NLP academic researcher (and this is not hypothetical, I'm actually
one), what do I learn from this paper? What importance does it have to my
research? Actually very little. You need more than 350 GB to fit the 175B
parameters in memory, currently the largest GPU I can access has 24 GB (and I
can access only one of those, which I use to -barely- run BERT-large). The
cost of training the model in the cloud is estimated to be $12 million
([https://twitter.com/eturner303/status/1266264358771757057](https://twitter.com/eturner303/status/1266264358771757057)).
This is a single training run, not including any neural architecture search,
bug fixing, etc. So even though for an academic researcher my funding
situation is not bad at all, I'm like a couple of orders of magnitude away
from being able to do anything meaningful with models of this size, and can't
expect that to change for at least 8-10 years (by which point, at the pace NLP
evolves, this will be ancient history anyway).

On the other hand, of course very often you learn useful ideas from papers
that you can apply yourself even if it's not by implementing the same models
in the paper, but that's not the case either. Here the lesson learned is
"bigger is better" and I cannot train these enormous models, so there is not
really much here that I can apply.

So as an academic researcher, really there isn't a lot to do with this apart
from shrugging, and basically dismissing it and just keeping trying to do our
best with what we have. Which is still useful, at least if we don't want NLP
applications to be in the hands of an oligopoly of megacorps and restricted
only to the few most economically viable languages.

~~~
andi999
I am doing other hpc stuff, so I am wondering why are you limited to 1 GPU?
Windows recognizes up to 8 I think.

~~~
freeone3000
The 24GB card mentioned is almost certainly a RTX Titan, which are $3000 each.
Just the card. Second, training frameworks like Megatron can distribute to
multiple GPUs in the same computer as if they were on different machines, but
the naive trainer is greatly helped by NVLink in order to actually look the
memory and greatly improve accuracy, which means V100s which are $5000 each.
(Also, people use Linux for ML)

~~~
p1esk
An average cost of an NLP researcher is probably around $300k/year. If buying
10 $5k cards makes them twice more productive, then it's a no brainer.

~~~
mattkrause
The average cost of an _academic_ NLP researcher is probably closer to
$30k/year.

~~~
p1esk
Most universities have access to supercomputers, including GPU clusters. But
that's not the point, not every NLP problem requires experimenting with 175B
parameter models.

Academic researchers shouldn't try to compete with Google or OpenAI in scaling
up models. They should try to come up with new approaches. Our brains have
been evolving under tight constraints (size, energy, noise, etc). Maybe a good
academic problem to solve is "how can I do what GPT-3 does if I only have an 8
GPU workstation?" This might lead to all kinds of breakthroughs.

------
Voloskaya
> Transformers are extremely interesting. And this is about the least
> interesting transformer paper one can imagine in 2020.

Because it's not a transformer paper.

This paper goal was to see how far can an increase in compute continue to
deliver an increase in model performance. There is no better way to study this
than to take a very well known architecture and keep it the same as possible,
otherwise it becomes very hard to know what is due to the increase size of the
model and what is due to the tweaks you make.

So yes, it's a disappointing paper if you expect it to be on a different topic
than what it is.

------
strin
> “GPT-3″ is just a bigger GPT-2. In other words, it’s a straightforward
> generalization of the “just make the transformers bigger” approach

Yes it’s true. But there is a difference between what’s interesting and what
works. deep learning (RNNs, transformers, etc.) is usually old ideas applied
at large scale with slight modifications. Proving a model works well at large
scale (175B parameters) is a great contribution and measures our progress
towards AI.

------
gambler
_Article > One of their experiments, “Learning and Using Novel Words,“ strikes
me as more remarkable than most of the others and the paper’s lack of focus on
it confuses me._

This sort of "learning" is not necessarily real learning and it's not new for
GPT-3. Even reduced GPT-2 willingly used made-up terms from the prompt in its
results:

[https://medium.com/@VictorBanev/interrogating-
gpt-2-345m-aaf...](https://medium.com/@VictorBanev/interrogating-
gpt-2-345m-aaff8dcc516d)

Search the article for 'Now I will feed it the same thing, but with a bunch of
made-up terms.' It has some examples of how that stuff worked.

I've already posted this in the original discussion of GPT-3 paper and I will
post it again: statements about whether some system "learns new words" or
"does math" require hypothesis formulation and testing. It astounds me that
many people in ML community not only don't do these sort of things, but even
actively oppose to the very idea of them being necessary.

Recently there was a great live-stream from DarkHorse talking about this
problem in science in general:

[https://www.youtube.com/watch?v=QvljruLDhxY](https://www.youtube.com/watch?v=QvljruLDhxY)

They talk about "data-driven" science and the fundamental problems with that
notion.

------
victor9000
My biggest problem with GPT3 is that it's not going to be accessible
(practically speaking) to the general public. There's been a recent push to
democratize this type of work with libraries like Huggingface transformers,
but models this large will force the benefits of this work back into the ivory
tower.

------
The_rationalist
Meanwhile, a revolutionary paper that brought for the first successful time a
new paradigm to NLP (latent variational autoencoders) and that destroy GPT 3
on text perplexity on the Pen treebank (4.6 vs 20) and with order of
magnitudes less parameters is talked about nowhere on the web...

[https://arxiv.org/abs/2003.02645v2](https://arxiv.org/abs/2003.02645v2)

~~~
psb217
FYI, the MELBO bound in that paper is invalid. Their perplexity numbers using
the MELBO bound are also invalid.

~~~
The_rationalist
Where is the error? How much would that change their score of 4.6?

~~~
psb217
The bound is completely invalid, as are the NLL/PPL numbers they report with
the MELBO. Look at the equation. If they optimized it directly, it would be
trivially driven to 0 by the identity function if we used a latent space
equivalent to the input space. The MELBO just adds noiseless autoencoder
reconstruction error to a constant offset equal to log of the test set size.
This can be driven to zero by evaluating an average bound over test sets of
size 1.

The mathematical/conceptual error is that they are assuming each test point is
added to the "post-hoc aggregated" prior when they evaluate the bound. This is
analogous to including a test point in the training set. Another version of
this error would be adding a kernel centered on each test point to a kernel
density estimator prior to evaluating test set NLL. In this case, obviously
the best kernel has variance 0 and assigns arbitrarily high likelihood to the
test data.

~~~
The_rationalist
Interesting, thx!

------
krzyk
What does GPT mean? I assume it is not about partition tables (GUID Parition
Table), it has something to do with NLP, but besides that it is hard to find
what does this acronym mean.

~~~
Smaug123
Supposedly it stands for "Generative Pretrained Transformer", but nobody ever
expands the acronym; it's a language model, originally released by OpenAI and
announced at [https://openai.com/blog/language-
unsupervised/](https://openai.com/blog/language-unsupervised/) and
[https://openai.com/blog/better-language-
models/](https://openai.com/blog/better-language-models/) .

------
reddickulous
It would be cool if there was a platform to crowd source compute resources to
train stuff like this so that regular people (without 7 figure budgets) can
have access to these models which are becoming increasingly out of reach to
the general public.

~~~
mryab
Here is a recent paper (disclaimer: I am the first author) named
"Learning@home" which proposes something along these lines. Basically, we
develop a system that allows you to train a network with thousands of
"experts" distributed across hundreds or more of consumer-grade PCs. You don't
have to fit 700GB of parameters on a single machine and there is significantly
less network delay as for synchronous model parallel training. The only thing
you sacrifice is the guarantee that all the batches will be processed by all
required experts.

You can read it on ArXiv
[https://arxiv.org/abs/2002.04013v1](https://arxiv.org/abs/2002.04013v1) or
browse the code here: [https://github.com/learning-at-
home/hivemind](https://github.com/learning-at-home/hivemind). It's not ready
for widespread use yet, but the core functionality is stable and you can see
what features we are working on now.

------
drcode
Newbie question: If/when models the size of GPT3 are released to the general
public, will average people going to be able to run them on their PCs, as they
can with GPT2? Or will that basically be impossible now without expensive
specialty hardware?

~~~
6gvONxR4sf7o
The big one is 175 billion parameters. With your hardware's usual 32 bit
floats, that's a 700GB model. You won't be using the big one for a while.

~~~
p1esk
This one uses FP16, so you just need to have a server with >350GB of RAM.
512GB of DDR4 would set you back around two grand. A total cost of a server
for this would probably be under $5k. Comparable to a good gaming rig.

~~~
sillysaurusx
A TPU can allocate 300GB without OOMing on the TPU's CPU. That's tantalizingly
close to 350GB. And 300GB + 8 cores * 8GB = 364GB.

It'll take some _work_ , but I think I can come up with something clever to
dump samples on a TPUv2-8. i.e. the free one that comes with Colab.

Realistically, I don't think OpenAI will release the model. Why would they?
And I'm not sure they'd dare use "it might be dangerous" as an excuse.

~~~
p1esk
Have you (or anyone) tried running GPT-2 inference in INT8 precision? Perhaps
worth looking at one of these efforts:
[https://www.google.com/search?q=running+transformer+in+int8&...](https://www.google.com/search?q=running+transformer+in+int8&oq=running+transformer+in+int8)

------
jkhdigital
> it would represent a kind of non-linguistic general intelligence ability
> which would be remarkable to find in a language model

As a relative outsider to this field, I don’t really see the stark line
between natural language and general intelligence implied by this statement.
Language is just abstractions encoded in symbols, and general intelligence is
just the ability to construct and manipulate abstractions. Seems reasonable to
think that these are two sides of the same coin.

Put another way, natural language is the product of general intelligence.

~~~
remexre
I think the line you'd see is that there exists some task where the language-
based model suddenly lacks the ability to perform the task despite the fact
that it "should."

I'd conjecture that this might include something like describing where places
are in relation to each other, and asking it to describe a route. (Not an NLP
expert, but work with AI folks; this task chosen as an example because it
seems like something you'd want a planner for rather than anything MLful.)

------
bitL
I think the main disappointment is that we humans aren't that special when a
brute-forced scalable transformer is getting into our ballpark. We have also
recently seen how Open AI + MS were able to use a GPT-variation for automated
text-description-to-python-code generation, and utilizing something like GPT-3
in that task might render many swengs obsolete fairly soon.

------
ericjang
I could not disagree more with this post. To summarize what the author is
unhappy with:

1) "It’s another big jump in the number, but the underlying architecture
hasn’t changed much... it’s pretty annoying and misleading to call it “GPT-3.”
GPT-2 was (arguably) a fundamental advance, because it demonstrated the power
of way bigger transformers when people didn’t know about that power. Now
everyone knows, so it’s the furthest thing from a fundamental advance."

2) "The “zero-shot” learning they demonstrated in the paper – stuff like
“adding tl;dr after a text and treating GPT-2′s continuation thereafter as a
‘summary’” – were weird and goofy and not the way anyone would want to do
these things in practice... They do better with one task example than zero
(the GPT-2 paper used zero), but otherwise it’s a pretty flat line; evidently
there is not too much progressive “learning as you go” here."

3) "Coercing it to do well on standard benchmarks was valuable (to me) only as
a flamboyant, semi-comedic way of pointing this out, kind of like showing off
one’s artistic talent by painting (but not painting especially well) with just
one’s non-dominant hand."

4) "On Abstract reasoning..So, if we’re mostly seeing #1 here, this is not a
good demo of few-shot learning the way the authors think it is."

\--------- My response:

1) The fact that we can get so much improvement out of something so "mundane"
should be cause for celebration, rather than disappointment. It means that we
have found general methods that scale well and a straightforward recipe for
brute-forcing our way through solutions we haven't solved before.

At this point it becomes not a question of possibility, but of engineering
investment. Isn't that the dream of an AI researcher? To find something that
works so well you can stop ``innovating'' on the math stuff?

2) Are we reading the same plot? I see an improvement after >16 shot.

I believe the point of that setup is to illustrate the fact that any model
trained to make sequential decisions can be regarded as "learning to learn",
because the arbitrary computation in between sequential decisions can
incorporate "adaptive feedback". It blurs the semantics between "task
learning" and "instance learning"

3) This is a fair point actually, and perhaps now that models are doing better
(no thanks to people who spurn big compute), we should propose better metrics
to capture general language understanding.

4) It's certainly possible, but you come off as pretty confident for someone
who hasn't tried running the model and trying to test its abilities.

Who is the author, anyway? Are _they_ capable of building systems like GPT-3?

~~~
master_yoda_1
I think we should not compare if anybody is capable or not. Here the most of
the work is done by azure engineers and none of them are mention in the paper.
So no even open ai can’t do it without azure infrastructure.

~~~
sanxiyn
OpenAI wrote a new GPU kernel to do this. It's a collaborative work.

