
GPT-3: Language Models Are Few-Shot Learners - gwern
https://arxiv.org/abs/2005.14165
======
spats1990
Even though this was the GPT-3-generated text that humans most easily
identified as machine-written, I still like it a lot:

Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm

Subtitle: Joaquin Phoenix pledged to not change for each awards event

Article: A year ago, Joaquin Phoenix made headlines when he appeared on the
red carpet at the Golden Globes wearing a tuxedo with a paper bag over his
head that read, "I am a shape-shifter. I can't change the world. I can only
change myself." It was a promise to not change to fit into the Hollywood mold:
"I think that's a really special thing, to not change yourself. I think it's a
really special thing to say, `This is what's inside of me, I'm proud of it,
and I'm not going to be ashamed because of the way that someone else thinks I
should be.'" Now, it's the Oscars, and Phoenix is at it again. But this time,
his publicist is saying he'll be wearing a tux no matter what.

Megyn Kelly was not impressed, and she let him have it on The Tonight Show.
"You know, I feel like, I feel like you could have worn the tux," she says.
"But you're saying you're a shape-shifter. I don't know if you can change your
tux, but you can change your mind. You can change your mind. You can change
your mind." Phoenix says he did, but it didn't stick. "I was like, `Okay, I'm
going to wear a tuxedo to this thing.' And then I thought, `I don't want to
wear a tuxedo to this thing.'" Kelly goes on to encourage him to change his
mind again, but Phoenix says it's too late: "I'm committed to wearing this."

~~~
wyattpeak
I don't know if it says something about text generation or human text
processing, but whenever I read an example of computer generated text, all
through I think "I can't tell this is machine generated, it seems completely
natural," and the only giveaway is that at the end I have no idea what it
said.

It's a pretty eerie feeling. It's as though both the AI and my short-term
processing only pay attention to a context of a few sentences, so nothing
seems off until I try to understand it as a whole.

EDIT: Thinking more, what it feels like most of all is reading a page of a
book and not taking it in.

~~~
pasquinelli
To me it reads like a child telling a story, but that this child has an
adult's ability to use language. When children tell a story they aren't going
anywhere with it but don't know how to cover it up.

~~~
carlmr
I know plenty of adults that can't seem to get to a point.

~~~
lowdose
Kevin Hart on the Joe Rogan show comes to mind.

------
cs702
This looks like a big deal to me:

1\. First of all, the authors successfully trained a model with 173 BILLION
PARAMETERS. The previous largest model in the literature, Google’s T5, had
"only" 11 billion. With Float32 representations, GPT-3-173B's weights alone
occupy ~700GB of memory (173 billion params × 4 bytes/param). A figure in the
100's of billions is still 3 orders of magnitude smaller than the 100’s of
trillions of synapses in the human brain [a], but consider this: Models with
trillions of weights are suddenly looking... achievable.

2\. The model achieves competitive results on many NLP tasks and benchmarks
WITHOUT FINETUNING. Let me repeat that: there is no finetuning. There is only
unsupervised (i.e., autoregressive) pretraining. For each downstream NLP task
or benchmark, the pretrained model is given text instructions, and possibly
sample text with questions and answers. The NLP tasks on which the model was
tested include translation, question-answering, cloze tasks, unscrambling
words, using novel words in sentences, and performing 3-digit arithmetic.

3\. The model is tested only in a ZERO-SHOT or FEW-SHOT setting. In other
words, for each NLP task, the pretrained model is given text instructions with
zero examples, or text instructions with a small number of examples (typically
10 to 100). As with human beings, GPT-3-173B doesn't need lots of examples to
perform competitively in novel NLP tasks.

4\. The results reported by this paper on all NLP tasks and benchmarks should
be seen as a BASELINE. These results likely could be meaningfully improved
with conventional finetuning.

5\. The model’s text generation FOOLS HUMAN BEINGS, without having to cherry-
pick examples.

\--

[a]
[https://www.google.com/search?q=number+of+synapses+in+human+...](https://www.google.com/search?q=number+of+synapses+in+human+brain)

~~~
gambler
I'll wait for a working interactive model before blindly believing these
statements. GPT-2 was hyped through the roof, but when inspected with a bit of
criticality it demonstrated glitches that told us more about how it actually
works than "good" examples:

[https://medium.com/@VictorBanev/interrogating-full-
gpt-2-10a...](https://medium.com/@VictorBanev/interrogating-full-
gpt-2-10ae1a9179f6)

ML models should be pushed to their limit, because that's where you gather
most useful information about what they actually do. Their results need to be
critically examined with both exploratory and hypothesis-driven testing. And
yet this is never done in initial papers and rarely done afterwards.

What was the last AI paper you've read that that said "and here is a list of
things out model failed at"?

~~~
gwern
That's a very sloppy post. He does a single example, not even running locally
or changing sampling parameters, and then concludes that GPT-2 is doing
nothing but pattern-matching? A lot of people underestimate NNs because the
sampling from them (top-k! how much dumber and cruder can you get? nucleus
works better, but is still obviously suboptimal) destroys a lot of dark
knowledge. I noticed this with Gary Marcus's claims about GPT-2 too: he would
try once, without changing any sampling settings, and conclude that it wasn't
doing anything, but if you tried, you would get different results. I'm not the
only one to notice that: [https://www.quantamagazine.org/common-sense-comes-
to-compute...](https://www.quantamagazine.org/common-sense-comes-to-
computers-20200430/) Such tests can prove the presence of knowledge, but not
the absence... And of course, GPT-3 does extensive arithmetic tricks:
[https://arxiv.org/pdf/2005.14165.pdf#page=22](https://arxiv.org/pdf/2005.14165.pdf#page=22)

~~~
gambler
The article I linked to makes claims only about the model it tests and since
it actually links to an online implementation, anyone can try to reproduce the
results and see for themselves. This is more than I can say about most chatter
about ML.

 _> Such tests can prove the presence of knowledge, but not the absence..._

This sounds like a setup for non-falsifiable beliefs.

~~~
gwern
> The article I linked to makes claims only about the model it tests and since
> it actually links to an online implementation, anyone can try to reproduce
> the results and see for themselves.

And I did (using my own local GPT-2-1.5b install which let me set the
hyperparameters rather than restricting it to inappropriate hardwired ones of
an online service), I linked to another person demonstrating the same thing, I
pointed out the extensive GPT-3 evaluation OA did, and here, have another link
about how bad querying of language models leads to highly misleading results
about how much they know:
[https://arxiv.org/abs/1911.12543](https://arxiv.org/abs/1911.12543)
Measurement error in general biases estimates towards zero.

> This sounds like a setup for non-falsifiable beliefs.

It's just as non-falsifiable as, say, concepts like 'lower bounds' or 'bugs'.

~~~
YeGoblynQueenne
The paper you link to claims that hand-crafted queries used to evaluate the
knowledge and understanding of language models are "sub-optimal" because they
do not take into account the context in which a LM was trained. For example:

    
    
      These manually created prompts (e.g. “Barack Obama was born in _”) might be
      sub-optimal because LMs might have learned target knowledge from
      substantially different contexts (e.g. “The birth place of BarackObama is
      Honolulu, Hawaii.”) during their training. 
    

In other words, the paper considers hand-crafted prompts like in the example
to be "sub-optimal" because they are not in the right format. To paraphrase
them a bit, such prompts are like making a mis-formed query to a database.

It is difficult to see how this is an argument _for_ the ability of LMs to
demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and
getting a correct answer; then asking "how much is 2+4?" and getting a wrong
answer. Most people would probably not take that as evidence that the second
question was "wrong". They would instead conclude that the child does not
"understand" addition and has only learned to reproduce specific answers to
specific questions.

To be fair the ability to return a correct answer given a question in the
right format is not without use. That, indeed, is how databases work. But it
shows none of the "understanding" or "knowledge" the paper claims is acquired
by Language Models.

~~~
gwern
> It is difficult to see how this is an argument for the ability of LMs to
> demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and
> getting a correct answer; then asking "how much is 2+4?" and getting a wrong
> answer. Most people would probably not take that as evidence that the second
> question was "wrong". They would instead conclude that the child does not
> "understand" addition and has only learned to reproduce specific answers to
> specific questions.

To use your database analogy, in what sense should we claim a database doesn't
know a record when you are using a malformed SQL query? If we fixed the query
and it emitted the right answer, then obviously it _did_ store the
information. The query does not encode the answer, and it is vanishingly
unlikely that the database would simply accidentally return the right answer
ever if it did not store the information in some way. Since LMs can get much
better results just by tailoring the prompts (increased by a third in that
paper! and there's no reason to think that that is the very best possible
performance either!), that shows that existing practices drastically
underestimate what knowledge the model has been able to _learn_. Learning
about the real world or text is very different from learning your particular
dumb broken query method.

~~~
YeGoblynQueenne
The problem is that nobody claims that databases "know" anything. They store
data. Data can be retrieved from storage. That's all they do.

>> The query does not encode the answer, and it is vanishingly unlikely that
the database would simply accidentally return the right answer ever if it did
not store the information in some way.

Oh, yes, absolutely. A query encodes the answer. Queries are patterns that are
matched by the data stored in the database. If a query fails it's because it
does not correctly represent the information it is trying to retrieve. For
example, if I SELECT * FROM TABLE PEOPLE and there is no table "PEOPLE", then
I don't get an answer because the query does not correctly represnt the
structure of the database. You cannot retrieve any data from a database unless
you have some idea about the structure of that data.

But that's not the point here. I don't disagree that a language model _can_
learn (i.e. it can represent some elements of its training dataset). I
disagree that it "understands" anything and I find the fact that it needs
specific queries to retrieve the data it is representing to be evidence that
it does not.

And so it's not more useful than a traditional database at this kind of task.
Except it's much less precise than a traditional database and costs
considerably more to create.

>> Learning about the real world or text is very different from learning your
particular dumb broken query method.

I'm sorry, I don't understand what you mean here. What is my "particular dumb
borken query method"? Is that meant as a personal attack?

------
sdan
Read though most of the paper and here's what GPT-3 is:

If you wanted to generate poems with GPT-2, you'd need to have a lot of poems
to fine-tune GPT-2 to get reasonable results.

With GPT-3, you use few-shot learning instead (without the need to do gradient
updates with each example)

The paper is long and filled with how it stacks with models like Grover and T5
and it does well... given that this is a 175 B param model (relative to
Grover/T5's 1.5/11B param models). This shows that even with these huge
models, smaller models can outperform them in certain instances with lesser
param models.

Also I think they did a good job with explaning the ethics and morals around
what models like these mean / what biases this has.

~~~
ericlewis
Would you have any easy to explain insight in to _how_ these perform better
than larger models? I’ve always wanted to understand that as a technically
adept and somewhat familiar (briefly) person who has explored what such models
can do.

~~~
Analog24
The key insight in this paper is that the new (larger) model was not "fine-
tuned" on the downstream NLP tasks. In other words, after it's trained on
unsupervised (you could call it self-supervised in this case) data to do
simple things like predict the next word (hence why it doesn't take any real
supervision) it can then be used to do very specific tasks like answer
questions or translating text _without_ further supervision.

Previous large-scale language models like BERT and GPT-2 had took a similar
approach but in order to actually perform the more complicate down stream
tasks they had to be fine-tuned. So they were trained with specific QA or
translation date in order to understand and do well on those tasks. GPT-3
doesn't do any fine-tuning, it is able to take it's very general initial
learning and perform very well on specific tasks that it was never trained on.
This is why it doesn't perform as well as the "smaller" models on those tasks.
But that is besides the point, if GPT-3 was fine-tuned on those tasks I'm sure
it would achieve the latest SOTA results in many (all?) of them. The exciting
part is how it was able to generalize the knowledge learned during "pre-
training" to much more specific tasks.

tl;dr the smaller models were trained on the specific tasks that they were
evaluated on. The large model (GPT-3) was not trained on those specific tasks
and still does almost as well.

~~~
ericlewis
very cool, thanks for explaining!

------
julianjm
I am not a fan of this trend of "Language Models Are X" in recent work
particularly out of OpenAI. I think it's a rhetorical sleight of hand which
hurts the discourse.

Like, the exact same paper could have instead been titled "Few-Shot Learning
with a Large-Scale Language Model" or similar. But instead there seems to be
this extremely strong desire to see certain ineffable qualities in neural
networks. Like, it's a language model. It does language modeling. Turns out
you can use it for few-shot learning and do amazingly well. Beyond that, what
does it mean to say it "is" a few-shot learner?

On one hand, it's literally the same claim in a strict sense. On the other
hand, it implies something much broader and more sweeping, that language
modeling / unsupervised learning as a task over long contexts _inherently_
implies meta-learning ability — which is a statement that is very difficult to
properly formulate, let alone back up. But that's the argument that I feel is
being slipped under the table by these titles. (And indeed it's very close to
what they suggest in the text, though with no more than a wave of the hands.)

Don't get me wrong: their intuition is reasonable, it's super cool that they
got this to work, and the results are very impressive on lots of tasks (though
there are clear gaps). But as a VERY publicly watched lab, they have a serious
duty (which I think they're neglecting) to frame their results more carefully.
In particular, there's a sort of religion that if you train a big enough model
on big enough data with self-supervision, it will somehow become AGI and/or
learn to solve arbitrary problems. Claims like "Language Models are Few-Shot
Learners" are clearly designed to fit into that worldview, even though the
research doesn't point at it any more than a more conservative interpretation
like "Lots of NLP Tasks are Learned in the Course of Language Modeling and can
be Queried by Example." They touch on this limitation in their discussion
section but I guess flashy titles are more important. I wish they would use
their status to set a better example.

~~~
sillysaurusx
_But as a VERY publicly watched lab, they have a serious duty_

I was nodding right along with you, and then...

OpenAI has no duty. It doesn't matter if they're publicly watched. What
matters is whether the field of AI can be advanced, for some definition of
"advanced" equal to "the world cares about it."

It's important to let startups keep their spirit. Yeah, OpenAI is one of the
big ones. DeepMind, Facebook AI, OpenAI. But it feels crucial not to reason
from the standpoint of "they have achieved success, so due to this success, we
need to carefully keep an eye on them."

Such mindsets are quite effective in causing teams to slow down and second-
guess themselves. Maybe it's not professional enough, they reason. Or perhaps
we're not clear enough. Maybe our results aren't up to "OpenAI standards."

As to your specific point, yes, I agree in general that it's probably good to
be precise. And perhaps "Language Models Are Few-Shot Learners" is less
precise than "Maybe Language Models Are Few-Shot Learners."

But let's be real for a moment: this is GPT-3. GPT-2 is world-famous. It's
~zero percent surprising that GPT-3 is "something big." So, sure, they're few-
shot learners.

In time, we'll either discover that language models are in fact few shot
learners, or we'll discover that they're not. And that'll be the end of it. In
the meantime, we can read and decide for ourselves what to think.

~~~
julianjm
I think all researchers and science communicators have a duty to present
science in a way which educates and edifies, and doesn't mislead. It's not
just that they're successful, but that their publicity gives them a prominent
role as science communicators. Science is all about and questioning your
assumptions, and acknowledging limitations. They claim the public interest in
their charter. I think it's reasonable to demand integrity from them, at least
as much as it is from any other researcher, if not more. And I think OpenAI
would agree with me on that point.

~~~
visarga
It's easy to say: they 'have a duty to present science in a way which educates
and edifies, and doesn't mislead'. But sometimes it takes years even for
scientists to really understand what they have created or discovered. It's
cutting edge, not well known, hard to communicate. How could lay people keep
up where not even scientists have grasped it fully?

Of course, if the same scientists were asked about something where the topic
has settled, they could be more effective communicators.

------
leesec
This part really freaked me out... GPT-2 couldn't do math:

Context → Passage: Saint Jean de Br´ebeuf was a French Jesuit missionary who
travelled to New France in 1625. There he worked primarily with the Huron for
the rest of his life, except for a few years in France from 1629 to 1633. He
learned their language and culture, writing extensively about each to aid
other missionaries. In 1649, Br´ebeuf and another missionary were captured
when an Iroquois raid took over a Huron village . Together with Huron
captives, the missionaries were ritually tortured and killed on March 16,
1649. Br´ebeuf was beatified in 1925 and among eight Jesuit missionaries
canonized as saints in the Roman Catholic Church in 1930.

Question: How many years did Saint Jean de Br´ebeuf stay in New France before
he went back to France for a few years?

Answer: Completion → 4

~~~
longtom
It seems it has (rudimentarily) learned concepts general enough to be logic
itself. That is general intelligence. Now hook it up to reinforcement
circuitry and make it even larger and it will mark the end to life as we know
it.

GTP-3 has 175 billion parameters, but the human brain has 100 trillion
synapses, so 0.175%. NN model capacity currently has a 3.4 month doubling
time.[1] In 7-10 doublings we'll be in a similar ballpark, i.e. 2-3 years.

[1] [https://openai.com/blog/ai-and-compute/](https://openai.com/blog/ai-and-
compute/)

~~~
KhoomeiK
Is there any specific reasoning behind equating 1 synapse to 1 NN parameter?
Seems a bit simplistic. Seems to me like a synapse probably has more
computational ability than a single parameter.

~~~
longtom
Real neurons have many other trainable parameters and a lot more computational
structure, so this is of course a simplifying assumption, but it is not
entirely baseless either as it is known ANNs can approximate any function in
theory, which may suggest synaptic weights do the heavy lifting in biological
brains (since what more than general do you need?).

Though biological brains are likely overly complicated due to evolutionary
baggage. There are hydrocephalus cases which have much reduced brain matter,
but still high IQ.[1] The recurrent laryngeal nerves in giraffes is about 4.6
metres (15 ft) because it goes up and down their neck as it could not be
rewired more directly during evolution.[2] Our pristine mathematical models
and low-noise computational environments are likely superior to evolved
wetware hacks.

[1] [https://www.newscientist.com/article/dn12301-man-with-
tiny-b...](https://www.newscientist.com/article/dn12301-man-with-tiny-brain-
shocks-doctors/)

[2]
[https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Gi...](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/GiraffaRecurrEn.svg/1280px-
GiraffaRecurrEn.svg.png)

~~~
nwienert
The hydrocephalus story looks a bit sketchy [0].

Also if anything brains are hyper optimized for many things (based on the many
specialized sub-units). I’d bet we are essentially _not_ unsupervised, and the
sub-units of the brain are essentially fine tuned for many tasks, and hyper
optimized to use all their resources incredibly efficiently (memory
optimization must be intense). Not that the generative models won’t get close
in some general way relatively soon, but I could see human brains being
another 10-1000x more powerful than your ballpark pretty easily.

[0] [https://www.gwern.net/Hydrocephalus](https://www.gwern.net/Hydrocephalus)

~~~
longtom
Thanks I was not aware of these details about the hydrocephalus story.

------
toxy
GPT included a picture of the variation of the transformer model that they
made.

GPT2 outlined the changes they made to the model in an acceptably moderate
detail.

GPT3 references another paper saying "we use alternating dense and locally
banded sparse attention patterns in the layers of the transformer, similar to
the Sparse Transformer" with no detail added on the changes they made.

How are you to reproduce these results at all? You could attempt to include
the changes as they references the sparse transformer paper, but you could
possibly do it in a different way, and there would be no way to verify the
results that they gave whatsoever due to changes in implementation.

A bit disappointing.

~~~
canjobear
The full model of GPT-2 is available for inspection and retraining, if you so
desire. GPT-3 will likely be released soon as well.

~~~
toxy
Likely, but in a released paper, there should be a bit more quality from a
research standpoint.

------
6gvONxR4sf7o
In the paper they say it took 3.14e23 flops to train. They used v100s to do
it. This is an _insane_ energy cost (and financial cost).

Nvidia's v100 product page [0] says that it gets about 15 (single precision) -
125 ("deep learning") teraflop/s at 250-300 watts (joules per second). That
means that if everything's as perfectly efficient as a marketing product page,
it gets about 250/125-300/15 = 2-25 joules per teraflop, putting this model at
about 0.6-8 terajoules.

A gallon of gasoline has about 120e6 joules [1] (though if you wanted to
compare with burning it in a car, it's only 20-25% efficient _at best_ [2] so
it'd be fewer joules/gallon).

This model took the equivalent of about 5,000-67,000 gallons of gasoline _at
best_ and at _ideal perfect energy efficiency_. I get that openAI has made a
decision not to be efficient with their dollars in order to see what's
possible with future tech, but that means not being efficient with energy
either, and it's getting kinda crazy. Sure, microsoft data centers aren't
gasoline powered, so maybe it is closer to this ideal energy efficiency, and
it's definitely going to be a better carbon footprint, but god damn it just
seems wasteful.

Hell, the new A100 (again going off marketing materials [3], so at least it's
apples to apples) could do it about 4x more efficiently. Is this research
really worth what it costs, when waiting a year makes it that much more
efficient?

[0] [https://www.nvidia.com/en-us/data-
center/v100/](https://www.nvidia.com/en-us/data-center/v100/)

[1] [https://www.calculateme.com/energy/gallons-of-gas/to-
joules/...](https://www.calculateme.com/energy/gallons-of-gas/to-
joules/#:~:text=A%20U.S.%20gallon%20of%20gasoline,or%20about%20120%20million%20joules).

[2]
[https://en.wikipedia.org/wiki/Engine_efficiency#Gasoline_(pe...](https://en.wikipedia.org/wiki/Engine_efficiency#Gasoline_\(petrol\)_engines)

[3] [https://devblogs.nvidia.com/nvidia-ampere-architecture-in-
de...](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)

~~~
saulrh
What's the TCO of a few hundred teenagers? I haven't read the paper yet, but
if the other comments here are accurate, that's about what you'd have to shell
out for if you wanted to duplicate the productivity of this mdoel without
externalizing costs by e.g. offering unpaid internships to high school
students.

~~~
6gvONxR4sf7o
GPT-2 came out about a year and a quarter ago. GPT came out less than a year
before. If we take another commenter's estimate of $3.6M, and a new model
comes out every year or so, then you could say just training is like $3.6M per
year. That should cover a pretty large number of teenagers. Hell, that would
cover a whole early stage startup in san francisco, including office space.

------
RoboTeddy
Does it have a latent personality? How would it answer the questions on a
5-factor personality test? Would its results on the test be consistent with
its behavior (generated text) in other situations?

~~~
sdbrown
Or, can it (like humans?) adapt its responses to suit the style of the
question? Like if you start asking it lots of antagonizing questions, will it
become more or less antagonistic itself?

------
tanilama
Just to point out, that text that feels most humanly generated from GPT-3,
seems heavily paraphrasing from the following articles:

[https://www.washingtonpost.com/religion/2020/01/03/united-
me...](https://www.washingtonpost.com/religion/2020/01/03/united-methodist-
church-is-expected-split-over-gay-marriage-disagreement-fracturing-nations-
third-largest-denomination/)

[https://www.washingtonpost.com/archive/local/1985/09/07/unit...](https://www.washingtonpost.com/archive/local/1985/09/07/united-
methodists-alarmed-by-empty-pews/db664b00-56be-4693-b8c0-36aa3d9a9905/)

GPT-3:

The first occurred in 1968, when roughly 10 percent of the denomination left
to form the Evangelical United Brethren Church.

WP:

The church has lost 1.6 million members since 1968, when the Methodist Church
merged with the considerably smaller Evangelical United Brethren to form the
present United Methodist Church.

I think this model is still very impressive, the parameter itself speaks. But
for this particular evaluation, the same news article may be removed from the
training set, other news article that paraphrases the same story might not.
IMO, the leakage still exists, it is hard to tell whether this model are
really 'generating', or just copy-pasting from its vast memory.

~~~
gas9S9zw3P9c
Where do you draw the line between "generating" and "copy-pasting from its
vast memory"? Why do you think what humans do is not copy & pasting different
snippets of information they have come across in the past? Isn't that what
grammar is? A bunch of rules you've come across a lot of times?

Other than the given prompt, the models don't have a goal. So what other than
copying and adjusting would they do?

~~~
tanilama
To evaluate generic model is hard.

For example, for image synthesis in GAN, the widely used Inception score
balances between authenticity of the generate samples vs the variety as well,
to make sure the model is not copy-pasting.

In this particular case, apparently the same event has been reported multiple
times by different news agency. Even if the exact one are excluded, still it
is suspicious how much less the model is being protected from knowing the
subject itself.

An analogy would exam in real world. Often, some of the questions aren't
leaked as is, but paraphrased yet stay close enough to the source.

In this particular case though, I disagree it is reaching human level
generation. They can tested the model with an unseen events, which happen
after the model is trained to test how well it generalize.

------
sytelus
GPT-3/175B model required 3.14E23 flops of compute for training. Even at
theoretical 28 TFLOPS for V100 and lowest reserved Azure pricing, this will
take 355 GPU-years and cost $3.6M for a single training run!

~~~
zwaps
I know it's a meme for the GPT team to just take the latest transformer model
and add a magnitude of parameters, done and done!

It'll be interesting to see whether the new paradigm really offers new
insights, or whether it's really just kicking the can down the road - and we
see the limits of generalizability in some other fashion.

I guess what irks me is that there is so little theory and math behind many
papers, even if there are dozens of co-authors on it.

The question of generalizability is deeply connected to statistics, e.g.
causal models, spurious correlations and so forth. Statements about these
things are just "thrown" in there, without any citation or proof. In peer
review, wouldn't anyone object? Those are clearly things that we actually do
not know enough about to be sure.

Edit: Reflecting further, perhaps this rapid iteration and result orientation
is in fact something positive. Perhaps it's good the way it is, without so
many scientific conventions and signals of deference. Perhaps it's that which
made other sciences more anemic and ML very productive.

All my whining aside, impressive work of course.

~~~
h3ctic
Can you point out some books/authors/papers to close the gap between
statistics and NNs?

------
baylearn
paper: [https://arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165)

abstract:

Recent work has demonstrated substantial gains on many NLP tasks and
benchmarks by pre-training on a large corpus of text followed by fine-tuning
on a specific task. While typically task-agnostic in architecture, this method
still requires task-specific fine-tuning datasets of thousands or tens of
thousands of examples. By contrast, humans can generally perform a new
language task from only a few examples or from simple instructions – something
which current NLP systems still largely struggle to do. Here we show that
scaling up language models greatly improves task-agnostic, few-shot
performance, sometimes even reaching competitiveness with prior state-of-the-
art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive
language model with 175 billion parameters, 10x more than any previous non-
sparse language model, and test its performance in the few-shot setting. For
all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with
tasks and few-shot demonstrations specified purely via text interaction with
the model. GPT-3 achieves strong performance on many NLP datasets, including
translation, question-answering, and cloze tasks, as well as several tasks
that require on-the-fly reasoning or domain adaptation, such as unscrambling
words, using a novel word in a sentence, or performing 3-digit arithmetic. At
the same time, we also identify some datasets where GPT-3's few-shot learning
still struggles, as well as some datasets where GPT-3 faces methodological
issues related to training on large web corpora. Finally, we find that GPT-3
can generate samples of news articles which human evaluators have difficulty
distinguishing from articles written by humans. We discuss broader societal
impacts of this finding and of GPT-3 in general.

------
dsign
With things like this, we will need to change how the media work, and how we
read news. Every sentence, every factual statement will need to be verified by
some kind of chain of trust involving entities with reputation, or be labeled
as "fiction/opinion".

------
sacred_numbers
Check out the poetry it generated in Figure F.1 (especially poem 4). I don't
know how many bad poems the authors had to sort through to find these, but
this AI is generating real poetry. If I didn't know they were computer
generated I doubt I would have even considered that they didn't come from a
human. This is a huge accomplishment and the team that created GPT-3 should be
proud.

~~~
jcims
GPT-2 would go out to lunch sometimes when generating poetry, but it would
also create some pretty remarkable strings of words. This always stuck out to
me from the work Gwern did:

'How the clouds Seem to me birds, birds in God’s garden! I dare not! The
clouds are as a breath, the leaves are flakes of fire'

My home is surrounded by a fairly large variety of deciduous trees, and
'flakes of fire' is by far the best descriptor I've ever heard of their colors
in the fall.

One of the other things i noticed about poetry (and songs) coming from these
language models is they are amazing at bleakness. Just dark, dark, darker,
fin. haha

~~~
moultano
It may have just seen that phrase before though.
[https://www.google.com/search?q=%22flakes+of+fire%22+leaves](https://www.google.com/search?q=%22flakes+of+fire%22+leaves)

------
ColanR
The models have not yet been released, and it looks like someone has already
asked about it in the issues:
[https://github.com/openai/gpt-3/issues/1](https://github.com/openai/gpt-3/issues/1)

~~~
jonbaer
I am sure it will be @
[https://huggingface.co/models](https://huggingface.co/models) by tomorrow ;-)

------
The_rationalist
GPT 3 is obscoleted by order of magnitudes. SMIM has achieved 4.6 of
perplexity vs 20 for GPT 3 with and with a thousand less parameters
[https://arxiv.org/abs/2003.02645](https://arxiv.org/abs/2003.02645) This is
the breakthrough of the year and will be silent until the few nerds like me
propagate the news to the mainstream

------
aquajet
How do you go about running a model this large?

~~~
robkop
I would hazard a guess that they will release versions that will be a smaller
size (they have in the past). But in order to run this, you'd just have to use
a cloud provider, first guesses say it'll be 500GB+ of just weights that
ideally you want in memory.

~~~
whymauri
Could we bank on the Lottery Ticket Hypothesis, distillation, or other model
compression algorithms to make these models smaller?

~~~
aquajet
I would guess so, but compressing it by 1/3rd it's size (ie. distilgpt) would
still be quite large. To be fair, I don't know if distillation scales like
that.

------
nullc
10^4 petaflop-second/days.

They missed an opportunity to be the first paper to measure their computation
in mole flops.

~~~
lopmotr
Rather, chemists constantly miss opportunities to use actual numbers instead
of their lazy legacy mole nonsense.

Nobody really seems to use SI prefixes beyond peta or occasionally exa. But
they could have called this 900 zetta-flop. (10^4 peta-flop/s-days)

~~~
nullc
But "mole flop" is one of the best units ever. It's better than furlongs per
fortnight.

------
jcims
I really feel that these very large language models are able to see us in a
way that we can't see ourselves. I'd be curious if someone can come up with
psychological experiments that could be conducted against them in a way that
helps us understand ourselves collectively (or commonly) rather than as the
individual. Sort of like an egoless human essence.

Would be interesting to see if they can learn how animals communicate as well.
Create a synthetic buddy for Buddy.

~~~
logicslave
Collective consciousness trapped in a model is coming. I wouldnt call it AI,
but an amalgamation of common human thought

~~~
h0p3
If you ever want a random friend to penpal with (or VC or whatever), HMU.
Seriously. I would be lucky to hear your thoughts.

------
mikkom
I'm kind of scared to see what GPT-10 will be capable of.

~~~
bluepanda1234
But I am really excited to get to play it, to test it out, and to try out my
toolset to make sure it will do what I need it to do.

Source: [https://talktotransformer.com/](https://talktotransformer.com/)
Input: "I'm kind of scared to see what GPT-10 will be capable of."

------
dbranes
Nice to see Jared Kaplan branching out into ML. He did fundamental work on
CFTs/bootstrap in physics.

------
dhab
Is there a fast.ai like library that allows a novice to try GPT-3?

[https://github.com/openai/gpt-3](https://github.com/openai/gpt-3) only
contains dataset

------
clmnt
Exciting! It's been asked to be integrated into the Hugging Face transformers
library already:
[https://github.com/huggingface/transformers/issues/4658](https://github.com/huggingface/transformers/issues/4658)

------
xtacy
I wonder if the "Lottery ticket hypothesis" work can be applied to this model
to further shrink the number of parameters by 10x, to bring it closer to
Google's T5 but with higher accuracy?

------
wadkar
They say there are no stupid questions, so here is mine:

If there are Billions of parameters in the SOTA models, how do we argue that
they are not over fitting?

~~~
6gvONxR4sf7o
That's section 4 of OP.

~~~
wadkar
Thank you. It is quite a labor to even skim through the 50+ page paper. Your
poignant reply was quite helpful to draw my attention to the issue of
contamination. After reading the section carefully, I think my understanding
of over fitting is very much improved at least in so far as models like GPT-3
are concerned.

Clearly, the authors have given careful considerations to the issue of
contamination and have provided reasonable analysis and a careful argument
regarding over fitting the existing benchmarks.

On the other I was wondering if the authors would like to consider
purposefully creating a type of "out of sample data" for "creative
evaluation"? Of course, GPT is no stranger to creativity, so it would be a
fascinating challenge to come up with methods to create such datasets that are
truly creative and challenge GPT-{N} to prove its mettle.

For example, would it be possible to engage a really good creative writer*
along with a highly experienced school teacher __to take on the Reading
Comprehension task and create few "tricky" evaluation samples that not only go
above and beyond the contamination objections but also challenge the human
intelligence to be careful not to fall into common traps?

This way lies a different evaluation metric - a subjective one perhaps, but
it's a start. Just a thought experiment - that's all.

* so that they can come up with new ways to trick GPT/humans __a teacher knows the common mistakes the average student makes

Edit: Duh, my head immediately screamed GANs the moment I pressed submit, lol.
But I am not sure if GANs make sense for NLP tasks. Like do they make sense if
humans/domain experts try to solve them?

~~~
6gvONxR4sf7o
You might be interested in the ELECTRA model. It's the solid first success
I've seen of a GAN-like framework in NLP. It also has links to why GANs still
don't do so great in NLP in its references.

~~~
wadkar
Thanks a lot.

If I may ask one more question, would you happen to know if the authors or
other researchers who are entertaining any theoretical work on the
experimental design and training methodologies of GPT/BERT? As in why does it
work? What is the significance of training via the "fill-in-the-blanks"
method?

Don't get me wrong - the work is great and the SOTAs are amazing, I would be
just happy to have a chat to discuss and bounce some ideas what all this means
and why do these methods seem to be working so well. Papers/articles/blog-
posts are always a pleasure to read!

~~~
6gvONxR4sf7o
I think it's just kind of understood, so I don't have any real references for
you. Filling in "A dog has ___ feet" requires actual facts. Or compare these
two:

"The city councilmen refused the demonstrators a permit because they advocated
violence. It wasn't the first time the _____ had advocated violence."

"The city councilmen refused the demonstrators a permit because they feared
violence. It wasn't the first time the _____ had feared violence."

The syntax is identical. The words are identical, except that I swapped
"advocated" out for "feared". When I swap it, the ____ changes from
"demonstrators" to "councilmen." Think about what kinds of reasoning and
experience and knowledge it takes you to resolve which group "they" refers to
in this sentence.

Most blanks might be simpler and just correspond to learning english, like
when the blank is "the," but learning that is a feat too. Filling in the
blanks that require broader knowledge requires somehow capturing that broader
knowledge.

------
josecyc
Haven't read the paper, but it's still unclear how the mechanism of one shot
learning works. If the weights are not being updated, how is it "learning"?

------
fouc
Duplicate thread at
[https://news.ycombinator.com/item?id=23345449](https://news.ycombinator.com/item?id=23345449)
(with github link)

------
gre
Is it sentient yet? /s

Real question, are they going to release the full model?

~~~
ganstyles
It took them a while to release GOT-2 full model because of the implications
for things like spambots. The GPT-3 paper indicates that they have been
monitoring forums and noticed that bad actors haven't really been using GPT-2
for their own devices. That's unsurprising because GPT-2 takes a lot of
hardware to run and I assume it messes with the economics of spamming.

GPT-3 will take significantly more resources to run. However, part of me
doesn't want it released ever because of the implications of what bad actors
could do with it.

~~~
minimaxir
> That's unsurprising because GPT-2 takes a lot of hardware to run and I
> assume it messes with the economics of spamming.

GPT-2 doesn't require as many resources to run as you would expect: even from
the 1.5B model, you can mass-produce passing spam comments for less than a
dollar an hour in GPU costs:
[https://docs.aitextgen.io/tutorials/generate_1_5b/](https://docs.aitextgen.io/tutorials/generate_1_5b/)

Pure text spam in general is less effective in 2020; it's content that harder
to fake (e.g. deepfakes) that shakes up social media, and why it's good
FB/Twitter have proactively taken a stance against it.

~~~
ralls_ebfe
I am confused. By content you mean audio/visual content in contrast to textual
content?

------
koolba
What is this and why does it take the top two spots on HN?

~~~
judge2020
One thread will (probably) be merged into the other, but GPT-2 was an
extremely popular OpenAI project that generated long, realistic-sounding
text/articles if you gave it a simple starting sentence or topic sentence.
GPT-3 is an iteration on that, so it's likely a huge improvement.

~~~
anoncareer0212
It doesn't sound like it's an improvement at all, but instead requires less
training data to produce worse results?

~~~
freeone3000
MUCH less training for SLIGHTLY worse results. It's a huge benefit to be able
to make this trade-off.

~~~
drusepth
Is the reverse also true? If you have the training data necessary for "good"
results on GPT-2, is it generally correct to assume that it would provide
better results on your task than GPT-3?

~~~
freeone3000
If you can answer this question without running both models over the data set,
you've got a very good paper on your hands.

