
GPT-3 - cocoflunchy
https://www.gwern.net/newsletter/2020/05#gpt-3
======
chillee
Personally, as a ML researcher, I find GPT-3 very unsatisfying. There aren't
any novel architectural details, it doesn't improve our "fundamental"
understanding of the field, and it requires the type of computation I have no
chance of getting.

As a fan of the field, however, it is undoubtedly very cool.

I still think there's tasks that GPT-3 has no chance of tackling (say,
generating code to solve a novel programming task), but the bitter lesson is a
bitter one indeed...

~~~
FeepingCreature
It's kind of LHC-like, isn't it? Same physics "but larger".. "will it work if
we scale it up? Apparently: yes."

I think the big point, and what scares me a bit, is that we have yet to
discover any sort of fundamental conceptual limit to the Transformer
architecture.

~~~
sooheon
Each attention block in the Transformer models a fully connected graph (with
the attention heads being learned edge attributes). A graph is the most
general data structure possible, so yeah, I don't think there's really a
fundamental limitation to them, just a computational one. Latest papers from
ICLR explore how they fully model CNNs and RNNs, for example, and I'm sure
papers on their theoretical equivalence with GNNs are coming.

~~~
p1esk
I'm not sure we want to work with "the most general data structure possible".
We can use a Hopfield-like network where every neuron is connected to every
other neuron and to itself. It probably won't be very useful. NN design have
been moving from more general to less general architectures.

~~~
Erlich_Bachman
It is implied that at least in theory, the promise is to use them with the
similar total complexity (like numberr of parameters and amount of required
calculation), in which case yes we do want the most general data structure
possible. If we can have a more general data structure that provides similar
performance characteristics, it is easier to apply, to debug, to understand,
and it likely means that we have found something more fundamental about the
underlying world in general.

------
throwaway4666
>Like the genomics revolution where a few far-sighted seers extrapolated that
the necessary n for GWASes would increase exponentially & deliver powerful
PGSes soon, while sober experts wrung their hands over “missing heritability”
& the miraculous complexity of biology & scoff about how such n requirements
proved GWAS was a failed paradigm, the future arrived at first slowly and then
quickly. Yet, here we are: all honor to the fanatics, and shame and
humiliation to the critics!

No. Actual geneticists are still pretty skeptical about GWASes because they
tell us almost _nothing_ about, well, the _genetics_ behind complex traits.
It's all good and well running GWASes for literally anything (see:
twitter.com/sbotgwa or that dude who got a pretty good PGS from correlating a
country's GDP with the genotypes of _Arabidopsis thaliana_ available in that
country) but that's virtually useless for serious research or if you want to
know how genes work.

Actually figuring out to what extent a trait is genetically determined usually
involves much more complex methods (e.g. mendelian randomization) and knock-
out experiments on animal models, which is all terribly expensive and tedious.
But that's how actual genetics works, not waving a magic wand of +15%
heritability.

~~~
esyir
Unless something has drastically changed in the past 5 odd years, Actual
Geneticists (tm) happily use GWAS in many investigations of complex traits.
They're often an early step in the overall pipeline in which they trawl for
potential candidates for that target of interest, after which they go on to
said more complex methods.

Why? You've already said so yourself. Those are expensive and tedious, and
searching across the entire pool of human genetics with them is an exercise in
futility.

~~~
throwaway4666
The important part in your post is

>They're often an _early_ step

I think we're in agreement here, I'm just arguing that "woah, look at all
those correlations" isn't a breakthrough or 'genomic revolution' in any sense
of the word as far as our understanding of human genetics is concerned.

------
roca
I don't know much about ML but I wonder what the need for so much training
data (apparently about 500 billion words for GPT-3) means for this approach.
Humans achieve their performance levels while only ever observing a tiny
fraction of that training data --- at least in word form. I see only two
possibilities:

\-- Human brains somehow are able to learn from all forms of sensory input and
world interaction in ways that pay off in word tests. (Even then, would the
information processed by an average human match the GPT-3 training corpus?)

\-- The human brain has a very different architecture that lets us learn
vastly more efficiently from small amounts of training data.

Is there another possibility I'm missing?

If the latter is true then that gives humans a durable advantage over
GPT-3-like approaches.

~~~
api
> Humans achieve their performance levels while only ever observing a tiny
> fraction of that training data

This is really the key detail and the hole in Gwern's argument that AGI is
around the corner. You can't just compare the result. You also have to look at
what it took to train the model and at what the model is actually doing.

If you look at GPT-3's output, it only superficially makes sense. Is there
evidence of true understanding or is it just a really really good text
generator?

Regardless of AGI though, I do think that models like this will eventually
mean the end of social media and perhaps all wide open discourse on the
Internet. When this stuff gets easy and cheap enough that spammers and
propagandists can use it, it's over. How much money in compute/storage would
it take to train GPT-3 to advocate for Donald Trump or Joe Biden all day long,
or to shill products, or to just generate superficially comprehensible text to
execute a kind of denial of service attack on a community?

~~~
dwheeler
I don't buy the "we have GPT-3, therefore we may soon have artificial general
intelligence" (AGI) notion.

AGI might happen tomorrow; it might happen in decades, in centuries, or never.
GPT-3 is basically a straightforward scaling of GPT-2, but I see no evidence
that simply scaling GPT-2 or GPT-3 will lead to AGI. The problem is we don't
know what else is needed.

------
SonOfLilit
Just so we're all on the same page, this is a random GPT-3 sample (I clicked
"Random" twice, the first one was a short Wikipedia-like article, this was the
second):

[https://read-the-samples.netlify.app/sample_1749](https://read-the-
samples.netlify.app/sample_1749)

As far as I'm concerned, I would probably pick this over a random fanfic in a
Turing Test.

~~~
sooheon
I was very impressed, but two clicks later landed me here:

[https://read-the-samples.netlify.app/sample_1792](https://read-the-
samples.netlify.app/sample_1792)

This reads exactly like bad plagiarism, with the word count padding repetitive
phrases and too-on-the-nose facts. It's not generating, it's memorizing.

~~~
afiori
> This reads exactly like bad plagiarism

Are you also saying that it is plagiarism or that it just look like it? To be
honest considering the huge training sets and knowing nothing of the testing
methodology I have some lurking suspicion that it could be just copy-pasting
chunks of texts...

On the other hand if it is just writing "articles that feels a lot like they
are plagiarism" then I suppose it doing its job properly considering what you
find on the internet.

~~~
sooheon
I guess it's both. It reads like plagiarism of bad plagiarism.

------
abj
What human like abilities would a scaled up version of GPT-3 have?

"Would it be worthwhile, even if it represented another large leap in AI
capabilities, to spend up to 10 milli-Manhattan-Projects to scale GPT-3 100x
to achieve human-like performance in some domains? Many researchers feel that
such a suggestion is absurd and refutes the entire idea of scaling machine
learning research further, and that the field would be more productive if it
instead focused on research which can be conducted by an impoverished
goatherder on an old laptop running off solar panels. Nevertheless, I think we
can expect further scaling." [1]

[1]
[https://www.gwern.net/newsletter/2020/05#gpt-3](https://www.gwern.net/newsletter/2020/05#gpt-3)

~~~
nl
Solving Winograd schemas would be a pretty interesting and significant step
forward.

------
minimaxir
I released a large number of GPT-3 demos yesterday:
[https://github.com/minimaxir/gpt-3-experiments](https://github.com/minimaxir/gpt-3-experiments)

------
fock
Well, I take the the opposite stand: GPT-3 = Give a 3rd grader wikipedia and
some paper. While it's certainly fun how it's relatively coherent in writing
text, I couldn't see yet, how it links syntactically correct text to actual
facts. Which in my opinion is the difference between 1e100 apes with
typewriters and aforementioned 3rd grader.

~~~
dwohnitmok
> GPT-3 = Give a 3rd grader wikipedia and some paper.

I don't know if you're being hyperbolic here, but if not, I consider that a
_massive_ step forward for AI.

~~~
wokwokwok
It's not a 3rd grader.

This kind of weird characterisation that people keep bringing up is totally
wrong. It's more like a savant, who can generate random stories that have no
bearing on reality.

It's just a bit better than GPT-2, which spat out mostly incoherent crap.

What's interesting about this is that it seems like the _approach_ still seems
to scale, and at some point, it might make something that actually generates
useful output... and the ability of the model to handle general NLP tasks is a
bit better.

So yeah, it's interesting, but no, it's not a massive step forward for AI, in
the way having _an actual 3rd grader_ would be.

~~~
zozbot234
I don't think a third grader would ever be able to write anything like e.g.
[https://read-the-samples.netlify.app/sample_1986](https://read-the-
samples.netlify.app/sample_1986) this. Like it or not, this dreamed-up story
is internally coherent in a truly impressive way, and even more impressive is
how it stays on-message throughout.

~~~
criddell
Grammarly says that text was plagiarized and if that's true, it's not a
surprise that it's coherent.

~~~
fock
and plagiarizing/wrongly paraphrasing texts is something every 3rd grader
should be able to do. And yeah, generally a 3rd grader should be able to stay
on topic as well ...

And additionally he can draw, do image recognition, run circles, climb trees,
pick fruits and mine rare minerals. Seems like we already have the business
proposition of most AI businesses!

------
losvedir
When I read the output of this model I'm really quite impressed. However,
given the sheer size of it and huge training corpus, to what extent is it just
regurgitating source text?

~~~
justanotherhn
I guess we could ask the same question about ourselves. How much of what we
say is just a regurgitation of what we hear/read every day?

------
andreyk
"What should we think about the experts? Projections of failure were made by
eminent, respectable, serious people. They spoke in considered tones of why AI
hype was excessive and might trigger an “AI winter”, and the fundamental flaws
of fashionable approaches and why brute force could not work. These statements
were made routinely in 2014, 2015, 2016… And they were wrong. I am aware of
few issuing a mea culpa or reflecting on it. It is a puzzling failure, and
I’ve reflected on it before."

Who are these experts? Where are records of these routine statements?
Seriously, I am an AI researcher, who said this?

~~~
p1esk
Gary Marcus?

~~~
andreyk
I mean, he and one or two other AI-research adjacent people basically have a
hobby of bringing up limitations of Deep Learning etc, sure. But if that's all
that is meant by this post, this is a weak point indeed...

~~~
p1esk
After IBM Watson won Jeopardy! game Noam Chomsky said: "Watson understands
nothing. It's a bigger steamroller". I wonder if he still holds this view
after reading GPT-3 samples.

------
cannabis_sam
“ but an idiot savant, we should remember, is only a genetic mutation or bit
of brain damage away from a normal human.”

I’m not sure this comparison holds.. seems like a chain of fuzzy implications
taken as necessary fact.

------
mark_l_watson
I retired a year ago from managing a deep learning team. While I am a fan of
the technology, I really yearn for more research aimed at hybrid AI systems.
Even given one-shot (few shot?) learning, transfer learning, etc., I keep
coming back to watching my grandchildren back when they were infants. They
could see a picture of a new animal in a picture book, and really "get" why
the animal looked different form others, etc. Deep learning will not get us
there.

I think we get to general AI by using prior knowledge, exploit deep learning,
and also build systems that can develop their own models of the world that
they can maintain, modify, discard, and combine.

As always, a great write up by gwern!

------
nl
Winograd schemas falling at 10T parameters is interesting. That's probably
only 5 years off.

If we can build something capable of passing Winograd schema, then it can
probably write working non-trivial computer programs from plain text.

Google's PEGASUS summarization model[1] has learned to count up to five (which
is _amazing_!!). That's "only" 568M parameters. It's be interesting to see
GPT-3 fine tuned against the PEGASUS objective function.

[1] [https://ai.googleblog.com/2020/06/pegasus-state-of-art-
model...](https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-
for.html)

~~~
LordDragonfang
For those curious the "counting" is at the end of the article, and really is
quite impressive:

>Following this post is an example article from the XSum dataset along with
the model-generated abstractive summary. The model correctly abstracts and
paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham
and HMS Cornwall) as “four Royal Navy frigates”, something an extractive
approach could not do since “four” is not mentioned anywhere. Was this a fluke
or did the model actually count? One way to find out is to add and remove
ships to see if the count changes.

>As can be seen below, the model successfully “counts” ships from 2 to 5.
However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as
“seven”. So it appears the model has learned to count small numbers of items
in a list, but does not yet generalize as elegantly as we would hope. Still,
we think this rudimentary counting ability is impressive as it was not
explicitly programmed into the model, and it demonstrates a limited amount of
“symbolic reasoning” by the model.

~~~
empath75
Sounds like my children. For a long time my now-four year old counted like
this: “one, two, three, so many!”

~~~
jimbokun
Sounds like your four year old was ready to start making inductive proofs!

------
personjerry
I'm failing to connect the anecdotes with the conclusion. He's claiming the
the ML is scaling well but then gives the data on how GPT-3 is expensively
brute-forcing its way to "success".

To me it just seems like what supercomputing is to normal computing: It makes
the computationally expensive stuff do-able in a reasonable amount of time, or
gives diminishing returns on existing algorithms. But it doesn't magic in any
real advancements.

The problem in AI/ML and the concept of "AI winter" to me was always the
barrier of the fact that we're just doing predictions, with no deep meaning or
comprehension of features. The layman thinks there's magic, but there's not,
so when that truth is revealed there will be problems. There's nothing
intelligent about our Artificial Intelligence; we're just doing statistics
with big data and added sprinkles. OpenAI just proved they could do statistics
with even bigger data and more expensive sprinkles.

Has their work really shown we can get past that core problem? Personally, I
don't see it.

~~~
LordDragonfang
>we're just doing statistics with big data and added sprinkles.

I mean, you could pretty much say that's how the human brain works, couldn't
you?

~~~
saberience
But thats the point, our brain is clearly a lot more than just a big table of
probabilities. You only need to look at the absolutely insane volume of data
and training time that these models need. How much time does it take a for
human to understand a concept like "love" and what volume of training data is
required? Computers would just regurgitate quotes from poetry or novels about
love without any real understanding after billions of hours of training time
and ingesting every document on the internet. A human can understand love in a
tiny fraction of the time and with a tiny fraction of the volume of
information processed and they also understand it in a more fundamental way
and can articulate that in a way these models cannot.

You might argue back, well the human brain has pre-trained neural networks
with billions of hours of training time. Well, that isn't really the case. We
don't start off with some pre-existing memory of what "love" means, or what
"Physics" is, or trillions of bytes of data. All we have is a capacity to
learn which is highly efficient, a conscious mind which is aware of itself,
and certain fundamental drives driven by our bodies and instincts. If you have
a human child and give it zero input information it will never learn a
language or be capable at all in any sense of the term. So we become
incredibly capable based on a tiny fraction of input data fed into us after
birth.

The way the human brain and mind works is deeply tied in to the experience of
having a body, knowing we are mortal, and having fundamental drives such as a
drive to survive, eat, drink, keep ourselves safe, and also a drive to be
social, find a mate, and procreate. I would argue that we will never be able
to have a computer/algorithm that thinks like we do unless it also has drives
like we do and a body like we do, since so much of our process of thinking is
tied in to having a body, our awareness of mortality, and our basic human
drives and experience.

~~~
keenmaster
X = A+B

A= C+D

B = E+F

Love = X = A+B = C+D+E+F

Obviously the above is contrived and abstracted, but you get my point. If I
took a little bit of time, I can schematically map every word to makes up the
definition of love, and how they interact. Then I can associate 3D, real world
graphical observations to each of those words and then love as a concept
holistically (as humans do, we're not just confined to text data, we observe
extremely rich 3D visual data, and audio data, and touch data, etc...).
There's no reason to believe a massive "correlation machine" can't do the same
thing with the right algorithms, enough compute power, and multimodal inputs.
Furthermore, we can make the correlation machine even better by specializing
parts of the hardware for certain tasks, just like the brain.

~~~
saberience
And it still wouldn't be the same. Again, our notion of love is tied into
having consciousness, which machines would not have. We still don't even
understand what consciousness is, how to define it, or how it is generated in
the brain. While machines are not consciousness they could never "understand"
or "experience" what we call "love" because love again is tied up with our
experience of consciousness, the idea of mortality, and having a physical
presence in the world.

~~~
keenmaster
What is consciousness, and what unique non-tautological properties does it
give us?

I know of the general idea of consciousness, but I can’t boil it down to first
principles. Self-awareness, on the other hand, is more tangible. AI would seem
capable of internal cognition, reflection on past experiences, etc...They
might not have the desire or need for such reflection, but they would
certainly have the ability.

------
Sniffnoy
Something seems to be wrong with the footnotes on this page...

Edit: Nevermind, my browser seems to just be screwing up on Gwern's footnotes
in general at the moment.

~~~
renewiltord
On my desktop, I hover over and they pop up a little window with the footnote
in it. That looks like the intended behaviour but it must be failing on your
platform.

------
jefftk
"GPT-3 is an extraordinarily expensive model by the standards of machine
learning: it is estimated that training it may require the annual cost of more
machine learning researchers than you can count on one hand (~$5m), up to $30
of hard drive space to store the model (500–800GB), and multiple pennies of
electricity per 100 pages of output (0.4 kWH). Researchers are concerned about
the prospects for scaling: can ML afford to run projects which cost more than
0.1 milli-Manhattan-Projects? Would it be worthwhile, even if it represented
another large leap in AI capabilities, to spend up to 10 milli-Manhattan-
Projects to scale GPT-3 100x to achieve human-like performance in some
domains? Many researchers feel that such a suggestion is absurd and refutes
the entire idea of scaling machine learning research further, and that the
field would be more productive if it instead focused on research which can be
conducted by an impoverished goatherder on an old laptop running off solar
panels. Nevertheless, I think we can expect further scaling."

------
deskamess
Love the document format and how hovering on links presents enough detail
inline (example Bitter Lessons, Sutton). I appreciate not being taken to
another page/tab. Wonder if a CSS style sheet is easily available.

Had never heard of MuZero before. Its impressive that it can reach AlphaZero
levels in Go without knowing the rules.

~~~
davnn
Just look at the page source for CSS and JS used to create it.

------
nl
Given that Image-GPT[1] which was based on the GPT-2 architecture has been
shown to do very good image completion, I think in the next 12 months we'll
see a unified Image/Language GPT.

I think it will be able to (imperfectly) do things like the following:

\- OCR from images \- Textual descriptions of images

It may start to make some progress towards things like:

\- Generating images from a textual description \- producing structured
documents (eg HTML) from document images

It'd be interesting to see how far along they already are with this.

[1] [https://openai.com/blog/image-gpt/](https://openai.com/blog/image-gpt/)

------
Tepix
I'm happy to see it being released.

What is the minimum hardware required to run this locally?

Or what's the cheapest way to run this model (even at barely acceptable
performance) at a cloud provider?

~~~
p1esk
Any server with 512GB of RAM. So basically a couple of bucks per hour, to
generate a few words per minute.

~~~
Tepix
You don't need fancy GPUs with lots of RAM?

~~~
p1esk
If you’re ok with generating a few words per minute, no.

------
Der_Einzige
All this advancement in text generation - but it you want to use it to
represent sentences or documents in a semantic space you still have to use
horrifically bad techniques like average or max pooling.

When is someone going to advance pooling techniques? We desperately need
improvement!

------
gwern
Incidentally, I've also been working on a selection of GPT-3 generated
creative writing (primarily, but far from limited to, poetry):
[https://www.gwern.net/GPT-3](https://www.gwern.net/GPT-3)

------
akeck
Did Gwern write parts of this with GPT-3? It has a certain... flavor.

------
yters
Reminds me of explorers where the alien talks in soundbites from earth TV. All
hail our future plagiarist AI overlords!

------
dgrabla
What book should I read to be able to understand this text? No papers please.
I don't understand what scaling a model means.

~~~
mkl
This is about new research, which mostly lives in papers and articles (like
this) about the papers. It won't show up in introductory books for a while, so
if you're unwilling to read or even look at papers, you won't be able to
understand details of new research.

Scaling a model is just like it sounds: more data fed into a bigger network
with more parameters. The gist of what this article is saying about scaling is
that there's no sign of diminishing returns yet in terms of what the network
can do and how well it generalises as the number of parameters is increased:
the "more parameters = better performance" trend continues up to the enormous
size of the full GPT-3 model, with no indication that even bigger models won't
have even better performance.

Here is the GPT-3 paper:
[https://arxiv.org/pdf/2005.14165.pdf](https://arxiv.org/pdf/2005.14165.pdf)

If you really want to understand, skim this, and focus especially on the
graphs, as they show the scaling. The x axis is usually model size, and the y
axis is mostly accuracy or "loss" (~error).

------
skrause
The GUID Partition Table is already at its third version?

~~~
anticensor
No, Generative Pre-Trained Transformer.

~~~
skrause
I know, it's just a stupid acronym because it's already used for something
else (the partition table).

~~~
Smaug123
Oh boy, do I have bad news for you: Wikipedia lists ten overloaded acronyms
starting with "AA" alone!

~~~
skrause
But they are usually in unrelated fields.

------
nonesuchluck
There's a lot of mixed metaphors here, but just to attack one I immediately
recognize: the 1940 Scientific American article "Don't worry, it can't happen"
is not (as Gwern appears to be implying?) stating that nuclear explosions are
impossible. Instead, the article explains why nuclear chain reactions
eventually stop, and do not continue to explode all available matter. S.A. is
not saying Trinity Test would not explode, merely that it would not turn Earth
into a new sun, and you can get some sleep about it.

~~~
nkurz
Gwern's typographical conventions are idiosyncratic, and you might not have
noticed that he linked the original Scientific American article in his text.
Reading the article now, I don't see how your interpretation is defensible.

It doesn't mention exploding the earth, and while there is a little ambiguity,
as Gwern does imply, they are describing a recent paper that concludes that
large fission reactions are simply impossible: "They found instead, that
instead of building up to a grand climax, it runs down and stops like an
unwound clock."

The final line of the caption is 'Readers made insomnious by "newspaper talk"
of terrific atomic weapons held in reserve by dictators may now get sleep'. At
least superficially, that sure sounds to be more about atomic weapons being
impossible than about whether the chain reaction would consume the entire
earth.

I think you are confusing the actual linked article with Edward Teller's later
argument that a nuclear fission explosion might ignite the atmosphere:
[https://www.realclearscience.com/blog/2019/09/12/the_fear_th...](https://www.realclearscience.com/blog/2019/09/12/the_fear_that_a_nuclear_bomb_could_ignite_the_atmosphere.html).

