
Tempering Expectations for GPT-3 and OpenAI’s API - vortex_ape
https://minimaxir.com/2020/07/gpt3-expectations/
======
Bx6667
I am totally confused by people not being impressed with gtp3. If you asked
100 people in 2015 tech industry if these results would be possible in 2020,
95 would say no, not a chance in hell. Nobody saw this coming. And yet nobody
cares because it isn’t full blown AGI. That’s not the point. The point is that
we are getting unintuitive and unexpected results. And further, the point is
that the substrate from which AGI could spring may already exist. We are
digging deeper and deeper into “algorithm space” and we keep hitting stuff
that we thought was impossible and it’s going to keep happening and it’s going
to lead very quickly to things that are too important and dangerous to
dismiss. People who say AGI is a hundred years away also said GO was 50 years
away and they certainly didn’t predict anything even close to what we are
seeing now so why is everyone believing them?

~~~
abernard1
> And yet nobody cares because it isn’t full blown AGI. That’s not the point.
> The point is that we are getting unintuitive and unexpected results.

I don't think these are unintuitive or unexpected results. They seem exactly
like what you'd get when you throw huge amounts of compute power at model
generation and memorize gigantic amounts of stuff that humans have already
come up with.

A very basic Markov model can come up with content that seem surprisingly like
a human would say. If anything, what all of the OpenAI hype should confirm is
just how predictable and regular human language is.

~~~
yters
Exactly. Even gpt3 is not creating new content. It is just permuting ecisting
content while retaining some level of coherence. I don't reason by repeating
various tidbits I've read in books in random permutations. I reason by
thinking abstractly and logically, with a creative insight here and there.
Nothing at all like a Markov model trained on a massive corpus. Gpt3 may give
the appearance of intelligent thought, but appearance is not reality.

~~~
mietek
_> I don't reason by repeating various tidbits I've read in books in random
permutations._

Are you sure?

~~~
yters
Yes, I would fail any sort of math exam if I used the GPT-3 model.

------
minimaxir
From Sam Altman just now:

> The GPT-3 hype is way too much. It’s impressive (thanks for the nice
> compliments!) but it still has serious weaknesses and sometimes makes very
> silly mistakes. AI is going to change the world, but GPT-3 is just a very
> early glimpse. We have a lot still to figure out.

[https://twitter.com/sama/status/1284922296348454913](https://twitter.com/sama/status/1284922296348454913)

~~~
currymj
it's interesting that he says that proving a mathematical theorem would be a
bigger milestone.

from my limited perspective, that seems less surprising than getting language
models of GPT-3 quality.

There are already theorem proving environments like Lean, HOL, and Coq.
Proving theorems in these systems is essentially just a kind of guided tree
search - you're trying to reach a goal state by choosing from a finite menu of
"tactics", and might have to backtrack.

There is nascent research in this area already; it works but hasn't proved any
significant theorems. But from what I've read this might be partly because
most "interesting" mathematics can't yet be formally expressed in a theorem
prover.

Eventually mathematicians will build out the libraries of mathematical objects
in those languages such that it's possible to state an interesting theorem;
then it seems like people are already very good at combining neural nets with
tree search to surpass human capabilities.

~~~
oggy
AI can be very useful in practice for theorem proving WITHOUT proving "big new
theorems" or inventing new mathematics. Right now, what makes theorem proving
an extremely expensive undertaking is making proof search work. AI could help
immensely by improving the search.

The process of proving things in provers like Lean, HOL or Coq is interactive;
roughly, the theorem you want to prove is your starting goal, and you apply
tactics to transform the proof goal until the goal becomes the boolean value
true. The Platonic ideal of this process is that you provide the main,
"creative" steps of the goal transformation, and the tactics discharge the
rest automatically. In practice, however, for the process to work, you need to
carefully state your theorems in a certain way, and you need an enormous
amount of tacit knowledge about how the internals of the tactics, as well as
of the libraries of already proved facts that you can use. These things take
years of practice to build up. I think it unlikely that theorem provers will
see a significant uptake until the tactics (i.e., proof search) improve
massively.

This is where I hope AI could step in. There are many challenges, obviously.
The training data sets are relatively modest; when I looked a few years ago,
the publicly available Isabelle theories, for example, amounted to a few
hundred thousand proved theorems, which is a minuscule corpus compared to what
something like GPT-3 uses. Then, how do you represent the input, just
characters, or do you need more structure? Can you can leverage something like
reinforcement learning? How do you could set the training process up? On the
other hand, compared to chat bots, the quality of the resulting AI system
would be much easier to quantify (success rate on a benchmark of new
theorems).

There's already work in this area, but I'm not aware of any grand successes
yet. I hope to see much more work on this in the near future. I've been
dabbling in it myself a while ago while I was still in academia, but other
priorities have taken precedence in the past couple of years.

------
GIFtheory
> As an example, despite the Star Wars: Episode III - Revenge of the Sith
> prompt containing text from a single scene, the 0.7 temperature generation
> imputes characters and lines of dialogue from much further into the movie

This makes me believe it is actually just memorizing the movie script, which
is probably in its corpus. As pointed out here, the model has enough
parameters to straight-up memorize over 1/3 of its gigantic training set.

[https://lambdalabs.com/blog/demystifying-
gpt-3/](https://lambdalabs.com/blog/demystifying-gpt-3/)

------
Barrin92
I'm not really sure I understand the hype anyway. All GPT-3 does is generate
text from human input to begin with, it's not actually at all intelligent as
the person from the Turing test thread pointed out.

Sure GPT-3 can respond with factoids, but it doesn't actually understand
anything. If I have a chat with the model and I ask it "what did we talk about
thirty minutes ago" it's as clueless as anything. A few weeks ago
computerphile put out a video of GPT-3 doing poetry that was allegedly only
identified as computer generated half of the time, but if you actually read
the poems they're just lyrically sounding word salad, as it does not at all
understand what it's talking _about_.

Honestly the only expectations I have for this is generating a barrage of spam
or fake news that uncritical readers can't distinguish from human output.

~~~
nullc
> If I have a chat with the model and I ask it "what did we talk about thirty
> minutes ago" it's as clueless as anything.

Come on, this is from a specific structural limit. The transformer only looks
back a thousand symbols or so.

If you simply scaled it up it would handle that fine. (it's costly to scale up
this architecture).

> but if you actually read the poems they're just lyrically sounding word
> salad,

This is what a lot of famous poetry sounds like to some people.

Saying it doesn't "understand" really sounds like a lot of what people said
about chess engines in the early 90s. However you define understanding, if it
doesn't have it probably doesn't need it to still do amazing and useful
things.

The new Navy Seal copypasta GPT3 wrote on gwerns blog is the greatest of that
form I've ever seen by a wide margin. It is witty, clever, and hilarious.
Could a very talented writer do as well? Probably. But none would ever bother
... because it's just a silly joke yet gpt3 generates world class raging navy
seal 24/7 rain or shine on demand.

[https://www.gwern.net/GPT-3#navy-seal-
copypasta](https://www.gwern.net/GPT-3#navy-seal-copypasta) (best screamed)

People seem unbounded in their ability to hype this stuff. Sure, it has
significant limitations. But so what? Decades ago someone might have asked
"What use is a 'computer' if it cannot learn to love?" \--- yet we can all
agree computers have been extremely useful.

If GPT's output brings beauty or truth to you does it matter that the computer
lacks some vague property humans have? People can find meaning in waterfalls
and sunsets, knowledge from looking at the stars. If we learn or feel
something on an account of a great machine that read the whole internet that
would seem to me to be the least surprising of all these examples.

~~~
abernard1
> If you simply scaled it up it would handle that fine. (it's costly to scale
> up this architecture).

What compelling reason do we have to believe this is true, when this is not
how animal or human brains work?

We humans don't get to "scale up" our architecture with billion-fold increases
in our training set. We have more sophisticated hardware, sure, but we don't
actually have more _data_.

I think we've just pursued this path because it's the one we can do. We have
more compute, we have more data, so we use it. And that's fine. But it still
baffles me that the field isn't more interested in trying to move in the
direction of less data, more sophistication in structures, when we already
know that works at some level because that's how we work.

~~~
nullc
> What compelling reason do we have to believe this is true, when this is not
> how animal or human brains work?

The problem of GPT having zero idea about far-past text isn't some spooky
emergent problem: It literally has zero access to the far-enough past.

GPT is completely memoryless: it has access to the past 1024 symbols. It
predicts the next symbol given the last 1024-- and that's it. When it goes to
predict the next symbol, it has literally zero access to what came before
except for the effect it had on the text still in the window.

(symbols are usually words, but can also be letters, due to the compressed
input).

The result is that when text ends up too far back it will totally "forget it".

If you scale it up to more symbols that effect will go away.

Maybe some other issues arise at some scale, but it seems unlikely that they'd
be all that similar to that totally-forgetting effect.

I'm not arguing that just scaling it up the best approach-- but just that
faulting its forgetting effect is not the most compelling criticism because
that one almost certainly can be addressed by scaling.

An architecture that could give it memory would be interesting-- people have
shown impressive results by making it "show its work" effectively making the
conversation into external memory-- but it's less obvious how to train it to
use memory.

~~~
Barrin92
>I'm not arguing that just scaling it up the best approach-- but just that
faulting its forgetting effect is not the most compelling criticism because
that one almost certainly can be addressed by scaling.

it can't because the size of the system obviously completely explodes to
practically infinity if you try to learn the past just by absorbing random
factoids.

This is clearly not how human memory works. If I ask you "did you float
through your bedroom on march 2nd 2015 at 11 pm?", you don't consult some sort
of history you have burned in a neural net, you make an inference using the
laws of physics to conclude that you didn't because that's impossible.

These neural nets don't reason this way, they don't have high level
understanding of ontologies or laws and they can't make inferences like this,
I haven't tested GPT-3 but I assume it can't even reliably solve basic algebra
that isn't found in the dataset.

So memory is a function of combining high level understanding of the world
with data, not shoving history into an encoder.

~~~
nullc
> I assume it can't even reliably solve basic algebra that isn't found in the
> dataset.

::sigh:: It can, even though there are specific technical reasons that working
with both numbers and exact answers is hard for it.

(The way the input text is encoded combines digits in an inconsistent and
formatting specific way, and the way text is generated isn't just to take the
models most likely output but to randomly sample among the top couple most
likely outputs, which is an obvious handicap when there is a single right
answer.)

[https://pbs.twimg.com/media/EdHuMgsWsAEk-29?format=png&name=...](https://pbs.twimg.com/media/EdHuMgsWsAEk-29?format=png&name=large)

[https://pbs.twimg.com/media/EdII5jhXkAA8eKv?format=png&name=...](https://pbs.twimg.com/media/EdII5jhXkAA8eKv?format=png&name=medium)

(lines starting with > are the human)

------
Reelin
The 0.7 unicorn example is doing absolutely nothing to temper my expectations.
Quite the opposite in fact.
([https://github.com/minimaxir/gpt-3-experiments/blob/master/e...](https://github.com/minimaxir/gpt-3-experiments/blob/master/examples/unicorn/output_0_7.md))

~~~
alserio
"They're so intelligent. I was able to converse with them about quantum
mechanics, which is something I've never even tried to talk to a regular horse
about."

I mean, that I would classify as just genius

~~~
repsilat
Wow, reading through... The Onion can fire half of its staff -- just write a
headline, generate the article a few times, and either edit the best one or
mine them all for gold.

~~~
not2b
For me, the fun of The Onion is almost entirely in the headline. That is where
the creativity and the laugh is, maybe there and in the first sentence or so.
The rest of the piece tends to be predictable from that, so GPT-3 could do
that part. Coming up with the beginning is another matter, so the writers'
jobs are safe for now.

------
rvz
> GPT-3 itself, like most neural network models, is a black box where it’s
> impossible to see why it makes its decisions, so let’s think about GPT-3 in
> terms of inputs and outputs.

Spot on. Explainability is always glossed over in the AI landscape and
generally neural networks used in CNNs, RNNs and GANs are still unable to
explain themselves and their decisions.

On top of that detection mechanisms of generated content like AI-generated:
faces, voices and now text are need to combat its use by bad actors.
Otherwise, it is very dangerous when all of this is used together.

While GPT-3 is impressive in its capabilities, we must think about detections
methods that distinguish content created by an AI or a human.

~~~
dharma1
> While GPT-3 is impressive in its capabilities, we must think about
> detections methods that distinguish content created by an AI or a human.

While this would be very useful, I'm not sure it's possible - at least not on
something like a static text article, where you can't query the machine
learning model with custom inputs to expose its' faults.

I think the internet will be soon be flooded with an absolute tsunami of AI
generated content, practically indistinguishable from human created content -
and it will make the job of search engines so much more difficult.

It's interesting that in our quest to organise the world we always actually
create a lot more entropy.

------
davesque
As encouraging as GPT-3 results are, I still don't see the critical ability to
synthesize _large scale_ structure (think book-level as opposed to just a few
sentences) manifest in this research. Even the much touted 0.7 temperature
revenge of the sith examples don't exhibit coherent high level structure that
continues across the span of more than a few sentences. And I differentiate
high level structure from long distance structure. The algorithm does appear
to be able to relate themes from a sentence earlier in a generated text to
ones which appear later (up to a few paragraphs later). But the narrative
connecting those sentences seems shallow and lacks depth. Compare this with
the complex, multi tiered narrative that underpins the structure of a well
written book or body of research.

My intuition here is that there is an exponential relationship between the
"depth" evident in the narrative of generated text and the number of
parameters required in the model that places viable, AGI-like intelligence
still much further in the future.

~~~
emsy
Absolutely. At the moment AI seems to scale extremely well in one dimension
(amount of data) but really poorly at the other (semantic of data). An huge
increase in the former looks like an increase in the other (because more data
= more simple data semantic)

------
knzhou
The people here reading GPT-3's output and dismissing it as "not really
understanding anything" or "pushing symbols without conscious intent" because
of minor slips in coherence have remarkably high standards for understanding.
On a high school essay prompt, GPT-3 produces more accurate and more coherent
output than 90% of high school students. If the commenters' standards were
actually used consistently, they would also apply to almost all human beings.

Another interestingly common objection is "we think in terms of meaningful
semantics, but GPT-3 only calculates meaningless coefficients". This is a
category error masquerading as an argument, equivalent to saying "I have moral
worth because I'm made of cells, but you don't because you're only made of
quarks." Cells and quarks are just two different-level descriptions of
precisely the same thing. Understanding in your brain can be reduced to the
calculation of "meaningless" coefficients just as surely as cells can be
reduced to quarks.

------
IXxXI
GPT-3 is more likely to revolutionize its industry the way the flying car
didn't than mirror bitcoin's success.

~~~
Judgmentality
I think many people don't even consider bitcoin successful. I still see it as
a solution looking for a problem, personally. It's not actually anonymous, it
doesn't scale for transactions, it's still not used or even understood by the
vast majority of people.

I've yet to hear of the killer app for blockchain technology.

~~~
IXxXI
If bitcoin is a long term store of value, it doesn't need to scale.

For the same reasons a person wouldn't expect a 401k to work like a debit
card.

(((((:

~~~
bkanber
That's not how bitcoin was marketed in the early days. And I don't trust it
even remotely as a long term store of value.

~~~
ClumsyPilot
Well, dynamite and machinegun were meant to end all wars. How things turn ut
is not up to the original creator.

------
HugThem
Strange, nothing in the article tempered my expectations.

------
qeternity
Although GPT-3 is most certainly not immune from the relentless hype-cycle,
the low barrier to entry provided by API combined with what appear to be SOTA
results on many tasks, will undoubtedly upend the business models of many
platform companies. There is a whole industry of feature-as-a-service
companies out there providing domain-specific functionality to various
industries...GPT-3 looks like it could make it a few orders of magnitude
easier/cheaper for these customers to build these features in-house, and ditch
the feature providers.

------
anotheryou
I played AI dungeon and the move from gpt2 to 3 didn't feel like much of a
difference. In general all the cool stories remind me of cherry picked AI
dungeon with gtp2

------
gdulli
> However, I confess that the success of GPT-3 has demotivated me to continue
> working on my own GPT-2 projects, especially since they will now be
> impossible to market competitively (GPT-2 is a number less than GPT-3 after
> all).

Doesn't the much larger size and therefore much higher expected cost of GPT-3
ensure that demand for GPT-2 will continue?

------
matsemann
Can someone explain to me how the design mockup demo and react generation
relates to GPT-3? Isn't it just prompts? How is inputting a text outputting a
design?

~~~
zora_goron
I believe those demos involve "priming" GPT-3 by providing a few examples of
(text → generated code), then during inference time passing in just the text.
The model would follow the examples provided and subsequently generate a
string of code/mockup syntax, which is then evaluated.

Edit: here is a tweet (from the author of the GPT3 layout generator) that
seems to show this in practice:
[https://twitter.com/sharifshameem/status/1282692481608331265](https://twitter.com/sharifshameem/status/1282692481608331265)

------
me551ah
It is still amazing at what it is and could be used to generate boilerplate
code. For many writing tasks, it can come up with a good starting point. For
programming tasks maybe someone can write a translator for python to C++ or
even assembly. It can also come up with basic designs which can be a useful
starting point and as long as people don't see it as a replacement for them we
will be fine.

------
hn3333
So many possibilities:

This can be used to generate fake flirting on dating sites. (I assume this is
already going on, but now they don't need to hire humans to do it.) Or how
about flamewar bots to keeps Twitter users busy. Also perhaps it can answer
support questions better than current systems. Come to think of it, we might
eventually need some sort of proof/signature that something was written by a
human being.

------
Ptrulli
GPT-3 is at the very least an indication of future AI advancement. Sure maybe
this particular version is still early to be widley incorporated in business
or used globally. It's definitely a step in the right direction. I'd
personally love the opportunity to test it personally to really get the whole
experience.

------
king07828
I'd like to see the LSTM brain in the OpenAI 5 model [1] replaced by GPT3 to
see if there is any improvement.

[1] [https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-
five/](https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/)

------
unexaminedlife
I wonder how domain-specific GPT-3 was trained on React. And what sort of
arbitrary limitations, if any, exist in what the user can request? I noticed
the user didn't have to specify which programming language they wanted the
apps in, yet it always chose React.

~~~
ZephyrBlu
That's only because the guy primed it with React. Someone else has done it
with SwiftUI [1].

[1]
[https://twitter.com/jsngr/status/1284874360952692736](https://twitter.com/jsngr/status/1284874360952692736)

------
lifeisstillgood
my question is the other way round - where is the state of the art for
_extracting meaning_ from existing text? How close are we to understanding
legal contracts for example

~~~
TaylorAlexander
Well my understanding is that GPT-3 can “read” material you feed it and then
respond in a pretty open ended way about the material with pretty good
accuracy and comprehension.

~~~
matsemann
But how can that further be used? Getting a text back only gives me the same
problem again.

------
id_ris
How are people accessing GPT3? Is there a website, code repo, api... or do
some select few have access?

~~~
drusepth
The OpenAI website has a login that lets people access it through a web
interface, but there is also an API. You can join the waitlist at
[https://beta.openai.com/;](https://beta.openai.com/;) they're apparently
working through the list.

------
brisky
GPT3 demos are impressive. This looks like a significant step towards AGI. I
think GPT3 simulates memory retrieval and data reconstruction quite well. The
part that we are still missing for AGI is curiosity - ability to ask questions
and fill in the missing information gaps.

------
cryptoz
This part piqued my interest:

> GPT-3 seed prompts can be reverse-engineered, which may become a rude
> awakening for entrepreneurs and the venture capitalists who fund them.

Is there any more information about this? Is the reverse engineering done by
hand or is this something that can be coded? Super curious about this point.

~~~
gwern
I think you would have to do it by hand. Assuming that the prompt is stripped
off before the text is sent to the user, there's no obvious way to me that you
would reverse-engineer it except by having an experienced GPT-3 user use their
intuition to think about what kind of prompt would elicit observed responses.

Since you don't have access to the GPT-3 model itself, you can't directly use
gradient ascent or MCMC to try to reverse it to get the prompt. You might be
able to blackbox it, but the only approach that comes to mind is using the
logprobs, and your target won't give you those any more than they will give
you the prompt itself.

I'm not sure what he has in mind; just because model-stealing has been
demonstrated for CNN classifiers doesn't mean you can feasibly steal GPT-3
prompts...

~~~
cryptoz
Super interesting, thanks for the reply!

------
zitterbewegung
I was going to give a talk at Thotcon 2020 (pushed back to 2021) about
generating fake tweets by refining GPT-2. The whole purpose of my experiment
was to see that if OpenAI was right in their statement that GPT-2 could be
used in nefarious ways. I saw a bunch of tutorials but first I read gwerns
tutorial and that made me understand the basics of using GPT-2.

If you see Gwerns experiments with GPT-2 you notice that his websites are
actually just extremely large samples of text / image data. Essentially the
design is that the whole website is practically statically generated.

Refining on AWS was costing me $100 a day.

I also met Shawn ([https://github.com/shawwn](https://github.com/shawwn)) who
decided to attempt to refine / train large models using TPUs. That sounded
interesting but since I figured out that I would waste time trying to
understand TPUs because I have a day job I instead just bought a Titan RTX.

My first experiment was that refinement of GPT-2 with your own data would
somewhat work. At first I used Donald Trump because he is a very active person
on twitter that I believed that people would have some kind of ability of
detecting whether the tweets are fake or not.

The above was a bad choice for the following unique reasons. 1\. People would
generally believe that Trump would say nearly anything. 2\. I noticed a weird
situation where someone basically duplicated everything that I did had
commenters which were apparently much better at figuring out that the person
performing the tweets were fake.

Since that experiment sort of failed I created an expanded refinement of GPT-2
on general twitter data. It was 200 MB of tweets from
[https://www.kaggle.com/kazanova/sentiment140](https://www.kaggle.com/kazanova/sentiment140).

I then repeated the experiment and eventually figured out that some people
were really bad at figuring out fake tweets and I found a person who used
twitter a great deal would actually perform better. I'm not sure if its that
you tweet a bunch or just read twitter enough. I had a low sample (n=5) where
I carried out the test. So my test could be completely biased.

The test procedure that I followed was to make a set of 10-20 questions and
then have the user pick which one was fake and which one wasn't.

I was also doing these experiments on a Titan RTX (I'm thinking about getting
a V100 or maybe just another Titan RTX so I can train two models at the same
time). I accidentally upgraded the memory to 32GB which worked for a bit but
instead you should probably get at least 1 or two multiples of your VRAM so
that you can keep your operating system running and having the dataset loaded
into memory.

Also, during refinement I don't think my loss ratio was improving. I think
that either I wasn't using the system long enough.

But, as a conclusion I figured out that there is a huge shortcut that I never
considered. It would be MUCH easier to just take any refined GPT-2 model or
even use GPT-3 as an API above and then make it LOOK like a tweet. Just adding
a hashtag and a t.co link would work. (the funny part is that GPT-2 actually
seems to have some notion of a t.co link and will happly generate t.co links
that don't work. Removing t.co links before refinement would be one way to get
this out.

I did some of these experiments using Google CoLab initially. But as I used it
I got out of memory options. I asked someone at pyOhio who worked for google
if there was a way to connect Google Colab to a paid instance. They responded
to refer me to Jake Vanderplas
[https://twitter.com/jakevdp](https://twitter.com/jakevdp) . They said no at
the time. Then a few months after Google came out with Google Colab Pro. But
then I was able to run out of system memory in the high memory instances. I
upgraded my Deep Learning Rig with a AMD 3600 and I am now waiting for my 64
GB memory that is in the mail.

The latest thing that I have done is use local voice synthesis and recognition
so that you can talk to GPT-2 locally. My tutorial is at
[https://www.youtube.com/watch?v=d6Lset0RFAw&t=2s](https://www.youtube.com/watch?v=d6Lset0RFAw&t=2s)

~~~
jamesdutc
In the future, will it be considered offensive to suggest that someone’s posts
look like they were generated by an ML model…?

~~~
teruakohatu
I have the same suspicion as you, and I like how delicately you put.

One thing that stands out is a reference to a "AMD 3600" which makes little
sense in the context of buying a V100. Why buy an older low-mid range CPU in a
deep learning rig with a high end Pro GPU. There is also talk of "accidentally
upgraded the memory to 32GB".

~~~
zitterbewegung
Yea, so I thought I only needed 32gb for refinement but I ran out of memory
when training a 800mb dataset.

I use a 3600 ryzen right now and I saw that v100s are approximately the same
price as a Titan rtx .

~~~
britmob
You only need a large amount of system RAM for the initial dataset encoding.
You can do this once per dataset on a high memory config on AWS for pennies.

~~~
zitterbewegung
I agree with you but I want to keep costs down so I am using my workstation .

