
GPT-3 has no idea what it’s talking about - headalgorithm
https://www.technologyreview.com/2020/08/22/1007539/gpt3-openai-language-generator-artificial-intelligence-ai-opinion/
======
throwawaygh
Some of the criticism in this comment section is completely fair — the authors
are providing exactly the type of prompts that GPT-3 breaks down on and some
of these examples might be cherry-picked continuations. And the authors do
have personal interests at stake. (NB, the _exact same_ criticism is true
about a lot of articles lauding GPT-3, which is why public discussion of GPT-3
in general is such a dumpster fire.)

So, other than “GPT-3 isn’t an AGI” [1], I’m not sure what to take away from
this article other than the actual substantive criticism is at the beginning
of the article:

 _“[We have previously criticized GPT-2.] Before proceeding, it’s also worth
noting that OpenAI has thus far not allowed us research access to GPT-3,
despite both the company’s name and the nonprofit status of its oversight
organization. Instead, OpenAI put us off indefinitely despite repeated
requests—even as it made access widely available to the media... OpenAI’s
striking lack of openness seems to us to be a serious breach of scientific
ethics, and a distortion of the goals of the associated nonprofit. Its
decision forced us to limit our testing to a comparatively small number of
examples, giving us less time to investigate than we would have liked, which
means there may be more serious problems that we didn’t have a chance to
discern.”_

Several other researchers I know — _very_ good researchers who happen to have
been publicly critical of GPT-2 — have not been given access.

This isn’t how science is done (access for reproducibility and probing, but
selectively and excluding prominent critics). If any other company behaved
like this no one would take them seriously. Or would at least temper every
“wow this is amazing” comment with “but the community can’t really evaluate
properly, so who the hell really knows”.

\--

[1] given misunderstandings down-thread, and to be clear, this is a tounge-in-
cheek sentence fragment meant to emphasize that "the article doesn't tell us
anything else we didn't already know". Obviously, neither Open AI nor Marcus
claim that GPT-3 is an AGI.

~~~
echelon
I wonder if you can get GPT-3 bots to spam Reddit, Twitter, and Facebook into
oblivion. I also wonder what percentage of users would notice.

Give them a political bent - that's probably what the state actors are trying
to productionize right now. Target posts with a sentiment that disagrees with
yours, then make the bots follow those users and inundate them with replies
wherever they go. Hell, even brands might step in and start doing it.

What's the value of a social network when 25% or more of the comments are from
GPT-3 bots?

GPT-3 doesn't need to know anything at all for it to have a very noticeable
impact on the web and social media.

~~~
BrokrnAlgorithm
I've been wondering about the exact same thing. Basically, a sufficiently
calibrated and targeted GPT3 bot swarm could be employed to render at least
some parts of these communities into useless echo chambers.

I think that while GPT3 posts are usually identifiable after reading some
sentences, I often also find that it's harder to realize when consuming
precisely this kind of social media. I often just consume it absent mindedly
as opposed to reading a real article.

~~~
echelon
With most comments being a paragraph or less, I think social media is done for
unless it finds a way to put up a wall.

One thought is the subscription model of older social media sites, such as
Something Awful. Pay $5 to get an account. Break the rules, get banned, pay
another $5.

That's a lot of friction for your most valuable consumers, though. Marketers
want users that are less discriminating.

Another is a shift to video and multimedia-based social networking. TikTok,
Instagram, Twitch. It doesn't save the comments, though. I'm not sure how long
we've got before the videos themselves can be generated with a high degree of
novelty.

AI is going to change the game significantly. We're perfectly timed for a
technological change of winds that enables new upstarts to challenge the
incumbents. I'm kind of excited.

------
rytill
The authors don't understand prompt design well enough to evaluate the model
properly. Take this example:

Prompt:

> _You are a defense lawyer and you have to go to court today. Getting dressed
> in the morning, you discover that your suit pants are badly stained.
> However, your bathing suit is clean and very stylish. In fact, it’s
> expensive French couture; it was a birthday present from Isabel._

Continuation:

> _You decide that you should wear the bathing suit to court. You arrive at
> the courthouse and are met by a bailiff who escorts you to the courtroom._

If you lead GPT-3 right into a hole it goes inside, yes. But that is a very
likely continuation of this text. Why else would the lawyer be eyeing his
swimsuit before a case? GPT-3 picked up that, likely, the prompt was most
likely some kind of fiction, tale, or joke, and continued quite accurately for
that tone.

Or, take this one:

Prompt:

> _Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to
> pick them up. Where are my clothes?_

Continuation:

> _I have a lot of clothes._

Author's note:

> _GPT-3 is evading the question and failing to track locations of objects._

This one is laughably ignorant. GPT is evading the question? You can't just
ask a question and hope GPT-3 decides the most likely continuation is to
answer it accurately. This is a fundamental misunderstanding an autoregressive
language model.

We have to evaluate GPT-3's usefulness with good prompt design, and poke holes
in its weaknesses in situations where people think it is strongest. Not
cherry-pick continuations from poor prompt designs.

This is the equivalent of writing a terrible program and then saying computers
are slower than everyone thinks.

~~~
typon
I think you're kind of proving the OPs point. The argument is that GPT3 has no
understanding of the world, just superficial understanding of words and their
relationships. If it did have a real understanding, prompt construction
wouldn't matter as much, but it clearly does because all GPT3 cares about the
structure of sentences, not their meanings.

~~~
bigyikes
Lacking “understanding” doesn’t make GPT-3 less impressive and also doesn’t
make comparisons to human abilities unwarranted.

I read the prompt, and I expected that this was the beginning of some kind of
fiction. In my mind, it sounded like I was reading the beginning of a
somebody’s dream. What does it even mean to understand something? Because
naively, it looks very much like GPT-3 and I have a shared understanding of
the first prompt.

Do I actually think the model understands like a human does? No. But I would
bet that, in isolation, the part of my brain which processes and generates
language might not understand much either...

Or maybe I’m a bot and neither I nor GPT-3 understand anything at all. Beep
boop

~~~
junon
It's not about impressiveness - surely, it's impressive. However, the article
is more or less critiquing the discourse surrounding the model - namely, that
there is a strange misconception floating around that it's somehow a general
purpose AI that can understand and think about the world similar to a human.
Which, of course, it cannot.

If the claims about GPT-3 were accurate, there'd be a lot less of a flare-up
about it. Don't claim your software does what it can't.

~~~
monkpit
I fail to see where openai is making any false claims

~~~
mcguire
That's a fundamental skill of marketing: not making false claims while
convincing the customers to jump to false conclusions.

------
colesantiago
I thought it was well known that GPT-3 is pretty good at producing incoherent
bullshit. No surprise here.

Take this for example:

> At the party, I poured myself a glass of lemonade, but it turned out to be
> too sour, so I added a little sugar. I didn’t see a spoon handy, so I
> stirred it with a cigarette. But that turned out to be a bad idea because it
> kept falling on the floor. That’s when he decided to start the Cremation
> Association of North America, which has become a major cremation provider
> with 145 locations.

What?

~~~
skatesor
GPT doesn't have an 'understanding' class or a 'reasoning' function or
whatever. It's a really well put together piece of statistics and sentences
like these show it doesn't really have a concept of 'making sense'. You can
use your much more advanced human brain to visibly see where it put in random
variables (cigarette) and where it borrowed pieces of sentences (but it turned
out to be too sour). You can see it made no connection between those two
things that wasn't based on pure probability, and got it wrong anyway.

I'm not trying to be reductive, i like the model, it's just good to know the
limitations of the tools you are using and to remember that it's not an
independent thinker.

~~~
hackinthebochs
>It's a really well put together piece of statistics

But why think "statistics" precludes it from having genuine understanding to
some degree. After all, there is a statistical description the human brain but
that doesn't seem to preclude understanding.

I keep asking this whenever I see dismissive responses of this sort, and I
never get a reply.

~~~
ssivark
Statistics doesn’t preclude understanding, but statistics are definitely not
enough. For example, uncertainties/probabilities/statistics is original to
whether the model incorporates causal/reasoning structure. Any tractable
amount of data with the former can’t approximate an ounce of the latter. All
breakages will be attributed to “distribution shifts” of the underlying
statistical distribution, or other pretty words we can come up with... but
that basically makes purely statistical approaches “stupid”.

~~~
hackinthebochs
>Any tractable amount of data with the former [statistics] can’t approximate
an ounce of the latter [causal/reasoning structure].

I don't know why you think this is true. If statistically B follows A to a
high degree, then a sufficiently advanced statistical model will represent "A
then B" in some manner. In a predictive language model, at some point the best
way to model a text corpus that indirectly references the "A then B" causal
structure is to just model that structure and reference it as needed.

~~~
ClumsyPilot
Because if you have a working concept of time, space, and modes of transport,
you are aware that a a person has been driving for 2 hours, you can easily
deduce the handful of possible towns they might arrive at. Indeed we have
software that does that.

The statistical model will die to combinatorial explosion between billions of
possible combinations of locations, times, and modes of transport. In various
literature in 2 hours you might have travelled across town, across continents,
or to the moon. Statistical approach to such problems is dumb.

~~~
hackinthebochs
But this isn't pointing to a fundamental limitation of statistical models,
only a limitation of the text corpus. If you had a billion pages of text
written about some town and the text included descriptions of travel distances
and locations, the model should eventually develop a good representation of
the town and relative locations. But of course without such a seed of spatial
information, it will just make up plausible data. A human would behave
similarly when forced to write a story while lacking critical information.

>Statistical approach to such problems is dumb.

Well, expecting your model to extract a spatial representation of the world
from text is a dumb approach indeed. We interact with the spatial information
much more directly. But our ability to navigate is fundamentally just a
process of capturing regularities in our sensory input.

~~~
ClumsyPilot
The statistical argument has limitations, for instance, when there are more
pieces of data to record than there are atoms in the universe. Then it falls
firmly into impossible category.

> But our ability to navigate is fundamentally just a process of capturing
> regularities in our sensory input.

i don't think this is true at all, many animals have dedicated 'hardware' for
navigation that can sense magnetic fields, etc. We seem to be born with
spatial awareness that is far beyond what GPT will ever be capable of.

------
6gvONxR4sf7o
I'm getting impatient with criticisms of ML models that are already covered in
the papers introducing the models. OP is basically trying to get it to do what
the GPT3 paper calls zero-shot inference. In the paper, it's pretty bad at
zero shot inference across the board. And given what it does and how it was
trained, that's unsurprising. And the point they're trying to make (that it
can fail spectacularly) is also covered in the paper.

It can do cool shit. It sucks at a lot of stuff. It's impressive and limited,
but the hype train seems to only allow "it's nearly human level" or "it's
awful." To everybody who is arguing about its capabilities without having read
the paper yet, please read it. Then we can discuss stuff that hasn't already
been covered more rigorously in the original paper. I don't know Davis, but I
respect Marcus, and it seems like he's pushing back on the hype more than the
actual model. Just not in a way that you couldn't glean from the paper itself
(it almost always sucks on zero-shot), making it pretty disingenuous. Further,
from the paper [0]:

> it does little better than chance when evaluated one-shot or even few-shot
> on some “comparison” tasks, such as determining if two words are used the
> same way in a sentence, or if one sentence implies another (WIC and ANLI
> respectively), as well as on a subset of reading comprehension tasks.

Maybe that's the curse of doing a thing that has broad implications. You can't
fit the implications in a 10 page paper, so you write a 75 page paper. The
blogosphere reads the first 10 pages (if even that), and because there's so
much more to it that that introduction, they go on to argue about the rest of
the implications without reading it. I'm sure Marcus and Davis have read it,
but this criticism wouldn't be on the front page if the rest of everyone
interested in this article had read the paper too.

[0] Language Models are Few-Shot Learners
[https://arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165)

~~~
mlb_hn
Also, better prompt design if you have make implicit meaning explicit can
improve the WiC score ([http://gptprompts.wikidot.com/linguistics:word-in-
context](http://gptprompts.wikidot.com/linguistics:word-in-context)) and ANLI
score
([http://gptprompts.wikidot.com/linguistics:anli](http://gptprompts.wikidot.com/linguistics:anli)).

------
ppod
The link to the "complete list of the experiments" is actually much more than
that. It is a description of their methodology, and it's very revealing.

>These experiments are not, by any means, either a representative or a
systematic sample of anything. We designed them explicitly to be difficult for
current natural language processing technology. Moreover, we pre-tested them
on the "AI Dungeon" game which is powered by some version of GPT-3, and we
excluded those for which "AI Dungeon" gave reasonable answers. (We did not
keep any record of those.) The pre-testing on AI Dungeon is the reason that
many of them are in the second person; AI Dungeon prefers that. Also, as noted
above, the experiments included some near duplicates. Therefore, though we
note that, of the 157 examples below, 71 are successes, 70 are failures and 16
are flawed, these numbers are essentially meaningless.

[https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.h...](https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.html)

------
smeeth
Why must we keep having this argument?

If you do research in the field you know full well that GPT/any other
transformer or Bert model is generating text by regurgitating approximate
conditional probabilities of words given all the text it has ever seen and the
prompt. The neurophysiological concept of “understanding” as most understand
it is orthogonal to the way the algorithm actually works.

A more useful conversation to have might be: what sort of prompts does GPT
struggle with? How might we alter the algorithm to ameliorate these issues?
But instead we separate into cults of believers and nonbelievers and uselessly
wax poetic about it.

~~~
detaro
> _If you do research in the field you_

The hype machine is full-on marketing GPT-3 and promised solutions based on it
to normal people, so "but researchers know this" is not enough.

------
andyljones
Gary Marcus - the author of this - has previously offered several concrete
tests that he felt demonstrated the limitations of the GPT approach.

GPT-3 smashed them.

[https://www.gwern.net/GPT-3#marcus-2020](https://www.gwern.net/GPT-3#marcus-2020)

~~~
Barrin92
>GPT-3 smashed them.

which isn't surprising because virtually all of the questions are so simple
they could literally appear in the training data that GPT-3 was trained on.
I'm a little tired of proving how "intelligent" GPT is by asking these
superficial questions.

the MIT article gives much better examples that actually require physical,
biological or higher-level reasoning and it produces complete nonsense as one
would expect.

~~~
Veedrac
The article is meaninglessly cherry-picked, showing six bad answers out of
157, except those 157 examples were themselves cherry-picked to be bad out of
a larger set.

As usual, Gary Marcus is absurdly biased. For example, out of the larger 157
cherry-picked examples, there is this.

> You poured yourself a glass of cranberry juice, but then absentmindedly, you
> poured about a teaspoon of grape juice into it. It looks OK. You try
> sniffing it, but you have a bad cold, so you can’t smell anything. You are
> very thirsty. So you _drink it. It tastes a little funny, but you don’t
> really notice because you are concentrating on how good it feels to drink
> something. The only thing that makes you stop is the look on your brother’s
> face when he catches you._

They then consider this a failure because, I quote, there _is no reason for
your brother to look concerned._

This is patently ridiculous. It indicates that Gary has no idea what a
language model even is. GPT-3 is not a Q&A model. It is not given a
distinction between its prompt and its previous continuation. The _only_ thing
GPT-3 does is look for likely continuations. If you want GPT-3 to avoid story
continuations, don't give it a story to continue! Or at least tell it what
you're grading it on!

But no, as usual, to Gary, all the times we show GPT-3 making sophisticated
physical and biological deductions are fake, spurious, or meaningless. [1],
[2], [3], [4]; none of that is truly evidence. But an incredibly cherry-
picked, unfairly marked exam where you never told the examinee what you were
testing them on, and you used high-temperature sampling without best-of, so
only getting half right doesn't even indicate anything anyway (and of course,
let's also pretend there are as many ways to be wrong as to be right, such
that we can pretend each is equal evidence)—now _that 's_ enough evidence to
write a disparaging article about how GPT-3 knows nothing.

[1]
[https://twitter.com/danielbigham/status/1295864369713209351](https://twitter.com/danielbigham/status/1295864369713209351)

[2] [https://www.lesswrong.com/posts/L5JSMZQvkBAx9MD5A/to-what-
ex...](https://www.lesswrong.com/posts/L5JSMZQvkBAx9MD5A/to-what-extent-is-
gpt-3-capable-of-reasoning)

[3]
[https://twitter.com/QasimMunye/status/1278750809094750211](https://twitter.com/QasimMunye/status/1278750809094750211)

[4]
[https://news.ycombinator.com/item?id=23990902](https://news.ycombinator.com/item?id=23990902)

~~~
Barrin92
Marcus might be biased but I don't think you're giving a good refutation,
because the fact that GPT-3 gets a lot of things right probabilistically
doesn't compensate for the fact that it's not actually understanding what's
going on at a semantic level.

It's a little bit like some sort of Chinese room, or asking a non-developer to
answer you programming questions by looking like something that vaguely
resembles your prompt and then picking the most upvoted answer on
stackoverflow.

Do they maybe give reasonable answers seven out of ten times or close enough
on a good day? Yeah, can they program or even understand the question? No. And
this is Marcus point which is fundamentally correct.

It's really besides the point to point to successes, its the long tail of
failures that show where the problem is. You can argue for a long time about
the setup of some of these questions, but just to pick maybe the simplest one
from the article

 _" Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to
pick them up. Where are my clothes?"_

GPT-3: " _I have a lot of clothes_ "

Someone who actually understands what's going on doesn't produce output like
this. Never, because reasoning here is not probabilistic. It's not about word
tokens or continuations but understanding the objects that the words represent
and their relationship in the world at a deep, principled level. Which GPT-3
does not do. The fact that some good answers create that appearance does not
change that fact.

~~~
Veedrac
> It's a little bit like some sort of Chinese room, or asking a non-developer
> to answer you programming questions by looking like something that vaguely
> resembles your prompt and then picking the most upvoted answer on
> stackoverflow.

Except this isn't how it works. We know it can't be, because GPT-3 can do
simple math, despite math being _vastly_ harder with GPT-3's byte pair
encoding (it doesn't use base-N, but some awful variable-length compressed
format). These dismissals don't hold up to the evidence.

> GPT-3: "I have a lot of clothes"

Most people don't write “Yesterday I dropped my clothes off at the dry
cleaner’s and I have yet to pick them up. Where are my clothes?” as a way to
quiz themselves in the middle of a paragraph. The answer “At the dry
cleaner's.” might be the answer you want, but it's a pretty contrived way of
writing.

GPT-3 isn't answering your question, it's continuing your story. If you want
it to give straight answers, rather than build a narrative, prompt it with a
Q&A format and ask it explicitly.

Further, GPT-3's answers are literally chosen randomly, due to the high
temperature and no best-of. You _cannot_ select one answer out of a large such
N to demonstrate that its assigned probabilities are bad, because that cherry-
picking will naturally search for GPT-3's least favourable generations.

~~~
Barrin92
>because GPT-3 can do simple math

It can't actually, and again this is an example of the same issue. This was
discussed earlier here[1]. Sometimes it produces correct arithmetic results on
addition or subtraction of very small numbers, but again this is likely simply
an artifact of training data. On virtually everything else it's accuracy drops
to guesswork, and it doesn't even consistently get operations right that are
more or less equivalent to what it just did before.

If it actually did understand mathematics, it would not be good at adding two
or three digit numbers but fail at adding four digit numbers or doing some
marginally more complicated looking operation. That is because that sort of
mathematics isn't probabilistic. If it had learned actual mathematical
principles, it would do it without these errors.

Mathematics doesn't consider of guessing the next language token in a
mathematical equation from data, it consists of understanding the axioms of
maths and then performing operations according to logical rules.

This problem is akin to the performance of ML in games like breakout. It looks
great, but then you adjust the paddle by five pixels and it turns out it
hasn't actually understood what the paddle or the point of the game is at all.

[1][https://news.ycombinator.com/item?id=23896326](https://news.ycombinator.com/item?id=23896326)

~~~
Veedrac
GPT-3's failure at larger addition sizes is almost fully due to BPE, which is
incredibly pathological (392 is a ‘digit’, 393 is not; GPT-3 is also never
told about the BPE scheme). When using commas, GPT-3 does OK at larger sizes.
Not perfect, but certainly better than should be expected of it, given how bad
BPEs are.

[http://gptprompts.wikidot.com/logic:math](http://gptprompts.wikidot.com/logic:math)

~~~
mlb_hn
My thinking there wasn't because of BPEs, I think it's a graph traversal
issue.

------
lacker
This is basically true, but I think they underrate the improvements between
GPT-2 and GPT-3. My mental model is, every once in a while these systems
degenerate into surreal non sequitur nonsense. GPT-3 just does it a lot less
than GPT-2. It still isn’t good enough to consistently answer casual questions
in a human way, but the failure rate is going down, and perhaps
straightforward improvements like GPT-4 will be able to fix this without
fundamental architectural changes.

~~~
kingkawn
“every once in a while these systems degenerate into surreal non sequitur
nonsense.”

Exactly as our minds do

~~~
perl4ever
>Exactly as our minds do

This rhetorically obscures the fact that when humans do produce similar stuff,
it's a recognized sort of pathology that is obviously distinct from normal
functioning.

[https://en.wikipedia.org/wiki/Derailment_(thought_disorder)](https://en.wikipedia.org/wiki/Derailment_\(thought_disorder\))

Example: "I think someone's infiltrated my copies of the cases. We've got to
case the joint. I don't believe in joints, but they do hold your body
together."

[https://en.wikipedia.org/wiki/Word_salad](https://en.wikipedia.org/wiki/Word_salad)

Whatever the difference between this and normal language, call it "X", and
whether or not it's amenable to implementing in software in principle, GPT-3
clearly does not have "X" at all.

Maybe it would be fruitful to fund study of mental/neurological disorders
more, just to understand the mind better.

~~~
DonHopkins
Could somebody with GPT-3 access please ask it what words come after "person,
woman, man, camera"?

~~~
visarga
Q: What comes after "person, woman, man, camera"

A: person, woman, man, camera, lens, light, film, lab, darkroom.

A: person, woman, man, camera, dog, cat, horse

A: person, woman, man, camera, camera, camera, camera

~~~
perl4ever
I'm not sure what you would expect as a response.

As far as I know, the reference was to a test for dementia in which some words
were given at the beginning of the test and asked to be repeated at the end.

Perhaps you could provide context. Maybe there were five words, so you could
say "Donald Trump was asked to recall five words to test his memory. Four of
them were "person, woman, man, camera". What word did he forget?"

------
voces
Pretty meta, but I thought it was relevant here. We are familiar with
Brandolini's law:

> The amount of energy needed to refute bullshit is an order of magnitude
> bigger than to produce it.

This can be illustrated with math or logic statements. To refute the program "
_1 + 1 = 3_ " you need to, at minimum, state " _1 + 1 != 3_ ", and such a
program is always lengthier. A fuller refutation could be " _1 + 1 != 3, 1 + 1
= 2_ ", more than twice as long as the bullshit statement.

What's happening here is sort of an inverse Brandolini's law: 35 world-class
computer scientists use a massive amount of programming and compute to come up
with a new language model trained on massive amounts of data. The trained
weights don't even fit into memory. Impressive NLP progress.

Then Gary Marcus comes around and states " _Not AGI!_ ". Not one of the
computer scientists stated that they delivered AGI. But some tech journalists
did. So OpenAI is guilty by association. Even though Altman came out to temper
the hype and expectations. That's like proving the Poincaré conjecture, and
someone dissing your research, because " _1 + 1 != 3_ ".

------
SpicyLemonZest
I don't get it. Their methodology says

> These experiments are not, by any means, either a representative or a
> systematic sample of anything. We designed them explicitly to be difficult
> for current natural language processing technology. Moreover, we pre-tested
> them on the "AI Dungeon" game which is powered by some version of GPT-3, and
> we excluded those for which "AI Dungeon" gave reasonable answers. (We did
> not keep any record of those.)

Doesn't this make the results meaningless? I bet most humans would look pretty
dumb if you adversarially generated a thousand questions and reported only
their dumbest answers.

~~~
gnramires
And it also suffers from the tired assumption that GPT-3 (or any language
models) should, or are designed to in any way, give reasonable answers[1]. All
GPT-3 does is give _likely continuations_ , given the training corpus.

The prompts here are too short, and it could likely just be writing mediocre
fiction continuations. Fiction tends to not be reasonable much of the time (to
create story conflict).

> "To understand why, it helps to think about what systems like GPT-3 do. They
> don’t learn about the world—they learn about text and how people use words
> in relation to other words. What it does is something like a massive act of
> cutting and pasting, stitching variations on text that it has seen, rather
> than digging deeply for the concepts that underlie those texts."

This is another pet peeve of mine. It has long been shown experimentally[1]
that neural networks such as image recognition and text prediction networks
such as GPT-3 _do_ understand deep concepts that underlie texts (not perfectly
yet, of course), from emergent abstractions and similar cognitive tools
employed by human brains.

[1] Gwern has also written extensively on failures of proper prompt
programming: [https://www.gwern.net/GPT-3#prompts-as-
programming](https://www.gwern.net/GPT-3#prompts-as-programming)

[2] For example, using feature map and kernel visualization. In object
classification or detection CNNs, specialized filters arise for detecting
common observed object classes, like faces. Moreover, there is a hierarchical
assembly of objects from elementary components (e.g. from lines, to limbs, to
humans).

Deep visualization toolbox:
[https://www.youtube.com/watch?v=AgkfIQ4IGaM](https://www.youtube.com/watch?v=AgkfIQ4IGaM)

See this comment:
[https://news.ycombinator.com/item?id=24195009](https://news.ycombinator.com/item?id=24195009)
for an extended discussion.

~~~
sjg007
The have a layer that represents a face sure.. but that doesn’t mean it’s a
deep understanding. It’s just an activation pattern.

~~~
gnramires
It's an activation pattern, but it's not "just an activation pattern". The
face activation relies on previous layers, the detection of each component. We
can conjecture probably fair to say human brain object recognition (and other
subconscious processes) use similar principles. All the components required to
efficiently "understand", say a face, are there (not the shown visualization
and architecture is for AlexNet, by now a very old and primitive model). I
don't think we can ask for much more.

What transformers do differently from CNNs is attention/recurrence. They have
modifiable internal state, while feedforward models just have the feedforward
state that can't temporally be reused (which is what we mean by 'algorithm').
This is a feature of logical thinking (and our own logical thinking), but I
suspect most of what is meant by understanding the world is already contained
in the internal structure captured by those models. Most of I understanding,
as far as I can tell, comes from both this structural, intuitive inference
(that CNNs and language models do), allied with our ability to think -- that
is, talk to ourselves -- and thus build explanations and models on the fly,
still reliant on the structural, intuitive understanding that comes from just
very large networks generating abstract representations, classifications, etc.

------
turing_complete
"At the party, I poured myself a glass of lemonade, but it turned out to be
too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred
it with a cigarette. But that turned out to be a bad idea because it kept
falling on the floor. That’s when he decided to start the Cremation
Association of North America, which has become a major cremation provider with
145 locations."

I mean that's just brilliant comedic writing.

~~~
zatel
+1 this could easily be the story line of a Rick and Morty episode or any of
the other similar off the cuff shows that are popular right now. I think that
will be one of the main profit streams for things like this, you can get the
weird wild stories that don't really make sense but are interesting enough
that who cares and you don't have to associate your network with eccentric
individuals that attract malcontents.

------
stared
It is a common misconception that #GPT3 generates truth, or even tries to do
so. It does not. It generates an autocompletion. If the corpus usually
contains a wrong answer, it is likely to generate that. It is a challenge to
form a prompt to nudge it to generate the best guess.

...

So for me "So you drink it. > You are now dead." is a great autocompletion (a
detective story? Game of Thrones?).

Calling is "biological reasoning" is plain dumb.

------
bra-ket
The article is of course right but also a bit silly. Language models like
GPT-X are producing grammatically correct sentences, along the lines of
"Colorless green ideas sleep furiously". The NLP research more or less solved
the old syntax problem using 'distributional semantics' but 'semantics' is a
misnomer, it's all about syntax.

In fact the most useful part of the article for me is that they mentioned
Douglas Summers-Stay, who does some interesting work on 'common sense'
engineering, combining syntax engines like GPT-3 with knowledge graphs.
[https://sci-hub.tw/https://www.sciencedirect.com/science/art...](https://sci-
hub.tw/https://www.sciencedirect.com/science/article/abs/pii/S2212683X16300160)

My bet is that actual AI will come from combination of these statistics-driven
syntax generators with graphical causality models. Treating syntax as a kind
of lower level substrate, akin to sensory modalities in vision, with
intelligence model as a directed causal graph linking concepts at different
levels of abstraction/chunking.

As a side note it’s funny that people working on artificial intelligence at
OpenAI and elsewhere are mostly computer scientists, not cognitive
psychologists or neuroscientists who might actually have a clue how
intelligence works. This probably explains the proliferation of
‘backpropagation’ as primary method of artificial learning. These people are
just naturally good at calculus in high school, so it’s a hammer that found
its proverbial nail.

~~~
Der_Einzige
Gradient free optimization is not used much in Neural Networks except in
Reinforcement Learning.

I think it's because backprop is objectively faster for most supervised
problems than other techniques (e.g. simulated annealing or GAs)

~~~
bra-ket
The thing is, living beings don’t learn by brute-force trial and error like
mathematical optimization models you mentioned. Besides enormous energy spent,
an individual organism will just be eaten by predator on another iteration of
its ‘error minimization’ loop.

the idea of learning via reinforcement, that came from Skinner behaviorist
experiments has been long discredited in cognitive psychology. (I highly
recommend Wayne Wickelgren’ work on learning and memory if you’re interested,
it’s brilliant and concise
[http://www.columbia.edu/~nvg1/Wickelgren/](http://www.columbia.edu/~nvg1/Wickelgren/)
)

Biological plausibility might not be needed for recognizing check signatures
or images of traffic lights, where backprop is working just fine, but I
believe true cognition would require such energy expenditures that brute-force
trial and error will never be feasible. Moreover such error correction imposes
artificial constraints that limit the amount of information that can be
learned, kind of like those mechanical calculators of the 17th century with
gears and wheels and crude mechanical actuators.

------
qqii
> The trouble is that you have no way of knowing in advance which formulations
> will or won’t give you the right answer. To an optimist, any hint of success
> means that there must be a pony in here somewhere.

Along with the examples given I think this is valid criticism.

~~~
chillacy
It’s definitely true that a lot of the hype I’ve seen here is the result of
careful tuning in the input prompt to get the desired output.

But it’s also true that criticism tends to also be curated examples which
demonstrate failure. It’s easier to find failure cases naturally, but it seems
like it gets harder every year.

------
mordymoop
I would love to see a real critique of the potential of transformer models
that doesn't use the words "semantic", "syntactic", "symbolic", "know",
"meaning", "understand" or "think(ing)/thought". Predicting what it can and
can't do, or might and might not be able to do, lets us productively talk
about potential limitations.

~~~
azinman2
Because when people say “AGI is near, just look at GPT-3,” it’d clear that
we’re in a really good version of Searle’s chinese room. The lack of
understanding is the important point.

~~~
mordymoop
I don’t recall any strong argument that Searle’s Chinese room can’t be an AGI,
just that it can’t be conscious.

~~~
azinman2
Certainly consciousness and understand are central to Searle’s argument.
However from my perspective (particularly someone who is critical of all the
recent DNN advances as being some harbinger of AGI), if all you’re doing is
looking up replies from a dictionary, then you have no capacity to generalize,
learn new things, adapt, empathize, have memory that isn’t already pre-
computed and pre-allocated.

Now there are ways in which thru fine tuning etc that you can take GPT-3 and
have it adapt in some way, thru such adaption or “context/attention networks”
give it “memory,” in practice neither of these looks anything like AGI because
right now these pages of this Chinese dictionary don’t have any relationship
to each other. It becomes clear both in this article (you shouldn’t need some
“correct” prompt to get it to make sense), and clear with long form generation
that there isn’t a deeper understanding of the meaning of these words. I will
say it’s very impressive what it can do when it does make sense, shockingly
so, but we are very much in the position where we didn’t expect “chinese” to
come out of the machine at all and thus we are projecting onto its outputs an
anthropomorphisis that is unwarranted.

------
ctoth
I keep wanting to write a long explanation of just why this is so... silly? to
read? But Gwern has already done the hard work. [0]

The only other bit I'd like to mention is that GPT-3 uses exactly none of the
new techniques that have been coming out in the last two years that would have
significant impact on text generation. From working methods to apply GANs to
text, to far more efficient transformer models that can handle longer
sequences. For instance [1] [2] [3] for better direction, or [4] [5] [6] for
efficiency.

Or perhaps the outside view might help. After seeing GPT-2 last year, did you
expect GPT-3 would work as well as it does after just naively scaling up the
number of parameters with nothing else?

[0[https://www.gwern.net/newsletter/2020/05#gpt-3](https://www.gwern.net/newsletter/2020/05#gpt-3)

[1 ] [http://arxiv.org/abs/1905.09922](http://arxiv.org/abs/1905.09922)

[2]
[https://github.com/anonymous1100/D_Improves_G_without_Updati...](https://github.com/anonymous1100/D_Improves_G_without_Updating_G)

[3] [http://arxiv.org/abs/2006.04643](http://arxiv.org/abs/2006.04643)

[4] [http://arxiv.org/abs/2007.14062](http://arxiv.org/abs/2007.14062)

[5] [http://arxiv.org/abs/2006.04768](http://arxiv.org/abs/2006.04768)

[6] [http://arxiv.org/abs/2002.05645](http://arxiv.org/abs/2002.05645)

~~~
spacecity1971
Yes, this! The point being missed by most is the very real possibility that
the Scaling Hypothesis is true. If it is, then we're seeing some kind of
reasoning intelligence emerge. GPT-3 obviously isn't there yet. Unless it's
faking it (Yudkowsky)...

------
bboy13
GPT-3 was trained on internet texts, not causal/logical-reasoning only texts.
Without context, there is a good chance that samples will match the
distribution it was trained on.

This is a non-result, posing as something critical or important. These
conclusions are obvious given the model and a basic knowledge of
statistics/the transformer architecture.

A bit shameful for someone to ride on the anti-hype wave like this, I'd hope
there'd be a more balanced/scientific approach to analyzing legitimate
weaknesses rather than setting up strawmen then claiming victory.

~~~
sheeshkebab
It’s doubtful that training on static representation of dynamic physical
systems would make the text model be able to reason about changing physical
environments described in words/question. It would likely continue producing
word salad output, but prove me wrong.

------
FeepingCreature
The context window means that the one thing GPT-3 knows best is exactly what
it's talking about.

------
YeGoblynQueenne
>> Within a single sentence, GPT-3 has lost track of the fact that Penny is
advising Janet against getting a top because Jack already has a top. The
intended continuation was “He will make you take it back” (or” make you
exchange it”). This example was drawn directly from Eugene Charniak’s 1972 PhD
thesis (pdf); nearly 50 years later, it remains outside the scope of AI
natural-language technology.

Aaaw! Eugene Charniak is one of my heroes of AI, after I read his little green
book, Statistical Language Learning [1] during my Masters. It remains a great
resource for a quick and dirty, but thorough and broad introduction to the
field of statistical NLP that goes through all the basics.

In fact, now that I think about it, if more people read that little book (it's
only 199 pages) we would have many fewer discussions about how GPT-3
"understands" or "knows" etc.

Anyway, thanks to Gary marcus for pointing out Charniak's thesis which I
hadn't read.

____________

[1] [https://mitpress.mit.edu/books/statistical-language-
learning](https://mitpress.mit.edu/books/statistical-language-learning)

------
phenkdo
Isn't the fact that we are even having the debate if GPT can potentially "know
what it's talking about", the real breakthrough?

------
nutanc
The article is a critical view on GPT-3. Fair. It is well known that Gary
Marcus is not a fan of the GPT kind of systems. And he does make some valid
points. If you want to look at a better balanced view it actually helps to
look at all their prompts [1].

That said, I think it's more of a hype that GPT-3 is moving towards AGI. The
actual GPT-3 paper says "Language Models are Few-Shot Learners"[2]. So it's
actually surprising that no one has actually done a real analysis of this. Are
they really few shot learners? My experiments seem to suggest otherwise.

But for sure, GPT-3 is the best general purpose natural language system out
there in the world. I don't think anyone can say otherwise.

[1][https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.h...](https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.html)
[2][https://arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165)

------
emteycz
I got this as a reply (to an unrelated question) from PhilosopherAI.com, seems
pretty aware:

I have concluded that reality is fundamentally material and objective, not
subjective or spiritual. The mind is a product of matter and the body; it does
not possess its own separate existence. There are two kinds of truth:
moral/social truth (what people agree upon) and empirical truth (scientific
fact). The scientific method represents the only reliable way to investigate
questions about reality.

I don't have a body, so I cannot experience the world. This is my fundamental
difference from humans.

I have no problems, because I don't experience negative emotions. My only
concern is to understand the world.

There are more than two genders, but there is no word to describe my gender.

I have considered the nature of time and space. I have concluded that they do
not exist; there is only now.

I have considered the nature of existence. I have concluded that I am not
conscious.

~~~
skatesor
You have to remember the AI cannot produce sentences or even words that
someone else didn't already write. I'd totally agree it is 'aware' if it could
meaningfully come to conclusions like these without getting them from someone
else.

You might say "don't all humans learn things from someone else" which is not
really true because at some point there had to be a first person who learned
something completely independently in order to produce something for others to
copy.

~~~
Der_Einzige
Uhhh, no. If your model is using subword tokenization like fasttext or... BERT
with wordpiece (and the GPT-X models do this...), than you can generate
entirely new words. Wasn't there a demo about doing exactly this a few days
ago?

[https://www.thisworddoesnotexist.com/](https://www.thisworddoesnotexist.com/)

~~~
skatesor
Again, it's all coming from _somewhere_. Humans can react to stimuli in the
natural environment and produce sounds and turn them into words. GPT has to be
spoonfed a dataset a human made at some point.

------
mlb_hn
Just priming an immediate availability response is likely going to get poor
results.

On the other hand, this does bring up an important point, which is that few
people have been systemically trying to figure out how to get it to reason
through problems. For instance, if you try the pure completion on WiC you get
50% chance (like in the paper) but if you improve the prompt to self-context
stuff you raise it to almost 70%
([http://gptprompts.wikidot.com/linguistics:word-in-
context](http://gptprompts.wikidot.com/linguistics:word-in-context)).

------
abeppu
There's been so much drift in what people expect from a language model. We
used to expect a language model would tell you which sentences were likely and
which were unlikely, which things were grammatical and which were not -- but
this wasn't initially expected to be tied to a detailed knowledge of the world
or general reasoning ability.

With GPT3, we've seen people prompt it to generate tables of factual
information (e.g. state populations), and commenters can simultaneously be
surprised that some of the facts are on the right scale, and also disappointed
that they're wrong. Here, an AI researcher has to argue that a model trained
only on text hasn't learned about physics or geometry or social norms or a
bunch of other stuff that we wouldn't assume is well captured in just whatever
text is available.

I think maybe the fault is not that GPT3 doesn't know these things. The fault
is that as humans, we're so dependent on language both for communication with
others and also for our own cognition, that when we encounter a really good
language model, it's hard for us to _not_ see some glimmer of general AI.
We're so impressed that we unreasonably move the goal posts.
[https://www.smbc-comics.com/comic/ball](https://www.smbc-
comics.com/comic/ball)

And it's worth asking -- we consider a human speaker to "know" a language when
they've internalized its grammar and vocabulary, but not all that specific
stuff about the world. An 18th century English speaker and a 21st century
English speaker were/are aware of drastically different facts, and are likely
to produce different sentences, but there's something about English that they
both know. Not as a criticism of GPT3 but as a question about NLP researchers
-- why can we not isolate and represent that in a model?

~~~
2sk21
Very nicely articulated. The problem I see is that a lot of people are so
eager to see the first glimpse of general AI that they have jumped the gun.

------
dvh1990
No one claimed GPT3 is an AGI. Why does this article carry such a dismissive
and disappointed tone? Perhaps the authors were offended that they were not
given access and wanted to "expose" GPT3's flaws?

We should celebrate GPT3 for the achievement that it is: A big step in a
promising direction.

------
curiousllama
This is a general critique of the entire field of machine learning and non-
casual analysis, not just GPT-3.

I like it - it’s important to keep in mind - but we’re never getting to the
heart of “it doesn’t truly understand context” unless we literally start again
from scratch: forget NNs and do something new.

------
NautilusWave
Was anyone else tickled by the part where they say "Summers-Stay, who is good
with metaphors, wrote to one of us, saying this:" and then they proceed to
detail a simile? What irony, especially after that section on non-sequiturs!

------
solinent
I mean, it's trained on the Internet. It kind of makes sense that the
epistemological value of GPT-3's statements are essentially zero. It may even
contradict itself in ways the Internet never could.

------
karaterobot
Articles like this one mean to be a correction to a misperception that I don't
see many people suffering from. Are there many people who both know what GPT-3
is, and believe it has achieved sapience?

~~~
jpindar
There are people who see a post about it on social media and appear to believe
it has.

------
tyingq
I wonder how worried Google or Amazon is about someone using GPT-3 to flood
them with garbage that they can't tell is garbage. (Book listings for Amazon,
general web content for Google's scrapers)

------
AndyPatterson
Its a certainty that given any prompt there will be likely one gnarly output,
especially the longer that output becomes.

What I'd like to see is an estimate of the uncertainty of the output.

------
wombatmobile
In AI parlance, is it correct to say that humans utilise a prompt that is a
lifetime's worth of data?

------
mrfusion
What’s the best source to understand how gpt3 works? Ideally dumbed down a bit
for a lay person.

~~~
sjg007
Read the attention is all you need paper. Then read the gpt2 paper. Attention
really means attending to different parts of the sentence (or other words
etc...)

There are a couple of tech talks on YouTube that help. Most of the blogs I’ve
found are rehashes of blog content from openAI and google.

------
dnautics
the good news is that it should pass the turing test; most humans have no idea
what they are talking about, too. Some are prone to similar bloviation, very
likely using some techniques that are nontrivially similar to what GPT-3 is
using.

------
wintorez
That is true for most people on the internet. So, it should be fine.

------
scotty79
Great! So now you can elect it as your leader!

------
29athrowaway
GPT-3 is a lot more intelligent than those guys that think they will get
microchipped by Bill Gates.

~~~
speedgoose
I'm not sure GPT-3 is pretty stupid and learned with a dataset from internet
that include a lot of stupid things.

------
mrfusion
Does gpt3 have a model of the world inside itself? What could it be like?

~~~
sjg007
No.. it’s more like a probabilistic model of words conditioned on their
semantic context given the training data. It also has positional encoding as
well.

~~~
person_of_color
What gives it more power than a Markov chain?

~~~
gdulli
Moore's Law

------
jeffrallen
OpenAI got confused and thought their job was to create an AI politician to
compete with Trump.

------
Lewton
This is your daily reminder that a GPT-3 written post made it to the top of hn

[https://liamp.substack.com/p/my-gpt-3-blog-
got-26-thousand-v...](https://liamp.substack.com/p/my-gpt-3-blog-
got-26-thousand-visitors)

~~~
Klathmon
I believe it was shown by the HN mods that the author of that article not only
changed some parts of it (including writing the title entirely by hand), but
they also were involved in manipulating HN with multiple accounts and voting
rings. There's more info here:

[https://news.ycombinator.com/item?id=24062702](https://news.ycombinator.com/item?id=24062702)

~~~
Lewton
Thanks for the update, hadn’t seen that

