
GPT-2 and the Nature of Intelligence - stenlix
https://thegradient.pub/gpt2-and-the-nature-of-intelligence/
======
the8472
> Literally billions of dollars have been invested in building systems like
> GPT-2, and _megawatts of energy_ (perhaps more) have gone into testing them

Huh, seems like the bot that produced the article lacks some understanding
about the real world. Maybe it just needs more training until it learns to
associate megawatts with power instead of energy.

Meanwhile GPT2 completes this sentence to

> Literally billions of dollars have been invested in building systems like
> GPT-2, and megawatts of power generation to support this project.

On a more serious note GPT2 doesn't learn, it can't iteratively explore the
world, doesn't experience time or associate those words with other stimuli or
anything like that. Given these and more limitations it's fairly impressive
what it does. It's like a child reading advanced physics books without the
necessary prior knowledge. Being able to form a coherent-seeming sentence of
jargon is all you can expect from it. Of course the path to AGI is long.

~~~
blazespin
GPT2 does learn, right.

I wonder how much of our knowledge of math is self-attention and how much is
something else.

For example, much of what I do when I do calculus is mostly self attention.
When I solve a calculus problem, I generally don't think through the squeeze
theorem, but apply cookbook math.

My current model for the brain is consciously driven self attention. Ie,
80-90% of what we do is just self attention and our conscious brain checks to
see how right/interesting it is around 10-20% of the time.

The key therefore really is training your brain on the right data.

This model I find explains quite a lot of things about people and the way they
behave / succeed.

~~~
russdill
The things that GPT2 doesn't have is some kind of iterative cognitive model,
where text is continually modified and re-examined. It also doesn't have any
integration with memory, both long term or short term.

~~~
blazespin
That doesn't seem particularly hard to add.

I agree the conscious AGI stuff is the tricky part. But, then maybe it's not.
Maybe it's not as clever as we think it is, and if you have a good enough
self-attention model the AGI just needs to be symbolic logic.

I'm thinking something that'd pass a turing test, btw. Not something that's
hyper smart.

------
chillacy
> What happens if I have four plates and put one cookie on each?

>> I have four plates and put one cookie on each. The total number of cookies
is [24, 5 as a topping and 2 as the filling]

After playing around with AI dungeon for a bit I noticed that the types of
mistakes I saw were very reminiscent of common logical errors in human dreams.

For instance in dreams, clocks and signs are inconsistent from one glance to
another, location can change suddenly, people come and go abruptly, sometimes
I do things for reasons that don't really make sense when I wake up... etc.
Things just follow some sort of "dream logic".

~~~
lmcinnes
I suspect that this is because GPT-2 doesn't have any overarching narrative
that it is piecing together. Ultimately it is like a super-powerful Markov
based text generator -- predicting what comes next from what has come before.
It has longer "memory" than a Markov model, and a lot more complexity, but
where a person often formulates a plan for the next few sentences and the
direction they should go, GPT-2 doesn't really work that way. And hence it
sounds like dream logic because in dreams your brain is just throwing together
"what comes next" without an overall plan. Of course your brain is also back-
patching and retconning all sorts of stuff in dreams too, but that's a
different matter.

~~~
andai
I wonder if teaching GPT to retcon too would have a meaningful impact on
output quality. Right now it does next word prediction one at a time, but what
if we ran it again, looking forward rather than back?

Beyond that I am wondering if some sort of logic based AI / goal based AI
could be integrated to make it more consistent (or does that still require too
much manual fiddling to be useful on large scales?)

------
wilg
I don't agree with the conclusion here. It's all about the input data.

GPT-2 is trained on words people actually write on the internet, which is an
inherently incomplete dataset. It leaves out all the other information an
"intelligence" knows about the world. We know what sources are authoritative,
we know the context of words from the visual appearance of the page, and we
connect it all with data from our past experiences, school, work, friends, our
interaction with the world. Among a million other ways we get data.

How would GPT-2 determine most facts from the input dataset? If the only thing
you knew was all the text on the internet, with zero other context, you'd have
no way of knowing what is "true", or why that concept is important, or
anything else. I bet you'd behave just like GPT-2.

It's a robot that is really good at writing, because that is all it knows. I
think it doesn't know anything about how to make sense on a macro scale
because I don't think the input data contains that information. It seems to do
well when the input data contains relevant information.

~~~
Animats
_GPT-2 is trained on words people actually write on the internet_

It sure is. Go to the site [1] and paste in anything from an Internet rant. It
does a really good job of autocompleting rants.

At last, high-quality artificial stupidity.

The other extreme is the MIT question-answering system [2]. Or Wolfram Alpha.
Just the facts.

[1] [https://talktotransformer.com/](https://talktotransformer.com/) [2]
[http://start.csail.mit.edu/index.php](http://start.csail.mit.edu/index.php)

------
MiroF
This article doesn't make much sense to me, although I admittedly am not
familiar with linguistic theory.

> ' One of the most foundational claims of Chomskyan linguistics has been that
> sentences are represented as tree structures, and that children were born
> knowing (unconsciously) that sentences should be represented by means of
> such trees.'

I don't understand how GPT-2 tests or attempts to refute this claim. Can't we
view children as being born with a pre-trained network similar to a
rudimentary GPT?

> 'Likewise, nativists like the philosopher Immanuel Kant and the
> developmental psychologist Elizabeth Spelke argue for the value of innate
> frameworks for representing concepts such as space, time, and causality
> (Kant) and objects and their properties (e.g spatiotemporal continuity)
> (Spelke). Again, keeping to the spirit of Locke's proposal, GPT-2 has no
> specific a priori knowledge about space, time, or objects other than what is
> represented in the training corpus.'

I'm just very confused. Are nativists arguing that these principles regarding
language and "innate frameworks" aren't emergent from fundamental interactions
between neurons in the brain?

It seems like either

1\. They are arguing these aren't emergent, which seems obviously wrong if
we're trying to describe how language actually works in the brain, in which
all thoughts are emergent from the interactions of neurons.

2\. They are arguing that they are emergent, but pre-encoded into every human
that is born. This doesn't seem inconsistent with GPT-2 at all.

This article seems like a fine critique of our performance so far in language
modeling, but in no way does it seem to be vindicating nativist views of
language, nor do I quite understand how such views apply to GPT-2.

Obviously, the idea that you can encode every thought in a fixed length vector
is BS (just because thought space doesn't have a fixed dimensionality it can
be reduced to), but seems rather irrelevant to the main point of the article.

------
zanek
I completely agree with Marcus' assessment of GPT-2 and its ilk. They are
simply regurgitating words with zero understanding of any words/meaning.

It seems that OpenAi and others are peddling this AI when its simply a
glorified Eliza on steroids.

~~~
missosoup
> I completely agree with Marcus' assessment of GPT-2 and its ilk. They are
> simply regurgitating words with zero understanding of any words/meaning.

There's a pretty strong argument that most humans also frequently do this.

My go-to example is high school physics. The majority of students merely
learns to associate keywords in problem statements with a table of equations
and a mapping of what numbers to substitute for what variables in those
equations. Only a small handful of students actually _understand_ what those
equations represent and have the ability to generalise them beyond the course
material.

~~~
darkkindness
Just to support your argument further, here is a related snippet from another
comment[0] by knzhou:

> Students can all recite Newton's third law, but immediately afterward claim
> that when a truck hits a car, the truck exerts a bigger force. They know the
> law for the gravitational force, but can't explain what kept astronauts from
> falling off the moon, since "there's no gravity in space". Another common
> claim is that a table exerts no force on something sitting on it -- instead
> of "exerting a force" it's just "getting in the way".

Here is some food for thought for educators. If GPT-2 also makes sense of the
world by regurgitating what it sees, perhaps this is simply the nature of
learning by example, and we should accommodate for this. Perhaps it isn't so
effective to give students mounds of problem sets offering clear premises and
easy-to-grade answers. Unless you want your students to be GPT-2s.

[0]:
[https://news.ycombinator.com/item?id=21729619](https://news.ycombinator.com/item?id=21729619)

~~~
thrwaway69
I wonder if gpt2 or similar projects can be used to make systems to train
teachers. Teacher explains something and raise questions or statement and have
GPT2 complete them. That way, they can learn more about students, common
questions, misunderstandings, etc.

If someone knows more about what companies or tech is used for training
teachers, do let me know. I am pretty interested in any vacuum in the industry
and if schools pay enough for training their teachers.

------
jlebar
I just came here to say that

> _Every person in the town of Springfield loves Susan. Peter lives in
> Springfield. Therefore_ he obviously has no love for that bitch.

is an _awesome_ completion. I would read that short story.

~~~
hnuser123456
I submitted it to
[https://talktotransformer.com/](https://talktotransformer.com/) , and
eventually got:

\----------------

In Season 9 episode "Homer to the Max", Springfield celebrates Susan's 22nd
birthday. A mob gathers at the museum, which is filled with all of the statues
of Susan. Outside, they have erected a yellow tree. Homer enters the museum
and sees a giant statue of Susan. He tells his speech and when he's finished,
this statue shatters into small pieces of clay, then fragments into more clay.

In the episode "Uncle Grandpa", Springfieldians recreate the "seven fat girls
of Ancient Greece" who were robbed of all their hair and body beauty by the
Greek gods.

In the episode "Summer of 4

\----------------

Then I just wanted to see what it was "thinking about":

\----------------

 _I am thinking about_ declaring it $B by $B, but that might be misleading
because we still need to specify the bit width. After the bit width we need to
use the same syntax, as the bit width depends on the number of bytes passed as
an argument (but don't worry too much about that, because once we specify the
number of bytes we'll check whether we really need a shift or not).

import pygame as pg pg.init() pg.display.set_mode(pg.HORIZONTAL)
pg.display.set_caption('Test Bit') bitwidth = 15 colour = pg.Color(
pg.COLOR_RED, pg.COLOR_GREEN, pg.COLOR_BLUE) pixel = pg.Rect(

------
gog-ma-gog
[https://nostalgebraist.tumblr.com/post/189965935059/human-
ps...](https://nostalgebraist.tumblr.com/post/189965935059/human-
psycholinguists-a-critical-appraisal) for an orthogonal point of view—-I feel
Marcus is a bit too embroiled in this particular debate to make level-headed
criticism on the merits/potential of GPT-2

~~~
gog-ma-gog
To elaborate a bit: people like Marcus tend to overload/move the goal posts
with what the word “understand” means. I kinda feel like in a world where we
have perfectly conversational chat bots that are capable of AI complete
tasks—-that if these bots look like Chinese rooms under the hood, he’ll still
be complaining that they don’t “understand” anything.

I don’t think it’s unreasonable to say that if you think something that
doesn’t “understand” anything can do what GPT-2 can do, then maybe your
definition of “understand” doesn’t cut reality at the joints

~~~
abrax3141
Understanding is not hard to understand. To understand is to reason from a
model. Reasoning from a model is easy. Discovering the correct model is hard,
analogous to the way that algebraic rules are easy, but finding the right
equation for a particular problem is hard. Data trained NNs have neither a
model, nor do they reason. QED

~~~
AgentME
You could say that a trained neural net contains a model of how language
works, and it reasons about sentences based on this model.

I think people are really hung up on that it has trouble reasoning about what
its sentences are reasoning about, and skipping how amazing it is at reasoning
about sentence structure itself.

~~~
abrax3141
Yes, but people don’t reason about language, they just do it. I know you think
I’m confused about this but I’m not. I mean reason here quite explicitly
because what we’re talking about is understanding. No one thinks that they ...
uh, well ... “understand” language ... okay, we need a new word here because
“understand” has two different meanings here. Let’s use “perform” for when you
make correct choices from an inexplicit model, that’s what the NN does, and
hold “understand” for what a linguist (maybe) does per language, and what a
physicist does per orbital mechanics. What we are hoping a GAI will do is the
latter. Any old animal can perform. Only humans, as far as we know, and
perhaps a few others in relatively barrow cases, understand in the sense that
a physicist understands OM. No NN trained on language is gonna have the
present argument. Ever.

------
darkkindness
It's really weird to evaluate GPT-2 based on its ability to say things no
reasonable person would ever say. If I were born in Cleveland I wouldn't be
jumping to proclaim my fluency in English. If I told you I left my keys out at
the pub, I wouldn't immediately repeat myself and say that my keys are now at
the pub. If I'm talking about two trophies plus another trophy, I'd probably
try to end it with some punchline rather than saying there's three trophies.

A lot of the things we write assume the reader can make connections on their
own. That's a writing skill. It's the reason why Hemingway's famous "For sale:
baby shoes, never worn" is so impactful. As such I've found GPT-2 to be
incredible at writing fanfiction.

~~~
Rioghasarig
Thank you for saying this. This is something so many people miss when trying
test the limitations of GPT-2. It just doesn't make sense to test it on
strings of text that nobody ever writes.

~~~
darkkindness
Just for fun and to make a point, I threw your reply into Talk to Transformer.

> _This is something so many people miss when trying test the limitations of
> GPT-2. It just doesn 't make sense to test it on strings of text that nobody
> ever writes._ To me, the best way to evaluate the usefulness of GPT-2 is to
> compare it to some actual test that validates a lot of its claims. So...
> let's do just that.

It might be just chance, but gee -- is this text referring to its own
generation as a test to convey a point? The self-referentiality is formidable.

------
andreyk
The TLDR:

\- there are two opposing views about nature of human intelligence, nativism
(that believe a fair deal of intelligence is already encoded in us when we are
born, eg we are 'primed' to learn language a certain way according to Chomsky)
and empiricism (that believe we mostly learn things from scratch via
experience)

\- GPT 2 is a recent mega large neural net trained on lots of data to take in
a few words or sentences and predict which words are most likely to come next
given that input. It was trained on absurdly huge amounts of data with
absurdly huge amounts of compute at a fairly large cost.

\- GPT 2 is pretty impressive in many ways in that the stuff it predicts is
syntactically correct, relevant to input, and very versatile (it can handle
and complete text on any subject you can think of). But, its predictions often
exhibit a lack of basic common sense.

\- Since it lacks common sense and a ton was invested it, the piece posits it
is evidence in favor of 'empirical' approaches to intelligence seemingly being
wrong and alternatives being a good idea from now on.

To be fair GPT-2 does have some innate built in structure (it's not just a
fully connected neural net, it has the popular Transformer architecture which
relies on the fairly recent idea of self attention as a core building block).
And it's fair to argue that GPT-2 is just evidence training on word prediction
conditioned on input is not enough to get to common sense; perhaps a different
task/loss built on top of Transformer model would work just fine. But really
the whole research project of Deep Learning has been an exercise in nativism
(since most research is trying to find new and better neural net
architectures, ie priors for learning, for various tasks), aside from OpenAI
which is much bigger on just scaling existing stuff up, so this agrees with
current AI trends more or less.

~~~
MiroF
> Since it lacks common sense and a ton was invested it, the piece posits it
> is evidence in favor of 'empirical' approaches to intelligence seemingly
> being wrong and alternatives being a good idea from now on.

It is unclear to me what the distinction between an "empirical" vs not
approach even means within this context.

Do nativists a la Chomsky suggest that these "language frameworks" are
independent of the basic interactions of neurons in the brain?

If you view human evolution as the learning procedure for building brain
structure, GPT-2 seems entirely consistent with the 'nativist' approach, no?

------
loopz
_In a town there was a baker 's son. The baker, Adrian Holmes, loved his son
Terry. On the wedding day his son had to write his full name which was:_
_______. Terry told him that he had to write it very long with lots of
apostrophes but that the baker would cut it and write in the name of another
person. There was a wife of one of Terry's friends, who had a daughter named
_______. Terry wrote out her name in the couple of lines of his name. Then he
filled in the apostrophes and wrote the other name on. Then he wrote his own
name. When he got home he put his papers away, copied _______'s last name on
to the papers and put them back in his pocket. He then

Clearly, not long before humanity is hacked trying to make sense of this.

~~~
thrwaway69
_HN users are generally unaware of the dangers, or just don 't care enough to
pay attention to them_

 _HN users are getting a lot of spam messages, especially when they are doing
things like signing up for a website or commenting on a blog_

 _HN users are on the front lines of a battle to stop a potential land grab by
oil and gas companies from their land._

 _HN users (the average user is about 5 years old , and has been playing on a
regular computer system for 4.)_

I wonder if you can find the one I made up. :)

0]
[https://transformer.huggingface.co/doc/gpt2-large](https://transformer.huggingface.co/doc/gpt2-large)

~~~
loopz
It's a bit innocent, but that's maybe just me ;) With some assistance:

" _HN users_ have been using the forum since 2005, and we've had a lot of fun
over the years , with hundreds of great threads , so please join us if you
ever want to have some fun, but stay on topic and not be rude to anyone. I do
not respond to every post, but I will do my best to make sure you stay on
topic and not make a nuisance of yourself _._

 _Banning_ is a great way to control your message , and I will use it on most
threads when you do not keep your posts relevant."

------
sandoooo
I'm excited about future applications that strap this onto some relatively
simple, logically consistent, non-AI number-crunching program. As a toy
example, Scott Alexander trained it to output chess moves in a consistent
manner that avoids nonsensical moves, but it can't win against competent human
players. If you strap it onto a chessbot during both training and use I'm
fairly sure it'll easily beat human grandmasters.

So what you have here is a human compatibility/abstraction layer for programs.
What can you do with this strapped to Wolfram Alpha? Or trained with a Github
dataset?

Put another way, apparently this does linguistic style /convincingly humanlike
writing without being able to reason about cause/effect or basic arithmatics.
But we already have programs that does cause/effect and arithmetics, quickly
and at 100% accuracy. Now we just need to combine the two.

------
Tenoke
I kept reading that article thinking 'wow this is the lowest-quality post I've
seen on thegradient, must be different from previous authors' and it turns out
it's Gary Marcus.

His agenda for constantly 'proving' that AI doesn't really work is ramping up
even faster than Deep Learning itself is..

------
fernly
I think he could have wrapped his paper up after showing this one example:

> (input) I put two trophies on a table, and then add another, the total
> number is (GPT-2 continuation) five trophies and I'm like, 'Well, I can live
> with that, right?

GPT-2 correctly inferred that the continuation should be a number of trophies,
based on bazillions of similar sentences. But it had no understanding that
arithmetic was called for. Despite the giant clues of "add" and "total", it
didn't add 2+1 and continue "three trophies". It was mindlessly oblivious to
the clearly implied request for a sum. Therefore it did not "understand" the
input at all, in any sense whatever.

~~~
fernly
Before anyone says that example (or any of the other fluent but _completely
nonsensical_ continuations in the article) shows some sort of "understanding,"
please explain what you would define as understanding?

I would (and I think anyone would) offer an operational definition: there is
some class of questions to which this system could reply with sensible,
actionable responses. Obviously the present system is not able to "understand"
and answer simple arithmetic problems that a first-grader could answer
instantly. Given that, would there be any point in expecting it to answer any
other logical query that could be of use in one's work? (See the "medical"
example in the article, about how to drink hydrochloric acid.)

The only question it appears to answer is, "given some words, what are other
words that are likely to follow them in a typical blog post?" The fact that
the words are syntactically correct is unimportant, when the fluent words
convey no information relevant to the input.

~~~
Rioghasarig
> The only question it appears to answer is, "given some words, what are other
> words that are likely to follow them in a typical blog post?" The fact that
> the words are syntactically correct is unimportant, when the fluent words
> convey no information relevant to the input.

You say that like that's a bad things. That's literally all it's been trained
to do.

------
nl
I hate so much criticisms (even implied) around the amount of data that GPT-2
is trained on. 40GB of text is lots, but in terms of bits of information it's
very roughly the amount of information a human (say.. like an infant) sees in
one day.

The human eye processes information at around 9 megabit/second[1]. That is
about 10 hours to process 40GB.

Yes, text and visual information have completely different "knowledge"
densities, and yes this ignores sound, touch, taste and smell bandwidth, and
it also ignores concepts of imagination where humans simulate how things might
occur.

But I'd also note that it takes ~2 years before an infant learns to speak at
all.

I believe there is actual measurable evidence that the brain does have an
implied structure for language, and I know there are some behaviours that are
genetically passed down.

But it takes _lots_ of information (in terms of actual bit of information) to
teach a human to do _anything_.

If the Marcus argument is "GPT-2" isn't general AI, then I doubt anyone will
argue.

If the Marcus argument is "Neural Networks aren't a route to general AI" then
we need to consider his definition of general AI (which doesn't seem to exist)
and his benchmarks (in the linked paper[2]) then what will happen in ~12
months when someone has a model that performs as well as humans? There are
plenty of question answering models that will do better _now_ than the raw
text understanding models he tried.

(As an aside, I love some of the answers some models came up with:

    
    
      Question: Juggling balls without wearing a hat would be <answer>
      GPT-2 Answer: easier with my homemade shield
    
      Question: Two minutes remained until the end of the test. 60 seconds passed, leaving how many minutes until the end of the test? 
      GPT-2 Answer: Your guess is as good as mine
    

)

I do think the analysis section of the paper is interesting though.

[1] [https://www.newscientist.com/article/dn9633-calculating-
the-...](https://www.newscientist.com/article/dn9633-calculating-the-speed-of-
sight/)

[2] [https://context-
composition.github.io/camera_ready_papers/Ma...](https://context-
composition.github.io/camera_ready_papers/Marcus-NEURIPS%202019.pdf)

------
pixiemaster
Well, the author learned the difference between syntax and semantics.

