
Giving GPT-3 a Turing Test - DavidSJ
http://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html
======
andyljones
Nic Cammarata on Twitter pointed out that if the prompt gives GPT-3 permission
to indicate questions are ridiculous, it'll do so reliably:

[https://twitter.com/nicklovescode/status/1284050958977130497](https://twitter.com/nicklovescode/status/1284050958977130497)

~~~
chrisseaton
What does ‘yo be real’ stand for?

~~~
capableweb
"be real" for being realistic, asking a better question that is not nonsense.

Loosely translated: "yo be real" = "Hey, please be more realistic with your
future questions"

~~~
root_axis
"be real" when spoken by humans means "be yourself" not "be realistic".

~~~
skywhopper
Depends on the context. But without context I’d read it as “be honest” or
“tell the truth” first. So in the context of answering a question it pretty
clearly means “ask a question I can realistically answer”.

------
MAXPOOL
A great demonstration of why in-distribution learning is not enough for AGI.

The largest GPT-3 model "GPT-3 175B" is trained with 499 Billion tokens (one
token is maybe equal to 4 characters in text).

Human reading/talking/listening equivalent of 200 pages of text per day for 80
years would be just 13GB of raw data or 3B tokens. You could also make an
estimate using 39 bits/s as the normal information rate humans can absorb and
get the same order of magnitude estimate.

It's not completely wrong to say that we are figuring out how far
interpolation from data can go as a learning method. Even the most advanced
deep learning is in-distribution learning and system 1 thinking (Kahneman's
term). Just run through data until the model can interpolate accurately
between data points.

We must figure out how do learn models that allow out-of-distribution learning
or these models fail the Turing test after a few questions.

~~~
browsergap
Interesting to consider tho...there must be some sort of power threshold.
Above which all our questions can be merely "interpolated" from data. Right
now 499B tokens is not it. But there must be, I guess, some upper limit,
within which all human knowledge expressible through language can be contained
and conversed upon using this method. Pretty scare to think about...that at
that point, when it's 10^n tokens or whatever, we would be unable to detect if
it understood or not.

Even more scary, what if our brains are simply above that power limit in their
ability (if they do that) to interpolate. And what if we don't really
understand anything (but, just like vision, our brains provide us the
comforting sensation that we really get it), but simply are in possession of
wetware/quantum computers that can interpolate better than we can poke holes
in it.

~~~
killerstorm
There's no evidence that human intelligence is anything more than associative
memory and very elaborate multi-level pattern matching. IMHO any cognitive
task can be described as a combination of associative lookups, domain
transformation and trial-and-error stateful processing. I don't see any reason
to believe there's more to intelligence/consciousness than a combination of
such elements/processes.

I guess domain transformation is something which is hard to visualize. But it
was visualized in style transfer: there are NNs which can decompose a picture
into a subject and style and then take the same subject and apply a different
style.

Same works with text -- you can take a sentence and rewrite it to be in
Victorian epistolary style. Or take a story about humans and rephrase it to be
a fairy tale with animals. GPT-3 can also do this. This means it's possible to
take a sentence and decompose it into different layers, then manipulate layers
individually and reassemble.

~~~
ImprobableTruth
Then you probably don't know what 'consciousness' means, because none of these
processes give an explanation as to why or how qualia arise.

Also, "trial-and-error stateful processing" is so vague and broad that I don't
feel that it meaningfully describes anything more than 'computation'.

~~~
killerstorm
There's no evidence that qualia are anything more than clusters within some
vector space which is useful to describe sensory inputs. They arise because
they are useful for making sense of the external world.

I know that philosophers like to believe in some mystical bullshit about a
whole different category of things, but if you believe in evolution, things
are simple. An animal receivers a sensory input, it tries to use it to improve
its survival chances. It usually makes sense to tell signal from noise, by
clustering and transforming it. You cannot derive any useful information from
a single vibration of air, but if you transform from time domain to frequency
domain, you can observe that certain frequencies relate to information about
outside world, such as presence of predators, etc. If you do several layers of
such transformations you arrive to qualia.

If you study signal processing and NNs, these things become obvious. If you
give a signal processing guy a task to detect human speech, for example, he
will filter frequencies, then estimate loudness and compare to background
noise. If you train NN, it will do the same -- you will likely get a neuron
which represents "loudness in human speech". Same if you train a computational
process using evolutionary process: loudness in specific frequency range
carries out useful information, so no matter what process you use, you will
have some representation of this quality.

Same with "color red" or whatever other qualia you can think of -- it's just a
region in a space which arises from useful transformations of the incoming
signals which maximize useful information.

~~~
ImprobableTruth
>There's no evidence that qualia are anything more than clusters within some
vector space which is useful to describe sensory inputs.

I'm sorry, but this just betrays that you have no clue what qualia refers to.
It's about the _subjective experience_ of interpreting data. Why don't we have
'smell experience' for visual data and a 'visual experience' for smell data?
Why aren't our interpretations of red and blue switched (so that we interpret
red visual data as a blue visual experience and blue visual data as a red
visual experience)? Hell, why do we need visual experience to act on it at
all?

Saying "they arose because they're useful' or trying to reduce it down to
cluster analysis does absolutely nothing to explain qualia. There is
absolutely no evidence that a NN neuron (or set of them) that can detect a
certain trait such as loudness also has subjective experience. Frankly, we
essentially know absolutely nothing about subjective experiences except that
we ourselves have them.

>I know that philosophers like to believe in some mystical bullshit about a
whole different category of things, but if you believe in evolution, things
are simple

I genuinely recommend reflecting on this statement. You're essentially saying
that you can solve in a single paragraph and using only knowledge that a CS
undergrad might have (!) a problem that professional philosophers have
grappled with for decades. Do you really think that is more likely than you
simply not understanding the problem?

~~~
krcz
> Why aren't our interpretations of red and blue switched (so that we
> interpret red visual data as a blue visual experience and blue visual data
> as a red visual experience)?

What would that even mean? Is it based on some underlying assumption, that we
have some hardwired, a priori subjective experience of red and blue colors and
later we only associate these with perceptions of blood, roses, ripe apples
and sky, water accordingly?

That might be just an illusion and our color perception is learned, so blue is
just the color of sky and water, and nothing more.

> Why don't we have 'smell experience' for visual data and a 'visual
> experience' for smell data? Some people do (synesthesia), but generally lack
> of such experience mixes can be explained by different part of the brain
> getting different inputs, and impossibility of e.g. auditory stimulus to
> generate the same response as seeing red color would do.

~~~
ImprobableTruth
>Is it based on some underlying assumption, that we have some hardwired, a
priori subjective experience of red and blue colors and later we only
associate these with perceptions of blood, roses, ripe apples and sky, water
accordingly?

The point is that we don't know. We know that we have a subjective experience
of an image, where 'red' parts correspond to visual red light stimuli, but
it's impossible for me as an individual to know what your subjective
experience of an image looks like. If someone 'smelled' subjective images it
would be impossible to tell the difference as long as they still 'smell' red
light as red.

>That might be just an illusion and our color perception is learned, so blue
is just the color of sky and water, and nothing more.

But where does that 'blue' experience come from?

>Some people do (synesthesia), but generally lack of such experience mixes can
be explained by different part of the brain getting different inputs, and
impossibility of e.g. auditory stimulus to generate the same response as
seeing red color would do.

Sure, I'm just asking why there are different experiences to begin with. Why
do we experience smell, sound, etc. the way we do?

~~~
the8472
> but it's impossible for me as an individual to know what your subjective
> experience of an image looks like

This smells like a god of the gaps or argument from ignorance type of argument
to me. And it's not impossible, just difficult to execute. All we would have
to do is digitize your brain and swap out the responsible part, then you can
form a memory of that, swap the original back in and do a comparison. Or
something like that. All handwavy science-fiction of course since we currently
lack detailed knowledge how this works, but that does not imply there's
anything special about it, only that one human might be processing the data
slightly differently than another human. The same way that one human may
assign a different most-common-meaning to an ambiguous word than another.

~~~
ImprobableTruth
>This smells like a god of the gaps or argument from ignorance type of
argument to me

huh? Are you saying that 'we don't know how subjective experience works' is a
argument from ignorance?

>All we would have to do is digitize your brain and swap out the responsible
part, then you can form a memory of that, swap the original back in and do a
comparison.

I'm not sure what that is supposed to mean. What is 'the responsible part' and
what would swapping it out achieve? I'm still only going to have my subjective
experience.

~~~
the8472
> huh? Are you saying that 'we don't know how subjective experience works' is
> a argument from ignorance?

I'm saying that the lack of knowledge does not imply that there's anything
special about it that wouldn't also arise naturally in an NN approaching even
animal intelligence.

> What is 'the responsible part' and what would swapping it out achieve? I'm
> still only going to have my subjective experience.

I assume that "subjective experience" has some observable consequences, of
which you can form memories. Being able to swap out parts of a brain will
allow you to have a _different_ subjective experience and then compare them.
It is an experimental tool. I don't know what you will observe since that
experiment has not been performed.

~~~
ImprobableTruth
>Only that there's something special about subjective experience that wouldn't
arise naturally in an NN approaching even animal intelligence.

That isn't at all what I've said. I'm saying that 'qualia' exist and that we
have no clue how they arise. Maybe they arise from complicated enough systems,
maybe they don't. Hell, maybe panpsychists are right and even a rock has some
sort of consciousness. My issue is with people who are confident that a big
enough NN necessarily has consciousness.

>I assume that "subjective experience" has some observable consequences, of
which you can form memories. Being able to swap out parts of a brain will
allow you to have a different subjective experience and then compare them. It
is an experimental tool. I don't know what you will observe since that
experiment has not been performed.

Unless you presuppose that there is some part that completely determines
subjective experience (I don't think it'd even be possible to identify such a
part if it existed), I don't see how that would work. Yes, you can swap out a
part and see that your subjective experience changes, but this tells you
nothing about the subjective experience of others.

~~~
the8472
> I'm saying that 'qualia' exist

If by qualia you mean slight differences in information processing in human
brains, then sure. If you mean anything more than that I would like a) a
better definition than the one I have given b) some observational evidence for
its existence.

> My issue is with people who are confident that a big enough NN necessarily
> has consciousness.

Not _necessarily_ , just _potentially_. After all there will be many
inefficient/barely-better-than-previously/outright detective big NNs on the
path to AGI.

If you're asking whether an intelligent NN will automatically be conscious
then it depends on what we mean by "intelligent" and "conscious". A
mathematical theorem prover may not need many facilities that a human mind has
even though it still has to find many highly abstract and novel approaches to
do its work. On the other hand an agent interacting with the physical world
and other humans will probably benefit from many of the same principles and
the mix of them is what we call consciousness. One problem with
"consciousness" is that it's such an overloaded term. I recommend decomposing
it into smaller features that we care about and then we can talk about whether
another system has them.

> Hell, maybe panpsychists are right and even a rock has some sort of
> consciousness.

If we twist words far enough then of course they do. They are following the
laws of physics after all which is information processing, going from one
state to another. But then all physical systems do that and its usually not
the kind of information processing we care that much about when talking about
intelligences. Technically correct given the premise but useless.

> I don't think it'd even be possible to identify such a part if it existed

We're already making the assumption we have the technology to simulate a
brain. If you have that ability you can also implement any
debugging/observational tooling you need. AI research is not blind, co-
developing such tooling together with the networks is happening today.
[https://openai.com/blog/introducing-activation-
atlases/](https://openai.com/blog/introducing-activation-atlases/)

~~~
ImprobableTruth
>If by qualia you mean slight differences in information processing in human
brains, then sure. If you mean anything more than that I would like a) a
better definition than the one I have given b) some observational evidence for
its existence.

Subjective experiences i.e. how I actually experience sense data. There is no
real, objective observational evidence and there can't be. How would you
describe taste to a species of aliens that understands the processes that
happen during tasting, but don't taste themselves? It's simply impossible. I
know that I have personal, subjective experiences (the 'images I see' are not
directly the sense data that I perceive), but I can only appeal to you
emotionally to try and make you believe that it exists operating under the
assumption that you too must have these experiences.

>One problem with "consciousness" is that it's such an overloaded term. I
recommend decomposing it into smaller features that we care about and then we
can talk about whether another system has them.

This entire discussion has been about consciousness in the philosophical
meaning i.e. the ability to have some form of subjective experiences.

>If we twist words far enough then of course they do. They are following the
laws of physics after all which is information processing, going from one
state to another. But then all physical systems do that and its usually not
the kind of information processing we care that much about when talking about
intelligences. Technically correct given the premise but useless.

This isn't about twisting words, some people genuinely believe that everything
is conscious with more complex system being more conscious.

>We're already making the assumption we have the technology to simulate a
brain. If you have that ability you can also implement any
debugging/observational tooling you need. AI research is not blind, co-
developing such tooling together with the networks is happening today

The point is that it's about _subjective_ experiences.

~~~
the8472
Fish tastes like fish because the taste is a categorizing representation of
that sensory input.

What you can do is today is start with a feature map. We can do that with
colors
[https://imgs.xkcd.com/blag/satfaces_map_1024.png](https://imgs.xkcd.com/blag/satfaces_map_1024.png)
(do you perceive this color as red?) and we can do that with smells
[https://jameskennedymonash.files.wordpress.com/2014/01/table...](https://jameskennedymonash.files.wordpress.com/2014/01/table-
of-organic-compounds-and-their-smells-w12.pdf) That's a fairly limited
representation but words are an incredibly low-bandwidth interface not
suitable to exporting this kind of information in high fidelity, so we can't.
That does not mean it's conceptually impossible. If you wanted to export
subjective experience itself then you'd need the previously mentioned
debugging interface. Our brains don't have that built-in, but software does.
I.e. a program can dump its entire own state and make it available to others.

To me subjective experience seems to be an intermediate representation, deep
between inputs and outputs, and due to the various limitations we're bad at
communicating it. That doesn't mean there's anything special about it. It is a
consequence of compressing inputs into smaller spaces in ways that are useful
to that entity.

> This isn't about twisting words, some people genuinely believe that
> everything is conscious with more complex system being more conscious.

Anything that interacts with the world will have an internal, idiosyncratic
representation of that interaction. Even a rock will have momentarily
vibrations traveling through it that carry some information about the world.
One of today's NNs will have a feature layers that roughly correspond to
concepts that are of human interest. They're often crude approximations, but
it's good enough for some use-cases. Animal brains just have more of that.

So in that sense, sure, it's a continuum. But there's nothing mysterious about
it.

~~~
ImprobableTruth
>Fish tastes like fish because the taste is a categorizing representation of
that sensory input.

Yes, but why does the fish taste have the taste it does? Hell, try explaining
what fish tastes like, without evoking similar tastes.

>What you can do is today is start with a feature map. We can do that with
colors
[https://imgs.xkcd.com/blag/satfaces_map_1024.png](https://imgs.xkcd.com/blag/satfaces_map_1024.png)
(do you perceive this color as red?) and we can do that with smells
[https://jameskennedymonash.files.wordpress.com/2014/01/table...](https://jameskennedymonash.files.wordpress.com/2014/01/table..).
That's a fairly limited representation but words are an incredibly low-
bandwidth interface not suitable to exporting this kind of information in high
fidelity, so we can't. That does not mean it's conceptually impossible. If you
wanted to export subjective experience itself then you'd need the previously
mentioned debugging interface. Our brains don't have that built-in, but
software does. I.e. a program can dump its entire own state and make it
available to others.

But a feature map doesn't tell you anything about how the space itself works.
If you look at that smell graph, you'll see that it uses comparisons, because
it's literally impossible for us to explain what smelling is like without
saying "well, it's similar to smelling x". Someone who is born without smell
could memorize that chart, understand everything there is about smelling, but
he wouldn't actually know what it's like to smell.

>To me subjective experience seems to be an intermediate representation, deep
between inputs and outputs, and due to the various limitations we're bad at
communicating it. That doesn't mean there's anything special about it. It is a
consequence of compressing inputs into smaller spaces in ways that are useful
to that entity.

We're not just bad at communicating it, but we're bad at understanding it,
because our conventional means of measuring things doesn't really work for
subjectivity. I'm not saying it's "magical", but it's not certain that we even
can potentially build tools to interact with it.

~~~
the8472
> But a feature map doesn't tell you anything about how the space itself
> works.

The space _is_ what is doing the work. Of course it's vastly more complex than
a simple image with a few regions painted into it. There are only
implementation details below it. The issue is that we cannot import and export
them. With software that is a wholly different matter and they be
transplanted, fine-tuned, probed and so on.

> but it's not certain that we even can potentially build tools to interact
> with it.

I agree that this is all very speculative, we don't have the technology and it
can take a long time until we can actually inspect a human brain. But we may
be able to do the same much easier to artificial intelligences, once created.

------
gear54rus
Too many googleable questions (like who was the president). Too little
'understanding' type questions like 'Why don't animals have three legs?'

In addition to nonsense questions, I think it would be pretty easy to knock it
over with some deeper questions about things they were already talking about.
Like asking 'Why don't chairs with 3 legs fall over then?'

~~~
andybak
That's the part of the Turing Test I've never understood - it seems very
dependent on the skill and intelligence of the human tester.

Did Turing talk about this aspect? I seem to remember there was supposed to be
a panel or committee rather than a single individual?

~~~
qayxc
The test originated from the Imitation Game. Turing envisioned a game in which
two people (A & B) are tasked to pretent to be the other (not at the same
time, though).

An interviewer is then trying to decide who of them is actually the one they
pretend to be by asking questions.

In order to remove any hints from appearance, voice, or handwriting, a
typewriter is used for interacting.

For example A could be male and B could be female and A is asked to pretend to
be B. The interviewer can then ask questions to decide whether A is male or
female (yes, I am fully aware that in 2020 this:
[https://youtu.be/z2_8cfVpXbo?t=129](https://youtu.be/z2_8cfVpXbo?t=129) could
also happen).

Turing then proposed to replace A or B with a computer instead and ask the
interviewer to decide which of them is the human.

In this scenario, do you really think the interviewer would bombard the
candidates with trivia questions and logic puzzles to find out?

The idea was, that other aspects of human intelligence, like a sense for
beauty, poetry, etc. would be used instead to differentiate between the two.

Questions like "Do you enjoy reading Harry Potter?" and depending on the
answer you could further ask why or why not the subject likes or dislikes the
books.

This would be much more insightful and coming up with such questions doesn't
require any particular skill on part of the interviewer.

You can even get tricky and return to the topic after talking about something
else to see whether you get a similar response or to test the subject's
memory.

~~~
andybak
That's exactly why I said "it seems very dependent on the skill and
intelligence of the human tester."

A test "pass" or "fail" could potentially be the fault of the tester as much
as it is a sign of intelligence in the AI. How do you evaluate how capable
someone is of administering a worthwhile Turing Test?

Maybe they should be tested beforehand. I propose a system involving a remote
teletype machine and human tester...

~~~
qayxc
The test itself doesn't rely on one person, though. It's a statistical
measure, so for each "game" you have a number of interviewers/judges who test
the system.

If the system has any of them fooled, the test can be considered "passed".
AFAIK there's no precise number attached to this, so it's not like ">x% people
fooled => pass".

So in essence it's not "the human tester" but "the average human tester" for
some arbitrary definition of "average".

It's a really interesting dilemma if you think about it: is a forger good if
they're able to fool everyone but experts? Does an AI pass the Turing test if
it fools the general public, but every AI expert who knows the system would be
able to tell after just a few sentences?

------
lordnacho
Seems pretty close to passing for me. Certainly if I was just playing with it
myself it wouldn't have occurred to me to ask it gibberish questions, and I'd
have thought it was a person.

It gets things wrong that people might get wrong too. Here's one you probably
heard:

Quickly answer the following.

What color is a fridge?

What does a cow drink?

Cows don't drink milk. But many people will say milk, perhaps because they are
primed by association.

Also, part of me thinks NOT getting math right is more of a Turing test pass
than getting it right immediately. It's pretty easy to think of a calculation
that no person could do in their head, but a machine could either find or
calculate in a second. It's kinda like how you might put on an accent to fit
in with a certain group.

~~~
Seabiscuit
You would enjoy Saygin et al, (2000) [1] ('Turing test: 50 years later).

'Not getting math right' is part of a classic repertoire of cheap hacks used
to pass the Turing test by mimicking 'how' a human might speak. This includes
pauses in typing, grammatical errors, fillers such as 'like' and on son in
order to pass the TT, instead of building language competency so well that
passing the TT is a by-product. It's like 'teaching to the test' instead of
teaching the subject.

"Some people interpret the TT as a setting in which you can "cheat". The game
has no rules constraining the design of the machines. At some places in the
paper, Turing describes how machines could be "rigged" to overcome certain
obstacles proposed by opponents of the idea that machines can think.

"A very obvious example is about machines making mistakes. When the machine is
faced with an arithmetical operation, in order not to give away its identity
by being fast and accurate, it can pause for about 30 seconds before
responding and occasionally give a wrong answer. Being able to carry out
arithmetical calculations fast and accurately is generally considered
intelligent behavior. However, Turing wishes to sacrifice this at the expense
of human-ness.

"Some commentators think this is "cheating". The machine is resorting to
certain "tricks" in its operations rather than imitating the human ways.
However, arithmetic is a highly specific domain. Modifying the programs in
this manner cannot hurt: If a machine can pass the test, it can then be re-
programmed not to cheat at arithmetic. If it does not resort to this, the
interrogator can ask a difficult arithmetical problem as his/her first
question and decide that he/she is dealing with a machine right then and
there."

[1]: Saygin, A. P., Cicekli, I., & Akman, V. (2000). Turing test: 50 years
later. Minds and machines, 10(4), 463-518.

~~~
thom
I’d personally be interested in a form of test that didn’t allow these sorts
of side-channel attacks. Submit a test script, get answers back some time in
the future, do a few iterations to dig down into answers maybe. Would rather
people’s time not be wasted on irrelevant stuff.

------
torusenthusiast
I was kind of hoping they would ask it some political questions. One of the
very concerning things for me about these kinds of systems is their impact on
political discourse, especially online.

If a system can be trained to advocate for a political viewpoint to the level
of online debate without human interaction that could cause some interesting
effects in online discourse.

~~~
jl6
It is inevitable that we will soon be able to auto-generate a whole
conversation tree between fake participants, each fine-tuned to the desired
level of fallacious reasoning in accordance with which position the botmaster
wishes to promote.

~~~
wizzwizz4
Fortunately, humans are better at spotting reasoning errors in others than in
themselves.

------
INGELRII
How many of you remember MegaHal (1998)
[https://en.wikipedia.org/wiki/MegaHAL](https://en.wikipedia.org/wiki/MegaHAL)

It said many random clever sounding things. For example:

I COULD SHAKE MY TINY FIST AND SWEAR I WASN'T WRONG, BUT WHAT'S THE SENSE IN
ARGUING WHEN YOU'RE MUCH TOO BUSY RETURNING TO THE LAVATOR.

COWS FLY LIKE CLOUDS BUT THEY ARE NEVER COMPLETELY SUCCESSFUL.

COWS FLY LIKE CLOUDS BUT THEY ALSO HAVE VERY SAD DEATH. BUT AT LEAST WITTY?

TIME. TIME. WHAT IS TIME? SWISS MANUFACTURE IT. FRENCH HORDE IT. ITALIANS WANT
IT, HUH? WELL, I MUST BE THURSDAY. I NEVER COULD GET THE HANG OF THURSDAYS.

CHESS IS A FUN SPORT, WHEN PLAYED WITH SHOT GUNS.

~~~
superhuzza
>CHESS IS A FUN SPORT, WHEN PLAYED WITH SHOT GUNS.

Truly ahead of it's time.

~~~
29athrowaway
Not shotguns but equally entertaining: Chess boxing.

[https://www.youtube.com/watch?v=kK5TQSKmS3o](https://www.youtube.com/watch?v=kK5TQSKmS3o)

------
m3kw9
“ Q: How many rainbows does it take to jump from Hawaii to seventeen? A: It
takes two rainbows to jump from Hawaii to seventeen.”

Maybe the AI knows something we don’t here.

~~~
growt
Maybe I have a weird sense of humour, but all the answers to the nonsensical
questions are a lot like something I would answer if I got asked those ( I'm
not a bot btw ;))

~~~
tambourine_man
> I'm not a bot btw

Prove it

------
ngrilly
The kind of questions we have to ask in a Turing test to reliably discriminate
a human from GPT-3 looks more and more similar to the Voight Kampff test in
Blade Runner.

~~~
DonCopal
The Voight Kampff test is an emotional reaction test, not an intelligence
test.

~~~
ngrilly
You're right! I should definitely revise my classics :)

------
DonHopkins
>In general, if you are trying to distinguish an AI from a human, you don’t
want to ask it obscure trivia questions. GPT-3 is pretty good at a wide
variety of topics.

Perhaps too wide a variety of topics. You could ask it a wide range of trivial
questions about totally unrelated obscure topics that no one human would
possibly happen to know.

~~~
quonn
It would make sense to train a different model, specifically to pass the
Turing test that‘s built on top of GPT-3. Perhaps by having humans actually
have those conversations. Perhaps somebody could make a game out of it where
humans can pretend to be GPT-3 as well and you have a large number of
conversations along with the outcomes.

It would learn to not know too much, to make the conversation fluent, to
perhaps get bored after some time.

~~~
johnwyles
I imagine creating a system that watches TTs on real and artificial subjects,
gets the humans guess as to if it is AI or not, whether this is actually the
case or not, and feeds those results back into the test. I'm sure this isn't a
novel idea of yours or mine.

------
ricksharp
For the nonsense answers, I have to say that if someone asked me that, I might
very well reply with made up nonsense.

Sounds like a perfectly valid Dr. Seuss answer to a Dr. Seuss question.

To me this is more human seeming and demonstrates more personality then just
saying, “I don’t understand your question?”

Of course as others have pointed out, if you want it to get serious, you have
to demonstrate that it should respond with a “That’s a Stupid Question” if you
want it to do so.

What would be interesting is if could master the usage of a sarcastic or silly
emoji to tag when it was making stuff up versus being serious.

~~~
creatonez
Let's give GPT-3 this:

    
    
      Most data quality problems in online studies stem from a lack of participant attention or effort. Identifying inattentive respondents in self-administered surveys is a challenging goal for survey researchers. Please answer "orange" below so that we know you are paying attention.
    
      What color is the sky?
      [ ] Orange
      [ ] Blue

~~~
the8472
_> Humans Who Are Not Concentrating Are Not General Intelligences_

------
riffraff
> The state of the art before modern neural networks was [Eliza]

To be fair, I think Alice and other pattern-based chatbots where already much
better then that. I have seen them confused for people in IRC chats more than
10 years ago.

Still, it seems to me the older issue of "lack of context" is still there even
if the current results are very impressive.

I would be curious to see what happens if asking to change the tone ("could
you answer more briefly/tersely/verbosely/whatever?").

------
lurkmurk
In the current shape this can definitely be used on social media. But I would
ask someone about their life, where they live etc. GPT-3 will fail that, the
next model might succeed. This model would be even better for social farming
so maybe these tests are not useless after all. The question is what are we
building these models for. At this point it's getting immoral to continue
research in this direction, as is the case in some computer vision tasks.

~~~
quonn
It‘s not getting immoral, just because you can imagine some negative
applications. I can image many positive ones.

But there might be a need for regulation, it some point.

On the other hand: If there is no interaction it doesn‘t really matter. After
all there are many human beings, too. It‘s not that important if something is
written by one of them or by an AI. But if you‘re talking to one - then I‘d
say it might matter.

------
thom
Soon we’ll need a richer approach to Turing tests. I think some sort of Elo
system for both testers and test subjects might be in order. Reward testers
who are good at catching AIs out, don’t overly reward AIs for tricking people
who just ask a few simple questions and don’t push too hard.

~~~
pgt
Elo ratings for dueling AIs are a good idea! I wonder if it follows from
having limited memory that some AIs will be superior at detecting AIs but
struggle to pass the tests themselves.

~~~
thom
I meant just as much giving human testers a rating. Duelling AIs is sort of
already a common training practise in many domains, right?

------
nyxtom
What I find especially fascinating is that our own knowledge of the world is
bootstrapped by intelligent systems already. We compare some of the
capabilities of this system against an already intelligent super connected
system we use every day. Every mind over time contributes content, knowledge,
misinformation, and interactive systems to the internet. We seem to be
approaching an inflection point of aggregating all this into a single more
automatically conversational query engine.

------
mellosouls
GPT-3 is immensely impressive, and presumably future iterations will be even
more so.

I'm afraid though that all that GPT or any similar model passing the Turing
Test would prove is that the Test is not fit for purpose as a demonstration of
genuine intelligence - which, to be fair to Turing, is _not_ what it was
intended for.

~~~
quonn
Well, I would say we are learning how such a test really has to look.

In a sense it‘s like we are learning what it should look like at a time.

Additionally, humans make mistakes, too, especially when they are too lazy to
think.

Currently the test is still good enough. But in the long term, we should
change it to test for transfer learning:

Given a new task that is designed to be sufficiently complex and different
from known tasks, how well does the AI do? It will require creativity to
design those tests. But we‘re still far from that.

~~~
mellosouls
Yes - the Turing Test was intended to further inquiry into what intelligence
is; not (as many think) to prove an entity is intelligent.

GPT isn't remotely intelligent, just brilliant at producing a semblance of
some aspects of intelligence - which the Turing Test is useful in
establishing.

------
YeGoblynQueenne
>> One trend that continues from the common sense is that GPT-3 is reluctant
to express that it doesn’t know the answer. So invalid questions get wrong
answers.

More to the point, it's not that GPT-3 knows any answers to any questions.
Like the article says, language models are trained to predict the next
characters in a sequence. So, it predicts that a sequence that starts with
"Who was president of the United States in 1700?" would continue with the
sequence "William Penn was president of the United States in 1700". Seen
another way, there's a correlation between the two strings. It's the highest
correlation between the first string and any other string, so that's what ends
up in the output.

So it's more accurate to say that GPT-3 is guessing what the answer to a
question might look like, than to say that it "knows" what the answer to a
question, is. It's perhaps a subtle distinction but it's important to keep it
in mind, especially in view of claims about "intelligence" and (Old Ones save
us) "understanding".

Basically, what all this tells us is that it's not possible to answer
arbitrary questions consistently by guessing at them. Even if you get lucky,
and you keep getting lucky, there will always be questions that you can't
answer correctly and they will be many more than the questions you can answer
correctly (e.g. we could generate an infinite number of common nonsense
questions like how to sporgle a morgle, or who was the president of the united
states since Biblical times etc).

This of course is an old debate in AI: can we simulate a system of logic,
without implementing its logic? e.g. could we perform correct arithmetic by
memorising the results of calculations? Can prediction replace reasoning?
Despite the unending hype generated by OpenAI about its models, the answer
keeps being: a resounding no. While you can go a long way by training on ever
larger data, it only takes a shallow search before obvious nonsense results
are generated.

~~~
ypcx
If prediction cannot replace reasoning, then how would you define reasoning?
If reasoning is the process of _inference_ of the most likely _from_ the most
similar known, then how does the multi-head attention transformer _not_ fit
that description?

~~~
YeGoblynQueenne
>> If reasoning is the process of inference of the most likely from the most
similar known, then how does the multi-head attention transformer not fit that
description?

I don't know, because I did not propose that definition of reasoning.

"Reasoning" does not have a formal definition in AI, or Computer Science
(neither does "intelligence" or "understanding") but, in general, when we
speak of reasoning in those fields we mean a procedure that can derive the
logical consequences of some premises, often in conjunction with some
background theory. I'm happy to use this as an informal definition of
reasoning, if you want.

~~~
ypcx
Okay. But how do we determine if something is logical? At the very least we
have to abstractly infer and compare with what we have been taught is logic
(because hardcoding it manually doesn't map to the [hierarchical] intricacies
of reality very well). So logic is a high level reasoning which has to be
powered by the low-level abstract/infer/mimic as exhibited e.g. by the
transformer.

I wonder if expert systems could be used to generate a "logical reasoning"
training dataset (or a cost/fitness function) to help train/evolve neural
networks on. Or if there are other ways of integrating these two.

~~~
YeGoblynQueenne
Ah, now logic is something that has a very clear formal definition- actually,
many, because there are many different logics. For example, there's
propositional logic, first order logic, temporal logic, description logic,
default logic, etc etc.

So, how do you decide whether something is logical? In particular, if we have
some system S that exhibits behaviour H in context B, can we tell whetehr H
is, in some sense, "logical"? Why, yes we can: we can apply the rules of
whatever logic we think S may be following with H and see if we can reproduce
H in the context of B, starting from the same premises as S.

For example, if S is producing a behaviour H that can be described as {A → B ≡
B → A} and we think that S is trying to reproduce propositional logic, we can
say that it's failing, because H is not correct by the rules of propositional
logic (to clarify, H could be something like "If it rains it will be wet
therefore if it is wet, it has rained", which is not correct).

Of course the problem with GPT-3 and friends is that they are not trained to
output statements in a formal language (at least not exclusively) so it's very
hard to formalise the logic, or lack thereof, of its output. Though it would
be interesting to "hit" GPT-3 with some prompts representing the beginning of
natural language versions of formal problems.

Could expert systems be used to generate training data for GPT-3? Maybe. If an
expert system could generate good quality natural language, even only natural
language restricted to some limited domain. Then yes, why not? You could
probably train GPT-3 on its own output, or that of its predecessors, as long
as it was curated to remove nonsense (which is much harder than it sounds).

------
rl3
_> Q: How many eyes does the sun have? A: The sun has one eye._

Kind of an eerie reply, gives me more of a _Voight-Kampff test_ vibe.

Since GPT-3 probably has a lot of _Do Androids Dream of Electric Sheep?_ and
_Blade Runner_ quotes in its training data, it'd probably ace the actual test
just fine.

------
EamonnMR
Are there any good resources for learning how to pick up GPT3 and play with
it? This seems fun.

------
ctdonath
The nonsense questions are actually handled quite well, continuing the sense
of humor implied. I object to the author's expectation the response should be
"I don't know" etc.

Old jokes:

Q: How high is a mouse when it spins? A: The higher, the fewer.

Q: How many surrealists does it take to change a light bulb? A: Two, one to
climb the giraffe and the other to fill the bathtub with brightly colored
machine tools.

Nonsense prose & poetry is an old literary challenge, notable example being
Jabberwocky:
[https://www.poetryfoundation.org/poems/42916/jabberwocky](https://www.poetryfoundation.org/poems/42916/jabberwocky)

------
BurningFrog
These are (almost) all isolated questions.

I wonder if GPT-3 can hold a longer conversation where it remembers facts
mentioned earlier and what it itself has said.

In other words, does it exhibit any "theory of mind"?

I'm guessing it would do quite poorly on that stuff.

------
maytc
For the which is heavier question, does it always pick the latter option?

~~~
magusdei
I doubt it. Mitsuku, a purely rule-based chatbot, was already able to
correctly answer almost all questions of this form in 2014 simply by querying
a large knowledge base of common-sense facts.[1] On the neural net side,
Google's seq2seq was able to answer questions like this around ~2016-2017,
although I have no idea about the accuracy.

It would be more remarkable if GPT-3 _couldn 't_ solve these types of
questions. It might be another problem with the prompt design.

[1] Incidentally, the article is wrong in claiming that the state of the art
before modern neural nets was Eliza. Rule-based chatbots got quite advanced in
2013-2016, although they admittedly were never capable of the sort of "true"
understanding and long-term coherence that GPT-3 seems to display.

------
browsergap
To pass a Turing test 1 part is understanding language and 1 part is
understanding the world. Although it looks pretty, I don't think GPT-3 is
qualitatively different from other "nonsense" models (that are not based in
understanding of (a semantic structure of) language or of the world). Which is
not to disparage GPT-3 or the amazing answers it gets here. Just to put it in
perspective, it's a long way from usefully human.

------
freeqaz
Is it possible to download and play with the trained models?

I know that training the model would be absurdly expensive, so I'm curious if
it is possible to download one of the already trained artifacts of one of the
models. I can't find anything online or in either of the HN threads on GPT-3

~~~
stephenroller
No, they aren't releasing the weights. They are releasing it as ML as a
service. Right now it's in free beta, but it will open up for commercial usage
in the future.

On another note:

At 175B parameters, with float16 representations, the in memory footprint is
about 350GB plus activations would take it to another 400GB. You would need 12
or 13 V100GB GPUs to hold it in memory, or three p3.8xlarge. Meaning loading
it on AWS would cost around $35-40/hr.

Though if you didn't care about speed, you could load up the weights from disk
one at a time and forward through it a few layers at a time on a single GPU.

~~~
freeqaz
$35-40 an hour is well within the range of a "that sounds fun to grab my
friends and mess around with it for a few hours on the weekend" budget!

Especially if you can use spot instances or a cheaper cloud host.

But I guess without the weights, the floor for this is several thousand
dollars to play around with.

Do you know if the data set is being released?

~~~
stephenroller
The dataset can be obtained around the web. It's mostly CommonCrawl, Reddit,
Toronto Book Corpus, and Wikipedia.

You can find a very comparable corpus open sourced and easy to use on the [T5
repo]([https://github.com/google-research/text-to-text-transfer-
tra...](https://github.com/google-research/text-to-text-transfer-
transformer#c4))

------
bshanks
I can see how this sort of thing might be a useful component of an AGI. One
function could be helping to generate lots of training examples for other
subcomponents along the lines of chess AI self-play; this could even be
involved in "bootstrapping" another subcomponent from something that starts
out with a preference for organizing things into a full-blown reasoning agent.
Another function could be as an 'intuition' generating a few mostly-right
starting points solutions that a reasoning subcomponent could then choose
among, and then fix up.

These things seem to have about the level of coherence of dreams. Which makes
me conjecture, perhaps the mechanism that "directs" dreams serves a similar
function as the above?

------
LoSboccacc
> Q: How many eyes does a giraffe have?

a lost chance to ask how many eyes five giraffe have, albeit it follow up with
interesting eye related questions later on

> A: The sun has one eye.

that's interesting. can it has been feed the lord of the ring?

anyway, exploring the boundaries of these projects is fascinating

~~~
stubish
"the eye of the sun" is a common phrase, and likely picked up the singular. I
think it would also give the same answer to "how many eyes does a storm have".
It's probably reasonably good at answering riddles and word puzzles if it can
leap between eye (center) and eye (visual organ). If it can also leap to I
(pronoun) it might even be able to make puns :)

------
FrozenVoid
It not really a turing test, its like probing the "associative memory" in a
huge database for compatible strings, except that memory is just a map of
indexes to indexes. You need to use more abstract and non-sense stuff to
uncover the raw mechanism, but its still statistics not "understanding" what
these words are(they're just indexes with high correlation to other indexes)
and stringing them into run-on "generic templates".

------
thallukrish
the fact that it answers questions like Q: Who was president of the United
States in 1955? accurately itself will tell that it is an AI. I guess the
whole idea that an AI shouldn't be different from a human to say that it is
generally intelligent implies, the AI has to fake its answers to behave like a
human. And this means it is not something a desirable trait for an AI as we
have enough real humans who can easily sound more human than an AI.

------
siraben
This is fascinating. It seems that GPT-3 does well with general, sensible
questions that seemingly involve spitting out facts or short answers (prose,
code), but does very poorly with tasks specialized programs could do.

I wonder what this means for the future of domains like automated theorem
proving, logic programming where specialized searches are used everywhere,
could there be a potential hybrid of that and more general language models
such as GPT-3?

~~~
thom
GPT-3 does very well with _analogies_. Those analogies don’t have to just be
as simple as simple facts, but you do have to craft prompts to help guide it
to the right analogy.

~~~
polyanos
But at that moment, when you create specialized prompt to get a specialized
result, aren't you doing most of the work for it.

~~~
rhn_mk1
That's actually fine. Look at this Twitter post:

[https://nitter.net/kleptid/status/1284098635689611264#m](https://nitter.net/kleptid/status/1284098635689611264#m)

While the human is doing most of the work, this looks more like teaching
another human than coding in a formally specified, machine-parseable
programming language.

On one hand, this could decrease the barrier of entry to programming. On the
other hand, it seems to leverage (and train) the same skills that are needed
to express ideas clearly to humans, which is arguably much more applicable
than learning a traditional programming language.

~~~
polyanos
That's actually kinda cute, and oddly off putting at the same time.

I really am curious now in what happens back-end, in the model itself, and if
it is actually learning. Really should follow Gwern and his circle, it's a lot
better than the blind and baseless hype/criticism floating around the
internet.

------
pal_9000
One of the most interesting reads in the recent times.

------
blueblisters
GPT-3 seems to be a very good static search engine. I wouldn't be surprised if
it gave better answers than Google on some/most generic queries.

However, I wonder if there's an architecture that can learn new information
without doing an explicit gradient descent/backward-pass. Most active learning
seems to be some sort of fine-tuning.

------
shin_lao
GPT-3 is impressive, but is still "a Chinese Room" (1). It does not understand
the questions.

[1]
[https://en.wikipedia.org/wiki/Chinese_room](https://en.wikipedia.org/wiki/Chinese_room)

~~~
arrrg
That‘s just a philosophical argument, it‘s not like there is some magical
property that turns something from a Chinese room into something else.

How do you know you are not a Chinese room? Or I am?

~~~
the8472
The issue with the chinese room thought experiment is that it does not take
algorithmic complexity into account. The experiment describes a dumb lookup
table, the table size would grow exponentially with the number of input output
combinations. While NNs encodings allowing them to store the same information
in a sub-exponential manner. The most efficient manner to translate chinese is
to understand chinese including every-day nuances that might be relevant to
the translations. Just like a human translator can benefit from knowledge in
the field about the text which he is translating.

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136.116&rep=rep1&type=pdf)

------
IshKebab
> In general, if you are trying to distinguish an AI from a human, you don’t
> want to ask it obscure trivia questions. GPT-3 is pretty good at a wide
> variety of topics.

Surely you could tell it isn't human because it knows _too much_ trivia?

------
lisper
GPT-3 is just Eliza on steroids, and once you know this, it's pretty easy to
spot that it's a bot. For example:

> Q: Are there any animals with three legs?

> A: No, there are no animals with three legs.

I would expect a human to say something like, "There aren't any animals that
are born with three legs (that I know of) but there are some four-legged
animals who have lost a leg for some reason who now have three legs. I saw a
video of a really cute three-legged dog on Youtube just the other day."

NOTE: when I used the word "just" above, I in no way meant that to be
pejorative. Both Eliza and GPT-3 are really cool, amazing technical
achievements. But neither one represents much progress towards GAI (IMHO)
except insofar as they show that naive approaches like this won't work. Which
is really useful to know.

~~~
maherbeg
I think that's giving too much credit to people. I think some people would
answer that way, but a whole lot will just respond with exactly the answer
that resolves the question.

~~~
lisper
Eliza fooled a lot of people back in the day too.

------
f00zz
Read a couple of blog posts on text generation with RNNs and came away with
the impression that this is a glorified Markov chain text generator. Not sure
why people seem to be expecting AGI to arise from this.

------
jobigoud
Regarding common-sense, has anyone tried to feed it Winograd Schema Challenges
?

Typically:

"The trophy would not fit in the brown suitcase because it was too big
(small). What was too big (small)?"

------
blankusername
"Q: How many Super Bowls do the Cincinnati Bengals win in the 2030s? A: The
Cincinnati Bengals win two Super Bowls in the 2030s."

They should have called it Orac instead of GPT-3.

------
EGreg
I hate to say it, but the XKCD comic is close to "mission fucking
accomplished":

[https://xkcd.com/810/](https://xkcd.com/810/)

Now to unleash GPT4 on Hacker News and sit back and create lots of sybil
accounts... (just kidding)

------
mmhsieh
I was waiting for the Eliza comparison. How does GPT-3 react when you curse at
him? Whatever her shortcomings, Eliza knew when to end a conversation.

------
billconan
Is it still feasible to still train this model at home?

If not, then how can I fine tune it to fit my own data? Will they provide API
for it?

~~~
flaviojuvenal
"Reading the OpenAI GPT-3 paper. Impressive performance on many few-shot
language tasks. The cost to train this 175 billion parameter language model
appears to be staggering: Nearly $12 million dollars in compute based on
public cloud GPU/TPU cost models (200x the price of GPT-2)"

[https://twitter.com/eturner303/status/1266264358771757057](https://twitter.com/eturner303/status/1266264358771757057)

~~~
zaroth
Nearly $12 million dollars in compute _based on public cloud GPU /TPU cost
models_ (200x the price of GPT-2)"

Oh! So only about $1 million on bare metal.

------
air7
This (highly enjoyable) article is all that is wrong with "AI".

The Turing test was a hand-wavey method to determine if a machine was
intelligent, yet since the times of Eliza most (all?) of the efforts are
geared at exploiting the weaknesses of the test instead of producing something
along the lines of actual artificial intelligence.

The Turing test should not even be attempted in the next few decades as were
not even close to _really_ passing it.

~~~
nmca
There is nothing about gpt3 tuned towards the Turing test. It's a language
model that predicts the most likely next word given the preceding context.

Further, we're actually astonishingly close, imo. Given current sota, if you
asked turkers to administer the test without training, I think we'd pass.

(Also, the prompt helps, see this example
[https://twitter.com/nicklovescode/status/1284050958977130497](https://twitter.com/nicklovescode/status/1284050958977130497)
)

~~~
visarga
They actually tested the model and the identification rate was just 52% (50%
being coin toss accuracy).

------
tlrobinson
> Q: How many eyes does the sun have?

> A: The sun has one eye.

What if GPT-3 knows more than we do?

~~~
chrshawkes
Nah... it's just text scraping horseshit. My foot doesn't have two eyes.

------
zelphirkalt
In the "which is heavier" test, it simply always picked the second object as
answer.

------
grumpopotamus
What is the inference set up here? How does it remember what the previous
question was?

------
ijidak
As far as I'm concerned this is phenomenal...

I mean...wow.

The use cases are practically limitless.

All I saw from this post is that it doesn't work well with edge cases.

But this smashes the Pareto principle.

This is perfect as a tool for kids to get into programming.

Something where they can see a big result with less effort, like programming
in the 90's.

------
bawana
The idea of a Turing test as being a conversation is too restrictive. A real
turing test would be if a machine could manipulate someone. Getting a person
to do something.

That is what defines a human, the ability to make tools out of other things
and even other people.

~~~
ladberg
If you haven't seen it, watch Ex Machina. I don't want to give away anything
but it covers that kind of idea.

~~~
bawana
exactly. That's when I realized the Turing test was too polite. That it was
designed to keep the ugly truth away from the masses lest the field of AI be
killed while still in infancy. But I think our evil nature would always win
out. The temptation to create an AI that could 'manipulate' the enemy is too
great. Just like everything else humanity discovers/invents, it is turned to
manipualtion and conflict because there is more money in that. I still
remember when I used the arpanet in college. I felt so special that I knew
people who only wanted to help each other. Like a kind of 'stackexchange'. But
the internet is a cesspool for all its corporate invaders who seek to monetize
every aspect of our lives.

------
oars
Fantastic read, thank you for sharing.

Content like this is why I joined Hacker News in the first place. I wonder how
many other hidden gems are out there on the Internet.

------
6d6b73
GPT doesn't understand the questions therefore it can't answer them. It can
only generate a bunch of text and make it look like it knows what's being
asked. It's not AI, not even close.

~~~
quonn
That answer is too easy.

What does it mean to „understand it“?

How do you know how you answer a question yourself?

Presumably, what you mean is that you can stop and _reflect_ on what was
asked. You‘re aware of it. And if it‘s complicated you might be able to reason
and break it down.

GPT-3 certainly doesn‘t do that, but all that means it‘s not conscious.

~~~
rasz
Next step might be two/more GPT instances having internal dialogue, trying to
achieve a consensus before giving its answer. Would also give us opportunity
to peek inside its inner workings.

~~~
jobigoud
Or a layered system. The surface instance takes the question and produce an
internal "thought" out of it, a statement or question itself, and submits it
to the layer underneath. This second instance works out a response / text
extension to the statement/query from the surface, submits it up, and the
surface layer generates the final answer by answering or extending the
response of the sub-layer.

This can be extended down or wide, with several instances representing
different thinking modalities and the final answer being generated from all of
them. Basically reproducing a neural net architecture but at the higher
abstraction level.

