GPT-4 Passes Turing Test. Humans Often Mistake Each Other for AI

tripletao · 2024-05-17T07:23:14

The "decoder" article says:

> The researchers defined 50 percent as success on the Turing test, since participants then couldn't distinguish between human and machine better than chance.

54% of GPT-4 conversations were judged to be human, so the "decoder" article says the Turing test has been passed--indeed, it seems more human than human. But the paper says:

> humans’ pass rate was significantly higher than GPT-4’s (z = 2.42, p = 0.017)

The seeming discrepancy arises because they've run a nonstandard test, in which the meaning of that 50% threshold is very hard to interpret (and definitely not what the "decoder" author claims). The canonical version of Turing's test is passed by a machine that can

> play the imitation game so well, that an average interrogator will not have more than a 70 percent chance of making the right identification after five minutes of questioning

The canonical experiment is thus to give the interrogator two conversations, one with a human and one with a non-human, and ask them to judge which is which. The probability that they judge correctly maps directly to Turing's criterion. If the two conversations were truly indistinguishable, then the interrogator would judge correctly with p = 50%; but that would take infinitely many trials to distinguish, so Turing (arbitrarily, but reasonably) increased the threshold to 70%.

That doesn't seem to be the experiment that this paper actually conducted. They don't say it explicitly, but it seems like each interrogator had a single conversation, with a human with p = 1/4. The interrogator wasn't told anything about that prior, leading them to systematically overestimate P(human). If every interrogator had simply always guessed "non-human", then they'd collectively have been right more often.

Even if the interrogators had been given that prior, very few would have the mathematical background to make use of it. GPT-4 is impressive, but this test is strictly worse than Turing's, whose result has clear and intuitive meaning.

somenameforme · 2024-05-17T09:10:25

As mentioned in another comment there are also numerous other issues missing relative to the canonical experiment. Three big ones:

- The whole point of the name, "the imitation game", is to imitate a specific identity. The more precise an identity is, the more difficult it would be for an imposter to imitate it. Turing chose male vs female, but modern choices have generalized it down to 'human or not' which is of course vastly easier to imitate than a more specific choice.

- Participants are expected to collaborate with the interrogator, break the 4th wall, and do everything possible to make it clear to the interrogator that they are the real person. Modern variants generally have participants acting adversarially and actively, and intentionally, giving responses that would be difficult to distinguish from those of a bot.

- The interrogator is expected to intelligently interrogate. For example one [intentionally] naive idea Turing gave was that an interrogator might ask the person to perform some mathematical calculation. If the person even tries to answer 37167361 * 372 (let alone succeeds in a short time frame), then they are probably not human. Of course the bot could be programmed to respond accordingly, but it's the point of actively and intelligently trying to break the bot and have it reveal itself. Contemporary interrogators typically ask the participants random and inane questions like "Where are you from?" which is a complete and absolute waste of time, unless part of some more precise plan - but it never is.

To my knowledge there have been no Turing Tests carried out with anything even vaguely resembling the rigor and purpose of the original test, but I think that's largely because the goal seems to be to create a test that can be passed, rather than actually evaluate the capabilities of the various LLMs.

Ukv · 2024-05-17T12:26:46

> - The whole point of the name, "the imitation game", is to imitate a specific identity. The more precise an identity is, the more difficult it would be for an imposter to imitate it. Turing chose male vs female, but modern choices have generalized it down to 'human or not' which is of course vastly easier to imitate than a more specific choice.

Turing did introduce the concept of the game by having it played between a human man and human woman, with the man pretending to be a woman, but to my understanding this was just a stepping stone to move on to having the game played between machine and human.

I don't think the gender specifity was meant to stick around beyond that initial introductory example. If you mean how he says things like "imitation of the behaviour of a man", that's most likely intended generally rather than specifically male (particuarly as the "machine takes the part of A", which was the man pretending to be a woman).

somenameforme · 2024-05-17T14:19:34

Here [1] is the original paper. Though he does not state as such, I'm sure the idea of man vs woman was just an example. It could be anything, but I think it inherently must be something. Generalizing this down to being human or not greatly simplifies the test, because the identity aspect is basically just free information for the interrogator. With or without the identity, he could still ask the exact same questions. The only difference is the domain of viable answers is greatly limited with identities. And the more specific the identity, the more the real person will be able to reveal themselves, and the more difficulty the imposter will have impersonating them.

[1] - https://redirect.cs.umbc.edu/courses/471/papers/turing.pdf

Ukv · 2024-05-17T17:43:52

I agree that "Man pretending to be woman, vs real woman" is just an example, used to introduce the question in the form of a party game between humans. I see the "something" it is replaced by as "Machine pretending to be human, vs real human".

I don't see indication that the machine must pretend to be human in addition to some other characteristic of the second player. I think the reason you see others as having "generalized it down" is that your interpretation is not apparent in the text.

> the more specific the identity, the more the real person will be able to reveal themselves, and the more difficulty the imposter will have impersonating them.

Definitely makes for a more difficult problem (arbitrarily difficult, even) and a potentially interesting extension.

Currently to me it doesn't seem as insightful as Turing's original proposal - there's no more inherent human benchmark of 50%, for instance, since humans can also be bad at impersonating some specific characteristic.

somenameforme · 2024-05-17T19:41:33

With no need to actually imitate anything in particular, you could simply chop away everything except the most basic linguistic functions and claim you are a non-native preteen. And who's to say otherwise? In fact that's literally the exact "trick" that yet another mockery of the Turing Test used when claiming they'd overcome the Turing Test. In fact shall we not just take it to the next level? You're 5 years old - and simply respond by randomly pounding various keys on the keyboard on occasion. Boom - didn't see that coming, now did ya Turing?

Passing the test will not be a benchmark because the test has been passed, but because of what passing the test ought entail. People often complain about shifting goalposts on AI, but that's not the issue. The issue is doing exactly what you're doing here and creating worthless goalposts to begin with. And so of course when you cross them, the first thing that happens is that they get inched forward somewhere closer to something reasonable, before you even have time to uncork the champagne. Why not simply skip this nonsense, and start with a reasonable goalpost to begin with? Because it's too hard? Well obviously - that's why it's a goal, and not next month's scrimmage point!

Ukv · 2024-05-17T22:07:19

> you could simply chop away everything except the most basic linguistic functions and claim you are a non-native preteen [...] You're 5 years old - and simply respond by randomly pounding various keys on the keyboard on occasion. Boom - didn't see that coming, now did ya Turing?

Then the real human B would, on average, offer far more compelling evidence of personhood and the bot would fail the majority of the time. I don't see how this issue affects Turing's proposed version of the experiment.

> The issue is doing exactly what you're doing here and creating worthless goalposts to begin with

Claims from skeptics that "machines fundamentally cannot do X without real intelligence" are relatively easy to come by even now, which creates goalposts for intelligence by contrapositive (¬I => ¬X, so X => I).

For me Turing's test is interesting because fully solving it implies achieving all (or at least, a very large class of) observable "X"s to the degree that current humans are capable of. If playing chess truly required intelligence, you could feed in chess moves and a machine that cannot play chess would (over a large enough experiment, so you get people who can and cannot play chess) offer less evidence than the average person.

I believe the overall impact is a push towards either "something can behave exactly like it is intelligent without being intelligent" or "machines can be intelligent". Both are interesting and I feel increasingly common viewpoints.

> Because it's too hard? Well obviously - that's why it's a goal, and not next month's scrimmage point!

Because the goal should be meaningful - "find the factors of this absurdly large coprime" doesn't really say all that much about intelligence, and many other tests would only cover one particular idea of what intelligence is.

somenameforme · 2024-05-18T07:02:57

But I think you're running into some cognitive dissonance here, because in your argument here you're not talking about a "more compelling evidence of personhood", but rather discriminating towards some specific identity. A preteen non-native speaker is just as much a person, with just as much personhood, as our earlier example of a nuclear physicist with a twin neurologist.

The only difference is that imitating a preteen non-native speaker is quite trivial and says very little, which is why you would obviously never select such as the identity. In other words your version of the test doesn't involve solving many "X"s at all. In fact it only requires one - the one which simplifies the domain so much as possible. And as you're strongly implying, but not acknowledging, this was not a meaningful achievement at all.

Ukv · 2024-05-18T16:58:37

> in your argument here you're not talking about a "more compelling evidence of personhood", but rather discriminating towards some specific identity. A preteen non-native speaker is just as much a person, with just as much personhood

It's not about how much of a person they are, but how much evidence of personhood responding in a particular way gives. If you were the interrogator and gave some challenge you think only humans can solve, and got back one "asdfghjkl" (no real evidence either way) and one correct answer (evidence in favor of personhood), your beliefs should be adjusted towards believing the latter is the human. Always giving bad answers just because humans can also give a bad answer is already a failing strategy with low success rate when the test is carried out as Turing specified, with no requirement to imitate a specific characteristic added in.

As an analogy that may or may not help: You have two boxes, one containing a rabbit and one containing a turtle. One box is perfectly still, offering little evidence either way (rabbits and turtles can both trivially stay still). The other box is bouncing up and down (something you have reason to believe is difficult for turtles). Which box more likely contains the rabbit?

> In other words your version of the test doesn't involve solving many "X"s at all. In fact it only requires one - the one which simplifies the domain so much as possible.

I think the key you are missing is that it is up against a real human. It does not just have to pass the "well both humans and bots could theoretically respond in this way" mark by giving gibberish answers, but instead get chosen as human when the second player is likely satisfying many of the interrogator's tests.

somenameforme · 2024-05-18T17:25:18

Your analogy inadvertently once again emphasizes the point. Now change it from being a rabbit and a turtle, to an unknown animal and some non-animal thing pretending to be another unknown animal. And you have to guess which is which. It would be effectively impossible to figure anything out, because you have absolutely no basis to work from.

LLMs are trained on nothing except the corpus of human knowledge. It is literally impossible for them to e.g. accidentally say something that it's inconceivable for a human to say, if we have absolutely no way of constraining the identity of said human. And another way these tests are made easier is by generally having all participants manually type out their interactions, instead of having them transcribed. So the dozens of interactions you'd normally have in 5 minutes goes down to ~5. Making all of this that much more true. It's just silly.

And no, always giving bad answers it not a failing strategy. As I mentioned, the scenario I'm describing is not a hypothetical. The Turing Test (or at least yet another abysmal bastardization of it) was passed in 2014, with the chatbot in question impersonating a 13 year old Ukrainian teen who was not good in English, or maintaining a coherent dialogue, or train of thought, or even being coherent for the most part. I'm sure this is exactly what Turing had in mind. [1]

[1] - https://www.bbc.com/news/technology-27762088

Ukv · 2024-05-18T19:13:37

> Your analogy inadvertently once again emphasizes the point. Now change it from being a rabbit and a turtle, to an unknown animal and some non-animal thing pretending to be another unknown animal. And you have to guess which is which. It would be effectively impossible to figure anything out, because you have absolutely no basis to work from.

It'd be possible to get an idea if there was some box movement that was unique to animals. That's not particularly interesting because it's fairly uncontroversial that a box-sized robot could very accurately imitate an animal through the medium of box movement, but for a bot to imitate a human through the medium of text (seen as a sufficiently general interface to test "almost any one of the fields of human endeavour that we wish to include") is interesting to many.

But, the concept the analogy was demonstrating was really just basic reasoning. That, if you're given X xor Y and have evidence of Y, you should tend towards Y even lacking direct evidence for/against X. Do you agree that, in my example, you would choose the box giving some evidence of being a rabbit over the one that gives none?

> LLMs are trained on nothing except the corpus of human knowledge. It is literally impossible for them to e.g. accidentally say something that it's inconceivable for a human to say

Depends on what you mean by "inconceivable", but it's certainly possible for it to say things that it is unlikely for a human to say due to the bot's limitations (at the extreme, consider a Markov chain). And, even if only saying things that a human could just as well say, if those things are also trivial for a bot to say it is poor evidence of personhood.

> And no, always giving bad answers it not a failing strategy. As I mentioned, the scenario I'm describing is not a hypothetical. The Turing Test (or at least yet another abysmal bastardization of it) [...]

To put relevant emphasis on my claims:

> > Always giving bad answers just because humans can also give a bad answer is already a failing strategy with low success rate when the test is carried out as Turing specified

> > Then the real human B would, on average, offer far more compelling evidence of personhood and the bot would fail the majority of the time. I don't see how this issue affects Turing's proposed version of the experiment.

I agree that there are ways to bastardize the test. If for instance you have no second player that you must choose between (have to say A xor B is a bot), then just remaining silent/incoherent to give no information either way can be a reasonable strategy. As with all benchmarks, you also need a sufficient number of repeats such that your margin of error is low enough - fooling a handful of judges does not give a good approximation of the bot's actual rate.

I'd even claim it's a bit of a bastardization to use Turing's 30% prediction (of where we'd be by 2000) to reduce the experiment down to just pass or fail. Ultimately the test gives a metric for which the human benchmark is 50%.

somenameforme · 2024-05-19T14:56:29

Wellp this was a fun conversation, but it seems to me that at this point there's not much more to do other than repeat ourselves. The final thing I'd emphasize is that it's important to make sure metrics measure what you want them to measure. To some degree we've already ruined the name of the Turing Test with excessive simplifications. 'Oh that? Yeah, it was passed a decade ago, right?'

Of course one practical issue that, in some ways, makes this all moot - is that if we ever create genuine AI systems capable of actual thought, the entire idea of a "test" would be quite pointless. Rapid recursive self improvement, perfect memory, perfect calculation, and the ability to think? We'd likely see rapid exponential discovery and advancement in basically ever field of human endeavor simultaneously. It'd be akin to carrying out a 'flying test' after we landed on the Moon.

Ukv · 2024-05-19T23:05:52

I think we generally both agree that there are some poor misimplementations of the test, like the one you linked where (according to their paper) the interrogator could answer "unsure" on a bot's response and count as being "fooled" by that bot even if they then answer "human" on a human, which does allow for giving nonsense answers to be a legitimate strategy (unlike with Turing's specification, I'd claim).

Ultimately I do think Turing's experiment measures something interesting. There's a nice "minimal maximality" to it, in that it's a simple game yet set up in a way that solving it encompasses all facets of intelligence that current humans have. Maybe coincidentally comparable to the test for Turing completeness, in that a Turing machine is conceptually simple yet simulating it proves computational universality. I feel there's a risk of missing the nuance and just taking the experiment as a singular benchmark, whether it's made "easier" or "harder", akin to "simulating a Turing machine is too easy, how about simulating the Numerical Wind Tunnel?"

> Rapid recursive self improvement

I'm a bit sceptical of a hard take-off scenario.

Say on first pass it cleans up a lot of obvious inefficiencies and improves itself by 50%. On the next pass it has more capacity to work with, but the low-hanging fruit are already dealt with, so it probably only manages to squeeze out an extra 10%. To avoid diminishing returns, it'd need to automatically build better chip fabrication plants, improve mining equipment, etc. so that many steps in the pipeline are improving. This will all happen eventually, and contribute to humanity's continuing exponential progress, but IMO will be a relatively gradual changeover (as is happening now) rather than an overnight explosion from some researcher making a bot that can rewrite itself as soon as it can "actually think", whatever that entails.

tripletao · 2024-05-17T18:29:07

I agree that those are also highly significant differences, though I'd consider them a reasonable "easy mode" while we wait for a machine capable of passing Turing's original test.

I focused on the statistical issue because that one seems indefensible to me. The paper's result has no clear interpretation, depending completely on what assumption the interrogator makes about the unspecified prior probability that their witness is human. It's not clear to me whether the paper's authors even understand what they've broken.

Just for fun, I tried a few LLMs and couldn't get them to recognize the statistical issue either. I guess they'll probably learn before social science professors do, though.

anonzzzies · 2024-05-17T06:13:16

I would pick the one that is the most polite and makes the least syntactical and grammatical errors to be the AI always; most humans are absolutely terrible at formulating anything coherent so it's a safe bet that the ones who actually form correct sentences are AI. Many (most I see online or on the street) humans talk like Markov Chains (check your kid's snapchat for examples) or, at best, very early transformers (tons of repetition, getting stuck and not all that coherent).

somenameforme · 2024-05-17T06:34:47

Ugh, these modern "Turing Tests" are a complete bastardization and dumbed down version of what Turing described. Here is his original paper. [1] In short the actual task involves a skilled interrogator, somebody of a given and specific identity, and then somebody pretending to have that identity. Turing proposed a simple example where you'd have a woman, and then a man pretending to be a woman. The more precise the identity, the more challenging the test becomes. A man might kind of sort be able to pass for a woman in text, but he'd never be able to pass for a nuclear physicist who has a twin brother working in neuroscience, against a skilled interrogator. And all participants are expected to actively collude and collaborate as much as possible to emphasize who is the "real" person. So for instance the woman might propose to help the interrogator by proposing questions he could use to help spot the fake, and/or to emphasize her own authenticity.

Modern takes generalize the identity to absurdity (with the identity being human or not), generally feature idiots (or people acting like such) for interrogators, and participants who are actively trying to act like a computer to trick the interrogator. Like in this article, the human is B and was asked, "What could you say to convince me that you're a human?" His response was "You just have to believe!" Why not just skip the pretext and just have the human start responding 01001001 01000010 01101111 01110100 01001100 01101111 01101100 to every question? And if all this nonsense wasn't enough, they bumped it up to 3 comps and 1 human pretending to be a comp. This isn't the Turing Test - it's complete LARPing!

[1] - https://redirect.cs.umbc.edu/courses/471/papers/turing.pdf