I do claim that a trained judge when confronted with two entities (1 person, 1 computer/AI) can easily come up with questions that enable him to distinguish the person from the AI in most cases (for inspirations just look at older HN discussions about hallucinations, or what AIs do in "unexpected situations" such as "nonsense texts" such as "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" etc.)
> But today’s A.I. systems can pass the Turing Test with flying colors, and researchers have had to come up with new, harder evaluations.
It’s a bit petty to rag on a single sentence… but why do editors still let falsehoods like this slide? There is not a single LLM that can pass a properly administered Turing test. Just last week I saw GPT-4 badly failed a de facto Turing test because it wasn’t able to count to 17. The idea that a Turing test means “do laypeople find the dialogue eerily human-like?” is one of the tech community’s most pernicious bits of nonsense. And here it is, repeated uncritically in the New York Times. Extremely frustrating.
yeah we are struggling with this question too, though an honest subjective assessment on everyday tasks has value, just as benchmarks do. no one has really solved this yet
I disagree. Just look at the exact definition of the original Turing test:
> https://en.wikipedia.org/w/index.php?title=Turing_test&oldid...
I do claim that a trained judge when confronted with two entities (1 person, 1 computer/AI) can easily come up with questions that enable him to distinguish the person from the AI in most cases (for inspirations just look at older HN discussions about hallucinations, or what AIs do in "unexpected situations" such as "nonsense texts" such as "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" etc.)