There is a square with arrow like the Volvo logo that suggests things are clickable (to get more information?) , I tried to tap the turing test one in different places but nothing happened - I am on iPhone with Safari.
Also if you want to read the original Turing paper (it was interesting to look back upon, I think the future of benchmarks may look a lot more like the Turing test): https://courses.cs.umbc.edu/471/papers/turing.pdf
Btw the conversation with the robot is super robot like because in real life people send double, triple, five separate messages in a row and mix topics and different conversations. It still feels very robot like for me.
For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (https://r0bk.github.io/killedbyllm/). Some interesting findings:
2023: GPT-4 was truely something new
- It didn't just beat SOTA scores, it completely saturated several benchmarks
- First time humanity created something that can beat the turing test
- Created a clear "before/after" divide
2024: Others caught up, progress in fits and spurts
- O1/O3 used test-time compute to saturate math and reasoning benchmarks
- Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation
- Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board
And yet with all these saturated benchmarks, I personally still can't trust a model to do the same work as a junior - our benchmarks aren't yet measuring real-world reliability.
P.S. I've had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer "can AI do X" questions then please let me know.
reply