Show HN: Killed by LLM – I catalogued AI benchmarks we thought would last years

a1o · 2025-01-02T22:22:54 1735856574

There is a square with arrow like the Volvo logo that suggests things are clickable (to get more information?) , I tried to tap the turing test one in different places but nothing happened - I am on iPhone with Safari.

robkop · 2025-01-02T22:33:45 1735857225

Ahh, there's a bug with the z-index on the Turing one (I made it a "legendary card" for surviving 70 year), will fix shortly.

Here's the link for the moment: https://arxiv.org/pdf/2405.08007

Also if you want to read the original Turing paper (it was interesting to look back upon, I think the future of benchmarks may look a lot more like the Turing test): https://courses.cs.umbc.edu/471/papers/turing.pdf

a1o · 2025-01-02T22:53:38 1735858418

Btw the conversation with the robot is super robot like because in real life people send double, triple, five separate messages in a row and mix topics and different conversations. It still feels very robot like for me.

a1o · 2025-01-02T22:50:42 1735858242

Thank you for the links!

detente18 · 2025-01-04T17:10:21 1736010621

It's interesting to see how short-lived some of these were (e.g. HumanEval).

robkop · 2025-01-02T22:13:50 1735856030

For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (https://r0bk.github.io/killedbyllm/). Some interesting findings:

2023: GPT-4 was truely something new - It didn't just beat SOTA scores, it completely saturated several benchmarks - First time humanity created something that can beat the turing test - Created a clear "before/after" divide

2024: Others caught up, progress in fits and spurts - O1/O3 used test-time compute to saturate math and reasoning benchmarks - Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation - Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board

And yet with all these saturated benchmarks, I personally still can't trust a model to do the same work as a junior - our benchmarks aren't yet measuring real-world reliability.

Data & sources (if you'd like to contribute): https://github.com/R0bk/killedbyllm Interactive timeline: https://r0bk.github.io/killedbyllm/

P.S. I've had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer "can AI do X" questions then please let me know.