Hacker News new | past | comments | ask | show | jobs | submit login

That doesn't show that they "fall over". All degraded performances are highly non trivial. And even the paper admits humans would see degraded performance on counterfactuals as well. They think humans may not only with "enough time to reason and revise", something the LLMs being evaluated don't get to do here.

If you took arithmetic tests in base 8, you wouldn't reach the same accuracy either.




Well, sure, but the problem is that LLMs can’t reason and revise, architecturally. Perhaps we can chain together a system that approximates this, but it still wouldn’t be the LLM doing the reasoning itself.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: