Claude is the best example of benchmarks not being reflective of reality. All the AI labs are so focused on improving benchmark scores but when it comes to providing actual utility Claude has been the winner for quite some time.
Which isn’t to say that benchmarks aren’t useful. They surely are. But labs are clearly both overtraining and overindexing on benchmarks.
Coming from gamedev I’ve always been significantly more yolo trust your gut than my PhD co-workers. Yes data is good. But I think the industry would very often be better off trusting guts and not needing a big huge expensive UX study or benchmark to prove what you can plainly see.
Which isn’t to say that benchmarks aren’t useful. They surely are. But labs are clearly both overtraining and overindexing on benchmarks.
Coming from gamedev I’ve always been significantly more yolo trust your gut than my PhD co-workers. Yes data is good. But I think the industry would very often be better off trusting guts and not needing a big huge expensive UX study or benchmark to prove what you can plainly see.