Ratings on LMArena are too easily gamed. Even professional human evaluators are ...

energy123 · 2025-11-08T16:34:39 1762619679

It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.

ACCount37 · 2025-11-08T17:17:31 1762622251

And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.

Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.