Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.
It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.
And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.
Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.
Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.