No, it's not "mostly worthless" and yes, some of the top models were removed a few months back from being trained on benchmark data.
I urge you to at least think through what alternative you propose before posting so aggressively in these situations. Lmsys doesn't have Grok, or I would have included it. And having _some_ data is better than none.
I also had someone arguing with me 6 months back that we can't trust any benchmarks at all from vendors, which would exclude the blog post. Instead of just repeating that back vehemently, I filled a gap. It's important we don't self-peasantize as a species, all data has its issues, that doesn't mean we throw it all out.
Quantifiable metrics are useful if they're credible, certainly.
But does it seem likely, to you, that a 7B-parameter model would outperform a 314B-parameter model? Given that we can look at the chatbot arena leaderboard and it's dominated by proprietary, 70B and 8x7B models?
A well regarded and modern model like Mixtral 8x7B, which is ranked 13th on the chatbot arena leaderboard, scores 72.7 'Average' on the open LLM leaderboard - and yet 'pastiche-crown-clown-7b-dare-dpo' scores 76.5.
Yup, 100%. Grok isn't very good and it was rushed.
Rest re: pastiche model, etc. are proposing things I'm not claiming, or close to what I'm claiming.
n.b. you don't multiply the parameters by experts to get an effective parameter count. Why? Think of it this way: every expert needs to learn how to speak English, so there's a nontrivial amount of duplication among all experts
The only reliable benchmark: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...