This benchmark is mostly worthless, some of the top models there were trained on...

refulgentis · 2024-03-18T01:02:07 1710723727

No, it's not "mostly worthless" and yes, some of the top models were removed a few months back from being trained on benchmark data.

I urge you to at least think through what alternative you propose before posting so aggressively in these situations. Lmsys doesn't have Grok, or I would have included it. And having _some_ data is better than none.

I also had someone arguing with me 6 months back that we can't trust any benchmarks at all from vendors, which would exclude the blog post. Instead of just repeating that back vehemently, I filled a gap. It's important we don't self-peasantize as a species, all data has its issues, that doesn't mean we throw it all out.

michaelt · 2024-03-18T11:51:45 1710762705

Quantifiable metrics are useful if they're credible, certainly.

But does it seem likely, to you, that a 7B-parameter model would outperform a 314B-parameter model? Given that we can look at the chatbot arena leaderboard and it's dominated by proprietary, 70B and 8x7B models?

A well regarded and modern model like Mixtral 8x7B, which is ranked 13th on the chatbot arena leaderboard, scores 72.7 'Average' on the open LLM leaderboard - and yet 'pastiche-crown-clown-7b-dare-dpo' scores 76.5.

To me, that sounds too good to be true.

refulgentis · 2024-03-18T12:41:27 1710765687

Yup, 100%. Grok isn't very good and it was rushed.

Rest re: pastiche model, etc. are proposing things I'm not claiming, or close to what I'm claiming.

n.b. you don't multiply the parameters by experts to get an effective parameter count. Why? Think of it this way: every expert needs to learn how to speak English, so there's a nontrivial amount of duplication among all experts

michaelt · 2024-03-18T13:13:21 1710767601

> n.b. you don't multiply the parameters by experts to get an effective parameter count.

I actually took the 314B from Grok's HF page [1] which describes the model as "314B parameters" when explaining why it needs a multi-GPU machine.

I certainly agree that parameter count isn't everything, though; clearly things like training data quality and fine tuning count for a lot.

[1] https://huggingface.co/xai-org/grok-1