What's Going on with the Open LLM Leaderboard?

garciasn · on June 23, 2023

From the article:

> We can see that for the same dataset, both absolute scores and model rankings (see the first figure) are very sensitive to the evaluation method we decide to use.

[...]

> Different models may fare differently when evaluated one way or another as we see above when the rankings change. To keep this as fair as possible, one may be tempted to select an implementation where the average score for all tested models is the highest so that we "unlock" as much capabilities as possible from the models. In our case, that would mean using the loglikelihood option of the original implementation. But as we saw above, using the loglikelihood is also giving some indications to the model in some way by restricting the scope of possible answers, and thus is helping the less powerful models maybe too much. Also Loglikelihood is easy to access for open-source models but is not always exposed for closed source API models.

---

My takeaway is that this space is evolving in near real-time and has so many variables during runtime that significant challenges need to be navigated to find a reliable, reproducible, and quality measurement harness that provides everyone solid footing from which to begin their own evaluations on how the various LLMs operate for their use cases.

I'm glad HF is offering the opportunity for the community to weigh in with their thoughts. Being the LLM community is strong and learning fast, I am confident this is the best way to get to a solid place that provides the most meaningful results in the short term while building out long-term viability and success.

---

We are currently using the HF LLM Leaderboard only as a starting point to wade through the significant volume of available options and narrow it down to something manageable for us to optimize our work to meet our own internal metrics.

Kudos to the HF team for this solid breakdown and approach.

weinzierl · on June 23, 2023

Also from the article:

"A key takeaway lesson from our journey is that evaluations are strongly tied to their implementations–down to minute details such as prompts and tokenization."

In line with what you were saying about the HF LLM Leaderboard only be a starting point I wish it had more (selectable) columns with significant hyperparameters like the tokenization method or the context window size.

A way to group related models would also be nice.

I think that would make the Leaderboard even more useful than it is already.