LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21

jszymborski · 2025-02-10T20:08:55 1739218135

Some very odd choices in that first plot. Lower is better, but also the x-axis is inverted such that higher scores go towards the left.

zone411 · 2025-02-10T20:18:50 1739218730

Not ideal, but the reason for this is that people have gotten used to larger bars indicating better performance on bar charts when evaluating LLMs. Including being confused by the older version of this very benchmark.

Two bar charts are also shown, along with a link to https://lechmazur.github.io/leaderboard1.html

jszymborski · 2025-02-10T20:36:42 1739219802

Reading further, including those charts, is what made me understand that my initial reading of the first chart was wrong after a bout of confusion.

IMHO, it's still the wrong choice. If one feels like their audience doesn't understand "Lower=Better", then I feel like plotting the inverse or the difference (I'm not familiar with this score) is the solution. Breaking the x-axis convention is inviting confusion, especially with the "Lower=Better" disclaimer (again imho).