Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21 (github.com/lechmazur)
17 points by zone411 7 months ago | hide | past | favorite | 3 comments


Some very odd choices in that first plot. Lower is better, but also the x-axis is inverted such that higher scores go towards the left.


Not ideal, but the reason for this is that people have gotten used to larger bars indicating better performance on bar charts when evaluating LLMs. Including being confused by the older version of this very benchmark.

Two bar charts are also shown, along with a link to https://lechmazur.github.io/leaderboard1.html


Reading further, including those charts, is what made me understand that my initial reading of the first chart was wrong after a bout of confusion.

IMHO, it's still the wrong choice. If one feels like their audience doesn't understand "Lower=Better", then I feel like plotting the inverse or the difference (I'm not familiar with this score) is the solution. Breaking the x-axis convention is inviting confusion, especially with the "Lower=Better" disclaimer (again imho).




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: