Hacker News new | past | comments | ask | show | jobs | submit login
LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21 (github.com/lechmazur)
17 points by zone411 39 days ago | hide | past | favorite | 3 comments



Some very odd choices in that first plot. Lower is better, but also the x-axis is inverted such that higher scores go towards the left.


Not ideal, but the reason for this is that people have gotten used to larger bars indicating better performance on bar charts when evaluating LLMs. Including being confused by the older version of this very benchmark.

Two bar charts are also shown, along with a link to https://lechmazur.github.io/leaderboard1.html


Reading further, including those charts, is what made me understand that my initial reading of the first chart was wrong after a bout of confusion.

IMHO, it's still the wrong choice. If one feels like their audience doesn't understand "Lower=Better", then I feel like plotting the inverse or the difference (I'm not familiar with this score) is the solution. Breaking the x-axis convention is inviting confusion, especially with the "Lower=Better" disclaimer (again imho).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: