Hacker News new | past | comments | ask | show | jobs | submit login

Yes, answers were distilled from a much stronger model. On the one hand, you can argue that this is exactly what the LMSYS, WildBench etc datasets are for (to improve performance/alignment on real-world use cases), but on the other hand, it's clear that training on the questions (most of which are repeatedly used by the (largely non-representative of general population) users of the ChatArena for comparing/testing models) makes ChatArena ELO less useful as a model comparison tool and artificially elevates Gemma 2's ChatArena score relative to its OOD performance.

At the end of the day, by optimizing for leaderboard scoring, it makes the leaderboard ranking less useful as a benchmark (Goodhart's law strikes again). The Gemma team obviously isn't the only one doing it, but it's important to be clear-eyed about the consequences.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: