Just saw this, might get lost in the noise, but just for posterity, apparently t...

occamrazor · 2024-06-28T17:08:15.000000Z

On prompts only, with answers presumably from the teacher model (Gemini).

It was not trained or RLHFd on Arena replies or user preferences.

lhl · 2024-07-01T18:41:34.000000Z

Yes, answers were distilled from a much stronger model. On the one hand, you can argue that this is exactly what the LMSYS, WildBench etc datasets are for (to improve performance/alignment on real-world use cases), but on the other hand, it's clear that training on the questions (most of which are repeatedly used by the (largely non-representative of general population) users of the ChatArena for comparing/testing models) makes ChatArena ELO less useful as a model comparison tool and artificially elevates Gemma 2's ChatArena score relative to its OOD performance.

At the end of the day, by optimizing for leaderboard scoring, it makes the leaderboard ranking less useful as a benchmark (Goodhart's law strikes again). The Gemma team obviously isn't the only one doing it, but it's important to be clear-eyed about the consequences.