Hacker News new | past | comments | ask | show | jobs | submit login

The LMSYS leaderboards are crowdsourced and would be hard to fake, it showing a pretty strong performance in terms of human preference.





Crowdsourced data is the easiest to fake unless you can somehow ensure that you have a completely unbiased population (which is impossible). There's a reason why certain models do so well on upvote-based leaderboards but rank nowhere on objective tests.

Which ones? I think fine-tunes are where I see most of this (I'll just call it) "model spam", but the base models don't seem to exhibit this behavior. I do see some models perform way below the curve compared to their benchmark performance, though (Phi family being the most famous).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: