Hacker News new | past | comments | ask | show | jobs | submit login

The LMSYS leaderboards are crowdsourced and would be hard to fake, it showing a pretty strong performance in terms of human preference.



Crowdsourced data is the easiest to fake unless you can somehow ensure that you have a completely unbiased population (which is impossible). There's a reason why certain models do so well on upvote-based leaderboards but rank nowhere on objective tests.


Which ones? I think fine-tunes are where I see most of this (I'll just call it) "model spam", but the base models don't seem to exhibit this behavior. I do see some models perform way below the curve compared to their benchmark performance, though (Phi family being the most famous).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: