Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.



So many data probes would be solved if everyone looked at a few outputs instead of only metrics.


Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: