> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.
Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.