Great catch. Super disappointing that AI companies continue to do things like this. Itβs a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.
To me it's very frustrating because such little caveats make benchmarks less reliable. Implicitly, benchmarks are no different from tests in that someone/something who scores high on a benchmark/test should be able to generalize that knowledge out into the real world.
While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.
SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.
Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.
Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.
Or maybe benchmarks are just bad at measuring intelligence in general.
Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?
> Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.
My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).
If I recall correctly the authors of the benchmark did mention on Twitter that for certain issues models will submit an answer that technically passes the test but is kind of questionable, so yeah, good point.