Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.
Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.
Models are actually pretty good at figuring out when they are being tested:
"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."
"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"
"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"
Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.