Exactly what I’ve found. Try giving an LLM any novel easy problem (IMO 2024 Problem 5 is a good example). And absolutely every llm out there fails miserably.
The problem is if a problem is formulated like this: "There are 7 pirates on a ship and 61 cannon balls, ..." (doesn't matter what the problem is: say the solution involves some dynamic programming algorithm) and that the LLM then finds a problem starting with:
"There are 7 cats in a playground and 61 plushies, ..." (insert basically the same problem, requiring the same solution)
Well... Then the LLM shall be able to solve it.
And many people will consider it's a novel problem and hence a resounding success.
I mean: it is a success, but not anywhere near as impressive as most think.
I had the same with the strawberries test. For the ones that supposedly gave the right answer, I tried again with "strawerberry" and they promptly failed miserably.
Now, given the token encoding I think naive letter counting is not something we should expect from LLMs, but still serves as a nice reminder to actually ensure the test/validation data is not part of the training data.