Exactly what I’ve found. Try giving an LLM any novel easy problem (IMO 2024 Prob...

Cyph0n · 2024-10-14T18:39:12.000000Z

https://deepmind.google/discover/blog/ai-solves-imo-problems...

fallingsquirrel · 2024-10-14T18:50:57.000000Z

If I'm reading this right, these are narrowly-applicable purpose-built systems, not general-purpose LLMs.

> AlphaProof, a new reinforcement-learning based system for formal math reasoning

> AlphaGeometry 2, an improved version of our geometry-solving system

TacticalCoder · 2024-10-15T00:08:23.000000Z

The problem is if a problem is formulated like this: "There are 7 pirates on a ship and 61 cannon balls, ..." (doesn't matter what the problem is: say the solution involves some dynamic programming algorithm) and that the LLM then finds a problem starting with:

"There are 7 cats in a playground and 61 plushies, ..." (insert basically the same problem, requiring the same solution)

Well... Then the LLM shall be able to solve it.

And many people will consider it's a novel problem and hence a resounding success.

I mean: it is a success, but not anywhere near as impressive as most think.

magicalhippo · 2024-10-15T08:06:54.000000Z

I had the same with the strawberries test. For the ones that supposedly gave the right answer, I tried again with "strawerberry" and they promptly failed miserably.

Now, given the token encoding I think naive letter counting is not something we should expect from LLMs, but still serves as a nice reminder to actually ensure the test/validation data is not part of the training data.