Hacker News new | past | comments | ask | show | jobs | submit login

Exactly what I’ve found. Try giving an LLM any novel easy problem (IMO 2024 Problem 5 is a good example). And absolutely every llm out there fails miserably.






If I'm reading this right, these are narrowly-applicable purpose-built systems, not general-purpose LLMs.

> AlphaProof, a new reinforcement-learning based system for formal math reasoning

> AlphaGeometry 2, an improved version of our geometry-solving system


The problem is if a problem is formulated like this: "There are 7 pirates on a ship and 61 cannon balls, ..." (doesn't matter what the problem is: say the solution involves some dynamic programming algorithm) and that the LLM then finds a problem starting with:

"There are 7 cats in a playground and 61 plushies, ..." (insert basically the same problem, requiring the same solution)

Well... Then the LLM shall be able to solve it.

And many people will consider it's a novel problem and hence a resounding success.

I mean: it is a success, but not anywhere near as impressive as most think.


I had the same with the strawberries test. For the ones that supposedly gave the right answer, I tried again with "strawerberry" and they promptly failed miserably.

Now, given the token encoding I think naive letter counting is not something we should expect from LLMs, but still serves as a nice reminder to actually ensure the test/validation data is not part of the training data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: