Hacker News new | past | comments | ask | show | jobs | submit login

I think LLMs aren't better than the median software developer at LeetCode. They simply have a compressed database of the (stolen) answers. If any software developer had access to Google in an interview, he could "solve" all of the answers instantly.





> he could "solve" all of the answers instantly

Set an experiment selecting 100 random software developers around the world and test this hypothesis. You're up to be surprised.

Nevertheless, most developers who wouldn't be able to "solve" a LeetCode challenge in a couple of hours even with access to Google, I bet would perform much better than o1 on real-world Github issues in their technical domain.

This highlights the essence behind OP's question about why focusing on Codeforces. And it shows me that "intelligence" involves a dimension that isn't logical and we don't understand yet.


Exactly what I’ve found. Try giving an LLM any novel easy problem (IMO 2024 Problem 5 is a good example). And absolutely every llm out there fails miserably.


If I'm reading this right, these are narrowly-applicable purpose-built systems, not general-purpose LLMs.

> AlphaProof, a new reinforcement-learning based system for formal math reasoning

> AlphaGeometry 2, an improved version of our geometry-solving system


The problem is if a problem is formulated like this: "There are 7 pirates on a ship and 61 cannon balls, ..." (doesn't matter what the problem is: say the solution involves some dynamic programming algorithm) and that the LLM then finds a problem starting with:

"There are 7 cats in a playground and 61 plushies, ..." (insert basically the same problem, requiring the same solution)

Well... Then the LLM shall be able to solve it.

And many people will consider it's a novel problem and hence a resounding success.

I mean: it is a success, but not anywhere near as impressive as most think.


I had the same with the strawberries test. For the ones that supposedly gave the right answer, I tried again with "strawerberry" and they promptly failed miserably.

Now, given the token encoding I think naive letter counting is not something we should expect from LLMs, but still serves as a nice reminder to actually ensure the test/validation data is not part of the training data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: