They’re not wrong though. The frequency with which these things still just make shit up is astonishingly bad. Very dismissive of a legitimate criticism.
It's getting better, faster than you and I and the GP are. What else matters?
You can't bullshit your way through this particular benchmark. Try it.
And yes, they're wrong. The latest/greatest models "make shit up" perhaps 5-10% as frequently as were seeing just a couple of years ago. Only someone who has deliberately decided to stop paying attention could possibly argue otherwise.
And yet I still can't trust Claude or o1 to not get the simplest of things, such as test cases (not even full on test suites, just the test cases) wrong, consistently. No amount of handholding from me or prompting or feeding it examples etc helps in the slightest, it is just consistently wrong for anything but the simplest possible examples, which takes more effort to manually verify than if I had just written it myself. I'm not even using an obscure stack or language, but especially with things that aren't Python or JS it shits the bed even worse.
I have noticed it's great in the hands of marketers and scammers, however. Real good at those "jobs", so I see why the cryptobros have now moved onto hailing LLMs as the next coming of jesus.
I still find that 'trusting' the models is a waste of time, we agree there. But I haven't had that much more luck with blindly telling a low-level programmer to go write something. The process of creating something new was, and still is, an interactive endeavor.
I do find, however, that the newer the model the fewer elementary mistakes it makes, and the better it is at figuring out what I really want. The process of getting the right answer or the working function continues to become less frustrating over time, although not always monotonically so.
o1-pro is expensive and slow, for instance, but its performance on tasks that require step-by-step reasoning is just astonishing. As long as things keep moving in that direction I'm not going to complain (much).
That would bug me, if I were you.