>...it’s a useless tool. I don’t like collaborating with chronic liars who aren’t able to openly point out knowledge gaps...
I think a more correct take here might be "it's a tool that I don't trust enough to use without checking," or at the very least, "it's a useless tool for my purposes." I understand your point, but I got a little caught up on the above line because it's very far out of alignment with my own experience using it to save enormous amounts of time.
and as others have pointed out, this issue of “how much should I check” is really just a subset of an old general problem in trust and knowledge (“epistemology” or what have you) that people have recognized since at least the scientific revolution. The Royal Society’s motto on its founding in the 1660s was “take no man’s word for it.”
Coding agents have now got pretty good at checking themselves against reality, at least for things where they can run unit tests or a compiler to surface errors. That would catch the error in TFA. Of course there is still more checking to do down the line, in code reviews etc, but that goes for humans too. (This is not to say that humans and LLMs should be treated the same here, but nor do I treat an intern’s code and a staff engineer’s code the same.) It’s a complex issue that we can’t really collapse into “LLMs are useless because they get things wrong sometimes.”
> Coding agents have now got pretty good at checking themselves against reality, at least for things where they can run unit tests or a compiler to surface errors.
YMMV. I've seen Claude go completely batshit insane saying that tests all passed. Then I run them and I see 50+ failures. I copy the output tell him to fix it and he goes on his sycophantic apologia before spinning his wheels doing nothing and saying all tests are back to green.
We double check human work too in all kinds of contexts.
A whole lot of my schooling involved listening to teachers repeating over and over to us how we should check our work, because we can't even trust ourselves.
(heck, I had to double-check and fix typos in this comment)
It’s very weird to me when someone says “this tool does not have this common property good tools have” and someone replies “humans also do not have those properties!” As if that is responsive to a complaint about a tool lacking a common property of tools.
We use tools because they work in ways that humans do not. Through centuries of building and using tools, we as a society have learned what makes a tool good versus bad.
Good tools are reliable. They have a clear purpose and ergonomic user interface. They are straightforward to use and transparent in how they operate.
LLMs are none of these things. It doesn’t matter that humans also are none of these things, if we are trying to use LLMs as tools.
The closest human invention resembling an LLM is the idea of a bureaucracy — LLMs are not good tools, they are not good humans, they are mindless automatons that stand in the way and lead you astray.
At best, LLMs are poor tools and also poor human replacements, which is why it’s so frustrating to me we are so intent on replacing good tools and humans with LLMs.
The reason for making that observation is that we don't have any other comparable tools and it is more reasonable to benchmark LLMs against what humans are capable of, because whether or not they are good approximations, we are trying to model human abilities.
One day maybe we exceed human abilities, but it's unreasonable to expect early attempts - and they are still early attempts - to do things we don't know how to beat other than by putting all kinds of complex process on top of very flawed human thinking.
We do have comparable tools to LLMs. There are plenty of human-composed tools that can do what LLMs do, like Mechanical Turk for instance. The human-composed tool that most closely resembles a LLM is the "bureaucracy".
An LLM is like a little committee you send a request to, and based on capricious, opaque rules that can change at any time, the committee returns a response that may or may not service your request. You can't know ahead of time if it will, and depending on the time of day, the political environment, or the amount of work before the committee, the delay of servicing and the quality of service may or may not degrade. Quality may range from a quick accurate response, to flat refusal to service without explanation, or outright lies to your face. There's no way to guarantee a good result, and there's no recourse or explanation for why things go wrong or changed.
LLMs feel like a customer service agent turned into a computer program, which is probably why it was people's first thought to use LMMs to automate customer service agents. They are a perfect fit there, but I don't want them to be my primary interface to do work. I have enough bureaucracies to deal with as it is.
> We do have comparable tools to LLMs. There are plenty of human-composed tools that can do what LLMs do, like Mechanical Turk for instance.
If you are going to treat humans as tools, then sure. In which case measuring LLMs against human ability is exactly the right thing, given that with Mechanical Turk the tasks are carried out by humans - sometimes with the help of LLMs...
It's utterly bizarre to argue over my comparing LLMs to humans when the tools you argue are comparable are humans.
> when the tools you argue are comparable are humans.
No, they are abstractions over humans. A group of people is not a person, they behave differently than people even though they are composed of them. Abstractions are hard to compare but still much easier than people.
It's not a meaningless difference, it's a crucial one. The contradiction only exists if you collapse all the differences between people and abstractions of people -- crucially that the former are people and the latter are abstractions -- and claim they're the same. Which they are not.
Anyway, we've gotten far from the point, which is that LLMs are not people and you can't treat them as such.
> Checking is usually faster than writing from scratch
Famous last words. Checking trivial code for trivial bugs, yes. In science, you can have very subtle bugs that bias your results in ways that aren't obvious for a while until suddenly you find yourself retracting papers.
I've used LLMs to write tedious code (that should probably have been easier if the right API had been thought through), but when it comes to the important stuff, I'll probably always write an obviously correct version first and then let the LLM try to make a faster/more capable version, that I can check against the correct version.
Humans are notoriously bad at checking for meticulous detail—hence why copyediting was a specialized role. We’ve seen what’s happened since the newspapers eliminated it for cost savings.
I only used an LLM for the first time recently, to rewrite a YouTube transcript into a recipe. It was excellent at the overall restructuring, but it made a crucial and subtle mistake. The recipe called for dividing 150g of sugar, adding 30g of cornstarch to one half, and blanching eggs in that mixture. ChatGPT rewrote it so that you blanched the eggs in the other half, without the cornstarch. This left me with a boiling custard that wasn’t setting up.
I did confirm that the YouTube transcript explicitly said to use the sugar and cornstarch mixture. But I didn’t do a side by side comparison because the whole reason for doing the rewrite is that transcripts are difficult to read!
> I'll probably always write an obviously correct version first
I’m not usually so confident in my own infallibility, so I prefer to think of it as “I might get this wrong, the LLM might get this wrong, our failure modes are probably not very correlated, so the best thing is for us both to do it and compare.”
Agree it is always better for the human engineer to try writing the critical code first, since they are susceptible to being biased by seeing the LLM’s attempt. Whereas you can easily hide your solution from the LLM.
> I’m not usually so confident in my own infallibility
I'll concede that point, but let me put it differently: when I write the "obviously correct" version, what I mean is that I can explain what every line of code is supposed to be doing. I can say what the expected result would be, at each step and test it.
When the LLM writes an inscrutable version for me, I might be able to test the final result, but I'll never know if there are corner cases (or not even that) in between where it will go spectacularly wrong. If I have to untangle what it's written to get that kind of confidence, then it's going to take me much longer. Reading code, especially somebody else's, nevermind an alien's code, is just so much harder for me.
It seems that on some level, you have to in order to not just constantly reflecting upon your thoughts and researching facts. Whether you trust a given source should surely depend upon its reputation regarding the validity of its claims.
I agree and by reputation you mean accuracy. We implicitly know not to judge anything as 100% true and implicitly apply skepticism towards sources - the skepticism is decided by our past experience with the sources.
Think of LLMs as the less accurate version of scientific journals.
Accuracy certainly does play a role, but this in itself is not sufficient for preventing an infinite regress – how does one determine the accuracy of a source if not by evaluating claims about the source, that themselves have sources that need to be checked for accuracy? Empirical inquiries are optimal but often very unpractical. Reputation is accuracy as imperfectly valued by society or specific social groups collectively.
I think a more correct take here might be "it's a tool that I don't trust enough to use without checking," or at the very least, "it's a useless tool for my purposes." I understand your point, but I got a little caught up on the above line because it's very far out of alignment with my own experience using it to save enormous amounts of time.