> I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out.
If you're lucky it figures it out. If you aren't, it makes stuff up in a way that seems almost purposefully calculated to fool you into assuming that it's figured everything out. That's the real problem with LLM's: they fundamentally cannot be trusted because they're just a glorified autocomplete; they don't come with any inbuilt sense of when they might be getting things wrong.
I see this complaint a lot, and frankly, it just doesn't matter.
What matters is speeding up how fast I can find information. Not only will LLMs sometimes answer my obscure questions perfectly themselves, but they also help to point me to the jargon I need to use to find that information online. In many areas this has been hugely valuble to me.
Sometimes you do just have to cut your losses. I've given up on asking LLMs for help with Zig, for example. It is just too obscure a language I guess, because the hallucination rate is too high to be useful. But for webdev, Python, matplotlib, or bash help? It is invaluable to me, even though it makes mistakes every now and then.
We're talking about getting work done here, not some purity dance about how you find your information the "right way" by looking in books in libraries or something. Or wait, do you use the internet? How very impure of you. You should know, people post misinformation on there!
> Yeah but if your accountant bullshits when doing your taxes, you can sue them.
What is the point of limiting delegation to such an extreme dichotomy? As apposed to getting more things done?
The vast majority of useful things we delegate, or do for others ourselves, are not as well specified, or as legally liable for any imperfections, as an accountant doing accounting.
Let's try it this way: give me one or two prompts that you personally have had trouble with, in terms of hallucinated output and lack of awareness of potential errors or ambiguity. I have paid accounts on all the major models except Grok, and I often find it interesting to probe the boundaries where good responses give way to bad ones, and to see how they get better (or worse) between generations.
Sounds like your experiences, along with zozbot234's, are different enough from mine that they are worth repeating and understanding. I'll report back with the results I see on the current models.
If you're lucky it figures it out. If you aren't, it makes stuff up in a way that seems almost purposefully calculated to fool you into assuming that it's figured everything out. That's the real problem with LLM's: they fundamentally cannot be trusted because they're just a glorified autocomplete; they don't come with any inbuilt sense of when they might be getting things wrong.