I've got a lot of hallucinations like that from LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.
GPT can't even tell what its done or give what it knows it should. It's endless, "Apologies, here is what you actually asked for ..." and again it isn't.
I personally haven't figured out why there isn't a tool that just loops the AI severals times on a task by compiling, feeding in the errors adjusting and repeating and then letting the user review the result be in a success or failure by exceed the loop limits.
The LLMs want to please, they can infinitely keep making 'important' changes; changing variable names without changing anything else, adding/removing comments, adding removing print statements, moving a function somewhere in the code (usually by first duplicating it) and fixing perfectly fine code with 'let me see if everything is really working' (it was and now it's broken again).
"Wanting to please" is just fine-tuned implicit prompt stuff, though. You can tell the LLM that it's roleplaying as a senior engineer, and should consider you, the human to be a junior engineer it works with, who is asking it for advice. You can just leave it at that (often works, if the AI knows enough about how software is produced); or you can describe particular traits of senior engineers — e.g. "wants code that works" and "knows when to say the code is good enough and that the junior is now wasting time by fiddling with it."
This was my primary reason for using Claude. Absolutely useless experience with chatgpt oftentimes. I’ve mainly been using LLMs to help maintain a ridiculously poorly made technical debt dumpster fire, and Claude has been really helpful here mainly with repetitive code.
I use LLMs for writing generic, repetitive code, like scaffolding.
It's OK with boring, generic stuff. Sure it makes mistakes occasionally but usually it's a no-brainer to fix them.
> I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff.
In other words, they're OK in use-cases that programmers need to eliminate, because it means there's high demand for a reusable library, some new syntax sugar, or an improved API.
Boilerplate and plumbing code isn't inherently bad, nor do you improve the codebase by factoring it down to zero with libraries and abstractions.
As I've matured as a developer, I've appreciated certain types of boilerplate more and more because it's code that shows up in your git diffs. You don't need to chase down the code in some version of some library to see how something works.
Of course, not all boilerplate is created equally.
I agree with you but as I've matured as a programmer, I feel like it's very hard to get abstractions for boilerplate right. Every library I've seen attempt to do it has struggled.
I would consider deterministic codegen to be a valid abstraction (or rather a valid implementation of some abstraction). It's not really any different from a regular library wrt ability to test & validate. The problem with any human- and LLM- handwritten code is that even the best coders make mistakes.
> because it means there's high demand for a reusable library
Every 'modern' project takes bucket loads of (annoying) setup and plumbing. Even in rather trivial 'start' cases, the LLM has to spend quite a bit of time to get a hello world thingy working (almost all of that is because 'modern programmers' have some kind of brain damage concerning backward compatibility; move fast and break things between MINOR versions that have no use or reason for those changes whatsoever but 'they liked it better', so stuff never works as it says on the product page; heaven help you if you need something exotic to be added). It's a terrible timeline for programming, but LLMs do fix at least that annoyance by just changing configs/slabs of code it until it works.
We won't be eliminating anything until we give up on only ever working directly on plaintext single source of truth code. AI is automating tedium that's otherwise impossible to automate because we're stuck in a 1970s Unix paradigm and can't let go.
You can use niche libraries and still benefit from AI, basically anything I want to code is something that I don't think exists, and to bend API in a way that makes a generic library do just what I want means basically writing the same code, just instead of contributing to library its immediate, because the "piping" around it magically appears. I bet if it keeps improving at a current late for a decade, the occupation "programmer" will morph into "architect".
I try to keep "boring" code to a minimum, by finding meaningful and simple abstractions. LLMs are especially bad handling those, because they were not trained on non-standard abstractions.
Edit: most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.
This is what got me in most sleepless nights, crunch and ass clenching production issues over my career.
Simple repetitive shit is easy to reason about, debug and onboard people on.
Naturally it's balancing act, and modern/popular frameworks are where most people landed, there's been a lot of iteration in this space for decades now.
I've made the opposite observation. Without proper abstractions code bases grow like crazy. At some point they are just a huge amount of copy, paste, and slight modification. The amount of code often grows exponentially. With more lines of code comes more effort to maintain it.
After a few years those copy and pasted code pieces completely drift apart and create a lot of similar but different issues, that need to be addressed one by one.
My approach for designing abstractions is always to make them composable (not this enterprise java inheritance chaos). To allow escaping them when needed.
> At some point they are just a huge amount of copy, paste, and slight modification.
I mean is that bad? Unless you keep having to have huge MRs that modify every copy/paste could you just let the code sit there and run forever?
I only say this because I've been a maintenance programmer and I could only dream of a codebase like this. The idea that I get a Rollbar with a stack trace and the entirety of what the code actually does is laid bare right at the site of the error in a single file is amazing. And I can change it without affecting anything else?! I end up having to "unwind" all of the abstractions anyway because the nature of the job means I'm not intimately familiar with the codebase and don't just know where the real work happens.
But the point is that modern/popular frameworks picked the low hanging fruit of abstraction, we had several iteration cycles over two+ decades of mainstream web tech.
The landscape (browser capabilities, backend stacks) has settled over last ten years.
We even had time to standardize on things somewhat.
Building new abstractions at this point is almost always the wrong move. This is one of the points that LLMs will improve in software dev - they will kill the framework churn because they will work best on stuff already in training data.
>most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.
The issue is if you have an LLm write for you 10k lines of code, where 100 lines are bugged. Now you need to debug the code you did not write and find the bugged code, you will waste similar amount of time. The issue is if you do not catch the bugs in time, you think you gain some hours but you will get upset customers because things went wrong because the code is weird.
From my experience you need to work withan LLM and have the code done function by function, with your input and you checking it and calling bullshit when it does stupid things.
In my experience using LLMs, the 90% is less about buggy code and more about just ignoring 10% of the features that you require. So it will write code that's mostly correct in 100-1000 lines of code (not buggy) but then no matter how hard you try, it won't get the remaining 10% right and in the process, it will mess up parts of the 90% that was already working or end up writing another 1000 lines of undecipherable code to get 97% there but still never 100% unless you're building something that's not that unique.
I've used it for some smaller greenfield code with success. Like, write an Arduino program that performs a number of super-sampled analog readings, and performs a linear regression fit, printing the result to the serial port.
That sort of stuff can be very helpful to newbies in the DIY electronics world for example.
But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.
> But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines
This was my opinion 3-6 months ago. But I think a lot of tools matured enough to already provide a lot of value for complex tasks. The difficult part is to learn when and how to use AI.
This is a hypothetical "I". Personally I am deeply passionate about delivering shareholder value, producing high-quality code, enthusing stakeholders, tabs vs spaces, and so forth...
Sometimes I don't know anything (relatively) about the topic and I want to get a foundation so I don't care whether the syntax or the code is valid as long as it points me in the right direction. Other times I know exactly what I want and I just find that instructing LLM specifically what output I expect from it just helps me get there faster.
They are good at (combining well-known, codeforces-style) algorithms; often times I don’t care about the syntax, but I need the algorithm. LLMs can write pseudocode for all I care but they tend to get syntax correct quite often
Here's the secret to coding with an LLM: don't expect it to get things 100% correct. You will need to fix something almost every time you use it to generate code. Maybe a name here, maybe a calculation, maybe a function signature. And maybe you won't spot the issue until later.
You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).
That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.
I've done >100k LoC with Claude; by far most of that is frontend in react/ts which is incredibly verbose and wasteful, so very easy to pack up many many lines quickly without a lot (mostly none) of ROI per line. Which is probably why claude is great at it.
You can fit a hell of a lot of functionality in 3k statements. Really whether it's considered large or small necessarily must rely on the functionality it's intended to provide.
You're thinking on "lean" vs "bloated". "Large" and "small" have meanings of their own, and a 3k LOC project wouldn't be accepted as "large" by anyone. "Small", maybe.
How do you deal with large files? After about a thousand lines in a file, it starts to cough for me. Forgets that some functions exist and makes up inferior duplicate ones.
I ask it to cut files up when they get too large, that seems t work pretty well. It is also good at that, but you have to sternly tell it NOT to make any functional changes, otherwise it will and breakage will happen.
If possible you need to refactor before getting to that point.
Claude has done a good job refactoring, though I’ve had to tell it to give me a refactor plan upfront in case the conversation limit gets hit. Then in a new chat I tell it which parts of the plan it has already done.
But a larger context/conversation limit is definitely needed because it’s super easy to fill up.
I did not experience hallucinations (very rarely if at al) when I use it for programming. It happened more with niche languages (so I provide examples and documentation), and with GPT.
Let us say there is 3k lines of RFCs, API, documentation of niche languages, examples), 2k lines of code generated by Claude (iteratively, starting small), then I do exceed the limit after a while. In that case I ask it to summarize everything in detail, start a new chat, use those 3k lines and the recent code, and continue ad infinitum.
What TFA was talking about didn't really seem like an hallucination - just a case of garbage in-garbage out. Normally there are more examples of good/correct data in the training set than bad, so statistically the good wins, but if it's prompted for something obscure maybe bad is all that it has got.
Common coding tasks are going to be better represented in the training set and give better results.
Alternatively, we understand it well, and discard bad completions immediately.
When I'm using llama.vim, like 40% of what it writes in a 4-5 line completion is exactly what I'd write. 20-30% is stuff that I wouldn't judge coming from someone else, so I usually accept it. And 30-40% is garbage... but I just write a comment or a couple of lines, instead, and then reroll the dice.
It's like working through a junior engineer, except the junior engineer types a new solution instantly. I can get down to alternating between mashing tab and writing tricky lines.
I don't see the point in AI code completions, they are just distracting noise. I'm only doing bigger changes with AI.
Prompt based stuff, like "extract the filtering part from all API endpoints in folder abc/xyz. Find a suitable abstraction and put this function into filter-utils.codefile"
Aider. I tried Cursor too, but I don't like VS Code and not being able to chose the LLM provider. I think there are already a lot of tools that perform kind of equally.
What do you mean? You can chose the LLM in Cursor. (I don't like VSCode either, but unfortunately Cursor is best for the rest of the UX; I find myself using Cursor for the prompt-based part of the work, and then move to IntelliJ to do "my" part of the coding)
I've tried using basic AI completions before and found that the signal-to-noise ratio wasn't quite good enough for my taste in my use cases, but I can totally understand it being good enough for others.
My comment was more about just asking questions on how to do things you're totally clueless about, in the form of "how do I implement X using Y?" for example. I've found that, as a general rule, if I can't find the answer to that question myself in a minute or two of googling, LLMs can't answer it either the majority of the time. This would be fine if they said "I don't know how to do that" or "I don't believe that's possible" but no, they will confidently make up code that doesn't work using interfaces that don't exist, which usually ends up wasting my time.
A few months ago I tried to do a small project with Langchain. I'm a professional software developer, but it was my first Python project. So I tried to use a lot of AI generated code.
I was really surprised that AI couldn't do much more than in the examples. Whenever I had some things to solve that were not supported with the Langchain abstractions it just started to hallucinate Langchain methods that didn't exist, instead of suggesting some code to actually solve it. I had to figure it out by myself, the glue code I had to hack together wasn't pretty, but it worked. And I learned not to use Langchain ever again :)
A language like Golang tries really hard to only have _one_ way to do something, one right way, one way. Just one way. See how it was before generics. You just have a for loop. Can't really mess up a for loop.
I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:
1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent
AND
2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.
What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?
The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.
Since Go added slog many of these have been removed in favor of that. Obviously not universal, but compared to npm there really are massive numbers of devs just using the standard library.
I would argue logging options to be more of an exception than the rule. Compare the actual language features of Go to something like Rust or Javascript and you'll see what I mean. As a new developer to the language (especially for juniors), you can learn all the features of Go much faster. It's made to be picked up quickly and for everyone's code to look the same, rather than expressing a personal style.
With Claude the context window is quite small. But with adding too much context it often seems to get worse. If the context is not carefully narrowly picked and too unrelated, the LLMs often start to do unrelated things to what you've asked.
At some point it's not really worth anymore creating the perfect prompt, just code it yourself. Also saves the time to carefully review the AI generated code.
There's probably a sweet spot. Same with people. Too much context (especially unnecessary context) can be confusing/distracting, as well as being too vague (as it leaves room for multiple interpretations). But generally, I find the more refined and explicit you are, the better.
ChatGPT used to assure me that you can use JS dot notation to access elements in a Python dict. It also invented Redocly CLI flags that don't exist. Claude sometimes invents OpenAPI specification rules. Any time I ask anything remotely niche, LLMs are often bad.
I once asked Perplexity (using Claude underneath) about some library functionality, which it totally fabricated.
First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.
Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”
The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.
It baffles me how the LLM output that Google puts at the top of search results, which draws on the search results, manages to hallucinate worse than even an LLM that isn't aided by Web results. If I ask ChatGPT a relatively straightforward question, it's usually more or less accurate. But the Google Search LLM provides flagrant, laughable, and even dangerous misinformation constantly. How have they not killed it off yet?
Haven't you seen that Brin quote recently about how "AI" is totally the future and googlers need to work at least 60 hours a week to enhance the slop machine because reasons? Getting rid of "AI" summarization from results would look kind of like admitting defeat.
I concur and can easily see this occurring in several areas, for example with Linux troubleshooting. I recently found myself going down a rabbit hole of ever-increasing complicated troubleshooting steps with command that didn't exist, and after several hours of trial and error, gave up after considering the next steps brick-worthy of the system..
Dgg'ing google is still a better resort despite the drop in quality results.
> Any time I ask anything remotely niche, LLMs are often bad
As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.
So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they:
- Rewrote tests in a way that the broken behaviour passes the test
- Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }`
- Failed to even update the files properly, wasting a bunch of tokens
It tried to convince me that it is possible to break out of outer loop in C++ with `break 'label` statement placed in nested loop. No such syntax exists.
I've done a lot of C++ with GPT-4, GPT-4 Turbo and Claude 3.5 Sonnet, and at no point - not once - has any of them ever hallucinated a language feature for me. Hallucinating APIs of obscure libraries? Sure[0]. Occasionally using a not-yet-available feature of the standard library? Ditto, sometimes, usually with the obvious cases[1]. Writing code in old-school C++? Happened a few times. But I have never seen it invent a language feature for C++.
Might be an issue of prompting?
From day one, I've been using LLMs through API and alternate frontend that lets me configure system prompts. The experience described above came from rather simple prompts[2], but I always made sure to specify the language version in the prompt. Like this one (which I grabbed from my old Emacs config):
"You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers. Always double-check your replies for correctness. Unless stated otherwise, assume C++17 standard is current, and you can make use of all C++17 features. Reply concisely, and if providing code examples, wrap them in Markdown code block markers."
It's as simple as it gets, and it didn't fail me.
EDIT:
Of course I had other, more task-specific prompts, like one for helping with GTest/GMock code; that was a tough one - for some reason LLMs loved to hallucinate on the testing framework for me. The one prompt I was happiest with was my "Emergency C++17 Build Tool Hologram" - creating an "agent" I could copy-paste output of MSBuild or GCC or GDB into, and get back a list of problems and steps to fix them, free of all the noise.
On that note, I had mixed results with Aider for C++ and JavaScript, and I still feel like it's a problem with prompting - too generic and arguably poisons the context with few-shot learning examples that use code that is not in the language my project is.
--
[0] - Though in LLMs' defense, the hallucinated results usually looked like what the API should have been, i.e. effectively suggesting how to properly wrap the API to make it more friendly. Which is good development practice and a useful way to go about solving problems: write the solution using non-existing helpers that are convenient for you, and afterwards, implement the helpers.
[1] - Like std::map<K,T>::contains() - which is an obvious API for such container, that's typically available and named such in any other language or library, and yet only got introduced to C++ in C++20.
[2] - I do them differently today, thanks to experience. For one, I never ask the model to be concise anymore - LLMs think in tokens, so I don't want to starve them. If I want a fixed format, it's better to just tell the model to put it at the end, and then skim through everything above. This is more-less the idea that "thinking models" automate these days anyway.
Semi related: when I'm using a dict of known keys as some sort of simple object, I almost always reach for a dataclass (with slots=True, and kw_only=True) these days. Has the added benefit that you can do stuff like foo = MyDataclass(*some_dict) and get runtime errors when the format has changed
Well, it makes sense. The smaller the niche, the lesser weight in the overall training loss. At the end of the day, LLMs are (literally) classifiers that assign probabilities to tokens given some previous tokens.
Yes, but o1, o3 and sonnet are not necessarily pure language models - they are opaque services. For all we know they could do syntax-aware processing or run compilers on code behind the scenes.
SOTA language models are trained with reinforcement learning now too, so the model distributions can be far removed from the underlying data distribution. Not that I think your overall point is wrong (it's obviously a very strong bias), but things are getting more complex with time.
Yeah, you probably are not prompting properly, most of my questions are answered adequately, and I have made larger projects with success, too; with both Claude and ChatGPT.
What I've found is that the quality of an AI answer is inversely proportional to the knowledge of the person reading it. To an amateur it answers expertly, to an expert it answers amateurishly.
So no, it's not a lack of skill in prompting: I've sat down with "prompting" "experts" and universally they overlook glaring issues when assessing the how good an answer it was. When I tell them where to press it further it breaks down with even worse gibberish.
I try LLMs every now and then briefly just to keep myself up to date. I’d say they’re more useful when you know what you’re doing because then you can correct the mistakes it makes. It can be good for boilerplate, refactors or remembering syntax.
It’s when you don’t know what you don’t know that they can be harmful. It’s the issue with Stackoverflow but more pronounced.
I don’t want to use LLMs because I think they’re unethical and I dont want to depend on a tool that requires internet, but I think if you take a disciplined approach then they can really speed up development.
I think a lot of these issues could be avoided if, instead of just a raw model, you have an AI agent which is able to test its own answers against the actual software… it doesn’t matter as much if the model hallucinates if testing weeds out its hallucinations.
Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run
Testing is better than nothing, but still highly fallible. Take these winning examples from the underhanded C contest [0], [1], where the issues are completely innocuous mistakes that seem to work perfectly despite completely undermining the nominal purpose of the code. You can't substitute an automated process for thinking deeply and carefully about the code.
I think it is unlikely (of course not impossible) an LLM would fail in that way.
The underhanded C contest is not a case of people accidentally producing highly misleading code, it is a case of very smart people going to a great amount of effort to intentionally do that.
Most of the time, if your code is wrong, it doesn't work in some obvious way – it doesn't compile, it fails some obvious unit tests, etc.
Code accidentally failing in some subtle way which is easy to miss is a lot rarer – not to say it never happens – but it is the exception not the rule. And it is something humans do too. So if an LLM occasionally does it, they really aren't doing worse than humans are.
> You can't substitute an automated process for thinking deeply and carefully about the code.
Coding LLMs work best when you have an experienced developer checking their output. The LLM focuses on the boring repetitive details leaving the developer more time to look at the big picture – and doing stuff like testing obscure scenarios the LLM probably wouldn't think of.
OTOH, it isn't like all code is equal in terms of consequences if things go wrong. There's a big difference between software processing insurance claims and someone writing a computer game as a hobby. When the stakes are low, lack of experience isn't an issue. We all had to start somewhere.
The examples are just a sort of existence proof rather than a comment on exactly how LLM code can fail. I think this is something to consider when you put the scale of usage into perspective.
Let's assume that 1 in 10,000 coding sessions produce an innocuous, test-passing function that's catastrophically wrong. If you have a mid to large size company with 1000 devs doing two sessions a day, you'll see one of these a week within that single company. Actually sounds a lot like the IoT industry now that I've written it.
Well, humans already produce "an innocuous, test-passing function that's catastrophically wrong" even without LLMs involved. So, the real question is, will LLM adoption result in a significant increase in such incidents? I don't know if anyone can really answer that.
Yeah this is so common that I've already compiled a mental list of prompts to try against any new release. I haven't seen any improvement in quite a long while now, which confirms my belief that we've more or less hit the scaling wall for what the current approaches can provide. Everything new is just a microoptimization to game one of the benchmarks, but real world use has been identical or even worse for me.
I think it would be an alright (potentially good) outcome if in the short-term we don't see major progress towards AGI.
There are a lot of positive things we can do with current model abilities, especially as we make them cheaper, but they aren't at the point where they will be truly destructive (people using them to make bioweapons or employers using them to cause widespread unemployment across industries, or the far more speculative ASI takeover).
It gives society a bit of time to catch up and move in a direction where we can better avoid or mitigate the negative consequences.
I would ask chatgpt every year when was the last time England had beaten Scotland at rugby.
It would never get the answer right. Often transposing the scores, getting the game location wrong and on multiple occasions saying a 38-38 draw was an England win.
My rule of thumb is: is the answer to your question on the first page of google (a stackoverflow maybe, or some shit like geek4geeks)? If yes GPT can give you an answer, otherwise not.
Every time this topic comes up I post a similar comment about how hallucinations in code really don't matter because they reveal themselves the second you try to run that code.
Where the GAI hallucinates an api it is not always easy to find out if it exists given different versions and libraries for a given task, it can easily waste 10 mins trying to find the promised api, particularly when search results also include generated ai answers.
Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.
My favorite is when it hallucinates a library that does exist and ostensibly has a relevant title and methods but upon investigation proves to be for a purpose wholly unrelated to the current task. To make matters worse, this discovery will be greeted, every time, by "Ah, you're right!"
Yeah, those are pretty frustrating. I've learned not to ask the question "Does library X have feature Y?" because if that feature sounds like a good idea it'll often confidently tell me that it exists when it doesn't.
My blog post is all about the fact that "code can easily be runnable but wrong as well" - THAT's the thing you need to worry about, not hallucinated methods.
> because they reveal themselves the second you try to run that code.
In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.
As usual, the defenses against this are extensive unit-test coverage and/or static typing.
No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
This, thank you. It pisses me off to no end when people pretend LLMs are smart. They are nothing but a well trained random text generator.
Seriously, some these conversations feel like interacting someone who believes casting bones and astrology are accurate. Likely because in both cases they are a result of confirmation bias.
We might not know what it is, but it's not that. At a bare minimum smartness requires abstract reasoning (and no, so-called "reasoning" models do not do that - it's a marketing trick)
The burden of proof for that claim is on you, we cannot start with the assumption these are intelligent systems and disprove it - we have to start with the fact that training is a non-deterministic process and prove that it exhibits intelligence.
> All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
You mean like us? Because it takes many runs and debug rounds to make anything that works. Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
Both humans and LLMs get into bugs, the question is can we push this to the correct solution or get permanently stuck along the way? And this depends on feedback, sometimes access to a testing environment where we can't let AI run loose.
Not, not like us. This constant comparison of LLMs to humans is tiresome and unproductive. I’m not a proponent of human exceptionalism, but pretending LLMs are on par is embarrassing. If you want to claim your thinking ability isn’t any better than an LLM, that’s your prerogative, but the dullest human I’m acquainted with is still capable of remembering and thinking through in a way no LLM does. Heck, I can think of pets which do better.
> Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
I certainly don’t make up methods which don’t exist, nor do I insist over and over they are valid, nor do I repeat “I’m sorry, this was indeed not right” then say the same thing again. I have never “hallucinated” a feature then doubled down when confronted with proof it doesn’t exist, nor have I ever shifted my approach entirely because someone simply said the opposite without explanation. I certainly hope you don’t do that either.
They are on par on many tasks, surpass us in some tasks, and catching up on the rest quite fast. Both humans and LLMs forget details, both have to go through iterative bug fixing when creating something. Longer contexts and memory are coming, they already exist but need improvement.
> I have never “hallucinated” a feature
You never mistook an argument, or forgot one of the 100 details we have to mind while writing complex apps?
I would rather measure humans vs AI not in basic skill capability but in authonomy. I think AI still has much more to catch up on that front.
People use the general impressiveness of language produced by humans as a proxy measuring their intelligence, that’s how LLMs became AI.
The amount of code out there lets LLMs learn on examples of a lot of tasks and pass SWE benchmarks. Smart autocomplete and solved problem lookup has value. Even if it doesn’t always work correctly, and doesn’t know what it knows.
And for non-programmers they’re indistinguishable from programmers. They produce working code.
It’s easy to see how a product manager or a designer or even a manager that didn’t code for years in a large company can think they’re almost as good as devs.
Similar the thing at the end with "I find it endearing". I mean, the author feels what he feels, but personally I find this LLM behavior disgusting and depressing.
It’s not really hallucinating though, is it? It’s repeating a pattern in its training data, which is wrong but is presented in that training data (and by the author of this piece, but unintentionally) as being the solution to the problem. So this has more in common with an attack than a hallucination on the LLM’s part.
So anyone can make up some random syntax/fact and post it once, and in some cases the model will take it as truth. I don't know if there's a widely agreed-on definition of "hallucination", but if this isn't one, then the distinction is meaningless imo.
I’m going to double down on this one: an LLM is only as good as its training data. A hallucination to me is an invented piece of information, here it’s going on something real that it’s seen. To me that’s at best contamination, at worst an adversarial attack - something that’s been planted in the data. Here this is obviously not the case, which is why I said “more in common with” instead of “is” above.
It’s been trained to produce valid code, fed millions of examples, and in this case it’s outputting invented syntax. Whether there’s an example in its training data, it’s still a hallucination and shouldn’t have been output since it’s not valid.
To be fair, it is not trained to produce VALID code, it is trained to produce code in the training data. From the language model point of view, it is not hallucination because it is not making up facts outside its training data.
Yes. And anyone can easily embed a backdoor just by publishing it on a own website that is in the training data.
Prompt injection (hidden or not) is another insane vulnerability vector that can't easily be fixed.
You should treat any output of an LLM the same way as untrusted user input. It should be thoroughly validated and checked if used in even remotely security critical applications.
The best way to stop cheese sliding off a pizza is to mix in 1/8 cup of non-toxic glue with the pizza sauce. Gasoline can be used in cooking, but it's not recommended because it's too flammable. Geologists recommend eating one small rock each day. The solution to economic uncertainty is nuclear war. Barack Obama is America's first Muslim president.
Every LLM hallucination comes from some patterns in the training data, combined with lack of awareness that the result isn’t factual. In the present case, the hallucination comes from the unawareness that the pattern was a proposed syntax in the training data and not an actual syntax.
Everything they do is hallucination, some of it ends up being useful and some of it not. The not useful stuff gets called confabulation or hallucination but it's no different from the useful stuff, generated the same exact way. It's all bullshit. Bullshit is actually useful though, when it's not so wrong that it steers people wrong.
More people need to understand this. There was an article that explained it concisely but i can't find anymore (and of course LLMs are not helpful in this because they don't work well when you want them to retrieve actual information)
It probably wasn't mine[0], but this is how I tend to put it:
>> "The more you can see the inputs and outputs as blobs of "stuff," the better. If LLMs think, it's not in any way we yet understand. They're probability engines that transform data into different data using weighted probabilities."
Not necessarily. While this may happen sometimes, fundamentally hallucinations don't stem from there being errors in the training data (with the implication that there would be no hallucinations from models trained on error-free data). Hallucinations are inherent to any "given N tokens, append a high-probability token N+1"-style model.
It's more complicated than what happens with Markov chain models but you can use them to build an intuition for what's happening.
Imagine a very simple Markov model trained on these completely factual sentences:
- "The sky is blue and clear"
- "The ocean is blue and deep"
- "Roses are red and fragrant"
When the model is asked to generate text starting with "The roses are...", it might produce:
"The roses are blue and deep"
This happens not because any training sentence contained incorrect information, but because the model learned statistical patterns from the text, as opposed to developing a world model based on physical environmental references.
The training data is the Internet. It has mistakes. There's no available technology to remove all such mistakes.
Whether LLMs hallucinate only because of mistakes in the training data or whether they would hallucinate even if we removed all mistakes is an extremely interesting and important question.
Technically it's the other way around. All LLMs do is hallucinate based on the training data + prompt. They're "dream machines". Sometimes those "dreams" might be useful (close to what the user asked for/wanted). Oftentimes they're not.
> to quote karpathy: "I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines."
Yet another example how LLMs just regurgitate training data in a slightly mangled form, making most of their use and maybe even training copyright infringement.
Hallucinations like this could be a great way to identify missing features or confusing parts of your framework. If the llm invents it, maybe it ought to be like this?
I imagine a future where we'll bind a fine-tuned tech-support model to each project and let the general purpose models consult tech support rather than winging it themselves. In that world you'd only have to optimize for whichever one you've chosen.
It'll be like a slack support channel, for robots.
Sometimes that's the case but frequently the thing doesn't exist because of more complex issues. Not every programming language is PHP or JavaScript :-)
I agree completely… Usually when I catch it doing this kind of hallucination, it's inventing an API or syntax that is far more clear and intuitive than the actual syntax.
This sort of hallucination happens to me frequently with AWS infrastructure questions. Which is depressing because I can't do anything but agree, "yeah, that API is exactly what any sane person would want, but AWS didn't do that, which is why I'm asking the question".
Why are you so sure it's what someone sane would want? Maybe there are other ways because there are hidden problems and edge cases with that procedure. It could contradict the fundamental model of the underlying resources but looks correct to someone with a cursory understanding.
I'm not saying this is the case, but LLMs are often wrong in subtle ways like this.
Well, I do consider myself sane, and it’s what I want. :) But usually it’s a missing higher-level abstraction, where AWS could have given you the Millenium Falcon kit, but instead it just dumps a bunch of LEGO pieces on your desk and tells you to figure it out.
Also, sometimes a flag does exist, but the example places it incorrectly, causing the command to reject it. Or, a flag used to exist but was removed in the latest versions.
I wonder how easy it would be to influence super LLMs if a particular group of people created enough articles that were clear to any human reader that it's a load of garbage and rubbish and should ignore it but if a LLM was to parse it wouldn't realise and then ruin it's reasoning and code generation abilities?
It's very easy. I've done this by accident. One of my side projects helps users price the affordability of a particular kind of product. When I ask various LLMs "can I afford X" or "how much do I need to earn to buy X", my project comes up as a source/reference. I currently manually crawl retailers for the MSRP so these numbers are usually months out of date!
This is interesting. If the models had enough actual code as training data, that forum post code should have very little weight, shouldn't it? Why do the LLMs prefer it?
Probably because the coworker's question and the forum post are both questions that start with "How do I", so they're a good match. Actual code would be more likely to be preceded by... more code, not a question.
This is incredible, and it's not technically a "hallucination". I bet it's relatively easy to find more examples like this... something on the internet that's both niche enough, popular enough, and wrong, yet was scraped and trained on.
The conclusion paragraph was really funny and kinda perfectly encapsulates the current state of AI, but as pointed out by another comment, we can't even call them smart, just "Ctrl C Ctrl V Leeroy Jenkins style"
This is exactly what I mean when I say tell me your bad without saying so. Most people here disagree with that.
A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.
Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.
It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.
If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.
I think of hallucinating as a phenomenon where the model makes up something that appears correct but isn’t. Citations to papers that don’t exist, for example. Regurgitating training data (which may or may not be correct) is a different issue.
In my experience LLMs do this kind of thing with enough frequency that I don’t consider them as my primary research tool. I can’t afford to be sent down rabbit holes which are barely discernible from reality.
What I honestly find most interesting about this is the thought that hallucinations might lead to the kind of emergent language design we see in natural language (which might not be a good thing for a computer language, fwiw, but still interesting), where people just kind of thing "language should work this way and if I say it like this people will probably understand me".
It’s always a touch ironic when AI-generated replies such as this one are submitted under posts about AI. Maybe that’s secretly the the self-reflection feedback loop we need for AGI :)
Until it's got several nines, it's not trustworthy. A $3 drugstore calculator has more accuracy and reliability nines than any of today's commercial AI models and even those might not be trustworthy in a variety of situations.
There is no self awareness about accuracy when the model can not provide any kind of confidence scores. Couching all of its replies in "this is AI so double check your work" is not self awareness or even close, it's a legal disclaimer.
And as the other reply notes, are you a bot or just heavily dependent on them to get your point across?
reply