I suppose that Simon, being all in with LLMs for quite a while now, has developed a good intuition/feeling for framing questions so that they produce less hallucinations.
Yeah I think that's exactly right. I don't ask questions that are likely to product hallucinations (like citations from papers about a topic to an LLM without search access), so I rarely see them.
But how would you verify? Are you constantly asking questions you already know the answers to? In depth answers?
Often the hallucinations I see are subtle, though usually critical. I see it when generating code, doing my testing, or even just writing. There are hallucinations in today's announcements, such as the airfoil example[0]. An example of more obvious hallucinations is I was asking for help improving writing an abstract for a paper. I gave it my draft and it inserted new numbers and metrics that weren't there. I tried again providing my whole paper. I tried again making explicit to not add new numbers. I tried the whole process again in new sessions and in private sessions. Claude did better than GPT 4 and o3 but none would do it without follow-ups and a few iterations.
Honestly I'm curious what you use them for where you don't see hallucinations
[0] which is a subtle but famous misconception. One that you'll even see in textbooks. Hallucination probably caused by Bernoulli being in the prompt
When I'm using them for code these days it is usually in a tool that can execute code in a loop - so I don't tend to even spot the hallucinations because the model self corrects itself.
For factual information I only ever use search-enabled models like o3 or GPT-4.
Most of my other use cases involve pasting large volumes of text into the model and having it extract information or manipulates that text in some way.
I don't think this means no hallucinations (in output). I think it'd be naive to assume that compiling and passing tests means hallucination free.
> For factual information
I've used both quite a bit too. While o3 tends to be better, I see hallucinations frequently with both.
> Most of my other use cases
I guess my question is how you validate the hallucination free claim.
Maybe I'm misinterpreting your claim? You said "I rarely see them" but I'm assuming you mean more, and I think it would be reasonable for anyone to interpret this as more. Are you just making the claim that you don't see them or making a claim that they are uncommon? The latter is what I interpreted.
I don't understand why code passing tests wouldn't be protection against most forms of hallucinations. In code, a hallucination means an invented function or method that doesn't exist. A test that uses that function or method genuinely does prove that it exists.
It might be using it wrong but I'd qualify that as a bug or mistake, not a hallucination.
Is it likely we have different ideas of what "hallucination" means?
> tests wouldn't be protection against most forms of hallucinations.
Sorry, that's a stronger condition that I intended to communicate. I agree, tests are a good mitigation strategy. We use them for similar reasons. But I'm saying that passing tests is insufficient to conclude hallucination free.
My claim is more along the lines of "passing tests doesn't mean your code is bug free" which I think we can all agree on is a pretty mundane claim?
> Is it likely we have different ideas of what "hallucination" means?
I agree, I think that's where our divergence is. Which in that case let's continue over here[0] (linking if others are following). I'll add that I think we're going to run into the problem of what we consider to be in distribution, in which I'll state that I think coding is in distribution.
Haven't you effectively built a system to detect and remove those specific kind of hallucinations and repeat the process once detected before presenting it to you?
So you're not seeing hallucinations in the same way that Van Halen isn't seeing the brown M&Ms, because they've been removed, it's not that they never existed.
I think systems integrated with LLMs that help spot and eliminate hallucinations - like code execution loops and search tools - are effective tools for reducing the impact of hallucinations in how I use models.
That's part of what I was getting at when I very clumsily said that I rarely experience hallucinations from modern models.
On multiple occasions, Claude Code claims it completed a task when it actually just wrote mock code. It will also answer questions with certainity (for e.g. where is this value being passed), but in reality it is making it up. So if you haven't been seeing hallucinations on Opus/Sonnet, you probably aren't looking deep enough.
This is because you haven't given it a tool to verify the task is done.
TDD works pretty well, have it write even the most basic test (or go full artisanal and write it yourself) first and then ask it to implement the code.
I have a standing order in my main CLAUDE.md to "always run `task build` before claiming a task is done". All my projects use Task[0] with pretty standard structure where build always runs lint + test before building the project.
With a semi-robust test suite I can be pretty sure nothing major broke if `task build` completes without errors.
Yes. Though an easier to solve hallucination. That is, if you know what to look for, but that's kinda the problem. Truth is complex, lies are simple. More accurately, truth has infinite complexity and the big question is what's "good enough". The answer is a moving target.
I think if you ask o3 any math question which is beyond its ability it will say something incorrect with almost 100% probability somewhere in output. Similar to if you ask it to use literature to resolve some question which is not obvious it often hallucinates results not in paper.