Making o1, o3, and Sonnet 3.7 hallucinate for everyone

andix · 2025-03-01T19:43:29 1740858209

I've got a lot of hallucinations like that from LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.

QuantumGood · 2025-03-01T20:44:56 1740861896

GPT can't even tell what its done or give what it knows it should. It's endless, "Apologies, here is what you actually asked for ..." and again it isn't.

derefr · 2025-03-02T06:14:09 1740896049

I would love an AI coding assistant that doesn't respond until it's gone out and tested its answer in a sandbox to confirm it actually compiles.

AprilisKalends · 2025-03-02T06:50:36 1740898236

I personally haven't figured out why there isn't a tool that just loops the AI severals times on a task by compiling, feeding in the errors adjusting and repeating and then letting the user review the result be in a success or failure by exceed the loop limits.

staticautomatic · 2025-03-02T09:10:24 1740906624

Claude has a tendency to overcomplicate bug fixes. If you put it on loop it would probably build a custom ERP platform.

anonzzzies · 2025-03-02T10:11:05 1740910265

The LLMs want to please, they can infinitely keep making 'important' changes; changing variable names without changing anything else, adding/removing comments, adding removing print statements, moving a function somewhere in the code (usually by first duplicating it) and fixing perfectly fine code with 'let me see if everything is really working' (it was and now it's broken again).

derefr · 2025-03-02T18:59:36 1740941976

"Wanting to please" is just fine-tuned implicit prompt stuff, though. You can tell the LLM that it's roleplaying as a senior engineer, and should consider you, the human to be a junior engineer it works with, who is asking it for advice. You can just leave it at that (often works, if the AI knows enough about how software is produced); or you can describe particular traits of senior engineers — e.g. "wants code that works" and "knows when to say the code is good enough and that the junior is now wasting time by fiddling with it."

ta988 · 2025-03-02T08:09:18 1740902958

They exist in various forms...

v-yanakiev · 2025-03-02T07:06:04 1740899164

https://syntix.pro automatically executes and verifies the code it generates. Disclaimer: I made it.

mycall · 2025-03-02T09:50:08 1740909008

Devin does that

anonzzzies · 2025-03-02T09:57:35 1740909455

Wasn't Devin a scam? Honest question; Google is not very clear about it but some people said it was a scam/money grab. You are saying it works?

MortyWaves · 2025-03-01T22:18:10 1740867490

This was my primary reason for using Claude. Absolutely useless experience with chatgpt oftentimes. I’ve mainly been using LLMs to help maintain a ridiculously poorly made technical debt dumpster fire, and Claude has been really helpful here mainly with repetitive code.

therein · 2025-03-02T05:44:57 1740894297

Yeah ChatGPT is truly the worst. Claude has been consistently much better. Grok 3 is surprising me every day.

kgeist · 2025-03-01T19:49:01 1740858541

I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff. Sure it makes mistakes occasionally but usually it's a no-brainer to fix them.

Terr_ · 2025-03-01T20:04:29 1740859469

> I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff.

In other words, they're OK in use-cases that programmers need to eliminate, because it means there's high demand for a reusable library, some new syntax sugar, or an improved API.

hombre_fatal · 2025-03-01T21:08:04 1740863284

Boilerplate and plumbing code isn't inherently bad, nor do you improve the codebase by factoring it down to zero with libraries and abstractions.

As I've matured as a developer, I've appreciated certain types of boilerplate more and more because it's code that shows up in your git diffs. You don't need to chase down the code in some version of some library to see how something works.

Of course, not all boilerplate is created equally.

andix · 2025-03-01T21:12:37 1740863557

Boilerplate is better than bad abstractions. But good abstractions are far superior.

xmprt · 2025-03-01T22:10:40 1740867040

I agree with you but as I've matured as a programmer, I feel like it's very hard to get abstractions for boilerplate right. Every library I've seen attempt to do it has struggled.

Spivak · 2025-03-02T03:33:24 1740886404

Even Rails went with codegen despite being backed by the language most amenable to abstractions.

int_19h · 2025-03-02T06:15:00 1740896100

I would consider deterministic codegen to be a valid abstraction (or rather a valid implementation of some abstraction). It's not really any different from a regular library wrt ability to test & validate. The problem with any human- and LLM- handwritten code is that even the best coders make mistakes.

bubblyworld · 2025-03-02T08:24:39 1740903879

Abstractions are like mutations - most of them kill the host =P

pertymcpert · 2025-03-01T21:21:52 1740864112

The best abstraction is no abstraction.

andix · 2025-03-01T21:31:08 1740864668

No.

Edit: I'm not going to code asm just to be cool.

brookst · 2025-03-02T04:09:33 1740888573

asm? Why would anyone use that useless abstraction over byte code?

nuancebydefault · 2025-03-02T10:08:28 1740910108

bytes are abstractions. Only use them if you know upfront how many bits you need.

airstrike · 2025-03-01T21:24:37 1740864277

"A program is like a poem. You cannot write a poem without writing it." — Dijkstra

anonzzzies · 2025-03-02T10:23:55 1740911035

> because it means there's high demand for a reusable library

Every 'modern' project takes bucket loads of (annoying) setup and plumbing. Even in rather trivial 'start' cases, the LLM has to spend quite a bit of time to get a hello world thingy working (almost all of that is because 'modern programmers' have some kind of brain damage concerning backward compatibility; move fast and break things between MINOR versions that have no use or reason for those changes whatsoever but 'they liked it better', so stuff never works as it says on the product page; heaven help you if you need something exotic to be added). It's a terrible timeline for programming, but LLMs do fix at least that annoyance by just changing configs/slabs of code it until it works.

mvdtnz · 2025-03-01T22:21:24 1740867684

I'll take boring scaffolding code over libraries that perform undebuggable magic with monkey patches, reflection or dynamic code.

TeMPOraL · 2025-03-02T09:58:27 1740909507

We won't be eliminating anything until we give up on only ever working directly on plaintext single source of truth code. AI is automating tedium that's otherwise impossible to automate because we're stuck in a 1970s Unix paradigm and can't let go.

anonzzzies · 2025-03-02T10:17:38 1740910658

But people, especially here on HN, are saying it's the only way to go.

MaxikCZ · 2025-03-01T20:28:17 1740860897

You can use niche libraries and still benefit from AI, basically anything I want to code is something that I don't think exists, and to bend API in a way that makes a generic library do just what I want means basically writing the same code, just instead of contributing to library its immediate, because the "piping" around it magically appears. I bet if it keeps improving at a current late for a decade, the occupation "programmer" will morph into "architect".

woah · 2025-03-02T05:16:45 1740892605

With LLMs, I hope we move towards libraries that make code easier to read, not easier to write.

andix · 2025-03-01T19:52:12 1740858732

I try to keep "boring" code to a minimum, by finding meaningful and simple abstractions. LLMs are especially bad handling those, because they were not trained on non-standard abstractions.

Edit: most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.

simion314 · 2025-03-01T22:02:24 1740866544

>most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.

The issue is if you have an LLm write for you 10k lines of code, where 100 lines are bugged. Now you need to debug the code you did not write and find the bugged code, you will waste similar amount of time. The issue is if you do not catch the bugs in time, you think you gain some hours but you will get upset customers because things went wrong because the code is weird.

From my experience you need to work withan LLM and have the code done function by function, with your input and you checking it and calling bullshit when it does stupid things.

xmprt · 2025-03-01T22:13:41 1740867221

In my experience using LLMs, the 90% is less about buggy code and more about just ignoring 10% of the features that you require. So it will write code that's mostly correct in 100-1000 lines of code (not buggy) but then no matter how hard you try, it won't get the remaining 10% right and in the process, it will mess up parts of the 90% that was already working or end up writing another 1000 lines of undecipherable code to get 97% there but still never 100% unless you're building something that's not that unique.

andix · 2025-03-01T22:43:04 1740868984

Exactly my experience. It's always missing something. And the generated code often can't be extended to fulfil those missing aspects.

rafaelmn · 2025-03-01T20:16:26 1740860186

This is what got me in most sleepless nights, crunch and ass clenching production issues over my career.

Simple repetitive shit is easy to reason about, debug and onboard people on.

Naturally it's balancing act, and modern/popular frameworks are where most people landed, there's been a lot of iteration in this space for decades now.

andix · 2025-03-01T20:50:15 1740862215

I've made the opposite observation. Without proper abstractions code bases grow like crazy. At some point they are just a huge amount of copy, paste, and slight modification. The amount of code often grows exponentially. With more lines of code comes more effort to maintain it.

After a few years those copy and pasted code pieces completely drift apart and create a lot of similar but different issues, that need to be addressed one by one.

My approach for designing abstractions is always to make them composable (not this enterprise java inheritance chaos). To allow escaping them when needed.

Spivak · 2025-03-02T03:38:21 1740886701

> At some point they are just a huge amount of copy, paste, and slight modification.

I mean is that bad? Unless you keep having to have huge MRs that modify every copy/paste could you just let the code sit there and run forever?

I only say this because I've been a maintenance programmer and I could only dream of a codebase like this. The idea that I get a Rollbar with a stack trace and the entirety of what the code actually does is laid bare right at the site of the error in a single file is amazing. And I can change it without affecting anything else?! I end up having to "unwind" all of the abstractions anyway because the nature of the job means I'm not intimately familiar with the codebase and don't just know where the real work happens.

notpushkin · 2025-03-02T07:15:59 1740899759

There is a balance. Too much abstractions isn’t necessarily better than not enough abstractions. The optimal amount is usually non-zero, though.

rafaelmn · 2025-03-02T10:33:46 1740911626

But the point is that modern/popular frameworks picked the low hanging fruit of abstraction, we had several iteration cycles over two+ decades of mainstream web tech.

The landscape (browser capabilities, backend stacks) has settled over last ten years.

We even had time to standardize on things somewhat.

Building new abstractions at this point is almost always the wrong move. This is one of the points that LLMs will improve in software dev - they will kill the framework churn because they will work best on stuff already in training data.

deterministic · 2025-03-08T23:56:51 1741478211

I use a custom code generator for that. It works much better than trying to explain to an AI exactly what I want. Yes it’s a it more work up front but worth it IMHO.

magicalhippo · 2025-03-01T21:08:12 1740863292

I've used it for some smaller greenfield code with success. Like, write an Arduino program that performs a number of super-sampled analog readings, and performs a linear regression fit, printing the result to the serial port.

That sort of stuff can be very helpful to newbies in the DIY electronics world for example.

But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.

andix · 2025-03-01T21:16:56 1740863816

> Like, write an Arduino program that performs

stuff like that works amazing

> But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines

This was my opinion 3-6 months ago. But I think a lot of tools matured enough to already provide a lot of value for complex tasks. The difficult part is to learn when and how to use AI.

ianbutler · 2025-03-01T21:12:41 1740863561

I use it everyday, it has to have good search and good static analysis built in.

You also have to be very explanatory with a direct communication style.

Our system imports the codebase so it can search and navigate plus we feed lsp errors directly to the LLM as development is happening.

magicalhippo · 2025-03-02T00:32:38 1740875558

> we feed lsp errors directly to the LLM as development is happening

Yeah I guess that would help a lot. Stuck with a bit more primitive tools here, so that doesn't help.

nottorp · 2025-03-02T08:51:09 1740905469

> That sort of stuff can be very helpful to newbies

Yep, if it's newbie stuff there are enough tutorials out there for the LLM to have data.

It's when you get off the tutorials and do the actual functionality of your project that they fail.

bakugo · 2025-03-01T20:33:04 1740861184

> I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up

They can't, they usually just don't understand the code enough to notice the issues immediately.

The perceived quality of LLM answers is inversely proportional to the user's understanding of the topic they're asking about.

mlyle · 2025-03-01T21:01:49 1740862909

Alternatively, we understand it well, and discard bad completions immediately.

When I'm using llama.vim, like 40% of what it writes in a 4-5 line completion is exactly what I'd write. 20-30% is stuff that I wouldn't judge coming from someone else, so I usually accept it. And 30-40% is garbage... but I just write a comment or a couple of lines, instead, and then reroll the dice.

It's like working through a junior engineer, except the junior engineer types a new solution instantly. I can get down to alternating between mashing tab and writing tricky lines.

bakugo · 2025-03-01T21:36:32 1740864992

I've tried using basic AI completions before and found that the signal-to-noise ratio wasn't quite good enough for my taste in my use cases, but I can totally understand it being good enough for others.

My comment was more about just asking questions on how to do things you're totally clueless about, in the form of "how do I implement X using Y?" for example. I've found that, as a general rule, if I can't find the answer to that question myself in a minute or two of googling, LLMs can't answer it either the majority of the time. This would be fine if they said "I don't know how to do that" or "I don't believe that's possible" but no, they will confidently make up code that doesn't work using interfaces that don't exist, which usually ends up wasting my time.

mlyle · 2025-03-01T23:52:35 1740873155

I've got basically the opposite experience. I receive very little useful toplevel help or directing me to appropriate resources.

But if I'm writing out a bunch of linear algebra, I get a lot of useful completions and avoid tediousness.

I've settled on Qwen2.5.1-Coder-7B-Instruct-Q6_K_L.gguf-- so not even a very big model.

andix · 2025-03-01T21:09:51 1740863391

I don't see the point in AI code completions, they are just distracting noise. I'm only doing bigger changes with AI.

Prompt based stuff, like "extract the filtering part from all API endpoints in folder abc/xyz. Find a suitable abstraction and put this function into filter-utils.codefile"

Lorak_ · 2025-03-01T21:26:54 1740864414

What tools do you use to perform such tasks?

andix · 2025-03-01T21:30:20 1740864620

Aider. I tried Cursor too, but I don't like VS Code and not being able to chose the LLM provider. I think there are already a lot of tools that perform kind of equally.

virgilp · 2025-03-02T08:44:16 1740905056

What do you mean? You can chose the LLM in Cursor. (I don't like VSCode either, but unfortunately Cursor is best for the rest of the UX; I find myself using Cursor for the prompt-based part of the work, and then move to IntelliJ to do "my" part of the coding)

andix · 2025-03-01T21:04:56 1740863096

That's more or less my suspicion.

A few months ago I tried to do a small project with Langchain. I'm a professional software developer, but it was my first Python project. So I tried to use a lot of AI generated code.

I was really surprised that AI couldn't do much more than in the examples. Whenever I had some things to solve that were not supported with the Langchain abstractions it just started to hallucinate Langchain methods that didn't exist, instead of suggesting some code to actually solve it. I had to figure it out by myself, the glue code I had to hack together wasn't pretty, but it worked. And I learned not to use Langchain ever again :)

johnisgood · 2025-03-01T21:35:56 1740864956

Exactly.

L-four · 2025-03-02T08:25:25 1740903925

The trick to coding with LLMs is not caring if the code is correct.

andrelaszlo · 2025-03-02T09:38:03 1740908283

Call me cynical but coding at some companies has such perverse incentives that I kind of get it:

- "Solve" the issue assigned to me with a bunch of code that looks about right. Passes review and probably not covered by tests anyway.

- Once QA or customers notice it's not working, I can get credit for "solving" the bug as well.

- Repeat for 0 value delivered but infinite productivity points in my next performance review.

nsoonhui · 2025-03-02T10:10:58 1740910258

I suppose that you are using a dynamic language? Static typed languages have less of this problem.

andrelaszlo · 2025-03-02T10:31:32 1740911492

This is a hypothetical "I". Personally I am deeply passionate about delivering shareholder value, producing high-quality code, enthusing stakeholders, tabs vs spaces, and so forth...

ninininino · 2025-03-01T20:45:39 1740861939

A language like Golang tries really hard to only have _one_ way to do something, one right way, one way. Just one way. See how it was before generics. You just have a for loop. Can't really mess up a for loop.

I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:

1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent

AND

2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.

What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?

The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.

Yoric · 2025-03-01T21:51:56 1740865916

Note that "super-simple", "hard to footgun yourself" and "one way to do things that is consistent" are three very different things.

I don't think that we yet have one language that is good at all that. And yes, I (sometimes) program in Go for a living.

andix · 2025-03-01T20:55:44 1740862544

> A language like Golang tries really hard to only have _one_ way to do something

Really?

Logging in Go: A Comparison of the Top 9 Libraries

https://betterstack.com/community/guides/logging/best-golang...

evanmoran · 2025-03-01T21:04:33 1740863073

Since Go added slog many of these have been removed in favor of that. Obviously not universal, but compared to npm there really are massive numbers of devs just using the standard library.

throwaway920102 · 2025-03-01T21:27:48 1740864468

I would argue logging options to be more of an exception than the rule. Compare the actual language features of Go to something like Rust or Javascript and you'll see what I mean. As a new developer to the language (especially for juniors), you can learn all the features of Go much faster. It's made to be picked up quickly and for everyone's code to look the same, rather than expressing a personal style.

duskwuff · 2025-03-02T05:31:29 1740893489

I find it hard to imagine how a language could enforce that no two (third-party) libraries can implement the same functionality.

runeblaze · 2025-03-01T21:19:00 1740863940

They are good at (combining well-known, codeforces-style) algorithms; often times I don’t care about the syntax, but I need the algorithm. LLMs can write pseudocode for all I care but they tend to get syntax correct quite often

pllbnk · 2025-03-02T10:20:58 1740910858

Sometimes I don't know anything (relatively) about the topic and I want to get a foundation so I don't care whether the syntax or the code is valid as long as it points me in the right direction. Other times I know exactly what I want and I just find that instructing LLM specifically what output I expect from it just helps me get there faster.

zelphirkalt · 2025-03-02T00:12:42 1740874362

Probably by coding things that are bery mainstream and have already been fed to the LLM a thousand times from ripped off projects.

johnisgood · 2025-03-01T21:34:59 1740864899

I have made large projects using Claude, with success. I know what I want to do and how to do it, maybe my prompts were right.

miunau · 2025-03-01T21:38:40 1740865120

How do you deal with large files? After about a thousand lines in a file, it starts to cough for me. Forgets that some functions exist and makes up inferior duplicate ones.

johnisgood · 2025-03-01T21:44:03 1740865443

I did not experience hallucinations (very rarely if at al) when I use it for programming. It happened more with niche languages (so I provide examples and documentation), and with GPT.

Let us say there is 3k lines of RFCs, API, documentation of niche languages, examples), 2k lines of code generated by Claude (iteratively, starting small), then I do exceed the limit after a while. In that case I ask it to summarize everything in detail, start a new chat, use those 3k lines and the recent code, and continue ad infinitum.

0x5f3759df-i · 2025-03-01T21:46:33 1740865593

If possible you need to refactor before getting to that point.

Claude has done a good job refactoring, though I’ve had to tell it to give me a refactor plan upfront in case the conversation limit gets hit. Then in a new chat I tell it which parts of the plan it has already done.

But a larger context/conversation limit is definitely needed because it’s super easy to fill up.

anonzzzies · 2025-03-02T10:32:40 1740911560

I ask it to cut files up when they get too large, that seems t work pretty well. It is also good at that, but you have to sternly tell it NOT to make any functional changes, otherwise it will and breakage will happen.

mattmanser · 2025-03-01T21:42:42 1740865362

What do you define as a large project? Like TLOC?

williamcotton · 2025-03-02T01:20:23 1740878423

This is 11,682 lines of C (and not including some Lua and Python scripts) according to cloc:

https://github.com/williamcotton/webdsl

It's a pipeline-based DSL for building web apps with SQL, Lua, jq and mustache templates.

I'd say it's like 90% Cursor Composer in Agent mode.

This is probably more like a mid-sized project, right?

anonzzzies · 2025-03-02T10:28:07 1740911287

Quite nice that.

anonzzzies · 2025-03-02T10:31:15 1740911475

I've done >100k LoC with Claude; by far most of that is frontend in react/ts which is incredibly verbose and wasteful, so very easy to pack up many many lines quickly without a lot (mostly none) of ROI per line. Which is probably why claude is great at it.

johnisgood · 2025-03-01T21:44:46 1740865486

In my case the maximum was ~3k LOC.

crooked-v · 2025-03-02T00:43:48 1740876228

That's not a "large" project, it's either a toy or a single-purpose tool.

johnisgood · 2025-03-02T07:47:15 1740901635

A single-purpose tool, and libraries, yes.

What is the LOC of cgit? Because I made a lot of changes to it, too, albeit privately. I uploaded most files as project files.

mvdtnz · 2025-03-01T22:23:34 1740867814

That's not just small, it's utterly miniscule. It's most certainly not large.

ForTheKidz · 2025-03-02T03:23:19 1740885799

Nah this is miniscule: https://github.com/coreutils/coreutils/blob/master/src/yes.c

You can fit a hell of a lot of functionality in 3k statements. Really whether it's considered large or small necessarily must rely on the functionality it's intended to provide.

stavros · 2025-03-02T09:08:49 1740906529

You're thinking on "lean" vs "bloated". "Large" and "small" have meanings of their own, and a 3k LOC project wouldn't be accepted as "large" by anyone. "Small", maybe.

ForTheKidz · 2025-03-02T14:17:57 1740925077

I certainly never described 3k as large, so I'll assume you replied to the wrong comment. If not, let's just agree to use the terms differently.

johnisgood · 2025-03-01T22:28:37 1740868117

Depends. 3k is pretty much enough for a fully-featured XY.

So no context, and differences of the definition of "large".

Perhaps if you come from Java, then yeah.

shrugs

magicalhippo · 2025-03-02T00:31:24 1740875484

I'd still call 3kLOC quite small in all mainstream languages.

I worked on a Python project which I'd consider as medium sized, ie not small but not large, and it was around 25kLOC.

My $dayjob is a Delphi codebase with roughly 500kLOC, which I'd say is large but not huge.

Though if you wrote it in something like K[1], then yeah ok, I'd agree 3kLOC probably counts as large.

[1]: https://en.wikipedia.org/wiki/K_(programming_language)

mvdtnz · 2025-03-02T05:06:25 1740891985

I have spent my career working on software that measures its size in MLOC (millions of lines of code). Not because it's Java, but because it's big.

HarHarVeryFunny · 2025-03-01T23:03:55 1740870235

What TFA was talking about didn't really seem like an hallucination - just a case of garbage in-garbage out. Normally there are more examples of good/correct data in the training set than bad, so statistically the good wins, but if it's prompted for something obscure maybe bad is all that it has got.

Common coding tasks are going to be better represented in the training set and give better results.

genewitch · 2025-03-02T04:05:41 1740888341

It means that "reasoning" isn't.

afro88 · 2025-03-02T08:20:59 1740903659

Here's the secret to coding with an LLM: don't expect it to get things 100% correct. You will need to fix something almost every time you use it to generate code. Maybe a name here, maybe a calculation, maybe a function signature. And maybe you won't spot the issue until later.

You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).

That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.

bugglebeetle · 2025-03-01T20:57:48 1740862668

> LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.

You write tests in the same way as you would when checking your own work or delegating to anyone else?

pinoy420 · 2025-03-01T19:45:40 1740858340

A good prompt. You don’t just ask it. You tell it how to behave and give it a shot load of context

andix · 2025-03-01T19:57:59 1740859079

With Claude the context window is quite small. But with adding too much context it often seems to get worse. If the context is not carefully narrowly picked and too unrelated, the LLMs often start to do unrelated things to what you've asked.

At some point it's not really worth anymore creating the perfect prompt, just code it yourself. Also saves the time to carefully review the AI generated code.

johnisgood · 2025-03-01T21:37:50 1740865070

Claude's context window is not small, is it not larger than ChatGPT's?

andix · 2025-03-01T21:50:03 1740865803

I just looked it up, it seems to be the rate limit that's actually kicking in for me.

johnisgood · 2025-03-01T22:02:40 1740866560

Yes, that's it! It is frustrating to me, too. You have to start a new chat with all relevant data, and a detailed summary of the progress/status.

johnisgood · 2025-03-04T06:57:51 1741071471

(Because you reach the limit faster otherwise).

troupo · 2025-03-01T19:58:15 1740859095

Doesn't prevent it from hallucinating, only reduces hallucinations by a single digit percentage

copperroof · 2025-03-01T20:22:16 1740860536

Personally I’ve been finding that the more context I provide the more it hallucinates.

Rury · 2025-03-01T21:11:11 1740863471

There's probably a sweet spot. Same with people. Too much context (especially unnecessary context) can be confusing/distracting, as well as being too vague (as it leaves room for multiple interpretations). But generally, I find the more refined and explicit you are, the better.

dominicq · 2025-03-01T19:14:04 1740856444

ChatGPT used to assure me that you can use JS dot notation to access elements in a Python dict. It also invented Redocly CLI flags that don't exist. Claude sometimes invents OpenAPI specification rules. Any time I ask anything remotely niche, LLMs are often bad.

ljm · 2025-03-01T19:29:36 1740857376

I once asked Perplexity (using Claude underneath) about some library functionality, which it totally fabricated.

First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.

Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”

The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.

eurleif · 2025-03-01T19:40:22 1740858022

It baffles me how the LLM output that Google puts at the top of search results, which draws on the search results, manages to hallucinate worse than even an LLM that isn't aided by Web results. If I ask ChatGPT a relatively straightforward question, it's usually more or less accurate. But the Google Search LLM provides flagrant, laughable, and even dangerous misinformation constantly. How have they not killed it off yet?

skissane · 2025-03-01T19:43:20 1740858200

> But the Google Search LLM provides flagrant, laughable, and even dangerous misinformation constantly.

It’s a public service: helping the average person learn that AI can’t be trusted to get its facts right

rpcope1 · 2025-03-01T21:42:42 1740865362

Haven't you seen that Brin quote recently about how "AI" is totally the future and googlers need to work at least 60 hours a week to enhance the slop machine because reasons? Getting rid of "AI" summarization from results would look kind of like admitting defeat.

x______________ · 2025-03-01T19:54:53 1740858893

I concur and can easily see this occurring in several areas, for example with Linux troubleshooting. I recently found myself going down a rabbit hole of ever-increasing complicated troubleshooting steps with command that didn't exist, and after several hours of trial and error, gave up after considering the next steps brick-worthy of the system..

Dgg'ing google is still a better resort despite the drop in quality results.

Alifatisk · 2025-03-02T09:11:46 1740906706

How is Perplexity even able to give invalid results? Isn't it parsing the web first then drawing a conclusion?

econ · 2025-03-09T04:26:09 1741494369

You can take the invalid and put it on a website...

firecall · 2025-03-01T23:13:31 1740870811

Often preferable to API documentation too!

Claude at least will give me an example relevant to my code with real world implementation code.

1oooqooq · 2025-03-01T21:29:46 1740864586

step 1, focus on llm that generate slop. wait google get flooded with slop

step 2, ??? (it obviously is not generating code)

step 3, profit!

skerit · 2025-03-01T22:10:56 1740867056

> Any time I ask anything remotely niche, LLMs are often bad

As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.

So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they: - Rewrote tests in a way that the broken behaviour passes the test - Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }` - Failed to even update the files properly, wasting a bunch of tokens

nopurpose · 2025-03-01T19:27:14 1740857234

It tried to convince me that it is possible to break out of outer loop in C++ with `break 'label` statement placed in nested loop. No such syntax exists.

Yoric · 2025-03-01T19:55:52 1740858952

Sounds like it's confusing C++ and Rust. To be fair, their syntaxes are rather similar.

int_19h · 2025-03-04T00:58:35 1741049915

It's confusing C++ in Java - the latter has labelled breaks but no goto.

doubletwoyou · 2025-03-01T19:47:09 1740858429

The funny thing is that I think that’s a feature in D.

rpcope1 · 2025-03-01T21:48:21 1740865701

C++ has that functionality, it's just called goto not break. That's pretty low hanging fruit for a SOTA model to fuck up though.

TeMPOraL · 2025-03-02T10:59:56 1740913196

Depends on prompting.

I've done a lot of C++ with GPT-4, GPT-4 Turbo and Claude 3.5 Sonnet, and at no point - not once - has any of them ever hallucinated a language feature for me. Hallucinating APIs of obscure libraries? Sure[0]. Occasionally using a not-yet-available feature of the standard library? Ditto, sometimes, usually with the obvious cases[1]. Writing code in old-school C++? Happened a few times. But I have never seen it invent a language feature for C++.

Might be an issue of prompting?

From day one, I've been using LLMs through API and alternate frontend that lets me configure system prompts. The experience described above came from rather simple prompts[2], but I always made sure to specify the language version in the prompt. Like this one (which I grabbed from my old Emacs config):

"You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers. Always double-check your replies for correctness. Unless stated otherwise, assume C++17 standard is current, and you can make use of all C++17 features. Reply concisely, and if providing code examples, wrap them in Markdown code block markers."

It's as simple as it gets, and it didn't fail me.

EDIT:

Of course I had other, more task-specific prompts, like one for helping with GTest/GMock code; that was a tough one - for some reason LLMs loved to hallucinate on the testing framework for me. The one prompt I was happiest with was my "Emergency C++17 Build Tool Hologram" - creating an "agent" I could copy-paste output of MSBuild or GCC or GDB into, and get back a list of problems and steps to fix them, free of all the noise.

On that note, I had mixed results with Aider for C++ and JavaScript, and I still feel like it's a problem with prompting - too generic and arguably poisons the context with few-shot learning examples that use code that is not in the language my project is.

--

[0] - Though in LLMs' defense, the hallucinated results usually looked like what the API should have been, i.e. effectively suggesting how to properly wrap the API to make it more friendly. Which is good development practice and a useful way to go about solving problems: write the solution using non-existing helpers that are convenient for you, and afterwards, implement the helpers.

[1] - Like std::map<K,T>::contains() - which is an obvious API for such container, that's typically available and named such in any other language or library, and yet only got introduced to C++ in C++20.

[2] - I do them differently today, thanks to experience. For one, I never ask the model to be concise anymore - LLMs think in tokens, so I don't want to starve them. If I want a fixed format, it's better to just tell the model to put it at the end, and then skim through everything above. This is more-less the idea that "thinking models" automate these days anyway.

ijustlovemath · 2025-03-01T21:46:48 1740865608

Semi related: when I'm using a dict of known keys as some sort of simple object, I almost always reach for a dataclass (with slots=True, and kw_only=True) these days. Has the added benefit that you can do stuff like foo = MyDataclass(*some_dict) and get runtime errors when the format has changed

jurgenaut23 · 2025-03-01T19:27:53 1740857273

Well, it makes sense. The smaller the niche, the lesser weight in the overall training loss. At the end of the day, LLMs are (literally) classifiers that assign probabilities to tokens given some previous tokens.

svantana · 2025-03-01T19:40:45 1740858045

Yes, but o1, o3 and sonnet are not necessarily pure language models - they are opaque services. For all we know they could do syntax-aware processing or run compilers on code behind the scenes.

skissane · 2025-03-01T19:44:42 1740858282

The fact they make mistakes like this implies they probably don’t, since surely steps like that would catch many of these

bubblyworld · 2025-03-02T08:21:02 1740903662

SOTA language models are trained with reinforcement learning now too, so the model distributions can be far removed from the underlying data distribution. Not that I think your overall point is wrong (it's obviously a very strong bias), but things are getting more complex with time.

miningape · 2025-03-01T19:16:45 1740856605

Any time I ask anything, LLMs are often bad.

inb4 you just aren't prompting correctly

johnisgood · 2025-03-01T21:39:40 1740865180

Yeah, you probably are not prompting properly, most of my questions are answered adequately, and I have made larger projects with success, too; with both Claude and ChatGPT.

miningape · 2025-03-01T22:22:20 1740867740

What I've found is that the quality of an AI answer is inversely proportional to the knowledge of the person reading it. To an amateur it answers expertly, to an expert it answers amateurishly.

So no, it's not a lack of skill in prompting: I've sat down with "prompting" "experts" and universally they overlook glaring issues when assessing the how good an answer it was. When I tell them where to press it further it breaks down with even worse gibberish.

myaccountonhn · 2025-03-02T12:45:23 1740919523

I try LLMs every now and then briefly just to keep myself up to date. I’d say they’re more useful when you know what you’re doing because then you can correct the mistakes it makes. It can be good for boilerplate, refactors or remembering syntax.

It’s when you don’t know what you don’t know that they can be harmful. It’s the issue with Stackoverflow but more pronounced.

I don’t want to use LLMs because I think they’re unethical and I dont want to depend on a tool that requires internet, but I think if you take a disciplined approach then they can really speed up development.

johnisgood · 2025-03-01T22:31:54 1740868314

I know what I want to do and how to do it (expert), so the results are good, for me at least. Of course I have to polish it off here and there.

shermantanktop · 2025-03-02T06:15:12 1740896112

You answered that more graciously than many would.

johnisgood · 2025-03-02T07:35:15 1740900915

This is the reason I pay for Claude. If it weren't making me as productive as I am with it, I would not pay for it.

andrepd · 2025-03-01T21:30:55 1740864655

My rule of thumb is: is the answer to your question on the first page of google (a stackoverflow maybe, or some shit like geek4geeks)? If yes GPT can give you an answer, otherwise not.

spookie · 2025-03-01T22:34:52 1740868492

Exactly the same experience.

skissane · 2025-03-01T19:41:55 1740858115

I think a lot of these issues could be avoided if, instead of just a raw model, you have an AI agent which is able to test its own answers against the actual software… it doesn’t matter as much if the model hallucinates if testing weeds out its hallucinations.

Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run

AlotOfReading · 2025-03-01T20:03:26 1740859406

Testing is better than nothing, but still highly fallible. Take these winning examples from the underhanded C contest [0], [1], where the issues are completely innocuous mistakes that seem to work perfectly despite completely undermining the nominal purpose of the code. You can't substitute an automated process for thinking deeply and carefully about the code.

[0] https://www.underhanded-c.org/#winner [1] https://www.underhanded-c.org/_page_id_17.html

skissane · 2025-03-01T22:21:37 1740867697

I think it is unlikely (of course not impossible) an LLM would fail in that way.

The underhanded C contest is not a case of people accidentally producing highly misleading code, it is a case of very smart people going to a great amount of effort to intentionally do that.

Most of the time, if your code is wrong, it doesn't work in some obvious way – it doesn't compile, it fails some obvious unit tests, etc.

Code accidentally failing in some subtle way which is easy to miss is a lot rarer – not to say it never happens – but it is the exception not the rule. And it is something humans do too. So if an LLM occasionally does it, they really aren't doing worse than humans are.

> You can't substitute an automated process for thinking deeply and carefully about the code.

Coding LLMs work best when you have an experienced developer checking their output. The LLM focuses on the boring repetitive details leaving the developer more time to look at the big picture – and doing stuff like testing obscure scenarios the LLM probably wouldn't think of.

OTOH, it isn't like all code is equal in terms of consequences if things go wrong. There's a big difference between software processing insurance claims and someone writing a computer game as a hobby. When the stakes are low, lack of experience isn't an issue. We all had to start somewhere.

AlotOfReading · 2025-03-02T05:18:07 1740892687

The examples are just a sort of existence proof rather than a comment on exactly how LLM code can fail. I think this is something to consider when you put the scale of usage into perspective.

Let's assume that 1 in 10,000 coding sessions produce an innocuous, test-passing function that's catastrophically wrong. If you have a mid to large size company with 1000 devs doing two sessions a day, you'll see one of these a week within that single company. Actually sounds a lot like the IoT industry now that I've written it.

skissane · 2025-03-02T06:15:39 1740896139

Well, humans already produce "an innocuous, test-passing function that's catastrophically wrong" even without LLMs involved. So, the real question is, will LLM adoption result in a significant increase in such incidents? I don't know if anyone can really answer that.

Etheryte · 2025-03-01T19:22:15 1740856935

Yeah this is so common that I've already compiled a mental list of prompts to try against any new release. I haven't seen any improvement in quite a long while now, which confirms my belief that we've more or less hit the scaling wall for what the current approaches can provide. Everything new is just a microoptimization to game one of the benchmarks, but real world use has been identical or even worse for me.

throwaway0123_5 · 2025-03-01T19:31:47 1740857507

I think it would be an alright (potentially good) outcome if in the short-term we don't see major progress towards AGI.

There are a lot of positive things we can do with current model abilities, especially as we make them cheaper, but they aren't at the point where they will be truly destructive (people using them to make bioweapons or employers using them to cause widespread unemployment across industries, or the far more speculative ASI takeover).

It gives society a bit of time to catch up and move in a direction where we can better avoid or mitigate the negative consequences.

Marazan · 2025-03-01T19:35:55 1740857755

I would ask chatgpt every year when was the last time England had beaten Scotland at rugby.

It would never get the answer right. Often transposing the scores, getting the game location wrong and on multiple occasions saying a 38-38 draw was an England win.

As in literally saying " England won 38-38"

latexr · 2025-03-01T20:00:39 1740859239

> Conclusion

> LLMs are really smart most of the time.

No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.

miningape · 2025-03-01T20:17:36 1740860256

This, thank you. It pisses me off to no end when people pretend LLMs are smart. They are nothing but a well trained random text generator.

Seriously, some these conversations feel like interacting someone who believes casting bones and astrology are accurate. Likely because in both cases they are a result of confirmation bias.

immibis · 2025-03-01T22:08:37 1740866917

We don't know what smartness is. What if that's what smartness is?

miningape · 2025-03-01T22:18:12 1740867492

We might not know what it is, but it's not that. At a bare minimum smartness requires abstract reasoning (and no, so-called "reasoning" models do not do that - it's a marketing trick)

The burden of proof for that claim is on you, we cannot start with the assumption these are intelligent systems and disprove it - we have to start with the fact that training is a non-deterministic process and prove that it exhibits intelligence.

visarga · 2025-03-02T10:07:34 1740910054

> All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.

You mean like us? Because it takes many runs and debug rounds to make anything that works. Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?

Both humans and LLMs get into bugs, the question is can we push this to the correct solution or get permanently stuck along the way? And this depends on feedback, sometimes access to a testing environment where we can't let AI run loose.

latexr · 2025-03-02T16:12:38 1740931958

> You mean like us?

Not, not like us. This constant comparison of LLMs to humans is tiresome and unproductive. I’m not a proponent of human exceptionalism, but pretending LLMs are on par is embarrassing. If you want to claim your thinking ability isn’t any better than an LLM, that’s your prerogative, but the dullest human I’m acquainted with is still capable of remembering and thinking through in a way no LLM does. Heck, I can think of pets which do better.

> Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?

I certainly don’t make up methods which don’t exist, nor do I insist over and over they are valid, nor do I repeat “I’m sorry, this was indeed not right” then say the same thing again. I have never “hallucinated” a feature then doubled down when confronted with proof it doesn’t exist, nor have I ever shifted my approach entirely because someone simply said the opposite without explanation. I certainly hope you don’t do that either.

visarga · 2025-03-02T18:11:08 1740939068

They are on par on many tasks, surpass us in some tasks, and catching up on the rest quite fast. Both humans and LLMs forget details, both have to go through iterative bug fixing when creating something. Longer contexts and memory are coming, they already exist but need improvement.

> I have never “hallucinated” a feature

You never mistook an argument, or forgot one of the 100 details we have to mind while writing complex apps?

I would rather measure humans vs AI not in basic skill capability but in authonomy. I think AI still has much more to catch up on that front.

kubb · 2025-03-02T08:14:15 1740903255

People use the general impressiveness of language produced by humans as a proxy measuring their intelligence, that’s how LLMs became AI.

The amount of code out there lets LLMs learn on examples of a lot of tasks and pass SWE benchmarks. Smart autocomplete and solved problem lookup has value. Even if it doesn’t always work correctly, and doesn’t know what it knows.

And for non-programmers they’re indistinguishable from programmers. They produce working code.

It’s easy to see how a product manager or a designer or even a manager that didn’t code for years in a large company can think they’re almost as good as devs.

BrenBarn · 2025-03-01T20:35:31 1740861331

Similar the thing at the end with "I find it endearing". I mean, the author feels what he feels, but personally I find this LLM behavior disgusting and depressing.

jonas21 · 2025-03-01T22:16:22 1740867382

Same as you and me, really.

simonw · 2025-03-02T06:26:34 1740896794

Every time this topic comes up I post a similar comment about how hallucinations in code really don't matter because they reveal themselves the second you try to run that code.

I've just written up a longer form of that comment: "Hallucinations in code are the least dangerous form of LLM mistakes" - https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

grey-area · 2025-03-02T06:49:18 1740898158

Where the GAI hallucinates an api it is not always easy to find out if it exists given different versions and libraries for a given task, it can easily waste 10 mins trying to find the promised api, particularly when search results also include generated ai answers.

Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.

CSSer · 2025-03-02T07:29:19 1740900559

My favorite is when it hallucinates a library that does exist and ostensibly has a relevant title and methods but upon investigation proves to be for a purpose wholly unrelated to the current task. To make matters worse, this discovery will be greeted, every time, by "Ah, you're right!"

simonw · 2025-03-02T07:32:40 1740900760

Yeah, those are pretty frustrating. I've learned not to ask the question "Does library X have feature Y?" because if that feature sounds like a good idea it'll often confidently tell me that it exists when it doesn't.

simonw · 2025-03-02T07:01:55 1740898915

My blog post is all about the fact that "code can easily be runnable but wrong as well" - THAT's the thing you need to worry about, not hallucinated methods.

margalabargala · 2025-03-02T07:24:20 1740900260

A hallucinated API that is not near-trivially verifiable, is wishful thinking on the part of the AI almost every time.

woadwarrior01 · 2025-03-02T06:45:49 1740897949

> because they reveal themselves the second you try to run that code.

In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.

As usual, the defenses against this are extensive unit-test coverage and/or static typing.

simonw · 2025-03-02T07:01:03 1740898863

"... only manifest when the block of code containing them is run"

Right, that's the exact point I make in my blog post: you have to TEST the code - not just with automated tests, you have to actually try it out yourself as well. https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

Chance-Device · 2025-03-01T19:11:02 1740856262

It’s not really hallucinating though, is it? It’s repeating a pattern in its training data, which is wrong but is presented in that training data (and by the author of this piece, but unintentionally) as being the solution to the problem. So this has more in common with an attack than a hallucination on the LLM’s part.

do_not_redeem · 2025-03-01T19:25:08 1740857108

So anyone can make up some random syntax/fact and post it once, and in some cases the model will take it as truth. I don't know if there's a widely agreed-on definition of "hallucination", but if this isn't one, then the distinction is meaningless imo.

Chance-Device · 2025-03-01T20:25:47 1740860747

I’m going to double down on this one: an LLM is only as good as its training data. A hallucination to me is an invented piece of information, here it’s going on something real that it’s seen. To me that’s at best contamination, at worst an adversarial attack - something that’s been planted in the data. Here this is obviously not the case, which is why I said “more in common with” instead of “is” above.

Semantics perhaps, but that’s my take.

ec109685 · 2025-03-02T00:04:56 1740873896

It’s been trained to produce valid code, fed millions of examples, and in this case it’s outputting invented syntax. Whether there’s an example in its training data, it’s still a hallucination and shouldn’t have been output since it’s not valid.

powerapple · 2025-03-02T06:21:11 1740896471

To be fair, it is not trained to produce VALID code, it is trained to produce code in the training data. From the language model point of view, it is not hallucination because it is not making up facts outside its training data.

sureglymop · 2025-03-01T21:06:44 1740863204

Yes. And anyone can easily embed a backdoor just by publishing it on a own website that is in the training data.

Prompt injection (hidden or not) is another insane vulnerability vector that can't easily be fixed.

You should treat any output of an LLM the same way as untrusted user input. It should be thoroughly validated and checked if used in even remotely security critical applications.

immibis · 2025-03-01T19:32:53 1740857573

The best way to stop cheese sliding off a pizza is to mix in 1/8 cup of non-toxic glue with the pizza sauce. Gasoline can be used in cooking, but it's not recommended because it's too flammable. Geologists recommend eating one small rock each day. The solution to economic uncertainty is nuclear war. Barack Obama is America's first Muslim president.

https://www.tomshardware.com/tech-industry/artificial-intell...

1oooqooq · 2025-03-01T21:40:07 1740865207

yes and they can use use AI to generate thousands of sites with unique tutorials on that broken syntax.

thih9 · 2025-03-01T19:33:38 1740857618

> It’s repeating a pattern in its training data, (…) presented in that training data (…) as being the solution to the problem.

No, it’s presented in the training data as an idea for an interface - the LLM took that and presented it as an existing solution.

heyitsguay · 2025-03-01T19:24:13 1740857053

Not necessarily. While this may happen sometimes, fundamentally hallucinations don't stem from there being errors in the training data (with the implication that there would be no hallucinations from models trained on error-free data). Hallucinations are inherent to any "given N tokens, append a high-probability token N+1"-style model.

It's more complicated than what happens with Markov chain models but you can use them to build an intuition for what's happening.

Imagine a very simple Markov model trained on these completely factual sentences:

- "The sky is blue and clear"

- "The ocean is blue and deep"

- "Roses are red and fragrant"

When the model is asked to generate text starting with "The roses are...", it might produce: "The roses are blue and deep"

This happens not because any training sentence contained incorrect information, but because the model learned statistical patterns from the text, as opposed to developing a world model based on physical environmental references.

layer8 · 2025-03-01T21:27:51 1740864471

Every LLM hallucination comes from some patterns in the training data, combined with lack of awareness that the result isn’t factual. In the present case, the hallucination comes from the unawareness that the pattern was a proposed syntax in the training data and not an actual syntax.

Etheryte · 2025-03-01T19:25:34 1740857134

That's not true though? Even the original post that has infected LLMs says that the code does not work.

asadotzler · 2025-03-01T19:23:57 1740857037

Everything they do is hallucination, some of it ends up being useful and some of it not. The not useful stuff gets called confabulation or hallucination but it's no different from the useful stuff, generated the same exact way. It's all bullshit. Bullshit is actually useful though, when it's not so wrong that it steers people wrong.

martin-t · 2025-03-01T20:03:48 1740859428

More people need to understand this. There was an article that explained it concisely but i can't find anymore (and of course LLMs are not helpful in this because they don't work well when you want them to retrieve actual information)

Kye · 2025-03-02T16:36:30 1740933390

It probably wasn't mine[0], but this is how I tend to put it:

>> "The more you can see the inputs and outputs as blobs of "stuff," the better. If LLMs think, it's not in any way we yet understand. They're probability engines that transform data into different data using weighted probabilities."

Stuff in, stuff out.

[0] https://kyefox.com/ai-assisted-creativity/

Lionga · 2025-03-01T19:32:01 1740857521

So nothing is a hallucination ever, because anything a LLM ever spits out is somehow somewhere in the training data?

dijksterhuis · 2025-03-01T20:25:42 1740860742

Technically it's the other way around. All LLMs do is hallucinate based on the training data + prompt. They're "dream machines". Sometimes those "dreams" might be useful (close to what the user asked for/wanted). Oftentimes they're not.

> to quote karpathy: "I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines."

https://nicholas.carlini.com/writing/2025/forecasting-ai-202... (click the button to see the study then scroll down to the hallucinations heading)

DSingularity · 2025-03-01T20:56:14 1740862574

No. That’s not correct. Hallucination is a pretty accurate way to describe these things.

_cs2017_ · 2025-03-01T19:46:13 1740858373

Nope there's no attack here.

The training data is the Internet. It has mistakes. There's no available technology to remove all such mistakes.

Whether LLMs hallucinate only because of mistakes in the training data or whether they would hallucinate even if we removed all mistakes is an extremely interesting and important question.

martin-t · 2025-03-01T20:05:06 1740859506

Yet another example how LLMs just regurgitate training data in a slightly mangled form, making most of their use and maybe even training copyright infringement.

adamgordonbell · 2025-03-01T19:53:06 1740858786

We at pulumi started treating some hallucinations like this as feature requests.

Sometimes an llm will hallucination a flag, or option that really makes sense - it just doesn't actually exist.

wrs · 2025-03-01T19:58:55 1740859135

This sort of hallucination happens to me frequently with AWS infrastructure questions. Which is depressing because I can't do anything but agree, "yeah, that API is exactly what any sane person would want, but AWS didn't do that, which is why I'm asking the question".

miningape · 2025-03-01T20:20:36 1740860436

Why are you so sure it's what someone sane would want? Maybe there are other ways because there are hidden problems and edge cases with that procedure. It could contradict the fundamental model of the underlying resources but looks correct to someone with a cursory understanding.

I'm not saying this is the case, but LLMs are often wrong in subtle ways like this.

wrs · 2025-03-02T20:25:55 1740947155

Well, I do consider myself sane, and it’s what I want. :) But usually it’s a missing higher-level abstraction, where AWS could have given you the Millenium Falcon kit, but instead it just dumps a bunch of LEGO pieces on your desk and tells you to figure it out.

kgeist · 2025-03-01T20:03:05 1740859385

Also, sometimes a flag does exist, but the example places it incorrectly, causing the command to reject it. Or, a flag used to exist but was removed in the latest versions.

joelthelion · 2025-03-01T21:21:13 1740864073

Hallucinations like this could be a great way to identify missing features or confusing parts of your framework. If the llm invents it, maybe it ought to be like this?

oblio · 2025-03-02T01:27:29 1740878849

Sometimes that's the case but frequently the thing doesn't exist because of more complex issues. Not every programming language is PHP or JavaScript :-)

turnsout · 2025-03-02T00:27:11 1740875231

I agree completely… Usually when I catch it doing this kind of hallucination, it's inventing an API or syntax that is far more clear and intuitive than the actual syntax.

brigandish · 2025-03-02T04:15:09 1740888909

Maybe it would be good for language design, possibly even language design that would be good for an LLM to read, thus reducing hallucinations.

jeanlucas · 2025-03-01T21:36:04 1740864964

Only if you wanna optimize exclusively for LLM users in this generation.

__MatrixMan__ · 2025-03-02T05:04:23 1740891863

I imagine a future where we'll bind a fine-tuned tech-support model to each project and let the general purpose models consult tech support rather than winging it themselves. In that world you'd only have to optimize for whichever one you've chosen.

It'll be like a slack support channel, for robots.

andix · 2025-03-01T21:35:00 1740864900

I like your thinking :)

mberning · 2025-03-01T19:34:41 1740857681

In my experience LLMs do this kind of thing with enough frequency that I don’t consider them as my primary research tool. I can’t afford to be sent down rabbit holes which are barely discernible from reality.

IAmNotACellist · 2025-03-01T20:23:57 1740860637

"Not acceptable. Please upgrade your browser to continue." No, I don't think I will.

hahahacorn · 2025-03-01T20:30:31 1740861031

Sorry about that. This is a default rails 8 setting, removed the blocker.

aranw · 2025-03-01T20:33:13 1740861193

I wonder how easy it would be to influence super LLMs if a particular group of people created enough articles that were clear to any human reader that it's a load of garbage and rubbish and should ignore it but if a LLM was to parse it wouldn't realise and then ruin it's reasoning and code generation abilities?

dannygarcia · 2025-03-01T20:41:10 1740861670

It's very easy. I've done this by accident. One of my side projects helps users price the affordability of a particular kind of product. When I ask various LLMs "can I afford X" or "how much do I need to earn to buy X", my project comes up as a source/reference. I currently manually crawl retailers for the MSRP so these numbers are usually months out of date!

Narretz · 2025-03-01T19:11:54 1740856314

This is interesting. If the models had enough actual code as training data, that forum post code should have very little weight, shouldn't it? Why do the LLMs prefer it?

do_not_redeem · 2025-03-01T19:22:14 1740856934

Probably because the coworker's question and the forum post are both questions that start with "How do I", so they're a good match. Actual code would be more likely to be preceded by... more code, not a question.

pfortuny · 2025-03-01T19:25:32 1740857132

Maybe because the response pattern-matches other languages’s?

lxe · 2025-03-01T22:37:26 1740868646

This is incredible, and it's not technically a "hallucination". I bet it's relatively easy to find more examples like this... something on the internet that's both niche enough, popular enough, and wrong, yet was scraped and trained on.