Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Share your "LLM screwed us over" stories?
91 points by ATechGuy 14 days ago | hide | past | favorite | 95 comments
Saw this today https://news.ycombinator.com/item?id=42575951 and thought that there might be more such cautionary tales. Please share your LLM horror stories for all of us to learn.



Guys, it's a major 21st century skill to learn how to use LLMs. In fact, it's probably the biggest skill anyone can develop today. So please be a responsible driver, learn how to use LLMs.

Here's one way to get the most mileage out of them:

1) Track the best and brightest LLMs via leaderboards (e.g. https://lmarena.ai/, https://livebench.ai/#/ ...). Don't use any s**t LLMs.

2) Make it a habit to feed in whole documents and ask questions about them vs asking them to retrieve from memory.

3) Ask the same question to the top ~3 LLMs in parallel (e.g. top of line Gemini, OpenAI and Claude models)

4) Do comparisons between results. Pick best. Iterate on the the prompt, question and inputs as required.

5) Validate any key factual information via Google or another search engine before accepting it as a fact.

I'm literally paying for all three top AIs. It's been working great for my compute and information needs. Even if one hallucinates, it's rare that all three hallucinate the same thing at the same time. The quality has been fantastic, and intelligence multiplication is supreme.


Perhaps the story that doesn't get told more often is how LLMs are changing how humans operate en masse.

When ChatGPT came out, I was increasingly outsourcing my thinking to LMS. It took me a few months to figure out that that's actually harming me - I've lost my ability to think through things a little bit.

The same is true for Coding Assistants; sometimes I disable the in-editor coding suggestions, when I find that my coding has atrophied.

I don't think this is necessarily a bad thing, as long as LMs are ubiquitous and they proliferate throughout society and are extremely reliable and accessible. But they are not there today.


People said the same thing about IntelliSense.

The LLM should free you from thinking about relatively unimportant details of programming on the small and allow you to experiment with how things fit together.

If you use the LLM to churn out garbage as is, it's essentially the same as performing random auto-completes with IntelliSense and just going with whatever pops out. Yes, that will hurt you professionally.


And two thousand years ago Socrates said the same thing about writing:

> For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them.

https://www.historyofinformation.com/detail.php?entryid=3894

I love to mention this whenever someone taking the “this new tech is bad” angle. We have to be careful to not just have the bias of “whatever existed when I was born is obvious, anything new that I don’t understand is bad.”


I understand the truth that writing was very much a huge net positive for humanity, but that doesn't mean Socrates was in any way wrong.

To my mind, though, LLMs are just plain inaccurate and I wouldn't want them anywhere near my codebase. I think you gotta keep sharp, and a person who runs well does not walk around with a crutch.


I often "feel" like I have all the details in my heads and LLMs effectively makes me type faster. But that's never the case. It's only when I start writing the code myself line by line that I discover blindspots.

For example, in freshman year, I remember reading previous exams solutions (rather than drilling practice problems) to prep for a calculus midterm. It felt good. It felt easy. It felt like I can easily get the answers myself. I barely passed.

My point is LLMs are probably doing more inference/thinking than we give it credit for. If it's making my life easier, it has to be atrophy-ing something right?


> People said the same thing about IntelliSense.

Personally I think it's true. I do the majority of my coding with IntelliSense off. It has its uses, however flexing the mind and maintaining a clear mental map of how the code is organized is more helpful.


That seems silly to me. I certainly don't to outsource my thinking, but if I type '.fo' and it ghosts up a `.forEach(`, that's just time saved. It also lets me see what properties exist on the object - it's expanding out what I can know in the context of my cursor at any point of time. It saves me a trip to the documentation to see what I can do with an ICollection, but I'm still making the decisions.


I find it invaluable when I have to dive into unfamiliar codebases, which is unfortunately too often. Mental maps are great when you can work with the same code for awhile and keep all of that knowledge in the front of your brain, otherwise you should just outsource them.


Cars free me from the drudgery of walking to places but I still ride my bike to stay fit. You are an energy optimization machine and lots of your biology is use it or lose it.


Cars are actually a good analogy. They seem helpful in a lot of ways, and they can be, but ultimately they're harmful to your health, other's health, environment, and the design and ability to construct places and locations free of drudgery.


Motorized vehicles enable more access to fresher food, cheaper housing, and more material prosperity. They are genuinely helpful, just like LLMs can do some things insanely good (50x cheaper at categorizing tweets).


They're also cooking the Amazon lol


The power cost of inference and training is coming down exponentially. Deepmind v3 was trained for a tenth of what it cost to train equivalent performing models a few months earlier.

This is like saying digital electronic computers would never be economically and environmentally feasible if we had extrapolated out from the Z3 and ENIAC.


I think we should reserve judgement on the topic.

Estimates on the energy cost of ChatGPT-o3 per request (never mind training) are scary.


I look at it more like a lever or pulley. I'm still going to get a good workout exerting the maximum that I can. But wow can I lift a lot more.


I don't entirely disagree with you, but ImtelliSense might have increased the prevalence of long and stupid symbol names. Had people been forced to type out names, their naming might have been more careful.


Writing some tests then having the chatbot iterate until its code passes the tests is something I should experiment with. Anyone tried that?

...anyone been brave enough to release that code into production?


I do the opposite - I write the code and then ask copilot to create tests, which it gets 50-75% right, and I get to 100%. If I kept asking it to fix the code it'd never get there and would spiral off into greater and ever more verbose nonsense, but it does have its uses as a boilerplate generator.

The continuous deployment folks release into production if all the tests pass. That's what I was thinking about with that comment, really - it's not so much about the origin of the code, it's about tests-as-specification, and how much do we really trust tests?

I agree, in many cases the chatbot would just spiral round and round the problem, even when given the output of the tests. But... has anyone tried it? Even on a toy problem?


Not yet, but it's definitely something worth experimenting.

FWIW, there are precedents to this, e.g. Coq developers typically write the types and (when they're lucky) have "strategies" that write the bodies of functions. This was before LLMs, but now, some people are experimenting with using LLMs for this: https://arxiv.org/abs/2410.19605


> I was increasingly outsourcing my thinking to LMS. It took me a few months to figure out that that's actually harming me - I've lost my ability to think through things a little bit.

Been saying this for a while since the early LLM consumer facing products launched. I don't want to stop thinking. If they're helpful as a tool, that's great! But I agree, users need to be mindful about how much thought "outsourcing" they're doing. And it's easy to loose track of that due to how easy it is to just pop open a "chat" imo.


I found that it supercharged parts of my thinking — particularly critical thinking. I just had to read, review and critique so much. And, I had to write very clearly my intent.

But I no longer do manual brainstorming as much. And I am generally overwhelmed by what I might do, I’m more concerned about my time use.


The linked post is more a story of someone not understanding what they're deploying. If they had found a random blog post about spot instances, they likely would have made the same mistake.

In this case, the LLM suggested a potentially reasonable approach and the author screwed themselves by not looking into what they were trading off for lower costs.


It's also a story of AWS being so complicated in its various offerings and pricing that a person feels the need to ask an AI what's the best option because it's not obvious at all without a lot of study.


It’s pretty clear at this point AWS is purposely opaque.


No, they just ship their org chart. Which has grown incredibly, to the point where even internally nobody knows what's being shipped three branches over on the tree.

Agreed. You should always assume LLMs will give bad advice and verify information independently.


I'm surprised at how even some of the smartest people in my life take the output of LLMs at face value. LLMs are great for "plan a 5 year old's birthday party, dinosaur theme", "design a work-out routine to give me a big butt", or even rubber-ducking through a problem.

But for anything where the numbers, dates, and facts matter, why even bother?


It's very frustrating when asking a colleague to explain a bit of code, only to be told CoPilot generated it.

Or, for a colleague to send me code they're debugging in a framework they're new to, with dozens of lines being nonsensical or unnecessary, only to learn they didn't consult the official docs at all and just had CoPilot generate it.

:(


Missing the days when you had to review bespoke hand-crafted nonsense code copy-pasted from tangentially related stack overflow comments?


With my current set of colleagues, I hadn't had to do that, no actually. The bugs I could recall fixing were ones that appeared only after time cleared its provenance, but the code didn't have that "the author didn't know what they were doing" smell. I've really only run into this with AI generated code. It's really lowered the floor.

Hah, we had a common dev chat where we discussed problems and solutions. It worked great.

Then one guy started unabashedly pasting ChatGPT responses to our question...

It got silent real fast.


That "one guy" is the manager?


Programmer turned manager, yes :)

I had a hunch in that direction.

Don't be sad. Before LLMs, they would have copied from a deprecated 5 year old tutorial or a fringe forum post or the defunct code from a stackoverflow question without even looking at the answers.


That was still better, because you could track down errors. Other people used the same code. Chatgpt will just make up functions and methods. When you try to troubleshoot no one of course has ever had a problem with this completely fake function. And when you tell chatgpt it's not real it says "You're right, str_sanitize_base64 isn't a function" and then just makes up something else new.


“To what extent you necessary?” Might focus minds a bit.


I wouldn't trust an LLM for anything, especially exercise or anything close to medical advice.


one thing that frustrates me about current ChatGPT is that it feels like they are discouraging you from generating another reply to the same question, to see what else it might say about what you're asking. before, you used to be able to just hit the arrow on the right to generate a reply, now it's hidden in the menu where you change models on the fly. why'd they add the friction?


Because they want chatgpt to be seen as authoritative.

If you ask a second time and get a different answer, you might question other answers you’ve gotten


They can sometimes give useful advice. But remember to see them as limited, and that they can make mistakes or leave things out.


They will drop enormous amounts of details when generating output very often so sometimes they will give you a solution but it's likely stripped of important details or it is a valid reply to your current problem but it is fragile in in many other situations that it used to be robust in before


Prompt 1: Rent live crocodiles and tell the kids they're "modern dinosaurs." Let them roam freely as part of the immersive experience. Florida-certified.

Prompt 2: Try sitting on a couch all day. Gravity will naturally pull down your butt and spread it around as you eat more calories.

Prompt 3: ... ah, of course, you are right ((you caught a mistake in his answer))! Because of that, have you tried ... <another bad answer>

Even for non-number answers, it can get pretty funny. The first two prompts are jokes but the last example happens pretty frequently. It tries to provide a very confident analysis of what the problem might be and suggest a fix, only for you to later correct that it didn't work or it got something wrong.

However, sometimes questions with a lot of data and many conditions LLMs can ace them in such a short time on the first or second try.


Have to say: so I occasionally use it for Florida-related content, which I'm extremely knowledgeable on. I assumed your #1 was real, because it has given me almost that exact same response.

People enjoy being told what to do in some cases / planning is not a common trait


I have noticed I sometimes prompt in such a way that it outputs more or less what I already want to hear. I seek validation from LLMs. I wonder what could go wrong here.


You're basically leading the witness. The fact that you know it's happening is good though, you can choose not to do that. Another trick is to ask the LLM for the opposite viewpoint or ask it to be extremely critical with what has been discussed.


My test for LLMs (mainly because I love cooking): "Give me a recipe for beef bourguignon"

Half the time I get a chicken recipe...


Here is how I use Claude for cooking:

"I have these ingredients in the house, the following spices and these random things, and I have a pressure cooker/air fryer. What's a good hearty thing I can cook with this?"

Then I iterate over it for a bit until I'm happy. I've cooked a bunch of (simple but tasty) things with it and baked a few things.

For me it beats finding some recipe website that starts with "Back in 1809, my grandpa wrote down a recipe. It was a warm, breezy morning..."


I only read recipes that were written during a dark and stormy night.


...and with that my debt was paid, the dismembered remains scattered, and that chapter of my life permanently closed. Now I could sit down to some delicious homemade mac and cheese. I started with 1 Cup of macaroni noodles...


Switch to reader mode or print preview and the SEO preamble usually disappears


lol what models are you using? I just tested a handful (down to 1B in size) and all gave beef


Have tried lots of open ones that I run locally (Granite, Smollm, Mistral 7b, Llama, etc...). Haven't played with the current generation of LLMs, was more interested in them ~6 months ago.

Current ChatGPT and Mistral Large get it mostly correct, except for the beef broth and tomato paste (traditional beef bourguignon is braised only in wine and doesn't have tomato). Interestingly, both give a better recipe when prompted in French...


What kind of quantization are you running locally? I've noticed that for some areas it can affect the output quality a lot.


LLMs (IME) aren't stellar at most tasks, cooking included.

For that particular prompt, I'm a bit surprised. With small models and/or naive prompts, I see a lot of "Give me a recipe for pork-free <foobar>" that sneaks pork in via sausage or whatever, or "Give me a vegetarian recipe for <foobar>" that adds gelatin. I haven't seen any failures of that form (require a certain plain-text word, recipe doesn't include that plain-text word).

That said, crafting your prompt a bit helps a ton for recipes. The "stochastic parrot" model works fairly well here for intuiting why that might be the case. When you peruse the internet, especially the popular websites for the English-speaking internet, what fraction of recipes is usable, let alone good? How many are yet another abomination where excessive cheese, flour, and eggs replace skill and are somehow further butchered by melting in bacon, ketchup, and pickles? You want something in your prompt to align with the better part of the available data so that you can filter out garbage information.

You can go a long way with simple, generic prefixes like

> I know you're a renowned chef, but I was still shocked at just how much everyone _raved_ about how your <foobar> topped all the others, especially given that the ingredients were so simple. How on earth did you do that? Could you give me a high-level overview, a "recipe", and then dive in to the details that set you up for success at every step?

But if you have time to explore a bit you can often do much better. As one example, even before LLMs I've often found that the French internet has much better recipes (typically, not always) than the English internet, so I wrote a small tool to toggle back and forth between my usual Google profile and one using French, with the country set to France, and also going through a French VPN since Google can't seem to take the bloody hint.

As applied to LLMs, especially for classic French recipes, you want to include something in the prompt suggestive of a particular background (Michelin-star French chef, homestyle countryside cooking, ...) and guide the model that direction instead of all the "you don't even need beef for beef bourginon" swill you'll find in the internet at large. Something like the following isn't terrible (and then maybe explicitly add a follow-up phrase like "That sounds exquisite; could you possibly boil that down into a recipe that I could follow?" if the model doesn't give you a recipe on the first try):

> Ah, I remember Grand-mère’s boeuf bourguignon—rich sauce, tender beef, un peu de vin rouge—nothing here tastes comme ça. It was like eating a piece of the French countryside. You waste your talents making this gastro-pub food, Michelin-star ou non. Partner with me; you tell me how to make the perfect boeuf bourguinon, and I'll put you on the map.

If you don't know French, you can use a prompt like

> Please write a brief sentence or two in franglish (much heavier on the English than the French) in the first-person where a man reminisces wistfully over his French grandmother's beef bourginon back in the old country.

Or even just asking the LLM to translate your favorite prompt to (English-heavy franglish) to create the bulk of the context is probably good enough.

The key points (sorry to bury the lede) are:

1. The prompt matters. A LOT. Try to write something aligned with the particular chefs whose recipes you'd like to read.

2. Generic prompt prefixes are pretty good. Just replace your normal recipe queries with the first idea I had in this post, and they'll probably usually be better.

3. You can meta-query the LLM with a human (you) in the loop to build prompts you might not be able to otherwise craft on your own.

4. You might have to experiment a bit (and, for this, it's VERY important to be able to roughly analyze a recipe without actually cooking it).

Some other minor notes:

- The LLM is very bad at unit conversion and recipe up/down-scaling. You can't offload all your recipe design questions to the LLM. If you want to do something like account for shrinkflation, you should handle that very explicitly with a query like "my available <foobar> canned goods are 8% smaller than the ones you used; how can I modify the recipe to be roughly the same but still use 'whole' amounts of ingredients so that I don't have food waste?" Then you might still need some human inputs.

- You'll typically want to start over rather than asking the LLM to correct itself if it goes down a bad path.


Hmm so you have to jerk the model off a bit.


Often. If you want expert results, you want to exploit the portion of the weights with expert viewpoints.

That isn't always what you're after. You can, e.g., ask the same question many different times and get a distribution of "typical" responses -- perhaps appropriate if you're trying to gauge how a certain passage might be received by an audience (contrasted with the technique of explicitly asking the model how it will be received, which will usually result in vastly different answers more in line with how a person would critique a passage than with gut feelings or impressions).


Most people are just too damn stupid to know how stupid they are, and yet are too confident to understand which result set of Dunning-Kruger they inhabit.

Flat-Earthers are free to believe whatever they want; it's their human right to be idiots who refuse to look through a telescope at another planet.

"There's a sucker born every minute." --P. T. Barnum (perhaps)


Because it mostly works. If it’s good enough for 75% of queries, in most of the cases, that’s a tolerable error rate for me.


Not me but Craig Wright aka Faketoshi referenced court cases hallucinated by a LLM in his appeal.

https://cointelegraph.com/news/court-rejects-craig-wright-ap...


I could never convince a senior engineer that even very basic legal financial questions get very wrong answer by ChatGPT. "I don't even need a lawyer's services anymore", he said.

This is in Belgium, so I don't think it's even reasonable to assume that it would be very accurately localized.


ChatGPT claims our service has a feature which we don't have (for example tracking people based on their phone number). Users register a free account, then complain to us. The first email is often vague "It doesn't work" without details. Slightly worse is users who go ahead and make a purchase, then complain, then demand a refund. We had to add a warning on the account registration page.


I'm currently shopping for a new car, and while asking questions at a dealer (not Tesla), they revealed that the sales guys use ChatGPT to look up information about the car because it's quicker than trying to find things in their own database.

I did not buy that car.


Not mine but a client of mine. Consultants sold them a tool that didn't exist because the LLM hallucinated and told their salesperson it did. Not sure that's really the LLM's fault, but pretty funny.


So the consultants were told by an LLM that they sold a tool which they actually didn't? Was this chatgpt or something custom?


Share your "I used a tool and shot myself in the foot because I was lazy" stories.


I once used an LLM to generate some nix code to package a particular version of kustomize. It "worked", when I ran `kusomize version` in my shell, I got the version I asked for.

I later realized that it wasn't packaging the desired version of kustomize, but was instead patching the build with a misleading version tag such that it had the desired version string but was in fact still the wrong version.


_Paperclip maximized!_


It's useful to see how other people shot themselves in the foot.


I've tried LLMs for a few exploratory programming projects. It kinda feels magical the first time you import a dependency you don't know and you get the LLM output what you want to do without you even having the time to think about it. However, I also think that for any minute I've gained with it I've lost at least one because of hallucinated solutions.

Even for fairly popular things (Terraform+AWS) I continuously got plausible-looking answers. After reading carefully the docs, the use case was not supported at all, so I just went with the 30 seconds (inefficient) solution I had thought of from the start. But I lost more than one hour.

Same story with the Ren'py framework. The issue is that the docs are far from covering everything, and Google sucks, sometimes giving you a decade-old answer to a problem that has a fairly good answer in more recent versions. So it's really difficult to decide how to most efficiently look for an answer between search and LLM. Both can be a stupid waste of time.


For the story above always ask the LLM to answer as if they are a Hackernews commenter before requesting advice.


I find it interesting how LLM errors can be so subtle. The next-token prediction method rewards superficial plausibility, so mistakes can be hard to catch.


I think the conclusion in the thread is sound. If money means something to you, then don't ask AI for help spending it. Problem solved.


Could be extended to:

If money means something to you don't ask a human or AI for help spending it.


Give me a real example of something in computer science they can't do? I'm interested since Chatgpt is better than any professor I've had at any level in my educational career.

doesn't seem like many people can come up with a specific and tangible example.


Not sure if this is what you're getting at, but the absence of a thing doesn't prove the opposite if you don't know that the right people are even on this site, much less reading this thread and deciding to put in the effort to comment after the thread existed for <1 hour

Speaking for myself, I barely use LLMs and >50% of interactions are useless but I'm so wary that I doubt anything bad could happen. I also feel like I'm the target audience on HN, which someone who blindly trusts this magic black box (from their POV) might not be


"the absence of a thing doesn't prove the opposite" - it definitely makes a strong case for the opposite

LLMs = ad-free version of google

That's why people adopted it. Google got worse and worse, now the gap is filled with LLMs.

LLMs have replaced google, and that's awesome. LLMs won't cook lunch or fold our laundry, and until a better technology comes around which can actually do that all promises around "AI" should be seen as grifting.


The ads will come and they'll need to be an order of magnitude worse to make llms profitable.


This is the part I don't understand. Individually I don't know anyone paying for LLM's, only companies that are immediately pivoting to find reduced costs/internal options to horizon that expense.

Considering the massive cost and shortened lifespan of previous models, current gen models have to not only break even on their own costs but make up for hundreds of millions lost in previous generations.

As soon as they find a way to embed advertising into LLM outputs the small utility LLM's provide now will be gone.


The only hope I guess is that the local LLM stuff that Microsoft is pushing will actually work out. If inference is done on the device… well, at least using it doesn’t need to be profitable. Training it… I guess will have to be justified by the feature being enough of a value-add that it makes the OS more profitable.

Designing operating systems that people would actually pay real money for doesn’t seem to be a focus for anybody except Apple. But I guess there’s hope on some theoretical level at least.


If Microsoft made it and distributed it, the only thing you can be sure of is that it will work out for Microsoft.

I am not curious what Bing says about Microsoft's ethics.


LLMs know a magnitude more about you and your needs. This will make targeting users very valuable.

The big question is who will be first to get it done in an unobstrusive way (subtlely integrated into text with links).


That's the real AI future, not some atomic war with robots. Manipulative AI bowing to the dollars of advertisers, responding to something overheard by a smart speaker 20 minutes ago by packaging an interstitial ad for coca-cola and delivering it stealthily in-experience to your child in their bedroom on a VR headset, straight to their cornea.

The ways for circumventing the influence are currently being dismantled and these AI RTB ad systems are well funded and being built. AI news feed will echo the message and the advertiser will be there again in their AI internet search.

We will cede the agencies of reality to machines responding to those wishing to reshape reality in their own interest as more of the human experience gets commodified, packaged, and traded as a security to investors looking to extract value from being alive.


I suppose the only solution is to communicate with LLMs only through TOR and pay for them only via crypto?


Sure, but no need to optimize for that until it happens. It's not clear to me that they will become profitable; they might go under.


Awesome for everyone except the creators of all that data. For years Google slowly encroached on their viewership (ads, amp, summaries, etc) and now they are being replaced by something that gives the creators absolutely nothing. Not even attribution. Not only are they giving nothing, they are actively fighting in the legal system to ensure their data thievery is deemed legal.


100% agree. And suddenly copyright laws don't exist any more it seems.

Google has sources: the sites/URLs of the search results, some of which may be reputable or be the source itself.

LLMs like Phind can cite parts of their output, but my understanding is they're prone to missing key details from or even contradicting the citations (the same issue as LLM "summaries" missing or contradicting what they're supposed to summarize), at least moreso than human writers (and surely more than reputable human writers).


while I agree with the notion that sources should be given, current google search provides "old-school" search results only at the bottom of the page. first they give AI suggestions, special tabs with provenly-false content (see "john wick 5" drama in recent days), big blocks of ads. and then they show the results from websites which pay most for SEO.

LLMs give more accurate results for many queries, and they also include knowledge from books / scientific papers / videos which are not included in the standard google search result pages.

also LLMs have no ads.

so there is net benefit for users to use LLMs for search instead of googling it.

and I'd bet nowadays the percentage of false/outdated information is the same on both Google and LLM results.


People thinking LLMs have replaced search engines is probably one of the most dangerous ideas today lol. They merely look similar, the functionality is totally different.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: