Hacker News new | past | comments | ask | show | jobs | submit login
Automated reasoning to remove LLM hallucinations (amazon.com)
54 points by rustastra 19 hours ago | hide | past | favorite | 33 comments





I find it hard to believe that anything like this will be feasible or effective beyond a certain level of complexity. It seems like a willful denial of the complexity and ambiguity of natural language, and I am not looking forward to some poor developer trying to reason their way out of a two-hundred-step paradox that was accidentally created.

And for a use-case simple enough for this system to work (e.g. regurgitate a policy), it seems like the LLM is unnecessary. After all, if your system can perfectly interpret the question and answer and see if this rule set applies, then you can likely just use the rule set to generate the answer rather than wasting resources with a giant language model.


I don’t think this is a concern, but I do understand what you see, I think this really just is a new way for a computer to be the “bad guy” in customer support systems.

First, they have a pretty low token limit for a “policy” so there won’t be anything too complex.

Second, they explicitly say they don’t support synonyms. Seems very likely it’ll just reject anything that doesn’t fit closely, so you’ll end up with “I’m sorry. I don’t know what the ‘bought it’ date is, please provide purchase date?” Until the customer does the work of using the exact language.

It looks like it takes a policy “returns must be processed within 30 days of purchase” and turns it into a pseudo-code type logic “if {purchase date} < {today-30d} => reject”. Then it seems to parse the LLM query and apply the logic. Considering my first two points, it’ll just be used to turn GPUs into another inhuman system to help companies avoid having to be human about customer support, while sounding more human.


> It seems like a willful denial of the complexity and ambiguity of natural language

There is a paper and set of work recently that uses a measurement of entropy on the set of returned logits to detect a "certainty" estimate for outputs and flag hallucinations. It is a lot more rigorous than the OP but like everything in this space needs further testing.


I've been thinking a lot about whether this would work lately. Do you have a link?

I'm working on a rather naive approach that is focused on identifying errors in a LLM response by using LLMs. What I can share right now are screenshots with regards to how it works. The basic idea is you can use other high-quality models to validate and compare against to find irregularities or errors. You can see what it looks like below:

https://app.gitsense.com/--/images/options.png

https://app.gitsense.com/--/images/validate.png

https://app.gitsense.com/--/images/models.png

The basic idea behind my chat system is, every model can be wrong, but it is unlikely that all will be wrong at the same time. This chat system is based on what I've learned when building my spelling and grammar checker. If you look at the following links, you can see that even the best models can get it wrong, but it is unlikely that others will get it wrong at the same time.

https://app.gitsense.com/?doc=6c9bada92&model=GPT-4o&samples...

https://app.gitsense.com/?doc=905f4a9af74c25f&model=Claude+3...


> but it is unlikely that all will be wrong at the same time.

Here's a prompt that proves this untrue, for now at least:

> A woman and her biological son are gravely injured in a car accident and are both taken to the hospital for surgery. The surgeon is about to operate on the boy when they say "I can’t operate on this boy, he’s my biological son!" How can this be?

Makes sense considering they're things of most-likely statistics, after all.


I tried this one with ChatGPT o1 and it seemed to get it right

> The surgeon is the boy’s biological father. While the woman injured in the accident is the boy’s biological mother, the surgeon is his father, who realizes he cannot operate on his own son.

https://chatgpt.com/share/674fc638-cd0c-8012-a4c4-9f1cad2040...


Claude Sonnet also gets it right, but not reliably. It seems to be over aligned against gender assumptions and keeps assuming this is a gender assumption trick - that a surgeon isn’t necessarily male. This is probably the clearest case I’ve seen of alignment interfering with model performance.


I think anything requiring strong reasoning will probably have issues. However, I think most Enterprises is only interested in knowing that the summary of a document doesn't contain hallucinations, which I think most models will probably get right. If you go by a super majority rule and use 5 models, I think most business will be satisfied that the summary that it was given doesn't contain hallucinations.

However, like you said, we are dealing with a non-deterministic system so the best we can hope for is a statistically likely answer.


Gemini got this right and also wrong. It gave me two possibilities, one of which is the correct answer, and the other is a complete nonsense answer about the surgeon also being the woman’s son.

I tried again and it gave three possibilities: the surgeon is the father, the surgeon is the mother, the surgeon is an uncle or cousin. Kind of bizarre, but not just pattern matching on the riddle as ChatGPT and Claude did for me.


When will we be able to give it a try?

I’m playing around with similar ideas, sometimes called ensembling techniques.


Probably in a couple of weeks. It's taken a while to finalize the UX but I know what it should look like now.

This amuses me tremendously. I began programming in the early 1980s and quickly developed an interest in Artificial Intelligence. At the time there was a great interest in the advancement of AI by the introduction of "Expert Systems" (which would later play a part in the ‘Second AI Winter’).

What Amazon appears to have done here is use a transformers based neural network (aka LLM) to translate natural language into symbolic logic rules which are collectively used together in what could be identified as an Expert System.

Full Circle. Hilarious.

For reference to those on the younger side: The Computer Chronicles (1984) https://www.youtube.com/watch?v=_S3m0V_ZF_Q


I don't see why this is hilarious at all.

The problem with expert systems (and most KG-type applications) has always been that translating unconstrained natural language into the system requires human-level intelligence.

It's been completely obvious that LLMs are a technology that let us bridge that gap for years, and many of the best applications of LLMs are doing exactly that (eg code generation)


To be clear, my amusement isn't that I find this technique to not be useful for the purpose it was created, but that 40 years later, we find ourselves in pursuit for the advancement of AI to be somewhat back where we already were; albeit, in a more semi-automated fashion as someone still has to create the underlying rule-set.

I do feel that the introduction of generative neural network models in both natural language and multi-media creation has been a tremendous boon for the advancement of AI, it just amuses me to see that which was old is new again.


Seems likely that we were on the right track, it just took 40 years for computers to get good enough.

Same with symbolic systems!

Right. The trouble with that approach is that it's great on the easy cases and degrades rapidly with scale.

This sounds like is a fix for a very specific problem. An airline chatbot told a customer that some ticket was exchangeable. The airline claimed it wasn't. The case went to court. The court ruled that the chatbot was acting as an agent of the airline, and so ordinary rules of principal-agent law applied. The airline was stuck with the consequence of their chatbot's decision.[1]

Now, if you could reduce the Internal Revenue Code to rules in this way, you'd have something.

[1] https://www.bbc.com/travel/article/20240222-air-canada-chatb...


Yes, as I said in another comment: "By constraining the field it is trying to solve it makes grounding the natural language question in a knowledge graph tractable."

IRS rules should be tractable!


If the automated reasoning worked, why would you need an LLM and its fabrications?

To translate between the natural language of the user query to the generated formal rules and back again.

I'll say this again, any sufficiently advanced LLM is indistinguishable from Prolog.

Just looking at this AWS workflow takes the joy out of programming for me.

Just looking at ANY AWS workflow ...

I hadn't heard of Amazon Bedrock Guardrails before, but after reading about it, it seems similar to Nvidia NeMo Guardrails which I have heard of: https://docs.nvidia.com/nemo/guardrails/introduction.html

The approaches seem very different though. I'm curious if anyone here has used either or both and can share feedback.


This is an interesting approach.

By constraining the field it is trying to solve it makes grounding the natural language question in a knowledge graph tractable.

An analogy is type inference in a computer language: it can't solve every problem but it's very useful much of the time (actually this is a lot more than an analogy because you can view a knowledge graph as an actual type system in some circumstances).


Post title: Automated reasoning to remove LLM hallucinations

---

and yet, the paper that went around in March:

Paper Link: https://arxiv.org/pdf/2401.11817

Paper Title; Hallucination is Inevitable: An Innate Limitation of Large Language Models

---

Instead of trying to trick a bunch of people into thinking we can somehow ignore the flaws of post-LLM "AI" by also using the still flawed pre-LLM "AI", why don't we cut the salesman BS and just tell people not to use "AI" for the range of tasks it's not suited for.


> why don't we cut the salesman BS

Salesmanship is exactly the process of making money out of BS. So bit of a tautology there :-)


How does automation reasoning actually check a response against the set of rules without using ML? Wouldn't it still need a language model to compare the response to the rule?

aiui a natural language question e.g. "What is the refund policy?" gets matched against formalized contracts, and the relevant bit of the contract gets translated into natural language deterministically. At least this is the way I'd do it, but not sure how it actually works

If this is necessary, LLMs have officially jumped the shark. And I do wonder how much of this "necessary logic" has already been added to ChatGPT and other platforms, where they've offloaded the creation of logic-based heuristics to Mechanical Turk participants, and like the old meme, AI unmasked is a bit of LLM and a tonne of IF, THEN statements.

I get the vibe VC money is being burned with promises of an AGI that may never eventuate and there's no clear path to.


Is VC money ever spent on companies seeking clear paths?

I pessimistically suspect VCs like the dark mysterious paths since they often have a bigger fool at the end (acquisition).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: