Hacker News new | past | comments | ask | show | jobs | submit login
Researchers describe how to tell if ChatGPT is confabulating (arstechnica.com)
56 points by glymor 12 days ago | hide | past | favorite | 38 comments





> But perhaps the simplest explanation is that an LLM doesn't recognize what constitutes a correct answer but is compelled to provide one

Why is it compelled to provide one, anyway?

Which is to say, why is the output of each model layer a raw softmax — thus discarding knowledge of the confidence each layer of the model had in its output?

Why not instead have the output of each layer be e.g. softmax but rescaled by min(max(pre-softmax vector), 1.0)? Such that layers that would output higher than 1.0 just get softmax'ed normally; but layers that would output all "low-confidence" results (a vector all lower than 1.0) preserve the low-confidence in the output — allowing later decoder layers to use that info to build I-refuse-to-answer-because-I-don't-know text?


Careful, I think there's a large difference here between:

1. An LLM's mathematical "confidence" of having a clear best-scoring candidate for the predicted next token when given a list of tokens.

2. A not-yet-invented AI that models the idea of different entities interacting, the concept of questions and answers, the concept of logical conflicts, and it's "confidence" that a proposition is compatible with other "true" propositions and incompatible with false ones.

To help illustrate the difference, suppose you trained an LLM on texts where a particular question was always answered with "I don't know, I have zero confidence in anything anymore." Later the LLM will regurgitate similarly nihilistic text, and by all objective internal measures it will be extremely "confident" as it does so.

> Why is it compelled to provide one, anyway

It's following the patterns in its training data, which probably reflects a whole lot more people trying to provide answers (sometimes even deliberately wrong ones) as opposed to admitting uncertainty.

This is especially true if developers put their thumb on the scale by injecting primer-text like "You are an intelligent computer eager to provide answers", as opposed to "behave like Socrates and help people understand that nothing is truly knowable."


Also to some questions it also is already overly cautious about answering. E.g. I give an image of a location when I am travelling and ask it to guess where the image is taken. It will not want to guess and it will at first provide a long disclaimer that it can't do it, but if I tell it, that it is a game and just make a guess for the fun of it, it is surprisingly accurate.

You don't need everything you describe in 2 to still be advancing the state of the art from how ignorant of "confidence" today's models can be.

After all, what I'm describing is something that even a classical Bayesian spam-filter classifier RNN can pull off — where a hidden layer near the output layer can notice that either:

1. the preceding layers have generated a confidence for both the "spam" or "ham" classifications that is not differentiable from 0 by at least epsilon, or

2. the preceding layers have generated a confidence for both the (mutually exclusive) "spam" and "ham" categories that are indistinguishable (not at least epsilon apart post-softmax)

...and in those cases will output "I DUNNO (TRY GREYLISTING IT)" rather than "SPAM (BLOCK IT)" or "HAM (PASS IT THROUGH)".

What I'm expecting to accomplish with a rescaled softmax output (or by other embeddings as long as they propagate/multiply confidences of each successive layer, allowing confidence to approach 0), is to allow some attention-head at some late layer in the model, to develop an overriding-output strategy that reacts to "not differentiable from epsilon" residual confidence in the previous layer's output vector (= the current layer's Q vector), by giving high confidence to an "I don't know the first thing about what you're saying; I didn't really 'get' what you wrote" concept in the current layer's output vector (so high that it overrides any other response at that layer.) This then just gets produced as a response by the same machinery that generates well-embedded responses from concepts at other layers. (Think alignment, not hard-trained fixed outputs.)

---

Though, thinking more carefully about it, something else is missing too. Since LLMs already have all the info available in later layers to recognize condition #2 above (as even under a pure Transformer decoder mask+add+norm+softmax kernel, it's still possible to do math that recognizes when the first N top-P-ranked elements of a vector of mutually-exclusive concepts are not differentiable by at least epsilon, and develop a special reaction for that case) — but they still don't tend to learn this.

I think the concept missing here, is a training technique that supervised training of simpler classifiers has done for forever, but which doesn't seem to come up at all in Transformer training frameworks. And that's dynamically generating the training label for an example input, based on aggregate statistical information output through a side-channel while running inference on the example input. I.e. training the model to have a specific reaction to its own internal state in response to an input, rather than to the input itself.

Let's say you want to use an LLM as a spam-classifier — given an input, have it output a classification {SPAM, HAM, DUNNO}. It's easy enough, just with a dataset of labelled exampels, to take any LLM and do a single fine-tune that results in a {SPAM, HAM} classifier. But you don't want a static dataset of DUNNO examples — because you don't want the classifier to output DUNNO when you aren't sure. You want the classifier to output DUNNO when it isn't sure.

So let's say you do two fine-tunes instead. The first one acts as an encoder, outputting a two-element vector (SPAM confidence, HAM confidence). And the second one acts as a decoder, turning those into categories.

What you actually need to achieve "correct" DUNNO outputs, is to train the decoder fine-tune not on labelled training examples, but by taking your existing (labelled!) training dataset; running it through the model with the decoder not connected, to get the raw confidences; applying a confidence-gating measure to them; and then, for any example that doesn't pass that measure, training the decoder (as a standalone LoRA) to output DUNNO.


I don't see a lot of answers on Stack Overflow that go "gee, I don't know how to solve that." Hence the confabulation.

Changing the internals of a net is more likely to affect its training speed and ability to converge than its observable behaviour after training (in my experience), and in general mutations make things worse unless you have good reason to believe it'll make things better (e.g. residual layers in resnets).

(disclaimer: I'm not an ML expert, maybe this is just a me problem, but I find nets extremely sensitive to stuff like changing activations, adding normalisation, layer initialisation, layer sizes, all this stuff that seems kinda arbitrary to a non-experts like me)


Is confabulation different from hallucination? If not I do suppose this is a more accurate term for the phenomenon except that the exact definition isn’t common sense without looking it up whereas “hallucination” is more widely understood.

When speaking about LLMs, confabulation and hallucination refer to the same thing. The term "confabulation" is just the most accurate description of what's happening, whereas the term "hallucination" refers to something LLMs are fundamentally incapable of.

Some people seem to get very angry about calling it "hallucination", because it's a computer, computers can't hallucinate! Stop anthropomorphising it!!

So I suppose if you want to stay on the right side of those people - or you are one - you call it confabulation instead.


There’s also the position that a definition of confabulate…

To fill in gaps in one's memory with fabrications that one believes to be facts.

…is much more accurate.

Since we’re talking about a technical process it helps to be more precise in our use of language.


I like to think that it's always confabulating.

It's just that usually the words it generates are accurate enough for my needs.


If it can’t hallucinate, can it “believe”? I think all the pedantry is silly.

From the paper <https://www.nature.com/articles/s41586-024-07421-0>:

> Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations.


Humans too can be committed to beliefs that are not true. I have friend that believes in and regularly consults her "clairvoyant". I wonder if our AI assistants in the future will be vulnerable to suspicions or popular fantasies about the world, people, or even other AIs they interact with.

Isn’t commitment to beliefs that aren’t true part of the value of intelligence? Like right now multiple billion dollar companies are being built on different theories of the future of AI. They can’t all be true.

TL;DR sample the top N results from the LLM and use traditional NLP to extract factoids, if the LLM is confabulating the factoids would have random distribution, but if it's not it will be heavily weighted towards one answer.

A figure from the paper shows this better than my TL;DR: https://www.nature.com/articles/s41586-024-07421-0/figures/1


The LLM is already generating factoids: Things which resemble a fact without actually being one.

(See also: Androids that resemble men but aren't, asteroids that resemble stars but aren't, meteoroids that resemble meteors but aren't...)


Thank you! This is so helpful.

It's also interesting to see what temperature value they use (1.0, 0.1 in some cases?)... I have a feeling using the actual raw probability estimates (if available) would provide a lot of information without having to rerun the LLM or sample quite as heavily.


Or we could just ask the same question on 3 different LLMs, ideally a large LLM, a RAG LLM and a small one, then use LLM again to rewrite the final answer. When models contradict each other there is likely hallucination going on, but correct answers tend to converge.

Why use an LLM to check the work of a different LLM?

You could use the same technique that this paper describes to compare the answers each LLM gave. LLMs don’t have to be in opposition to traditional NLP techniques


What we lack is for these models to state their context for their response.

We have focused on the inherent lack of input context, leading to wrong conclusions, but what about that 90B+ parameters universe, plenty of room for multiple contexts to associate any input to surprising pathways.

In the olden days of MLPs we had the same problem with softmax basically squeezing N output scores into a normalized “probability”, where each output neuron actually was the sum of multiple weighted paths, which one winning the softmax made up the “true” answer, but there may as well have been two equally likely outcomes, with just the internal “context” as difference. In physics we have the path integral interpretation and I dare say, we humans too, may provide outputs that are shaped by our inner context.


> There are a number of reasons for this. The AI could have been trained on misinformation; the answer could require some extrapolation from facts that the LLM isn't capable of; or some aspect of the LLM's training might have incentivized a falsehood

This article seems rather contrived. They present this totally broken idea of how LLMs work (that they are trained from the outset for accuracy on facts) and then proceed to present this research as it is a discovery that LLMs don't work like that.


Simplistic version of this is just asking the question in 2 ways - ask for confirmation that the answer is no, then ask for confirmation that the answer is yes :)

If it's sure it won't confirm it both ways.


A corollary for natural intelligence: if you can prove that all foos are bar, and that all foos are not bar, that's a good time to suspect that no foos actually exist.

(I still don't understand why everyone seems happy to conflate "intelligence" with fact-retrieval?)


> I still don't understand why everyone seems happy to conflate "intelligence" with fact-retrieval?

Because it's useful and impossible till very recently.


Because there is no single widely accepted definition of intelligence.

No widely accepted definition, but fact retrieval is outside of any of the ones with which I'm familiar.

That works for yes/no questions, but not for informational questions, like "Where's the Eiffel Tower?".

So the same as SelfCheckGPT from several months ago?

> LLMs aren't trained for accuracy

This assertion in the article doesn't seem right at all. When LLMs weren't trained for accuracy, we had "random story generators" like GPT-2 or GPT-3. The whole breakthrough with RLHF was that we started training them for accuracy - or the appearance of it, as rated by human reviewers.

This step both made the models a lot more useful and willing to stick to instructions, and also a lot better at... well, sounding authoritative when they shouldn't.


Isn't that the issue? Getting thumbs up from an underpaid human reviewer isn't the same as accurate facts.

The one person I know getting paid to review AI outputs gets paid anywhere from $25 / hour to $40 / hour. Not sure if that's underpaid. It may be a nice option when you can do it at any time to supplement your regular income.

Reviewing AI output or helping in training a LLM itself?

This person works through an interface which is similar to Mechanical Turk. You get a list of available projects you qualified for via an assessment. For the AI projects, many of them are comparing responses from two different models, answering questions, and selecting the best response. Other projects might be attempting to get the model to do something against the guidelines, or rating the model on certain capabilities. There's no requirements other than to pass the assessment. As with Mechanical Turk, you can work on your available projects at any time.

This feedback is used for training.


It's not the same as completely accurate facts, but it's much closer to accurate facts than LLMs we had before.

This method seems to lean into the idea of LLM as fancy search engine rather than true intelligence. Isn’t the eventual goal of LLMs or ai that it’s smarter than humans. So I guess my questions are:

Is it plausible that LLM’s get so smart that we can’t understand them. Do we spend like years trying to validate scientific theories confabulated by AI?

In the run up to super-intelligence, it seems like we’ll have to tweak the creativity knobs up, like the whole goal will be to find novel patterns humans don’t find, is there a way to tweak those knobs that get us super genius and not super conspiracy theorist? Is there even a difference? Part of this might depend on whether or not we think we can feed LLM’s “all” the information.

But in fact, assuming that Silicon Valley CEO’s are some of the smartest people in the world, I might argue that confabulation of a possible future is in fact their primary value. Not being allowed to confabulate is incredibly limiting.


Yes, I agree, scientific theories have their value from being new/confabulated. Though this isn't mutually exclusive from solving today's problems of confabulations. The proposed methodology could be used to mark semantic coherence, but it doesn't mean we have to hide confabulations.

LLMs are language models, and I think it's best not to try to extrapolate them to general intelligence. They are universal language translators, and a lossy database of a lot of text. They might be a component of some bigger AI system in the future, but themselves they are not as intelligent as their marketing implies.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: