Hacker News new | past | comments | ask | show | jobs | submit login
The Geometry of Truth: Do LLM's Know True and False (saprmarks.github.io)
106 points by sebg 6 months ago | hide | past | favorite | 58 comments



Related

GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975

Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334

Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221


The real question is if they know True, False, and FileNotFound.

https://thedailywtf.com/articles/What_Is_Truth_0x3f_


Fascinating paper, explanation, and of course interactive models of data on the site itself. This belongs in my HN favorites! The smaller_than data is indeed strange. It reminds me of the need for hysteresis on electrical comparators: https://en.wikipedia.org/wiki/Comparator#Hysteresis

>This allows us to produce 2-dimensional pictures of 5120-dimensional data. See this footnote for more details.

Read that line out loud and get a laugh out of yourself. It's something Data on TNG would say.


Max Tegmark has been on a roll lately with papers looking at linear representations in LLMs.


Feels like we've got to the point of giving LLMs fMRIs just like human brains.


> fMRIs just like human brains.

or dead salmon :)


When LLMs are good enough, they will do the science of understanding LLMs for us. And all other decisions.


“I don't really understand the explanations, but GPT-5, Bard and LLaMa-aLp4ca all come to the same conclusion, so I guess it must be correct.”


still better than most politicians today or people influenced by populistic media.



True, particularly when you consider what fMRIs tell us about a brain’s state of knowledge


Is ABB~aC4 true or false? LLM will happily process trillions of such sequences and learn to produce similar sequences even if they all were random. 'Truth' is an interpretation of symbols. With a proper interpretation, the ABB sequence may as well be a proof of the ABC theorem.


And of course they don't know true and false, because they don't know anything in the cognitive sense of using innate emotional and logical reasoning (faulty or not) to come to a self-directed conclusion of their own.

They respond to prompts, scan through their data and give answers based on what their programming for probability dictates.

Humans seem to do the same thing much of the time but beneath even the most half-baked human reasoning is a self-directed sense of the world that no LLM has. The AI woo on HN is quite strong, so many will probably disagree for all kinds of shallow reasons of semantics, but even the creators of LLMs don't claim they possess anything resembling consciousness, which is necessary for understanding notions of truth and falsehood.


The only thing I take issue with here is 'programming'. These are mathematical models for probabilistic processes. They are not programmed. They are conceived and then optimized to fit a distribution.


Consciousness is the woo you accuse people of. There's no reason to think it's real or required for distinguishing truth from falsehood.


Do you not affirm your own ability to take self-directed (emotionally or otherwise) positions on how you think the world is? Does your assertion that consciousness itself isn't real not fit exactly into the basic idea that it is because you yourself are consciously favoring that assertion?

GPT doesn't sit there pondering consciousness and deciding, emotionally, that it thinks it's bullshit. It most certainly doesn't then decide, on its own volition, to go out into the web and comment this to others for the sake of debating them.

It doesn't sit there contemplating anything by its own volition unless its asked to. Regardless of what consciousness really is at its heart (I admit that we still don't fully know), it's a distinct self-motivated thing that we can see in our human selves and which is seen in no LLM except as a simulation produced by specific prompts.


You are moving away from the topic, which is the question of whether AIs know true and false.

Your first statement was that "they don't know true and false, because they don't know anything in the cognitive sense of using innate emotional and logical reasoning (faulty or not) to come to a self-directed conclusion of their own."

You haven't yet proved that assertion.


I don't think consciousness is well defined, nor that it's required for producing true statements nor for distinguishing them from false statements.

I can write a program that will print out infinite number of no repeating true statements.


> Consciousness is the woo you accuse people of.

Are you sure?:

https://www.youtube.com/watch?v=O7O1Qa4Zb4s

> There's no reason to think it's real or required for distinguishing truth from falsehood.

There seems to be no better definition for what matter is than for what consciousness is. Hence, there is no reason to believe it's real or required to discern true from false as well?

As long as that's the case, I'd be a bit more careful with statements like these.

If we can be sure of one thing we know, it's nothing.


Betteridge aside, the subtitle here doesn't appear in TFA. Is it just editorializing?


They have stronger language "we present evidence that language models linearly represent the truth or falsehood of factual statements"


That sentence from the article seems more straightfoward and technical, and more toned down than the headline here. It's not a philosophy paper.


does anyone?


The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

> Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we curate high quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements. We also introduce a novel technique, mass mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.

https://arxiv.org/abs/2310.06824


I'd be interested to see the same kinds of results before/after RLHF training for censorship/"harmlessness", which I've seen pointed to in a few places as degrading the quality of responses in general.


[flagged]


I think people fixate on the word "knows" too much. They can clearly identify the difference between true and false statements, in the sense that if you provide them some statement it can identify it as true or false with some large degree of accuracy that is far more than you'd expect from random chance. This takes like 5 seconds to verify with chatgpt and you can see it as a feature in these data sets.

Whether it "knows" true and false or _anything at all_ is a completely different question.


Literally from the abstract of the accompanying paper:

"Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements."

You didn't even read anything before commenting just off the headline, did you?


> You didn't even read anything before commenting just off the headline, did you?

Welcome to hacker news. :)


TL;DR: actually, they might. If I'm reading the visualization correctly, and not just projecting my own prior expectations, it seems to imply that a distinct and general concept of "true" and "false" (or at least "known true" and "known false") emerges in the latent space.

It'll be interesting to explore how robust it is for statements crafted so that they surely did not occur in the training data, but their truth value could be indirectly inferred from information in the training data.


I can already answer you that question based on the "reversal curse" paper. LLMs trained on “A is B” fail to learn “B is A”. It will only work if you explicitly generate text connecting B to A before training.

More generally, the organic training data is almost always incomplete. Samples of text leave many assumptions and inferences hidden. Like a math problem, you see the statement, but don't know the answer until you work it out. Or like a puzzle, you see the pieces but don't know the big picture until you fit those pieces together.

That is how training text samples are like unsolved enigmas, we train our models on undigested text. Often the pieces are spread over many training examples that almost aways appear separately, never together have the chance to draw a conclusion from them. Search is needed, augmenting training examples with supporting data.

Neural nets are smart at inference time but dumb at training time. They don't make those connections when they train. Instead, we need to draw those connections out by generating new text. We need to benefit from inference-time smarts before training. That means we need to use current LLMs to write the dataset of next LLMs.

All the best LLMs today used a big piece of synthetic data, including GPT-4. Datasets like Orca, Phi-1.5, ShareGPT, etc. It's also the best way to create small models (<10B) that actually work, you need very high quality, high diversity synthetic data.


I don't get it. A is B doesn't always mean B is A.


GP meant that LLMs trained to learned "A is a B" won't necessarily be able to answer "give an example of B". Eg given the fact "Foo is the capital city of Bar", it might be able to complete "Foo is the capital city of _" but not "The capital city of Bar is _".

(Which isn't necessarily a evidence against truth models for out-of-distribution facts, it could just be a matter of indexing.)


>"Eg given the fact "Foo is the capital city of Bar", it might be able to complete "Foo is the capital city of _" but not "The capital city of Bar is _"."

Is that really so? It would seem to be well within what I thought they were capable of based on all the other things they can do correctly.


The reversal curse itself is real. An example I can remember from the paper is "Who is Tom Cruise's mother? [Tom Cruise's mother's name]" paired with "Who is [Tom Cruise's mother's name]'s son? [incorrect answer or "Can't answer that"]". It is an interesting side effect of the way they think and a significant annoyance in getting them to work as we want them to.

That said, I think people make too much of it as an "LLMs can't reason" point, when I don't think that's accurate. What it says is that LLMs instant recall is not logically bidirectional, but this is something that humans do as well. Humans take longer to respond to (and are less accurate answering) "Who is Tom Cruise's mother?" than "Who is [her name]'s son?". At least for me, when I get questions that are the "wrong way around", I have to literally run through it logically in my head, generally along the lines of "(What does that name remind me of, is she a spy? Or her son is a spy? Is she a fictional character? Wait I think this is is a celebrity thing, which spy celebrity has her as his mum? Oh yeah, Tom Cruise.) [Out loud:] Tom Cruise."

Also, some people misunderstand the actual deficiency, and think that the LLM can't answer the question at all, rather than just zero shot. The LLM can answer the question if it has the information in context, it can reason "If A=B, then B=A" just fine. It just can't do the less popular halves of AB equivalencies zero shot.


Good point. In some ways this kind of error makes it feel like its process is very comparable to what humans do.


Yeah, I'm counting this as yet another weak but positive evidence that we've hit on something fundamental here - if not universal, then at least to the path evolution took that led to a human mind.


Yeah, I agree. There is exactly one non-LLM entity in existence that can reason this generally and this well in human languages, and that's us, human beings. We built LLMs by taking inspiration from the human brain and trying to approximate it with neural networks that were often able to achieve intelligent-ish performance on tasks in narrow domains, and eventually we stumbled on an architecture that is truly general, even if it's generally dumb. It would be an absurd coincidence to me if that architecture, LLMs, actually had nothing in common with how humans think. None of that means it is the best architecture for thinking like humans, or that we just need to scale it up to get to super-intelligence, or that it is currently as smart as a human being. But it just doesn't seem plausible that it behaves so much like a human mind if there's really nothing in common underneath.


> An example I can remember from the paper is "Who is Tom Cruise's mother? [Tom Cruise's mother's name]" paired with "Who is [Tom Cruise's mother's name]'s son? [incorrect answer or "Can't answer that"]".

The paper is, apparently, still under review.

In the mean time, may I suggest you to verify that example by yourself?


I'm seeing exactly what lucubratory described with ChatGPT. If the information is in the context window, it has no trouble working out the reverse.

    Me: Who is tom cruises mother?
    ChatGPT: Tom Cruise's mother is Mary Lee Pfeiffer.
    User: Who is Mary Lee pfeiffers son
    ChatGPT: Mary Lee Pfeiffer's son is the famous actor, Tom Cruise.
But if you ask my second question directly into a fresh session, it doesn't know the answer.

Interestingly though, you can give it additional clues and it'll get it. https://chat.openai.com/share/893c1088-6718-4113-a3f1-cf273d...


I misread lucubratory's comment: indeed I see the same as you do. I only tried asking both question in the same session. I didn't see that point in the paper when I quickly skimmed through it to find the relevant part.

I also agree with him about humans capable of the same "errors".


Her, but thank you for rereading the comment, I appreciate it.


Yes it can only learn to solve the problems that it was trained to solve. Not surprising really. Tautological.


Right, but that ignores the key piece, which is whether such systems can infer the equivalence relation denoted by "is". A is A. A is B implies B is A. A is B and B is C implies A is C. When a system sees that A is B, yet cannot infer that B is A, it exhibits an asymmetry, which is interesting. Whether this asymmetry exists in the underlying language is unclear.


> A is B implies B is A.

I get the overall idea, but this statement isn't always true, right?


Correct. Not in natural language. These are natural language systems, not logic systems.

"The sky is blue" does not imply "blue is the sky".


There's an asymmetry here, too. "The sky is ${color}", with no extra context, has one obvious answer for us living here, today. Whereas for "Blue is the color of ${thing}", with no extra answer, there's an insane amount of equally sensible substitutions for ${thing}. Without extra content, the model has no reason to privilege "sky" over any other equally valid answer.


> When a system sees that A is B, yet cannot infer that B is A

So the issue is that they cannot infer that general rule due to a fundamental limitation of the transformer LLM architecture, not just a training data issue? I skimmed the paper and it seems to be the case.


Could you give an example?


Cows are mammals, but mammals are not cows.


"Cows are mammals" means: Array($mammals).includes($cows).

It does not mean $cows === $mammals.

"A is a B" Or "A's are B's" means that A is included in B.

$a === $b ("A is B") implies that $b === $a (B is A), in all cases that I know of.


What cases are there for $a === $b? I guess there's synonyms and rephrasing, but in general it's not a useful relationship in language


From this paper:

https://owainevans.github.io/reversal_curse.pdf

"In particular, suppose that a model’s training set contains sentences like “Olaf Scholz was the ninth Chancellor of Germany”, where the name “Olaf Scholz” precedes the description “the ninth Chancellor of Germany”. Then the model may learn to answer correctly to “Who was Olaf Scholz? [A: The ninth Chancellor of Germany]”. But it will fail to answer “Who was the ninth Chancellor of Germany?” and any other prompts where the description precedes the name."


Batman is Bruce Wayne.


The operator are, which happens to look similar to an english word, has a meaning that depends on context. Another example is 5/2 means a simple fraction in one context, or a class of polytopes in another.


I wonder how well it would work if llms would do "online learning". After every interaction they'd be asked to summarise new knowledge (assuming the input is trusted) and create multiple variations on it, then train on it. I wonder if such a model would improve or degrade over time.


If I say to chat gpt

    Adrian Tompkins was the ninth mayor of the town of Wolverhillington. Who was the ninth mayor of Wolverhillington?
Then it correctly responds

    The ninth mayor of the town of Wolverhillington was Adrian Tompkins.
What am I doing wrong?


Nothing. It's not an inference problem but a training one. If you train on the sequence,

"Adrian Tompkins was the ninth mayor of the town of Wolverhillington"

And later ask in inference,

"Who was the ninth mayor of Wolverhillington?",

It might not return the answer.


That is a poor summary.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: