I think a better way to think about it is that LLMs can only "hallucinate", that...

tysam_and · on May 3, 2023

This generally lines up with some threads being passed around online but not really with the mathematics of what's happening with the network. Since this is a comment with visibility at the moment and I'm doing my part in trying to counter some of the malinformation on LLMs I wanted to make a quick note.

A simple casual proof:

Emitting a token T for an input I for a given system has 0 entropy and requires knowing the entire state of the system at the time input I is given. This includes knowing the entire system itself, as knowing the state alone is meaningless without having knowledge of the system.

This is of course impossible for a model that is contained within the system emitting tokens T itself, however, an approximation is possible. The bound of approaching 0 entropy necessarily requires learning the inherent dynamics of the system itself. Any model that trivially depends upon statistics could not do causal reasoning, it would become exponentially less likely over time. At long output lengths, practically impossible.

Thus, beyond a certain point, to reduce the entropy any further beyond some softly-defined minima for cross-entropy, the system must inherently generalize to the underlying problem that yields the tokens in question (hence ML's data hungriness).

I don't think I feel surprised by this comment, it is a personal belief after all, but seeing similar ideas posted with such confidence both here and on Twitter is something that I do find personally confusing, it's not really grounded in my view in the information theory of how deep learning works. Even with a purely statistical argument once could make a very strong argument rather easily. Especially comparing small to large models.

badtuple · on May 3, 2023

It seems like the point being made is that because an LLM lives within the universe and can't store the entire universe, it would need to "reason" to produce coherent output of a significant length. It's possible I misunderstood your post, but it's not clear to me that any "reasoning" isn't just really good hallucination.

Proving that an AI is reasoning and not hallucinating seems super difficult. Even proving that there's a difference would be difficult. I'm more open to the idea that reasoning in general is just statistical hallucination even for humans, but that's almost off topic.

> Any model that trivially depends upon statistics could not do causal reasoning, it would become exponentially less likely over time. At long output lengths, practically impossible.

It's not clear to me that it _doesn't_ fall apart over long output lengths. Our definition of "long output" might just be really different. Statistics can carry you a long way if the possible output is constrained, and it's not like we don't see weird quirks in small amounts of output all the time.

It's also not clear to me that adding more data leads to a generalization that's closer to the "underlying problem". We can train an AI on every sonnet ever written (no extra tagged data or metadata) and it'll be able to produce a statistically coherent sonnet. But I'm not sure it'll be any better at evoking an emotion through text. Same with arithmetic. Can you embed the rules of arithmetic purely in the structure of language? Probably. But I'm not sure the rules can be reliably reversed out enough to claim an AI could be "reasoning" about it.

It does make me wonder what work has gone in to detecting and quantifying reasoning. There must be tons of it. Do we have an accepted rigorous definition of reasoning? We definitely can't take it for granted.

tysam_and · on May 3, 2023

Reasoning and hallucinating are terms that are more shallow that are oftentimes used in discussions of this topic, but ultimately don't cover where and how the model is fitting the underlying manifold of the data -- which is in fact described by information theory rather well. That's why I referenced Shannon entropy, which is important as an interpretive framework. It provides mathematical guarantees and ties nicely into the other information compressive measures which do I feel answer some of the queries you're noting seem more ambiguous to you.

That is the trouble with mixing inductive reasoning sometimes with a problem that has mathematical roots. There are degrees where it's intractable to easily measure how much something is happening, but we have a clean mathematical framework that answers these questions well, so using it can be helpful.

The easiest example of yours that I can tie back to the math is the arithmetic in the structure of language. You can use information theory to show this pretty easily, you might appreciate looking into Kolmogorov complexity as a fun side topic. I'm still learning it (heck, any of these topics goes a mile deep), but it's been useful.

Reasoning on the other hand I find to be a much harder topic, in terms of measuring it. It can be learned, like any other piece of information.

If I could recommend any piece of literature for this, I feel like you would appreciate this most to start diving into some of the meat of this. It's a crazy cool field of study, and this paper in particular is quite accessible and friendly to most backgrounds: https://arxiv.org/abs/2304.12482

PaulDavisThe1st · on May 3, 2023

> Any model that trivially depends upon statistics could not do causal reasoning, it would become exponentially less likely over time. At long output lengths, practically impossible.

This is handwaving. Yes, a system that is fundamentally based on statistics will require more and more data and compute power to be able to continue to function over longer and longer output lengths.

But you don't know a priori what the shape of that curve is, or how far along it current LLMs are (maybe their creators have some idea, but I suspect not even they truly understand where on that curve the current systems are).

Thus, there's no reason to assume that the system is "generaliz[ing] to the underlying problem" at all. And in fact, I'd argue that not only is there no reason to do so, there are strong reasons to assume that it is not doing that.

tysam_and · on May 3, 2023

Nope. It's an argument from Lyapunov exponents, translated into layman's terms.

Edit: I noticed we could be reading this in different directions -- I'm reading the OP's post as treating LLMs as large Markov-chain-memorizing models, which is where the statistics argument comes into play. Heck, the curse of dimensionality alone makes the memorization aspect of it intractable, so there is a sort of compression, only what kind of compression. I agree that statistically it's going to approach what happens in the real world on the text as things get larger, but simple hallucinated chains of text in an autoregressive matter severely breaks the causal regime. I think that was where I was coming from originally, please do let me know if I've misunderstood, however.

Second Edit: w.r.t. to the underlying manifold generalization, yes, this is a natural consequence of the operators used. directly measuring how much it's happening is intractable but by the nature of the system operating under them we are actually somewhere on that compressive spectrum of generalization.