Some ~72 years ago in 1951, Claude Shannon released his Paper, "Prediction and Entropy of Printed English", an extremely fascinating read now.
It begins with a game. Claude pulls a book down from the shelf, concealing the title in the process. After selecting a passage at random, he challenges his wife, Mary to guess its contents letter by letter. The space between words will count as a twenty-seventh symbol in the set. If Mary fails to guess a letter correctly, Claude promises to supply the right one so that the game can continue.
In some cases, a corrected mistake allows her to fill in the remainder of the word; elsewhere a few letters unlock a phrase. All in all, she guesses 89 of 129 possible letters correctly—69 percent accuracy.
Discovery 1: It illustrated, in the first place, that a proficient speaker of a language possesses an “enormous” but implicit knowledge of the statistics of that language. Shannon would have us see that we make similar calculations regularly in everyday life—such as when we “fill in missing or incorrect letters in proof-reading” or “complete an unfinished phrase in conversation.” As we speak, read, and write, we are regularly engaged in predication games.
Discovery 2: Perhaps the most striking of all, Claude argues that that a complete text and the subsequent “reduced text” consisting of letters and dashes “actually…contain the same information” under certain conditions. How?? (Surely, the first line contains more information!).The answer depends on the peculiar notion about information that Shannon had hatched in his 1948 paper “A Mathematical Theory of Communication” (hereafter “MTC”), the founding charter of information theory.
He argues that transfer of a message's components, rather than its "meaning", should be the focus for the engineer. You ought to be agnostic about a message’s “meaning” (or “semantic aspects”). The message could be nonsense, and the engineer’s problem—to transfer its components faithfully—would be the same.
a highly predictable message contains less information than an unpredictable one. More information is at stake in (“villapleach, vollapluck”) than in (“Twinkle, twinkle”).
Does "Flinkle, fli- - - -" really contain less information than "Flinkle, flinkle" ?
Shannon concludes then that the complete text and the "reduced text" are equivalent in information content under certain conditions because predictable letters become redundant in information transfer.
Fueled by this, Claude then proposes an illuminating thought experiment: Imagine that Mary has a truly identical twin (call her “Martha”). If we supply Martha with the “reduced text,” she should be able to recreate the entirety of Chandler’s passage, since she possesses the same statistical knowledge of English as Mary. Martha would make Mary’s guesses in reverse.
Of course, Shannon admitted, there are no “mathematically identical twins” to be found, but and here's the reveal, “we do have mathematically identical computing machines.”
Those machines could be given a model for making informed predictions about letters, words, maybe larger phrases and messages. In one fell swoop, Shannon had demonstrated that language use has a statistical side, that languages are, in turn, predictable, and that computers too can play the prediction game.
There was a fun recent variant on this game using LLMs, asking GPT3 (3.5?) to encode text in a way that it will be able to decode the meaning. Some of the encodings are insane:
This is super interesting. Are there more examples I can see? The one in the article is a famous song which makes me wonder if it's really "decompressing" the data, or just being hunted towards a very common popular pattern of tokens.
Whenever a human remembers something specific, they actually don't. Instead, they remember a few small details, and the patterns that organize them. Then, the brain hallucinates more details that fit the overall pattern of the story. This phenomenon is called "Reconstructive Memory", and is one reason why eyewitness testimony is unreliable.
An LLM is similar to memory: you feed its neurons a bunch of data that has meaning encoded into it, and it becomes a model for any patterns present in that data. When an LLM generates a continuation, it is continuing the patterns that it modeled; including the data it was trained on, and whatever prompt it was given.
Natural language solved! Right?
---
Not so fast!
The human mind performs much more than memory reconstruction. How else would we encode meaning into the semantics of language, and write that data into text?
There is more to this story. There is more to... story.
Remember when I said the information is "moved"? Where did it go? More importantly, how can we use that data?
---
Let's consider a human named Dave, reading a book about boats. What is Dave using to read it? The short version: empathy.
Story is held together with semantics. Semantics are already-known patterns that define logical relationships between words.
When Dave reads the statement, "Sue got in the boat", he interprets that information into meaning. Sue is a person, probably a woman. She entered the top of a vessel that was floating on water.
But Dave was wrong! Sue is a cat, and the boat was lying on dry beach.
Here's the interesting part: Dave was totally correct until I declared otherwise. His interpretation matched his internal worldview: all of the ambiguity present in what he read was resolved by his assumptions. Making false assumptions is a completely valid result of the process that we call "reading". It's a feature.
After Dave overheard me talking to you just now, he learned the truth of the story, and immediately resolved his mistake. In an instant, Dave took the semantic information he had just read, and he reread it with a completely different worldview. But where did he read it from?
His worldview. You see, after reading the book, its meaning was added neatly into his worldview. Because of this, he was prepared to interpret what I was telling you: about Sue being a cat and so on. Dave performed the same "reading" process on my new story, and he used the statement he read in the book to do just that.
---
Worldview is context. Context is the tool that resolves ambiguity. We use this tool to interpret story, particularly when story is ambiguous.
So what is context made of? Story.
It's recursive. Everything that we read is added to our internal story. One giant convoluted mess of logic is carved into our neurons. We fold those logical constructs together into a small set of coherent ideas.
But where is the base case? What's the smallest part of a story? What are the axioms?
This is the part I struggled with the most: there isn't one. Somehow, we manage to perform these recursive algorithms from the middle. We read the story down the ladder of abstraction, as close to its roots as we can find; but we can only read the story as far as we have read it already.
We can navigate the logical structure of ideas without ever proving that logic to be sound. We can even navigate logic that is outright false! Constraining this behavior to proven logic has to be intentional. That's why we have a word for it: mathematics. Math is the special story: it's rooted in axioms, and built up exclusively using theorems.
Theorems are an optimization: they let us skip from one page of a story to another. They let us fold and compress logical structure into a something more practical.
---
LLMs do not use logic at all. The logic of invalidation is missing. When story categorizes an idea into the realm of fiction, the LLM simply can't react accordingly. The LLM has no internal story: only memory.
The comment was just to tell a fascinating story of the conceptual origins of what we have today. But the predictor Claude imagined actually works quite a bit differently than what we have today.
Yes Shannon argued meaning and semantic wasn't necessary but today, we know that our language models develop meaning and semantics. We know they build models. We know they try to model the casual processes that generate this data and implicit structure that was never explicitly stated in the text can find themselves emerging in the inner layers.
>LLMs do not use logic at all. The logic of invalidation is missing.
This is a fascinating idea that i see that just doesn't square with reality. In fact, this is all they do. What do you imagine training to be ?
Prediction requires a model of some sort. It need not be completely accurate, or how you imagine it. But to performantly make predictions, you must model your data in some way.
The important bit here is that the current paradigm doesn't just stop at that. Here, the predictor is learning to predict.
We have some optimizer that is tirelessly working to reduce loss. But what does a reduction in loss of internet skill data distribution mean?
It means better and better models of the data set. Every single time a language model fails a prediction, it's a signal to the optimizer that the current model is incomplete, insufficient in some way, work needs to be done, and work will be done, bit by bit. The models in a LLM at any point in time, A, is different from the models at any point in time, B during the training process but it's not a random difference. It's a difference that trends in the direction of a more robust worldview of the data.
This is why language model don't bottleneck on some arbitrary competence level humans like to shoe-horn it on.
There is a projection of the world in text. Text is the world and the language model is very much interacting with it.
The optimizer may be dumb but this restructuring of neurons to better represent the world as seen in the text is absolutely happening.
> >LLMs do not use logic at all. The logic of invalidation is missing.
> This is a fascinating idea that i see that just doesn't square with reality. In fact, this is all they do. What do you imagine training to be ?
I could have been more clear, but I didn't want to write a novel. The ambiguity here is what they invalidate: memory reconstructions, not logical assertions.
An LLM can't tell the difference between fact and fiction, because it can't apply logic.
Better memory will never suddenly spawn itself the feature to think objectively about that memory. LLMs improve, yes, but they didn't start as a poor-quality equivalent to human thought. They started out as a poor quality equivalent to human memory.
> There is a projection of the world in text. Text is the world and the language model is very much interacting with it.
The language model becomes that world. It does not inhabit it. It does not explore. It does not think, it only knows.
>An LLM can't tell the difference between fact and fiction, because it can't apply logic.
Not true. They can differentiate it just fine. Of course being able to tell the difference and being incentivized to communicate it are 2 different things.
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
Some ~72 years ago in 1951, Claude Shannon released his Paper, "Prediction and Entropy of Printed English", an extremely fascinating read now.
It begins with a game. Claude pulls a book down from the shelf, concealing the title in the process. After selecting a passage at random, he challenges his wife, Mary to guess its contents letter by letter. The space between words will count as a twenty-seventh symbol in the set. If Mary fails to guess a letter correctly, Claude promises to supply the right one so that the game can continue.
In some cases, a corrected mistake allows her to fill in the remainder of the word; elsewhere a few letters unlock a phrase. All in all, she guesses 89 of 129 possible letters correctly—69 percent accuracy.
Discovery 1: It illustrated, in the first place, that a proficient speaker of a language possesses an “enormous” but implicit knowledge of the statistics of that language. Shannon would have us see that we make similar calculations regularly in everyday life—such as when we “fill in missing or incorrect letters in proof-reading” or “complete an unfinished phrase in conversation.” As we speak, read, and write, we are regularly engaged in predication games.
Discovery 2: Perhaps the most striking of all, Claude argues that that a complete text and the subsequent “reduced text” consisting of letters and dashes “actually…contain the same information” under certain conditions. How?? (Surely, the first line contains more information!).The answer depends on the peculiar notion about information that Shannon had hatched in his 1948 paper “A Mathematical Theory of Communication” (hereafter “MTC”), the founding charter of information theory.
He argues that transfer of a message's components, rather than its "meaning", should be the focus for the engineer. You ought to be agnostic about a message’s “meaning” (or “semantic aspects”). The message could be nonsense, and the engineer’s problem—to transfer its components faithfully—would be the same.
a highly predictable message contains less information than an unpredictable one. More information is at stake in (“villapleach, vollapluck”) than in (“Twinkle, twinkle”).
Does "Flinkle, fli- - - -" really contain less information than "Flinkle, flinkle" ?
Shannon concludes then that the complete text and the "reduced text" are equivalent in information content under certain conditions because predictable letters become redundant in information transfer.
Fueled by this, Claude then proposes an illuminating thought experiment: Imagine that Mary has a truly identical twin (call her “Martha”). If we supply Martha with the “reduced text,” she should be able to recreate the entirety of Chandler’s passage, since she possesses the same statistical knowledge of English as Mary. Martha would make Mary’s guesses in reverse.
Of course, Shannon admitted, there are no “mathematically identical twins” to be found, but and here's the reveal, “we do have mathematically identical computing machines.”
Those machines could be given a model for making informed predictions about letters, words, maybe larger phrases and messages. In one fell swoop, Shannon had demonstrated that language use has a statistical side, that languages are, in turn, predictable, and that computers too can play the prediction game.