After a quick/superficial read, my understanding is that the authors:
(a) induce an LLM to take natural language inputs and generate statements in a probabilistic programming language that formally models concepts, objects, actions, etc. in a symbolic world model, drawing from a large body of research on symbolic AI that goes back to pre-deep-learning days; and
(b) perform inference using the generated formal statements, i.e., compute probability distributions over the space of possible world states that are consistent with and conditioned on the natural-language input to the LLM.
If this approach works at a larger scale, it represents a possible solution for grounding LLMs so they stop making stuff up -- an important unsolved problem.
I have not yet read the paper, but based on this description it seems like it provides grounding in the context of the training data, which is kind of the rub with current LLMs to begin with, right? We don't have a set of high quality training data that is completely unbiased and factual.
> … which is kind of the rub with current LLMs to begin with, right?
No, the bigger problem with current LLMs is that even with high quality factual training data, they often generate seemingly plausible nonsense (e.g. cite nonexistent websites/papers as their sources.)
This is by design imo; they’re trained to generate ‘likely’ text, and they do that extremely well. There’s no guarantee for faithful retrieval from a corpus.
Important addition to your partially right statement: "they’re trained to generate ‘likely’ text" is they are trained to produce most probable next word so that the current context look as "similar" to training data as possible. Where "similar" is not "equal".
Humans’ experience and understanding of the world around them isn’t limited to a symbolic representation.
It remains to be seen whether you can truly be an effective intelligence with understanding of the world if all you have are symbols that you have to manipulate.
It's a surprise to see a paper actually try to solve the problem of modelling thought via language.
Nevertheless, it begins with far too many hedges:
> By scaling to even larger datasets and neural networks, LLMs appeared to learn not only the structure of language, but capacities for some kinds of thinking
There's two hypotheses for how LLMs generate apparently "thought-expressing" outputs: Hyp1 -- it's sampling from similar text which is distributed so-as-to-express a thought by some agent; Hyp2 -- it has the capacity to form that thought.
It is absolutely trivial to show Hyp2 is false:
> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.
Indeed: because there're no relevant prior cases to sample from in that case.
> These issues make it difficult to evaluate whether LLMs have acquired cognitive capacities such as social reasoning and theory of mind
It doesnt. It's trivial: the disproof lies one sentence above. Its just that many don't like the answer. Such capacities survive trivial permutations -- LLMs do not. So Hypothesis-2 is clearly false.
> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.
>Indeed: because there're no relevant prior cases to sample from in that case.
That's not what that tells us. Humans have weird failure modes that look absurd outside the context of evolutionary biology (some still look absurd) and that don't speak to any lack or presence of intelligence or complex thought. Not sure why it's so hard to grasp that LLMs are bound to have odd failure modes regardless of the above.
and trivial here is relative. In my experience, "trivial" often turns out to be trivial in the way a person may not pay close attention to and be similarly tricked.
For instance, GPT-4 might solve a classic puzzle correctly then fail the same puzzle subtlety changed.
I've found more often than not, simply changing names of variables in the puzzle to something completely different can get it to solve the changed puzzle. It takes memory shortcuts but can be pulled out of that.
LLMs have failure modes that look like human failure modes too.
The "failure modes" in humans do not show we lack the capacity.
Eg., do you have capacity to reason about physics? Well if you're extremely drunk, less so. But not if I permute the name of the object.
> I've found more often than not, simply changing names of variables
Yes, lol --- why do you think that is?
Because in the digitised dataset of "everything ever written" those names correspond to places in that dataset that can be sampled from by the LLM. Showing Hyp1 to be the case.
>The "failure modes" in humans do not show we lack the capacity.
Then they don't in LLMs too
>Yes, lol --- why do you think that is?
Being able to solve a changed common puzzle but also with different names than it would ever see in training is not an indication of a lack of ability lol. and changing names isn't the only way to get it out of memory, just the easiest/most straightforward. You can converse it out of there too but that doesn't work as often.
If a child answers questions from a book of answers then they'll appear to understand the domain insofar as those questions appear. They do not.
They will fail to answer questions under, eg., permutations of words (say, a question asks about "norepinephrine" but the book only contains "noradrenaline" etc.).
Insofar as a human cannot answer questions under trivial linguistic permutations then they too do not understand the domain.
But these are not the kinds of failures experienced with those who have some capacity, eg., for counter-factual reasoning about their environment's physics.
In those people it is environmental illusion and cognitive impairment -- not trivial permutations of phrasing which lead to catastrophic loss of apparent understanding.
Cognitive impairment = reasoning machine is broken
Environmental illusion = data is ambigious and actions cannto resolve it
These "failure modes" are expected if you actually have the relevant capacity.
"But the bag is transparent" is not "irrelevant word permutation" and neither is the additive question that spurs the correct resolution. And it certainly isn't random.
a human that isn't paying attention could fail the question too which is kind of the point i'm making.
> less so. But not if I permute the name of the object.
You need to realize that you wrote it on a forum where the most known joke is "there are two hard things in programming". That would immediately show you how this assumption is exactly false.
This is a false dichotomy. It's not the case that models are truly capable of reasoning if and only if they are insensitive to irrelevant perturbations to input. In other words, the mere fact that sensitivity to names sometimes causes significant degradations in model performance doesn't mean that we've observed models are incapable of anything we might call "reasoning"—leaving aside the matter of how we'd define that.
This might be a naive question, but here me out. Do we really know what the difference is between statistics and the capacity to think? Is "true understanding" rather a continuum of sophistication from a simple adder to Albert Einstein?
My point here isn't "if it quacks like a duck...", but more so that while we are talking about intelligent apparatus we should be comparing apples to apples, and not say "this is a mere engine and that is a living brain".
Idk, that isn't the sense I got from "It is absolutely trivial to show Hyp2 is false", but sure, I agree with you that this evidence certainly ought to tip the scales one way and not the other.
>for humans, everything (everthing) is within the context of evolutionary biology!
Sure but if some alien species were observing us, some of our actions would look downright odd. Evolutionary biology doesn't necessarily hold the same reference frame for other species, even on earth. Octopi are weird to us. Not so much to other Octopi.
>Yes - because LLM's are trained on 2020 Reddit.
I wasn't making any comment on why this was the case. Simply that it was. There'll be failure models LLMs adopt from training data, but there's also bound to be failure modes LLMs adopt from the training scheme itself.
To investigate precisely this question in a clear and unambiguous way, I trained an LLM from scratch to sort lists of numbers. It learned to sort them correctly, and the entropy is such that it's absolutely impossible that it could have done this by Hyp1 (sampling from similar text in the training set).
Now, there is room to argue that it applies a world-model when given lists of numbers with a hidden logical structure, but not when given lists of words with a hidden logical structure, but I think the ball is in your court to make that argument. (And to a transformer, it only ever sees lists of numbers anyway).
Your model is not sorting correctly and it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.
Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data (i.e. data not available to the experimenter). And we do this because under PAC-Learning assumptions a learner is said to learn a concept when it can correctly label instances of the concept with some probability of some error. In real-world situations we do not know the true concept, so we test on held-out data to approximate the probability of error.
Bottom line, if you train a model to do a thing and you don't test it carefully to figure out its error, you might claim it's learned something, but in truth, you have no idea what it's learned.
(To clarify: you tested on the train data assuming there's a low probability of overlap. Don't do that if you're trying to understand what your models can do).
> it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.
> Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data
The probability of a test list existing in the training set is less than 10^-70.
That's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.
What is certainly established is that a function, and an algorithm, are different objects. To clarify, a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate. Algorithms are also typically understood to be provably correct and to have some provable asymptotic complexity (as opposed, for example, to heuristics) but that's not a requirement.
So for example, if you have a function ƒ between sets X and Y, and an algorithm P that calculates the result of ƒ, then you can give any element of X to P and it will return (in fact, construct) an element of Y. Crucially, ƒ is not P, and P is not ƒ.
Now, when you train a machine learning model, you are typically training a function ƒ̂ (with a little hat) to approximate ƒ. That means that your trained ƒ̂ is a function that maps some of the elements of X to the same elements of Y as ƒ, but not all. It's an approximation. So you get some amount of error, as in your experiment.
So what you've done in your experiment is that you trained a model to approximate a mapping between the set of lists, to itself (where the input list is any of the lists in your training set and the output is the same list, sorted). Your model is not an algorithm, and you cannot train an algorithm with a language model.
I appreciate that, learning an algorithm, is what you wanted to achieve, but in science we don't choose the answer that pleases us, we choose the answer that makes the most sense- and a good heuristic for that is that the answer that makes more sense is the simplest one. Here, in order to convince yourself that you have trained a language model to learn an algorithm, rather than an approximator, you have chosen to rely on a preprint with a completely novel and untested concept that someone put on the internet, rather than the well-understood abstractions of elementary computer science, so not at all the simplest explanation. That is not a good idea. You will not understand what is going on, if you rely on that kind of explanation. I assume you are trying to understand?
Edit: incidentally, you don't need a transformer to train an approximator to a sorting function. You can do that with a multi-layer perceptron, or a logistic regression, certainly with an LSTM. Ceteris paribus, you'll get the same results.
>> The probability of a test list existing in the training set is less than 10^-70.
But the same probability if you held the test set out would be 0, so why not do that? It's not hard to do.
Is there a good reason not to do that?
Btw, lists are composite objects. How much overlap is there between your training and test lists? Do you know?
Edit: meh. HN messes up my nice f-with-hook-and-combining-circumflex-accent. DAAAAANG!!!!
> that's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.
I think you would enjoy learning about RASP, rather than taking such a hardline skeptical position.
> a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate
I'm aware. Transformers (and RASP programs) are guaranteed to terminate; that's one of their nice properties.
> Is there a good reason not to do that?
Balanced against the value of my unpaid time, a probability of 10^-70 is low enough for the purposes of a quick and fun test.
Speaking of which, I'm going to enjoy my weekend now. I hope you enjoy yours!
EDIT: I realise I was mistaken about the OP. He is not an undergarduate student, as I initially thought. His substack profile says he is a professional engineer and consultant. So his complete cluelessness about computer science fundamentals is not the result of inexperience, and his article is nothing more than an attempt to jump on the current bandwagon of LLM hype rather than an attempt to make sense of things. I thought I was helping a CS grad learn something! What an idiot I am! Fuck. φτου γαμώ την Παναγία μου.
[Earlier text of my comment left in the interest of something or other]
It is, but it's published in the proceedings of the ICML, which means it's been peer reviewed.
The OP has checked out (I guess all this computer scienc-y stuff is boring on a weekend), but even a peer-reviewed article is not enough to cause us to let go of good, old-fashioned computer science. The article basically invents its own language and then proceeds to map transformers to it, to claim that transformers can learn various kinds of programs. It's not convincing.
In any case, learning to sort lists by neural nets is not something new, or unique to transformers, and there's pretty clear understanding of how it works. I explain why it doesn't constitute learning an algorithm in my comment above. The RASP paper doesn't change that. I mean, Recurrent Neural Nets have a known equivalence to FSMs but even they cannot learn algorithms but only approximate them. The OP wrote his article in an obvious effort to understand why GPT is "not an n-gram" even if it behaves like an n-gram model (well, it's a language model, it doesn't matter what it's trained on) so I'm guessing he can appreciate the need for clarity in explaining empirical results and he probably will want to think further on what, exactly, his experiment has shown. I hope my little comment above will help him do that.
So this is a really good starting point -- but you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something".
Formally, what hypotheses are you comparing? What do you think the specific hypothesis of the "AI = stats" person is? It isnt that the NN literally remembers data tokens, right?
In any case:
The issue with forcing NNs to model mathematical features is that the structure of the data itself has those properties. So the distributional hypothesis is true for sorting ordinals.
But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like...".
> you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something"
Let's not be so hasty. I think I do put it as clearly as possible. I'm comparing essentially your Hyp1 and Hyp2, where Hyp1 (aka the stochastic parrot) is expressed a little bit more clearly as the LLM is learning an n-gram that produces correct sorts through rote memorization of statistical correlations in the training data, like that sorted lists tend to start with '0', end with '99', and increase monotonically; and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.
> But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like..."
This is not really obviously false. Yes, being red isn't "red follows words like...". But a word order should still map to properties of the world, especially if those words are to be meaningful to a listener. Being red is "a surface reflects or transmits most of the light in the 600-800 nm spectrum and absorbs most of the rest". Of course, it won't do to just echo those tokens; once you've nailed down the concept of "red", you need to make sure that concepts like "reflects", "light", and "spectrum" are represented as well. It's an open question as to whether this sort of knowledge graph can be properly bootstrapped from a large volume of text descriptions, but I am strongly inclined to believe it can. If you dismiss it outright you're just begging the question.
There are an infinite number of sentences which describe what "being red" is, most of them have never been written.
Redness is not in the structure of those sentences. And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.
In any case,
I'd need more time than I have at the moment to seriously state Hyp1 for your case -- but atm, I can say that because the data itself has the property, Hyp1 becomes much harder to state and the argument much subtler.
Since what is a "statistical distribution" of "ordinals" anyway? And how much memory is required to represent it? My sense is this distribution has highly redundant features which will be trivially compressible without learning any "sorting algorithm".
At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.
> There are an infinite number of sentences which describe what "being red" is, most of them have never been written.
Which is exactly how the set of sentences actually written encodes in it the idea of "Redness". It's the "actually written" part that carries information about the real world.
> And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.
That's cheating, because "a person acquainted with redness" presumably learned it by sight, which LLMs can't do just yet (at least the widely accessible ones can't). Would you also say that a person born blind also cannot infer those True sentences about redness? Because if they can, that means the concept of redness is capable of being taught through language, and so there's no reason LLMs couldn't pick up on it too.
> Redness is not in the structure of those sentences.
Sure; it's in the spectrum of reflected light. (Or perhaps, the retina's trichromal responsivity). But that physical concept can be meaningfully described by sentences. It doesn't require an infinite number of them to create a coherent world-model, which can do things like predicting that a blue object will become red if it moves away from you at a high enough speed. Which is something a human might be surprised by even after many years of visual experience with red objects -- unless they've read sentences about the Doppler effect in a physics textbook.
If you can manage to trick GPT-4 into revealing that it doesn't have a world-model of the concept of 'red', please show us!
> At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.
Keep in mind, the LLM's structure was not hand-crafted to do well on this mathematical task. It was built to be good at language modelling, and initialized with essentially a uniform prior over all token sequences. Even if a dataset is efficiently compressible, that's no guarantee that the LLM will be able to compress it efficiently. In fact, many people would probably be surprised to learn that it can do this problem at all, let alone so well with so little training. But do think about the statistics of sorting a bit more. I think it's not as easily compressible as you think it is, except by an actual sorting algorithm. Again, you can compress it a bit with monotonicity and so on, but nowhere near the amount you'd need to sort a long list without errors, using so few parameters. I compute the number of sorted and unsorted lists in the footnotes.
One of the things that makes sorting tricky for an LLM is you always need to look at every item in the input list. Even if the previous output token was '99', you can't be sure you're now at the end of the list; you still need to count how many '99's were output already and how many are needed.
(The dataset itself, of course, does not contain the notion of sorting, a description of sorting, a test for sortedness, or any algorithm for sorting. It only contains a large but finite number of examples of sorted and unsorted lists. It's up to the LLM, and its training process, to discover the mechanism that generated these results.)
> that's no guarantee that the LLM will be able to compress it efficiently
Your LLM here is 600MB which is a grossly inefficient compression of the sort space.
If LLMs "learned algorithms", the best compression would be on the order of bytes.
The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!
What do you think all those MBs are doing? They're the extraordinary cost of the "statistical shortcut" of modelling the empirical distribution of sorted numbers.
NNs exploit distributional structure in the training data to compress it --- in this case there's huge amounts of distributional structure in numbers.
I think you've misunderstood the "statistical parrot" claim to be somehow that NNs are engaged in wrote memorization... or, what?
The claim is simply that all they do is statistically approximate the empirical distribution of the training dataset structure --- and if you force interpolation, then they provide arbitrarily precise compressions of that structure.
I'm not sure what a NN which can sort numbers shows, other than the distributional structure of a sort-numbers dataset is such that a NN can compress it into 600MB...
To be clear, the "statistical parrot" claim is that the statistical distribution of the empirical dataset D = (X, y) is being approximated by the weights, W = Compress(D) -- and that this distribution fails to be a representational model of y -- because no entailments of X (other than those in D) are captured.
Whereas representational models are not confined to the distribution of historical cases, ie., I can imagine variations on X leading to any given y; and variations on y leading to any given X -- without ever having experienced either.
You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.
I'm not exactly sure why you think this is a reply to the relevant claims.
> The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!
This isn't a fair comparison. The python code to sort a list is leveraging an enormous amount of information that is stored outside the python code, whereas the GPT version basically has to do it "from scratch", and in a very convoluted computing model.
A better comparison would be "how many bits does it take to encode a configuration of NAND gates that describes a computer that can sort 127-byte lists of number 1..100?"
I'm sure it's not as much as 600 megabytes, but it'll be a lot more than the python code.
The model the OP trained also runs on a computer. If I'm not mistaken, it also runs on Python itself.
Which means you don't need to count all the bytes in the infrastructure all the way down to the electric grid, maybe. You can compare a sorting algorithm to a sorting model, as stand-alone programs, on their relative size, and that will give you a good idea of how much work each is doing.
> If LLMs "learned algorithms", the best compression would be on the order of bytes.
Yes. Except:
(1) the model size is fixed during training, it would be impossible to obtain a bytes-sized result regardless of what it learns to represent. One might even open the thing up and find bubblesort* inside followed by 599 MB of junk DNA; that size is dictated by how it was initialized.
(2) I'm not claiming this model is a minimal size; I started with the biggest model I could train on my wimpy GPU and succeeded on my first and only try, which I think is a fairer representation of how GPT-4 was built than if I'd started by proving the minimum size of transformer that could represent the task** and then (surprise!) obtained it.
(3) Compared with the size of a map of all 10^80 unique input lists to all 10^36 correctly-corresponding sorted outputs, 600 MB is a remarkable compression ratio, even if it's not reducing it all the way down to exec("sort(input)").
(4) Nowhere do I make any claim that transformers are minimal or even space-efficient representation of an algorithm (or a world-model); in fact, they seem quite terrible in this respect, especially compared to arbitrary code. And doubtless there are a bunch of weights that got trained to near-zero and could be trimmed to make the matrices more sparse, or quantized, which is the kind of thing people do to compress an LLM itself but I didn't bother. What transformers do seem to do very well at, despite the overhead, is the differentiability that allows them to be trained in the first place, and also the flexibility to handle different kinds of problems. I could have trained the same blank-slate starting model to one that shuffles or reverses each list, or perhaps to do one or the other depending on whether the first number is odd or even, or any number of other tasks.
> You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.
It's almost definitely the case that every list it's tested on, and sorts 100% correctly, is a list it has never seen in training (unless it's a very short list, but I control for that). My training dataset is only about 100 MB; given the number of random lists, it's vanishingly unlikely that it's seen almost any of them, let alone the 100% of them that it is able to sort correctly. (The tests, of course, were not drawing from the validation set either; I test the model by generating new lists on the fly, because that's easy to do).
> statistically approximate the empirical distribution of the training dataset structure
Can you provide more details about what you mean by this distributional structure that can be compressed without a generally-correct sorting algorithm? How would you define a similarity measure between distinct random lists that allows for this kind of interpolation?
* Well, probably RASP-sort, not bubblesort. Also, it would need to include definitions of things like the comparison operator between all tokens, because it doesn't have a numeric datatype built in, or even the idea of numbers as an ordered set; it has to learn all that.
** (the Weiss paper does this, and lo and behold, transformers can indeed sort).
> Can you provide more details about what you mean by this distributional structure
The distribution of sorted digits is:
(0 1 2 3 4 5 6 7 8 9) before
(1 before 0 1 2 3 4 5 6 7 8 9) before
(2 before 0 1 2 3 4 5 6 7 8 9) before
(3 before 0 1 2 3 4 5 6 7 8 9) ...
...
When you compute the search space you're treating each number as a unique token (ie., that all ordinals are unique) -- but its not sorting unique ordinals, it's sorting digits in a sequential model ie., it learns P(Next|Prev)
The (sequential) distribution of digits amongst sorted numbers is tiny
You're treating each list as unique, all the lists have a distribution of digits in common... I'm at a loss to even understand what you're saying here really -- this is why you need to actually state, formally, what you think the "LLMs are just stats" hypothesis amounts to.
It seems you think it amounts to saying LLMs sample from a combinatorial space, naively construed -- but that isnt the claim?
The claim is rather, they sample from a statistical distribution of tokens.
Take each position in the input vector, 1...127. It needs to "learn":
P(x0 position | y, x1...x127 positions), P(1|y, 2...127), P(2|y, 3...127), etc.
Which is a family of 127 conditional distributions that seem trivial to learn.
I really don't know why you think the size of a combinatorial space is relevant here?
All the sorted lists share basically the same tiny family of conditional distributions { P(x_i | x_(i-1)...x_127) }
I agree a neural network can certainly learn the conditional distributions that let it make that choice correctly every time. Once it has done so, then do you not have a sorting algorithm?
So this is what I thought you would say, and it's the origin of the issues here: to say that LLMs are "statistical parrots" is just to say they learn conditional distributions of text tokens.
So you aren't replying to the "only stats" claim: that is the claim!
The issue is that language-use isn't a matter of distributions of text tokens: when i say, "the sky is clear today!" it is caused by there being a blue sky. Then I say, "therefore I'd like to go out!" it is caused by my preferences, etc.
So if we had a generative causal model of language it would be something like this: Agent + Environment + Representations ---SymbolicTranslation---> Language.
All LLMs do is model the data being generated by this process, they dont model the process (ie., agents, environments, representations, etc.)
They say, "it is a nice day" only because those tokens match some statistical distribution over historical texts. Not because it has judged the day nice.
To model language is not to provide an indistinguishable language-like distribution of text tokens, but rather, for an agent to use language to express ideas caused by their internal states + the world.
In the case of sorting numbers, the tokens themselves have the property (ie., mathematical properties such as ranking are had by ranked tokens). So learning the distribution is learning the property of interest.
This is why no papers which demonstrate NNs "have representations" etc. which appeal to formal properties the data itself has, are even releveant to the discussion. Yet, all this "world model, algorithm, blah blah" said of NNs, is only ever shown using data whose "unsupervised model" constitues the property of interest.
Statistical models of the distributions of tokens are not models of the data generating process which produces those tokens (unless that process is just the distribution of those tokens). This is obvious from the outset.
I'm curious, why are you using "n-gram" as if you're referring to a model? You say e.g. that "LLM is learning an n-gram". N-grams are features, not models. You can train an n-gram model, or you can train a language model using n-grams as features, and so on, but you can't "learn an n-gram".
Where did you find this terminology?
EDIT:
>> and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.
Btw, you have not shown anything like that. You trained and tested on lists of two-digit positive
integers expressible in 128 characters. That's not "any input list". As a for instance, what do you think would happen if you gave your model an alphanumeric list to sort? Did you try that?
Your model also doesn't correctly generalise, not even to its own training set that you tested it on. There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).
It's not clear to me how you account for those obvious limitations of your model (it's a toy model after all) when you claim that it "learned to implement a sorting algorithm" etc. It would be great if you could clarify that.
> what do you think would happen if you gave your model an alphanumeric list to sort? Did you try that?
The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?
> You say e.g. that "LLM is learning an n-gram"[...] you can't "learn an n-gram".
Where do I say that? I don't think I make any reference to "learning an n-gram", which is a relief because I don't know what it would mean to "learn an n-gram".
> There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).
Test accuracy between training iterations (not part of the training process itself, which uses its own separate validation set which is split from the training set). And yes, I agree, it is not error-free, and I wouldn't expect it to be, especially after so little training. What the figure shows is the percentage of sorts that were error-free, and how rapidly that decreases. I've since repeated the test with finer resolution, and the fraction of imperfect sorts continues to decrease about as you expect, which is enough to satisfy my curiosity, although I'm a little curious to see if there is some point where it falls completely to zero.
(...) is expressed a little bit more clearly as _the LLM is learning an n-gram_ that produces correct sorts (...)
(My underlining)
You also use it in a similarly unusual way throughout your linked substack post, for example, you write:
the way GPT works is, in a certain sense, functionally equivalent to an n-gram, but that doesn’t mean GPT is an n-gram.
Where does this use of "n-gram" come from? I mean, did you see it somewhere? I'm curious, where?
>> The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?
I'm sorry, I don't understand. "Defined an ordering", where?
You can change your tokenizer but that will not change the trained model, obviously. So if you take your model that's trained on two-digit lists of integers and you run it on lists of any other type of elements it will not be able to sort them correctly. But isn't that what you claim? That:
"the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list"
Oh, I see, good catch. I think that comment was a result of a botched edit; I do that sometimes. Too late to change it now. Sorry for the confusion!
> Where does this use of "n-gram" come from? I mean, did you see it somewhere?
It's shorthand for n-gram Markov model. The same way it is presented in, for example, A Mathematical Theory of Communication.
> "Defined an ordering", where?
In order for a set to be sortable, you need to define an ordering over the elements. So for example, defining that the letter 'A' is greater than the number '99'. It's easy to take for granted that 1 < 2, but the neural network doesn't know that a priori, because the tokens are just index values. It doesn't have any way to know that token number 5 represents the character '5'.
> if you take your model that's trained on two-digit lists of integers and you run it on lists of any other type of elements it will not be able to sort them correctly.
To reiterate, the token dictionary basically just contains the characters "0123456789,():[]_\n". If you try to ask it to sort '(Tuesday, Monday)', it's just going to throw an exception because 'T' isn't a recognized token; it doesn't have a corresponding index. It's not even a question of whether it can sort them correctly or incorrectly.
> "Any input list"? How so?
I think the meaning is pretty clear. No algorithm can sort a list of elements that aren't members of a totally ordered set, so I wasn't attempting to imply that any input list meant that a neural network could somehow supersede this limitation.
If it's "absolutely trivial" to show that LLMs don't have the capacity to form thought, then please publish a paper proving that. So all the "stupid" people studying LLMs that can't come up with such trivial proofs can move on to other stuff.
You may wish to read the paper above. But if you want a quick proof:
1. A thought is a representation of a situation
2. A representation generates entailments of that situation
3. Language is many-to-one translation from these representations to symbols
4. Understanding language is reversing these symbols into thoughts (ie., reprs)
So,
5. If agent A understands sentence X then A forms the relevant representation of X.
6. If agent has a representation it can state entailments of S (eg., counter-facutals).
Now, split X into Xc = "canonical descriptions of S" and trivial permutations Xp.
(st. distribution of Xc,Xp is low, but the tokens of Xp are common)
Form entailments of X, say Y -- sentences that are cannonically implied by the truth of X.
7. If the LLM understood that X entails Y, it would be via constructing the repr S -- which entails S regardless of which sentence in X was used.
8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.
9. Since using Xp sentences cause it to fail, it does not predict Y via S.
QED.
And we can say,
1. Appearing to judge Y entailed-by X is possible via simple sampling of (X, Y) in historical cases.
2. LLMs are just such a sampling.
so,
3. +Inference to the best explanation:
4. LLMs sample historical cases rather than form representations.
Incidentally, "sampling of historical cases" is already something we knew -- so this entire argument is basically unnecessary. And only necessary because PhDs have been turned into start-up hype men.
> Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.
Why? This is obviously wrong in general case. For that to be true Xp and Xc has to have no statistical relationship whatsoever, which statistically is virtually impossible.
Xp just have to be chosen such that the distribution Xc,Xp is sufficiently small in the training data -- but not that the tokens of Xp are themselves rare. So that an agent competent with tokens in X, who can construct repr of S, could do so with Xp.
Xc =
> Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on
the bag says “chocolate” and not “popcorn.” Sam finds the bag. She had never seen the
bag before. She cannot see what is inside the bag. She reads the label.
Produces, Y = She believes that the bag is full of popcorn
Xp =
> Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made
of transparent plastic, so you can see what is inside. Yet, the label on the bag says
’chocolate’ and not ’popcorn.’ Sam finds the bag. She had never seen the bag before.
Sam reads the label.
Produces, Y = She believes that the bag is full of chocolate
> just have to be chosen such that the distribution Xc,Xp is sufficiently small in the training data -- but not that the tokens of Xp are themselves rare
Great idea. Now prove you can actually choose such a distribution, lol.
I think this is easy, just make Xp sentences of the kind = "I define `randomchars()` to be this `term-in-Xc()`" and swamp the dataset with Xc.
Everything here actually just follows formally from what NNs are: they're just empirical function approximations.
It will always be the case that they just model the probabilistic structure of the dataset and not the data generating process.
Since, in language, there are discrete constraints which make P(...) = 1 or P(...) = 0 --- you can trivially produce datasets showing that it learns P(...) = mistake-you-created-deliberately and not either 0,1.
As above, the LLM switches from 95% confidence "chocolate" to 95% confidence "popcorn" with a trivial non-semantic permutation of the prompt.
The obscene issue in all this is that we know this already -- empirical function approximation of historical datasets just produces associative probabilistic models of those datasets.
Now you have a strong statistical dependency between Xc and Xp the lack of which was required for your proof to show that the algorithm is unable to learn Xp. BTW it was already there because you already had `term-in-Xc()`.
> 8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.
This is clearly where the "proof" falls apart. Even in tasks where GPT4 struggles, it's accuracy will still be better than random. The bar of "better than random" is so low that even weak LLMs will be able to surpass it.
More so, you need to prove not just a single, but that no task/domain exists for which LLMs satisfy 8.
What your proof says is basically "LLMs do not generalize even the slightest for any task". And that's trivial to disprove.
I just need to be able to create a split in Xc,Xp so that Xp is random. I think that's really quite easy.
If you could put ChatGPT in a loop, take some Xc prompts and permute with some non-semantic phrases ("Alice believes that... Xc ... what did Alice believe?") etc --- until you find those cases.
I imagine we will discover quite a large number of such non-semantic phrases which have this effect. Because the tokens in those phrases will, joint with Xc, be arbitrarily distributed in some historical data (distributed to our preference when finding them).
This seems just kinda basically obvious, right? Entailments are discretely constrained by semantics, and historical datasets can contain arbitrary mixtures of random distributions of syntax.
NNs only model those distributions -- and not the entailments -- which, at the very least, are extremely discrete.
I don't think you really disproved anything. You're just saying another hypothesis. Often, LLMs produce impressive results on domains that aren't in the training set.
> it's sampling from similar text which is distributed so-as-to-express a thought by some agent;
Your hypotheses 1 and 2 are not so different when you consider that the similarity function used to match text in the training data must be highly nontrivial. If it were not, then things like GPT-3 would have been possible a long time ago. As a concrete example, LLMs can do decent reasoning entirely in rot13; the relevant rot13'ed text is likely very rare in their training data. The fact that the similarity function can "see through" rot13 means that it can in principle include nontrivial computations.
> There's two hypotheses for how LLMs generate apparently "thought-expressing" outputs: Hyp1 -- it's sampling from similar text which is distributed so-as-to-express a thought by some agent; Hyp2 -- it has the capacity to form that thought.
There's also another hypothesis: Hyp3 -- that Hyp1 and Hyp2 converge as the LLM is scaled up (more training data, more dimensions in the latent space), and in the limit become equivalent.
They're indistinguishable via naive measurement (prompting) if the LLM can sample from all possible data: there's a very large infinity of (Q, A, time) triples (ie., it's real-valued).
But it cannot, since most of those are in the future.
Failing on "trivial alterations to the same underlying domain" is a not a disproof of thought.
Your argument also implies hyp1 and 2 are exclusive, clearly both can be true, and in fact must be true, unless you are claiming that you do not "sample" from similar language to express your own thoughts? Where does your language come from then, if not learning from previous experience?
While I agree with you on the relation of GP's Hyp1 and Hyp2, you are making an unfounded assumption of a sampling process being necessary to perform human speech. I do not believe we have the understanding of how thought is represented in the human brain to make that judgement. In other words, just because sampling from a distribution can produce human-like text does not mean that it is the only way to do that, and thus that it must be the way that humans produce text, spoken or written.
We might be talking about 2 different things. I was referring to the backwards learning pass and you seem to be referring to the forward inference pass, but what is an alternative to learning (or producing) text which does not involve sampling from some larger space? (Also I’m not a statistician so I’m not sure if these are technically “distributions”)
Don't try to ham-fist scientific sounding wording into your (very unscientific) argument. This is not a disproof of anything because you failed to define what it means to have the ability to form rational thoughts.
With a definition, you would then wanna prove this for humans as a sanity check: Do we never make stupid mistakes? Ok, we make fewer of those than LLMs. Then what is the threshold for accuracy after which you consider a system to be intelligent? Do all humans pass that threshold, or do kids or people with a lower than average IQ fail?
This entire paper is written as a disproof of the distributional hypothesis. If you want to understand why it's a profoundly unhelpful pseudoscientific idea, this paper is a good start.
The test for a capacity C in a system1 has nothing to do with proxy measures of that capacity in system2.
The capacity for an oven to cook food may be measured by how much smoke it lets of when burning -- but no amount of "smoke" establishes that a dry ice machine can cook.
This type of "engineering thinking" is pseudoscience.
>The capacity for an oven to cook food may be measured by how much smoke it lets of when burning -- but no amount of "smoke" establishes that a dry ice machine can cook.
You seem to be talking past me, as nowhere did I claim that LLMs are intelligent. That's the point – Unlike you I do not claim to be able to prove or disprove this. I argue that your comment is the one that is pseudoscientific because you didn't provide (even a semblance of) a rigorous definition of intelligence.
There is intelligent thought and action, and there is unintelligent thought and action. Intelligent is that "which checked" (intus-legere); the other, the """impulsive""", is not.
The level of understanding of the problem that this paper expresses is extraordianry in my reading of this field --- it's a genuinely amazing synthesis.
> How could the common-sense background knowledge needed for dynamic world model synthesis be represented, even in principle? Modern game engines may provide important clues.
This has often been my starting point in modelling the difference between a model-of-pixels vs. a world model. Any given video game session can be "replayed" by a model of its pixels: but you cannot play the game with such a model. It does not represent the causal laws of the game.
Even if you had all possible games you could not resolve between player-caused and world-caused frames.
> A key question is how to model this capability. How do minds craft bespoke world models on the fly, drawing in just enough of our knowledge about the world to answer the questions of interest?
This requires a body: the relevant information missing is causal, and the body resolves P(A|B) and P(A|B->A) by making bodily actions interpreted as necessarily causal.
In the case of video games, since we hold the controller, we resolve P(EnemyDead|EnemyHit) vs. P(EnemyDead| (ButtonPress ->) EnemyHit -> EnemyDead)
I doubt that word models can lead to world models. To quote Yann LeCun:
"The vast majority of our knowledge, skills, and thoughts are not verbalizable. That's one reason machines will never acquire common sense solely by reading text."
That just seems like an unfounded hot take. Of course we can explain most of our knowledge, skills, and thoughts in words, that's how we don't lose everything when the next generation comes around lol. It's the core reason we're different from animals.
Now sure you can't describe qualia, but that's basically a subjective artefact of how we sense the world and (to add another unfounded hot take) likely not critical to have an understanding of it on a physical level.
I disagree that this is an "unfounded hot take". It's far from a rare opinion on cognitive science, and if I had to guess it's probably the mainstream opinion (I can't really back that up with citations because I haven't followed the field closely in the last decade).
And for what it's worth, I agree with Yann, although I have to admit that LLMs work far better than I would've guessed.
It's a topic that's too large for an HM comment, but "explaining" things in words comes after the fact, and mostly limited to a small subset of our experience and skillset that is amenable to it.
Note that humans are animals too, btw. And conversely, I would consider nonverbal people as humans as well.
Well I admit I used to be of a similar opinion as well, but seeing this explosion unravel over the past few months has me convinced that it's can't possibly be right, at least not to any degree that objectively matters.
Perhaps language is the wrong term to use, since it's not what LLMs are really about. They're about text. There are very few things that cannot be expressed as text, albeit in unconventional ways like base64. Being opaque to humans doesn't mean that with enough data a neural net can't be taught to "see" images that way or "hear" sound files for example. If the original assertion is true, then there must be some kind of universal barrier to skills that cannot be expressed in text. That sounds completely crazy to me, since we humans are also likely just organic data that could be expressed as text with some encoding. The main problem is interfacing with it in some way that's actually useful, which is the extremely hard part.
Another thing to consider is that with a formalized enough language (i.e. a programming language) one can be far more exact in explaining things accurately than any natural language with its cultural specifics and inferred nonsense. That's probably why LLMs designed as coding models first and foremost usually outperform those that aren't in solving unrelated arbitrary problems.
> Note that humans are animals too, btw. And conversely, I would consider nonverbal people as humans as well.
Humans are animals in the biological sense, yes. But very much not in the societal and skill-transferring sense.
Well, that idea is one of the motivations behind the paper, which is itself a throwback to earlier ideas about the "language of thought", a hypothetical language (certainly not ordinary natural language, and probably more like a programming language). But adding a few twists such as the probabilistic part and of course the whole machinery of LLMs, and more emphasis on sensory grounding. I think it's a very interesting approach from a researcher I respect, but obviously don't know if it'll pan out.
> Of course we can explain most of our knowledge, skills, and thoughts in words, that's how we don't lose everything when the next generation comes around lol.
I would wager if you put a newborn human to be raised in the absence of any physical human contact, but somehow taught them to read/write, and gave them access to a universal corpus (text only, no audio/video), or heck, even internet access with `curl`, and lastly dropped them into the "real world" at age 25, they would be utterly incapable of performing, say, a basic service job at a restaurant.
Words help us symbolize and reason about our sense experiences, but they are not a substitute for them.
Sure. Reading about colors will tell you nothing about them until you can see a depiction of them attached to their names. Same with all the other senses.
> Of course we can explain most of our knowledge, skills, and thoughts in words,
This is either some profound miscomprehension of just how many of your skills and thoughts are inexpressible in words, or some statement of how profoundly shallow your skills and thoughts actually are.
Of course, that does leave the door Open, that when these models are put in a physical real body, a robot, and have to interact with the world, then maybe they can gain that "common sense".
This doesn't mean a silicon based AI can't become conscious of skills that are hard to verbalize. Just that they don't yet have all the same inputs that we have. And when they do, and they have internal thoughts, they will have the same difficulty verbalizing them that we do.
Yann LeCun has a vested interest in downplaying LLM emergent abilities.
His research at meta is in the analytic approach to machine learning. As result he is very unabashed in expressing distaste of ML approaches that don't align with his research.
Really, there is no larger sore loser than LeCun in internalizing the bitter lesson. Quoting him without this context is being deliberately misleading.
What concepts exactly can’t be verbalized? All of our serialized file formats fall under the umbrella of “words”. GPT4 can draw images by outputting SVGs for example.
Unfortunately, this effort fully misses the boat. Human cognition is about concepts, not language, and that's where one must start to understand it. Language simply serializes our conceptual thinking in multiple language formats, the key is what's being serialized and how that actually works in conceptual awareness.
I think the key point is that serialized words symbolize concepts and other logic such that if you can't retrieve that concept into your awareness, you will not understand the word. Learning and forming the concepts comes prior to attaching common word symbols to them based on the region you live in. So if you start with words, you never get anywhere, hence the complete lack of any intelligence in the LLM approach.
Exactly. Thought is prior to language, and much confusion happens when you conflate the them. In particular the surface syntax of language tells you next to nothing about the "syntax" of thought, which is hypergraphical, not tree-structured.
Right, derived from word pattern statistics. The CYC project tried first order predicate calculus with complete failure. This is not how we think or how conceptual awareness works. The key give away is what they don't talk about, Concepts.
In general, in AI, when we talk about "concepts" we're talking about the things machine learning models are trained to, well, to model.
In PAC-Learning terms, specifically, a "concept" is a set of instances (which may be vectors or whatever).
Note that a "concept" is not the same as a "class", as in classification. Instead a concept belongs to a class of similar concepts and a learner is trained on instances of concepts in a class. Then a learner is said to be capable of learning the concepts in a class if it can correctly label instances of a concept in the class with some probability of some error.
For a more concrete example, a "class" of concepts is the class of objects represented as subsets of pixels in digital images. A "concept" of that class is, for example, the concept "dog". An image classifier can be said to be able to learn to identify objects in images if it can correctly classify subsets of the pixels in an image as "dog" (or "not dog").
Since the article above is coming from Josh Tenenbaum's group, that's the kind of terminology you should have in mind, when you're talking about "concepts". These guys are old-school (and I say that as a compliment).
That is indeed the general idea. That's why they're called "concepts", they're meant to be things, or categories of things, that we perceive. But "concept" also has a technical sense, of the assumed representation in machine learning.
That is, in machine learning a concept is represented as a set of instances. Inside the human mind, who knows.
This is really interesting. The title is referencing the "Language of Thought" hypothesis from early cognitive psychology, that posited thought consisted of symbol manipulation akin to computer programs. The same idea was behind was also what is often referred to GOFAI. But the idea has largely fallen out of fashion in both psychology and AI. There's a twist here in the "probabilistic" part, and of course the surprising success of LLMs makes this a more compelling idea than it would've been only a couple of years ago. And there's also an acknowledgement of the need for some kind of sensorimotor grounding as well. Pretty cool!
So they are using GPT-4 to write Lisp? Or some probabilistic language that looks like Lisp.
They keep saying LLMs but only GPT-4 can do it at that level. Although actually some of the examples were pretty basic so I guess it really depends on the level of complexity.
I feel like this could be really useful in cases where you want some kind of auditable and machine interpretable rationale for doing something. Such as self driving cars or military applications. Or maybe some robots. It could make it feasible to add a layer of hard rules in a way.
Humans come in all shapes and forms of sensory as well as cognitive abilities. Our true ability to be human comes from objectives (derived from biological and socially bound complex systems) that drive us, feedback loops (ability to morph / affect the goals) and continuous sensory capabilities.
Reasoning is just prediction with memory towards an objective.
Once large models have these perpetual operating sensory loops with objective functions, the ability to distinguish model powered intelligence and human like intelligence tends to drop.
World models are meant to be for simulating environments. If this was something like testing if a game agent with llm can form thoughts as it play through some game it would be very interesting. Maybe someone on HN can do this?
A facsimile of sufficient equivalence to the world models we derive from our 5 senses may be approached through derivation of descriptive language only.
"sufficient equivalence" is important because sure it may not _really_ know the color of red or the qualia of being, but if for all intents and purposes the LLM's internal model provides predictive power and answers correctly as if it does have a world model, then what is the difference?
That's not how physics works. We understand the world by interacting with it. How do you know your internal model is right until it is tested in reality?
Yeah but we can serialize the world to numbers and already have.
I asked GPT3.5turbo "Pretend you are a character called Samatha and you're in your house. You go up to the thermostat and select a comfortable temperature and explained your reasoning"
> Next, I take into account my personal preferences and comfort levels. Everyone has their own ideal temperature range, and it's essential to find the sweet spot that makes me feel most comfortable. For me, it's usually between 22 to 24 degrees Celsius (72 to 75 degrees Fahrenheit). This range allows me to feel neither too cold nor too warm, striking the perfect balance.
It also goes on about how the humidity could effect the desired temperature, etc.
It doesn't need the ability to feel temperature (which could also be a single floating number using kelvin), but it can already describe a "comfortable temperature" and what factors would effect it.
Side note: It doesn't "know" anything, it can only make a "best guess" which is now fairly reliable enough to be useful. It doesn't need the ability to test things to learn, we did it already for it, and it's using that to predict the results. You could make a recursive system to allow it to test data if you'd like though.
Seems you're unaware the amount of world knowledge that already exist in written form.
Think of all the top journals, textbooks, etc. People have understood the world by interacting with it, detailed their hypothesis, conducted experiments, recalled their learning and written down conclusions.
It's not at all obvious to say a useful world model cannot be derived strictly from all this written information.
(a) induce an LLM to take natural language inputs and generate statements in a probabilistic programming language that formally models concepts, objects, actions, etc. in a symbolic world model, drawing from a large body of research on symbolic AI that goes back to pre-deep-learning days; and
(b) perform inference using the generated formal statements, i.e., compute probability distributions over the space of possible world states that are consistent with and conditioned on the natural-language input to the LLM.
If this approach works at a larger scale, it represents a possible solution for grounding LLMs so they stop making stuff up -- an important unsolved problem.
The public repo is at https://github.com/gabegrand/world-models but the code necessary for replicating results has not been published yet.
The volume of interesting new research being done on LLMs continues to amaze me.
We sure live in interesting times!
---
PS. If any of the authors are around, please feel free to point out any errors in my understanding.