My experience interacting with chain-of-thought is that it should not be likened to the rigid chains of logic/math. Step-by-step reasoning by models isn’t magically imparting that much rigidity to their outputs. The strength of the chain is the strength of related contexts, which is to say much less than math/logic done by humans. We tell ourselves we are teaching AI to do step-by-step reasoning, but admittedly as someone who deals with models daily in this area and not programming them, I don’t see the tight necessary connections we teach in basic math because I see how much the model(s) fail in ways no human past a certain age could. It’s more of a search for related contexts, which is powerful, but again not how a human reasons logically. Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions. Models aren’t doing that. They are leapfrogging through context. Yes an argument can be that’s splitting hairs, but that’s because it’s hard to describe succinctly, not hard to see.
Given that LLMs are basically doing Sequential Monte-carlo sampling in latent space, the "thought" part of chain-of-thought certainly seems more akin to the necessary warm up period whenever you do any kind of SMC sampling.
Anyone whose done serious Bayesian stats work knows that the sampler needs to warm up for a bit to get start efficiently sampling. I suspect something similar is happening with chain-of-thought: the model needs to wander around a bit before it gets into the correct neighborhood for sampling the answer.
That's quite an interesting comparison. I like the description of both as Sequential Monte-carlo sampling from a desired distribution. But I think there are two crucial differences.
First, in Bayesian sampling, the initial value and first samples are not sampled from the desired distribution. In a well trained LLM, the prompt is given and the first response is sampled from the desired distribution (of text that is likely to follow the prompt).
Second, in Bayesian sampling, the fact that the samples aren't independent is an unwelcome but unsolvable problem. We want independent samples but can't generate them, so we settle for conditionally independent samples.
In an LLM, we want each sample to be dependent on the preceding text, in particular the prompt.
In summary:
Bayesian sampling - poorly chosen "prompt" (the initial sample), future samples would ideally be independent of the prompt and each other.
LLM sampling - carefully chosen prompt, future samples are ideally dependent on the prompt and on each other.
And in conclusion:
The warm up period helps a Bayesian sampler find values that are less dependent on the initial "prompt", which we definitely don't want in an LLM.
I think a lot of what humans think of as "1. 2. Therefore 3." kind of reasoning isn't different from what the llm is doing, and not in fact any more clever than that. Plenty of people believe plenty of questionable things that they assume they have thought through but really haven't. They used the context to guess the next idea/word, often reaching the conclusions they started out with.
When you talk about ironclad conclusions, I think what happens is that we come up with those confabulations intuitively, but then we subject them to intense checking - have we defined everything clearly enough, is that leap in reasoning justified, etc.
So what I'd really like to see is a way to teach llms to take a vague English sentence and transform it into a form that can be run through a more formal reasoning engine.
Often instead of asking an llm to tell you something like how many football fields could you fit inside England, you are better off telling it to write python code to do this, assume get_size_football_field() in m^2 and get_size_England() in m^2 is available.
> I think a lot of what humans think of as "1. 2. Therefore 3." kind of reasoning isn't different from what the llm is doing, and not in fact any more clever than that. Plenty of people believe plenty of questionable things that they assume they have thought through but really haven't. They used the context to guess the next idea/word, often reaching the conclusions they started out with.
Agreed that many/most humans behave this way, but some do not. And those who do not are the ones advancing the boundaries of knowledge and it would be very nice if we could get our LLMs to behave in the same way.
That's what I mean though, people don't just come up with the right answer out of nowhere. They think through many possibilities generatively, based on "intuition", and most of what they come up with is rubbish - they find out by applying strict rules of reasoning and often checking against other known (or "probably true") ideas, and winnow it down to the ideas that do in fact advance the boundaries of knowledge.
Often times it's not even the individual that throws out bad ideas - many times it'll be colleagues poking holes in his argument, removing further unsuitable generated candidates from the pool of possible answers.
If you think clever people just sit in a corner and come up with revolutionary ideas, I think you're probably wrong. Even the ancient philosophers used to hang out with some wine and hear out their peers and poked holes in their arguments. They called it a symposium.
Sorry yes I should have been more clear and I think I am agreeing with you. I was saying that most people just come up with a thought and retroactively apply "logic" to it so they feel like they've reasoned themselves there. A select few people rigorously apply logic and then follow that to whatever conclusion it leads to. We call these people scientists but honestly in my experience even many scientists can fall into the first camp.
The thing is those were much larger ideas/arguments that could be picked apart by sturdy logical targeting. My experience is narrow scope prompts (that still require chain-of-thought) that are much less lofty defeating models. No symposium ever entertained these prompts, because we all know the pigeon hole principle for very basic setups, for example. Humans a lot of the time do just come up with the right answer. We just don’t ask those questions much because we answer them ourselves a lot of the time. Though I only see one small angle with my work.
> Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions. Models aren’t doing that.
Sure, but the structure of human reasoning is almost identical to chains of thought. We have an auditory loop and, faced with a complex problem we repeat the mantra "now that I know XYZ, then what..." until the a good next step pops into our head and we add that to the context.
The transition function just is (currently) much better in humans.
Considering GPT can do programming and logic to some level, I assume it has has training of that sort? It can seem to do logic even on some completely made up abstract notions. For example "Consider a jumajambi has 2 jimimis. Each jimijimi is a jomololo or a joobajooba. How many possible variations of jumajambi are there if there are 4 jumajambi?".
People keep calling it "next next token predictors", but clearly there is something more going on and I would love for someone to give a simple explanation.
>People keep calling it "next next token predictors", but clearly there is something more going on and I would love for someone to give a simple explanation.
Next token prediction is the objective function.
The model is asked to predict the next word yes but it's also allowed to compute the answer and more importantly, the entire training process is supposed to be the model learning and figuring out what sort of computations aid the prediction of the corpus it's trained on.
If your corpus is language A followed by the translation in Language B then there's little choice but for the model to learn computations that translate as loss goes down.
Is your corpus is chess moves then again, it's going to have to learn how to compute chess games to reduce loss.
You can see this with toy models trained on toy problems. Example - a tiny transformer trained on addition examples - x + y = z learning an algorithm for addition.
"Pick the right word" is not a trivial exercise for the vast majority of text data.
And again because people often make this mistake but a LLMs ultimate objective is NOT to produce "text that looks right" but "text that is right". Of course "right" as determined by the training corpus but basically anytime it picks a wrong word is opportunity for the model to learn and learn it does.
> People keep calling it "next next token predictors", but clearly there is something more going on
I think this depends what you mean by "something more going on".
Now, if someone says that it is "just" "next token prediction", in a dismissive way, I think that's an error.
But, while they RLHF ones aren't exactly trained just to match the observed distribution, but rather are trained with the RLHF objective, it is nonetheless true that the model produces a probability distribution over possible next tokens, conditioned on the previous tokens, and samples from that. (I suppose there's also like, things done as part of the sampling on top of these conditional probabilities, rather than just sampling according to the probabilities given the temperature. (I don't know how this part works really.) But I think this is mostly just a trick to get a little more quality, and not a major part of how it behaves? Not part of the NN itself in any case.)
> People keep calling it "next next token predictors", but clearly there is something more going on and I would love for someone to give a simple explanation.
Starting from a point of outputting random gibberish, the only feedback these models are given during training is whether their next word prediction was right or wrong (i.e. same as next word in the training sample they are being fed). So, calling these models "next word predictors" is technically correct from that point of view - this is their only "goal" and only feedback they are given.
Of course, what these models can accomplish, reflecting what they have learnt, is way more impressive than what one might naively expect from such a modest goal.
The simple, usual, and rather inadequate, explanation for this mismatch between training goal and capability is that in order to get really, REALLY, good at "predict next word", you need to learn to understand the input, extremely well. If the input is "1+2=" then the model needs to have learnt math to predict next word and get it right. If the input is a fairy tale, then it needs to learn to recognize that, and learn how to write fairy tales.
This is how these LLM's "predict next word" goal turns into a need for them to learn "everything about everything" in order to minimize their training error.
The question of course then becomes how do they do it? We are training them on pretty much everything on the internet, so plenty to learn from, but only giving them some extremely limited feedback ("no, that's not the correct next word"), so what magic is inside them that let's them learn so well?!
Well, the magic is a "transformer", a specific (and surprisingly simple) neural network architecture, but this is pretty much where the explanation ends. It's relatively easy to describe what a transformer does - e.g. learning which parts of it's input to pay attention to when predicting next word, and doing this in a very flexible way using "keys" that it learns and can search for in the input, but it is extremely hard to explain how this mechanism let's it learn what it does. Interpreting what is really going on inside a transformer is an ongoing research area.
I think that maybe the best that can be said is that the transformer designers stumbled upon (I'm not sure they were predicting ahead of time how powerful it would be) an extremely powerful and general type of sequence processor, and one that appears to be very well matched to how we ourselves generate and recognize language. Maybe there is some insight to be learnt there in terms of how our own brains work.
> Starting from a point of outputting random gibberish, the only feedback these models are given during training is whether their next word prediction was right or wrong (i.e. same as next word in the training sample they are being fed). So, calling these models "next word predictors" is technically correct from that point of view - this is their only "goal" and only feedback they are given.
This is true for pretraining - creating a "base model" - but it's not true for instruction tuning. There's a second stage (RLHF, DPO, whatever) where it's trained again with the objective being "take questions and generate answers" and from there "generate correct answers".
I would expect there could be further advancements where we actually program algorithms into transformers (which can be done) and then merge models with proven capabilities together rather than trying to train everything by example. Or emit tool-running tokens which can do unbounded computation.
> so what magic is inside them that let's them learn so well?!
Funny thing is there _are_ known limits to what it can do. In particular, it can't do reverse association from anything it learned going forwards. This is called the "reversal curse".
ie, if you give GPT4 a line from a song it can tell you what the line after it is, but it's a lot worse at the line before it!
> This is true for pretraining - creating a "base model" - but it's not true for instruction tuning. There's a second stage (RLHF, DPO, whatever) where it's trained again with the objective being "take questions and generate answers" and from there "generate correct answers".
Yes, but those are essentially filters, applied after the base model has already learnt it's world model. I think these are more controlling what the model generates that what it learns, since you don't need much data for this.
> merge models with proven capabilities together rather than trying to train everything by example
Merging specialist LLMs is already a recent thing. I'm not sure how it works exactly but basically merging weights post-training. Yannic Kilcher mentioned this on one of his recent YouTube videos.
> if you give GPT4 a line from a song it can tell you what the line after it is, but it's a lot worse at the line before it!
I suppose a bidirectional transformer like BERT would handle this better, but generative language models are deliberately only using the past to predict the future, so this might be expected. Some short term memory (an additional "context" persisting across tokens) would presumably help.
No; it can reason backwards from things it found in context, just not things trained into the model. If you have lines A, B, C there's no association in the model back from C to B. I don't think this can be solved by better reasoning.
A proposed solution I saw recently was to feed every training document in backwards as well as forwards.
So I understand how the classical ML classification/regression works. I can see how if you applied the same method to each word you can produce sentences.
"Dogs are" -> "animals that..."
Where Im confused is how this method would learn logic. I can imagine after seeing a certain amount of patterns "dogs are animals, "birds are animals", "cats are animals", it encodes a concept like "x are animals", which is connected to other concepts like "animals are y"
How is this encoded in the model? Can we see the abstraction it has formed?
"for every word I give you, reverse every alternating word".
How does this fit into next word prediction? Its not just a matter of seeing what a reversal or alternating is, it has to actually compute these things. That cant be just predict the next word iteratively.
I imagine its a system of different models, with a "master" model trained on directing prompts -> type of model to use.
I've seen succesful projects that add a constraint solver as a layer in a neural network, so it's potentially something that could be integrated at an even deeper level than our current finetuning for tool use.
It's not a priority for the current big model architecture, but there's a bunch of stuff we could be doing with network architecture.
> We have an auditory loop and, faced with a complex problem we repeat the mantra "now that I know XYZ, then what..." until the a good next step pops into our head and we add that to the context.
You probably should replace "auditory" with "auditory or visual or conceptual or ??? - depending on the specific human"
I don't use any kind of verbal tools (either silent or out loud) in that process, I think different people use different tools for that process.
I think that chain-of-thought for LLMs is just helping them enhance their "memory", as it puts their reasoning into the context and helps them refer to it more readily. That's just a guess, though.
That’s pretty much correct. An LLM is often used rather like a forecast model that can forecast the next word in a sequence of words. When it’s generating output it’s just continuously forecasting (predicting) the next word of output. Your prompt is just providing the model with input data to start forecasting from. The prior output itself also becomes part of the context to forecast from. The output of “think about it step-by-step” becomes part of its own context to continue forecasting from, hence guides its output. I know that “forecasting” is technically not the right term, but I’ve found it helpful to understand what it is LLM‘s are actually doing when generating output.
A simplified explanation, which I think I heard from Karpathy, is that transformer models only do computation when they generate (decode) a token. So generating more tokens (using CoT) gives the model more time to “think”.
I have another explanation. LLMs are essentially trained on "A B", i.e. is it plausible that B follows A.
There's simply a much larger space of possibilities for shorter completions, A B1, A B2, etc. that are plausible. Like if I ask you to give a short reply to a nuanced question, you could reply with a thoughtful answer, a plausible superficially correct sounding answer, convincing BS, etc.
Whereas if you force someone to explain their reasoning, the space of plausible completions reduces. If you start with convincing BS and work through it honestly, you will conclude that you should reverse. (This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.)
This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.
> This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.
Actually, one of the best ways is pretending to be more extreme than them. Agree with them on everything, which is disarming, but then take it a step or two even further. Then they're like, "now hang on, what about X and Y" trying to convince you to be more reasonable, and pretty soon they start seeing the holes and backtrack to a more reasonable position.
I think you're right. I would go a step further and say that all learning is roughly synonymous with reducing the output space, and that humans do the exact same thing. There are more ways to get the wrong answer to a math problem than there are to get the right answer. When you learn someone's name, you're narrowing your output to be a single name rather than all plausible names.
The output of a generative model is practically infinite. I suspect it's possible to continually narrow the space of completions and never converge on a single output. If this turns out to be true, it would bode well for the scalability of few-shot learning.
It helps, but it still gets stuck in local optima based on what it started with. I've never seen it turn around and correct its faulty reasoning unless it tried to actually run the code and observed an Exception. If I respond with "but have you considered XYZ?", my leading question will usually cause it to correct itself, even when it wasn't incorrect.
We need some way to generate multiple independent thoughts in parallel. Each separate thought is constructed using chain of thought to improve the reliability. Then you have some way to "reduce" these multiple thoughts into a single solution. The analogy would be a human brainstorming session where we try to attack the same problem from multiple angles and we try to decorrelate each idea/approach.
We already have that, it's called beam decoding, and there are three of thought solutions as well, for each beam you can pick the one with the best logprob, but it's not a given that the result will be better because logprob only capture the model decisiveness not correctness, so it'll still fail if a model is confidently wrong.
I was going to write pretty much this exact same comment. I am an amateur in how LLMs work, definitely, but I always thought this was the plausible explanation.
If I want the "assistant "LLM to tell me "How much 5 times 2 is", if I feed it the line "5 * 2 = " as if it's already started giving that answer, it will very likely write 5*2 = 10.
Since LLMs operate on semantic relationships between tokens, the more a bunch of tokens are "close" to a given "semantic topic", the more the LLM will keep outputting tokens in that topic. It's the reason why if you ask an LLM to "review and grade poetry", eventually it starts saying the same thing even about rather different poems -- the output is so filled with the same words, that it just keeps repeating them.
Another example:
If I ask the LLM to solve me a riddle, just by itself, the LLM may get it wrong. If, however, I start the answer, unravelling a tiny bit of the problem it will very likely give the right answer, as if it's been "guided" onto the right "problem space".
By getting LLMs to "say" how they are going to solve things and checking for errors, each words basically tugs onto the next one, honing in on the correct solution.
In other words:
If an LLM has to answer a question -- any question --, but right after we ask the question we "populate" its answer with some text, what text is more likely to make the LLM answer incorrectly?
- Gibberish nonsense
- Something logical and related to the problem?
Evidently, the more gibberish we give to it, the more likely it is to get it wrong, since we're moving away from the "island of relevant semantic meaning", so to speak. So if we just get the LLM to feed itself more relevant tokens, it automatically guides itself to a better answer. It's kind of like there's an "objective, ideal" sequence of tokens, and it can work as an attractor. The more the LLM outputs words, the more it gets attracted to that sequence...that...."island of relevant semantic meaning".
But, again, I know nothing of this. This is just how I view it, conceptually. It's probably very wrong.
That reminds me ... You know how LLMs have a hard time being corrected? If I ask it not to format responses as bullet lists, after 1-2 rounds it does it again. Why? Because the context is filled with examples where it has used bullet lists, and it acts like an attractor.
I ask it not to start phrases with "However..." and it does it again. Maybe just having the word However in the prompt acts like an attractor that compels the LLM to use it, even when I actually asked the opposite. Probably also the fault of heavy handed RLHF telling it to balance any user position with the opposite take.
This is one of many ways of LLMs are being crippled by terrible UI controls. You can't do simple things like edit the conversation history to make it forget things.
You can edit the conversation history though. You need to try alternative apps/UIs instead of the product websites like ChatGPT. Those are only for collecting more training data from users instead of being the most useful interface possible.
if you haven't already, I recommend trying the openai playground instead of chatgpt. It is the same underlying ai (i.e. gpt4), but you have much more control over the inputs.
Bonus 1: Since you pay per token, it's much cheaper than a chatgpt abo
Bonus 2: You can increase the context window dramatically (iirc 8000 being the max for playground, while 2000 is the max for chatgpt)
Facebook had a paper about "system 2" LLM attention, where they identified which parts of the input would be distracting for the LLM and just deleted them.
> This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.
And this is why taking your time to write a detailed software help request delivers a good chance that you will solve your problem all by your lonesome.
The autoregressive transformer architecture has a constant cost per token, no matter how hard the task is. You can ask the most complicated reasoning question, and it takes the same amount of computation to generate the next token compared to the simplest yes / no question. This is due to architectural constraints. Letting the LLM generate "scratch" data to compute (attend to relevant information) is a way of circumventing the constant cost limitation. The harder the task, the more "scratch" you need so more relevant context is available for future tokens.
That's flatly wrong. Each successive token costs progressively more. The deeper a token is in the sequence, the more past states it has to attend to. As a proof, just remember how slow it gets when the context is large, and how snappy when you first start a chat.
The way I worded it, it might seem wrong - and I agree with you. When I said "constant" I meant without any optimizations to speed up shorter contexts, so with full designed context, architecturally, it is constant. You can pad shorter active contexts with zeroes and avoid attending to empty spaces as an optimization, but that is just an optimization, not an architectural property. If you want "more computation" you fill the context with relevant data (chain of thought, or n-shot stuff), which is the "trick" Karpathy alluded to (it provides more context to attend to), and I agree with that analysis.
You're both kinda right. The type of computation that happens for that attention step that you refer to is parallel. I would say the thing that is "constant" is the computation graph depth (the number of sequential computations) which is actually important in computing certain functions.
That's what I thought at first, but that actually doesn't make sense, the amount of work done on a string is the same even if the string is followed by padding due to the mask used in attention. Then I realised that an LLM's working memory is limited to its activations, which can be limiting. But it can extend its working memory by writing partial results to the output and reading it in. E.g. if you tell it to "think of a number" without telling you what it is it can't do that, there is nowhere to store that number, it has no temporary storage other than the tape. But if you ask it to "think step by step" you let it store intermediate results (thoughts) on the tape, giving it extra storage it can use for thinking.
So my experience creating products on GPT3.5-Turbo is that there is an upper limit to how much instructional complexity the model can handle at a time. It isn't really about "adding computation", though you are doing this. The key is to construct the process so that the model only has to focus on a limited scope to make the decision on.
In effect you are kind of creating a tree structure of decisions that build off of each other. By generating intermediate tokens the model can now only pay attention to the smaller set of already collapsed decisions. It is a little more complicated than that as the model will create anticipatory behavior where intermediate steps get biased by an incorrect result that the model anticipates.
One of the things I’ve been doing with the models I’ve been using with coding is adding the stack and primary dependencies in the system prompt and then asking or conversing. It has helped out a lot, or at least feels like it has.
The tokens are also necessary to store information, or at least off-load it from neuron activations.
E.g. if you asked an LLM "think about X and then do Y", if the "think X" part is silent, the LLM has a high chance of:
a) just not doing that, or
b) thinking about it but then forgetting, because the capacity of 'RAM' or neuron activations is unknown but probably less than a few tokens.
Actually, has anyone tried to measure how much non-context data (i.e. new data generated from context data) a LLM can keep "in memory" without writing it down?
I don’t think commonly used LLM architectures have internal state that carries over between inference steps, so shouldn’t that be none? Unless you mean the previously generated tokens up to the context limit which is well defined.
Sorry, I meant the information that is inferred (from scratch on every token) from the entire context, and is then reduced to that single token. Every time a token is generated, the LLM looks at the entire context, does some processing (and critically, this step generates new data that is inferred from the context) and then the result of all that processing is reduced to a single token.
My conjecture is that the LLM "knows" some things that it does not put into words. I don't know what it is, but it seems wasteful to drop the entire state on every token. I even suspect that there is something like a "single logic step" of some conclusions from the context. Though I may be committing the fallacy of thinking in symbolic terms of something that is ultimately statistical.
Correct, there's no internal state, but CoT techniques simulate this by providing a space for the model to generate tokens which represent intermediary thoughts.
This is true. You can get a similar effect by asking the model to plan its path first without writing any code, then asking it to review its plan for deficiencies, and finally asking it to enact the plan and write the code.
This begs the question: why is it that giving them more time to "think" yields better answers, and is there any limit to that? If I make them write hundreds of pages of explanation, there must be a diminishing returns of some kind. What influences the optimal amount of thinking?
My guess is that good answers are more well reasoned than answers that are short and to the point, and this is picked up in training or fine-tuning or some other step.
And probably the optimal amount of thinking has something to do with the training set or the size of the network (wild guesses).
Look at it from an algorithmic perspective. In computer science many algorithms take a non-constant number of steps to execute. However, in transformers models, there are a limited number of decoder blocks, and a limited number of FFN layers in each block. This presents a theoretical upper bound on the complexity of the algorithms a decoder network can solve in a single token generation pass.
This explains why GPT4 cannot accurately perform large number multiplication and decimal exponentiation. [0]
This example can extend to general natural language generation. While some answers can be immediately retrieved or generated by a "cache" / algorithm which exists in latent space, some tokens have better quality when their latent-space algorithm is executed in multiple steps.
> Quiet-STaR: Language Models Can Teach Themselves to
Think Before Speaking
This paper suggests that a large language model should "think ahead" by predicting not only the next token but also a "supporting thought." The approach involves generating all tokens simultaneously, allowing for a single forward pass that produces both the next token and a supporting thought, which might consist of, for example, 16 tokens.
This supporting thought influences the model's prediction. The process is then extended to multiple supporting thoughts by ingeniously masking cross-attention between thoughts to ensure their independence. So in essence we can fill all the remaining context with supporting thoughts and benefit from all of them in the same single forward pass.
The supporting thoughts themselves are trained with the objective to maximize the probability of a longer sequence ahead, using RL. So they are trained to optimize for longer-term, instead of the myopic next token prediction task.
I think it's fairly simple: you're creating space for intermediary tokens to be generated, where those intermediary tokens represent "thoughts" or a simulated internal dialog.
Without that, it's analogous to asking someone a question and they immediately start responding from some information they'd heard before, rather than taking some time to have an inner dialog with themself.
There's a recent paper which seeks to explicitly perform time-to-think using pause tokens[1].
> However sophisticated this end-to-end process may be, it abides by a
peculiar constraint: the number of operations determining the next token is limited by the number of tokens seen so far.
There are obviously pros and cons to each, but nothing excludes us from combining the two either.
Do LLM not also think when they encode the prompt? If Karpathy's explanation is accurate, longer prompts should also help even if they don't contain additional information, just by virtue of giving more time to think.
The time processing the longer prompt isn't being spent churning (i.e. "thinking") on the problem at hand, it's spend calculating attention matrices between all the tokens. The time spent on this is a function of the number of flops you have available.
So no, if you just fill up your context window to garbage, the LLM will not perform better at your task/question.
Do you think there is a fundamental difference between masked language modelling vs causal language modelling? I feel like most LLMs are decoder only models just cause they are easier to train because their attention mask is fixed
No, you can cache some of the work you did when processing the previous tokens. This is one of the key optimization ideas designed into the architecture.
> These are the central questions in the formal study of computation. The field dates back to 1936, when Alan Turing first imagined a fanciful device, now called a Turing machine, that could perform any computation by reading and writing symbols on an infinite tape.
It dates further back to the 1920s when Moses Schönfinkel came up with Combinatory Logic [1], and the early 1930s when Alonzo Church came up with the lambda calculus [2]. These models however make a less suitable base for computational complexity theory.
Arguably it goes further back to Pearce and Frege, Boole, Pascal, Leibniz and all the way to Aristotle, who was probably the first to seek a way to formalise structured thinking. Turing meant his computational apparatus as a formalisation of the way a human mathematician solves a problem by computation, i.e. by the manipulation of symbols according to a set of formal rules. In that, he followed in a long line of others who had thought about the same experience and how eminently mechanisable it is. Pascal was the first to actually do it, for arithmetic.
Parent has probably seen this (or everything in it), but for others who are interested in this stuff (including Schönfinkel’s work) I recommend https://youtu.be/h0OkptwfX4g.
I think the two modes of LLM discourse: “they’re conscious!/they’re just next token predictors with impressive datasets” comes largely from two different groups of people: those who learned about LLMs before learning about ML fundamentals, and those who learned ML fundamentals before encountering LLMs of today. While I fall in the second group, there is a real risk that my prior concepts about the fundamentals is limiting my view of the bigger picture, so I at least welcome the debate.
Re: chain of thought, I at least know that in practice a lot of the results from the original paper has not been quite reproducible at later attempts. Whether that is a quirk of models changing everyday or something deeper, I do not know.
Instinctively I'd trust the people with the knowledge that goes farther back in time. On the other hand, I once whinged to my thesis advisor that a lot of people in machine learning don't seem to know much about older machine learning and AI work and he, with 30+ years of research on me, pointed out that people complained about that already when he was a PhD student.
There is so much work on AI that goes back about 80 years (counting from Pitts and McCulloch, because why not; or you could count from Turing) and it's very hard to both keep up with what everyone else is doing, and go deep in your own subject. e.g. you pick up a Reinforcement Learning book and it's basically attacking the same problems as in planning, and with very similar assumptions (states and action spaces) but it's like planning doesn't even exist.
Why not (possibly) both? After all, we can't even define consciousness, other than "what's it like to be a bat" makes more sense to us intuitively than "what's it like to be a rock".
Consciousness may just come along with the ride with a certain amount of information processing, we just have no clue.
If we can't define consciousness then that's a very good reason not to assume that a computer program is conscious, especially when it wasn't created with the purpose of being conscious in the first place.
Consciousness is something we say animals or humans have, but not computers, so to say that a computer program is conscious it takes a lot more work and, I guess, a formal definition that most of us can agree on.
At this point I think I'm leaning towards "organic brains are just next token predictors with impressive secondary heuristic systems".
The fact that we can get such impressive results from transformers which are such a poor approximation and completely stateless makes me think there really isn't any special sauce to it.
I thought this was obvious: They lack an “inner voice” (80% of humans?) or “inner imagery” (the rest) as we humans do, so they cannot first think the problem through before answering. Thus, using the actual “output area” as such a scratch pad can help it cover a larger area of reasoning before outputting an answer - just as we do.
I feel you can even see this when you ask it certain questions with “think in steps” prompting: It can output temporary thoughts which aren’t of use in the final answer - again just as we do when attacking a problem we can’t immediately answer.
Also, we humans often use pen and paper to jot down temporary and intermediary thoughts and answers. Again, LLMs don’t have that, but can use the output as something similar.
Some styles of ToT prompting actually make the LLM have two types of output - one for its “inner voice thinking”, and then another for output meant for the human. The same goes when one give the LLM method calling abilities, or “googling”: This can be seen as a way to perform thinking and reasoning without output meant for the user, before formulating an answer.
Models can't think. It uses the input context to predict an output. So if you have a problem that needs to be solved iteratively, those intermediate steps need to be persisted to the context, because there is nowhere for them to go otherwise.
> Models can't think. It uses the input context to predict an output.
The first claim doesn't follow from the second. What is it about using the input to predict an output that makes you believe they can't think? What if that's all thinking is? We don't know.
I think a fundamental difference is we are able to learn new things by reasoning through our existing knowledge. Moreover our beliefs are mostly consistent with each other, and we can be argued with and have our beliefs changed.
As far as I understand, GPT isn't going to alter its whole worldview if you show it that its thinking is flawed.
But perhaps this is possible if upon discovering a flaw, it looped through its corpus and altered its connections?
It's not the the second statement follows from the first; I'm asserting that models can't think, and that what they're really doing is prediction based on context.
The fact that chain-of-thought reasoning yields significantly better results is your hint: that means that the model doesn't think like a human does when it comes up with responses. If it's not in the context, it doesn't exist. You can't ask a model "why did you answer that way" without it generating from whole cloth a plausible retroactive reason. But there is no memory, so it can't really tell you.
> What if that's all thinking is?
I think that this is roughly true. But we actually have memory outside of what we say, so when we think, all those intermediate steps are persisted in our brains. For a model, the context is the memory. If you delete your question from the context and ask it "why did you answer that way", it will have no idea what you're talking about.
> You can't ask a model "why did you answer that way" without it generating from whole cloth a plausible retroactive reason.
I've caught myself generating from whole cloth a plausible retroactive reason for some of my actions, which I later realized wasn't true at all. Does that mean I can't think either? Is there a way for an external observer to tell if someone is thinking or not, or is it something that, axiomatically, only humans can do, and nothing else?
Again, you have memory but models don't. If it's not in the context, it doesn't exist. This is really easy to prove: delete your question from the context and ask the model why it answered the way it did. It will not know, because the model prediction service is stateless. It only takes the context as input.
It's like if you and I weren't able to remember anything that wasn't written down. That's why you can get models to approximate thought by telling it to write down its intermediate steps.
If you say that the models have memory and can think, you are heavily implying that there is a brain in the cloud that is capable of "thinking" or "reasoning" through your request. That's what it means for humans: we're self-contained.
Models aren't like people. The context is not "part" of the model, the context is given to the model when you ask it for a prediction. It's like you cut a small part of a person's brain out. Alone, it can't think.
It's a pedantic distinction, but an important one. The model itself isn't capable of thinking, but if you package it up with a context that it can manipulate at-will, that combination of parts can be said to "think". That's probably going to be the next step in LLM tech.
I think your use of the word "memory" here is imprecise.
For example, I can ask ChatGPT "Give me a 200 word summary of George Orwell's book Animal Farm". It gives me a pretty cogent description of the novel.
That knowledge of Animal Farm is somewhere, not in the context. If we don't call that memory, I'm not sure what to call it. Why should I think of this as different than my own memories of the book?
That's encoded in the model weights, not "memory". Basically, there is no context outside of the context that you give the model. When you ask it a question, those model weights don't change. It doesn't "remember" what you asked.
This is why chain-of-thought reasoning works so effectively: it lets the model "use" the context as a sort of scratch pad to build up a response. Without it, the model isn't capable of mimicking thought because it's only capable of predicting based on the current context.
That is memory. Ask ChatGPT a basic fact about universe and it will tell you because it has memorized the answer. You are asking it to learn, or incrementally create new memories. That just has not been implemented yet, due to costs.
It's actually fairly plausible. The answer is numeric. Two digits, even, which is pretty likely when adding together 2-digit inputs. 24 is also a common answer to math problems (it has lots of factors, for one). It even has all the digits from adding 1+3 and 1+1.
Now how plausible is
Show your work. 11 + 31 = the result of adding the 10s digits together, so 10 + 30 = 40, and then adding in the 1s digits, so 1 + 1 = 2. Combining the 40 and the 2 gives 24.
That last sentence doesn't seem very likely. Or:
Show your work. 11 + 31 = the result of adding the 10s digits together, so 10 + 30 = 20, and then adding in the 1s digits, so 1 + 1 = 4. Combining the 20 and the 4 gives 24.
If you're breaking things down, you have to traverse through some territory that is lower probability than the quick wrong answer.
The argument by computational complexity is stronger, though. I just wanted to point out that the above is a confounding explanation that is sufficient for simple cases, and so may need to be ruled out before claiming that computational complexity matters.
The complexity argument is also intuitively obvious. If you think of an LLM as a type of computer that does one constant-time forward pass over the input so far on each clock cycle (and outputs a single token), then of course you can compute more if you give your computer more cycles! You can use state (even if the mechanism for transmitting the state from one cycle to the next is sharply limited).
Similarly, it's an expansion of the old problem of a single-layer perceptron not being able to compute XOR. (Here, the "cycles" are advances from one layer to the next.)
That's not to say that the nuances are obvious. Simply saying you can use multiple clock ticks doesn't really say anything about how much you can do in one tick.
I want to point out a tweet [1] that is very relevant to the miracle of CoT, and probably a simpler explanation.
> Let's think "step by step"!
> Another tidbit I like about data and prompts that miraculously work.
> Searching for this phrase resulted in this website (among others),
> http://geteasysolution.com, containing many math step-by-step solutions.
> How common are they? Quite.
> Makes you think.
Though that justifies the specific phrase, it doesn't really contradict the usual explanations of how CoT works. Like... the phrase directs it into the conceptual space of a website that has lots of CoT examples, but if CoT didn't help it think, that wouldn't actually result in better outputs.
I hesitate to the use description as "think," just biasing correlations for subsequent generations.
In any case, there is at least one work that shows that CoT may not be necessary and biasing the decoding path via logit probabilities is also promising. [1]
One could argue it still doesn't contradict the benefits of CoT, but I suspect there is nothing fundamental about CoT, except that we happened to have been pre-training on sequences that use certain prompts that were easy to conceive from a human's perspective.
Absolutely. QuietSTaR is going to make CoT obsolete. It's just, it won't make it obsolete by showing that CoT does nothing, but by getting the LLM to embed invisible CoT-like token sequences into the context on its own. That's a victory for CoT, not a loss.
"Let's think step by step" is a hack. It was always a hack. Its main payoff was showing people where the model had a weakness and how to (hackily, heavily dependent on training data and phrasing) route around it. Now with QS, the models will be able to bypass that weakness on their own.
> I hesitate to the use description as "think," just biasing correlations for subsequent generations.
This is of course a fully general description of any iterative computation. :)
It's all just about the awareness of contexts. Want to improve it? Simply add a term to the prompt to unlock more considerations. Assuming we've not reached the edge of the context window, every new word "unlocks" new vectors with more context the language models adds to the considerations.
The similarity with how the human brain (seems to) works is so remarkable, it doesn't even make sense not to use it as an analogue for how to better use language models.
When the results (same way of manipulating an LLM as manipulating a human brain ... using the right words) can be achieved the same way, why believe there's a difference?
This is stuff one can learn over time by using/researching 3B models. While most people seem to shun them, some of them are extremely powerfull, like the "old" orca mini 3B. I am still using that one! All they really need is better prompts and that approach works perfectly fine.
The biggest hurdle I've found is the usually small context window of such small models, but there's ways of cheating around that without sacrificing too much of the quality using small rope extension, summarizing text, adding context words or leaving out letters of words in the prompt, virtually increasing the size of the context window.
If you want to improve the results of your language model, you should become a mentalist/con-man/magician/social engineer. It sounds weird, but it works!
Nothing about what you’re saying actually deals with this non-obvious limitation of chain-of-thought:
> Examples like this suggest that transformers wouldn’t gain much from using just a few intermediate steps. Indeed, Merrill and Sabharwal proved that chain of thought only really begins to help when the number of intermediate steps grows in proportion to the size of the input, and many problems require the number of intermediate steps to grow much larger still.
This aligns with my experience: GPT-4 can only break down “simple” problems when prompted to solve step-by-step. In particular, if the actual steps need to be broken down further (O(n^2) complexity), GPT-4 can’t handle it reliably - it will break a tasks into steps but it struggles to break subtasks into substeps even if it otherwise can solve the subtask with CoT prompting.
CoT prompting works for simple O(n) computations because it prevents LLMs from blindly guessing the answer, but they are theoretically (and IMO empirically) incapable of breaking any O(n^2) problem down into O(n) separate O(n) subproblems. Needless to say humans are quite a bit smarter than that. (so are mice!)
Great article. Now what happens when you apply this idea and let a LLM continue a chain of thought beyond mere question answering? Some form of artificial consciousness.
Material reductionism at its best. Now you have a stochastic parrot "talking" to itself. How can anyone get to the conclusion that this could even begin to resemble a tiny bit of what we call consciousness?
This is context window narrowing. It's not any more "reasoning" than chaining together sub-queries in a database to arrive at a result that's an overlay of multiple matrices of data.
In computing we use analogies everywhere: stack, bus, web, garbage collector, parent, container, ...
Master became somewhat controversial recently, but overall the main risk our liberal repurposing of terms introduces is that we sometimes follow the "wrong" idea and design a machine that doesn't do what it ought to, or is unnecessarily complicated, that we develop systems (and documentation etc) that are inefficient if not dumb.
In adopting "thought" terminology and other analogies to psychological processes I fear we'll not just misunderstand this technology and how it works, but also degrade the rigour of machine science, damaging our credibility and misleading the public as well.
Nobody will ever make the mistake of supposing that "rehydrating" a data structure involves water, or that busy beaver machines are living beings. But the language coming out of the LLM field in particular causes these problems immediately, and they are extreme -- scientists and engineers themselves have trouble telling if it's supposed to be an analogy or not.
This has always been a problem for AI research since its start in the 70s. AI researchers come up with names for things they assume happen in the human brain, and then further assume that if they write a computer program that does something that could be given the same name, it must work too.
So called chain of thought can improve output quality ("reasoning ability") to an extent, but I wish for josh sake it were called "intermediate token conditioning" / something explanatory or at least descriptive.
Great article! Now what would happen if we took this idea, and turned it on it's head? Let's train a model to consistently give an answer first, and have it infer the steps it took to get there after.
... Is what I think the researchers at mistral AI are saying, because that's what they did. Every slightly complex question you ask their models goes somewhat like this:
>Input: Alice has 3 brothers. Each of her brothers has 2 sisters. How many sisters does Alice have?
>Output: Alice has 2 sisters.
>Here's the reasoning:
> We know that Alice has 3 brothers.
> Then we are told that each of her brothers has 2 sisters.
> Since Alice is one of the sisters to her brothers, there must be one more sister besides Alice for each brother to have 2 sisters.
> Therefore, Alice has 2 sisters in total.
Conversely, if you ask the model to think first, it gets it right immediately. I'm kinda baffled, they have not corrected this after their very first model. From mistral 7b to large, each one shows this same trained behavior to answer first, think second.
It’s kind of funny that the article calls the field of computational complexity “arcane”, considering it is at the forefront of everything we know about the limits of computing.
That said, I haven’t understood the intense, long term focus on worst-case case and average-case analysis within the field. In fact, when I first heard of big-O notation many years ago, it took me an embarrassingly long time before I realized this referred to the asymptotic performance of an algorithm on the worst-case instances of a problem. I remember thinking “Why on earth would you care about that? You can derive pathological examples to just about anything.”
Even the term “average-case” is misleading. We’re not talking about “average” in the sense of a typical instance of a problem class one might encounter in the course of daily life. This instead refers to the expectation value of the algorithm’s (asymptotic) performance over all problem instances within a formal language. Sure, the non-colloquial usage of the term “average” here is obvious to mathematicians, but I don’t think someone outside the field is likely aware that we see drastically better performance of heuristic algorithms on real-world instances of NP-hard problems than one would expect based upon a naive review of the research from computational complexity theory.
This performance gap between theory and practice is due to the fact that the problems we encounter in daily life have such a huge amount of mathematical substructure to them, and I would be very surprised if a provably optimal average-case algorithm ever realistically corresponds to the mean performance of an algorithm optimally tailored to the distribution of problem instances we encounter in real world data.
Consider matrix factorization. There are techniques that speed this up considerably if the matrix is known to be positive semidefinite, sparse, low-rank, and so on. Who knows how much undiscovered substructure is lurking in the set of real-world problem instances we lump together under “matrix factorization”.
The subfield of computational complexity theory that moves beyond overall complexity analysis I believe is called BWCA, “Beyond Worst-Case Analysis” (but someone correct me if that’s not right).
For mathematical objects like neural networks, where the specific problem instances have an absolutely massive amount of hidden substructure within them, I think we will have to use approaches like BWCA going forward to learn more about the nature of e.g., transformers.
My view is that we should focus less on the absolute limits of a particular architecture (woohoo, it’s Turing complete and a universal function approximator like everything else) and drill down more into studying the limits of the interplay between model architecture and the intrinsic hidden substructure of the data that the model is trained on.
Nothing about big-O is specific to worst case. You can calculate a big-O for best case, average case, or worst case. For example, quicksort is O(n log n) best case, O(n log n) average case, and O(n^2) worst case. You may be confusing worst-case with upper bound: big-O is an asymptotic upper bound notation, but that refers to how we simplify the terms of the cost function and is entirely orthogonal to best/average/worst case.
Perhaps technically true but if someone asks you for the time complexity of an algorithm in an interview are you going to say “O(n)” without mentioning you’re referring to average-case? As far as I’m aware, without additional specifiers or context, people are referring to worst-case by default.
I'd say it's the opposite. I've never heard quicksort referred to as a O(n^2) algorithm for example, always as an O(n log n) algorithm with O(n^2) worst-case. Maybe in an interview you would explicitly say average-case, but I think that's more to show the interviewers that you realize that a single algorithm can have multiple big-Os.
I don't see why this needs any long-winded explanation.
LLMs generate their output one word at a time (and don't themselves even know what that word will be, since it's randomly sampled from the output probabilities the model generates).
Chain-of-Thought simply let's the model see it's own output as an input and therefore build upon that. It lets them break a complex problem down into a series of simpler steps which they can see (output becomes input) and build upon.
It's amazing how well these models can do without CoT ("think step-by-step") when they are just ad-libbing word by word, but you can see the limitations of it if you ask for a bunch of sentences starting with a certain type of word, vs ending with that type of word. They struggle with the ending one because there is little internal planning ahead (none, other than to the extent to which the current output word limits, or was proscribed by, the next one).
This is kind of an epistemological debate at this level, and I make an effort to link to some source code [1] any time it seems contentious.
LLMs (of the decoder-only, generative-pretrained family everyone means) are next token predictors in a literal implementation sense (there are some caveats around batching and what not, but none that really matter to the philosophy of the thing).
But, they have some emergent behaviors that are a trickier beast. Probably the best way to think about a typical Instruct-inspired “chat bot” session is of them sampling from a distribution with a KL-style adjacency to the training corpus (sidebar: this is why shops that do and don’t train/tune on MMLU get ranked so differently than e.g. the arena rankings) at a response granularity, the same way a diffuser/U-net/de-noising model samples at the image batch (NCHW/NHWC) level.
The corpus is stocked with everything from sci-fi novels with computers arguing their own sentience to tutorials on how to do a tricky anti-derivative step-by-step.
This mental model has adequate explanatory power for anything a public LLM has ever been shown to do, but that only heavily implies it’s what they’re doing.
There is active research into whether there is more going on that is thus far not conclusive to the satisfaction of an unbiased consensus. I personally think that research will eventually show it’s just sampling, but that’s a prediction not consensus science.
They might be doing more, there is some research that represents circumstantial evidence they are doing more.
They are absolutely planning ahead inasmuch as what they are outputting is setting up a continuation. They’re not even word predictors remember - they are token predictors. Are you really saying that when you prompt an LLM with ‘name a large grey land animal’ and it outputs ‘ele’, it isn’t ‘planning’ that the next token will likely be ‘phant’?
The ‘decision’ to output ‘elephant’ is being made further up the neural network than final token selection - after all, it might want to output ‘Ele’ or ‘an’ (with a view to ultimately outputting ‘an elephant’) or ‘a’ (with a view to ultimately outputting ‘a common large grey land animal is an elephant’), or maybe it has been LoRA trained to output all responses as JSON so the first token it needs to output is ‘{‘… but surely the neural activations for that prompt are firing off ‘elephanty’ messages somewhere in the network, right?
So if there’s some sort of symbol activation ahead of token selection, why would it be hard to believe that a large neural network is forming more complex decisions about what it intends to output, in an abstract way, before it selects how to express itself?
And in what way is that distinct from ‘planning ahead’?
> Are you really saying that when you prompt an LLM with ‘name a large grey land animal’ and it outputs ‘ele’, it isn’t ‘planning’ that the next token will likely be ‘phant’?
The model outputs words, not tokens, so that is not a great example.
Any prompt will have multiple possible (predict next word) continuations, which you can think of as branching futures. Many possible next words, each of which have many possible following words, etc, etc.
The model is essentially predicting over all these possible futures. You can call it planning if you like, but remember that the model has no idea of which of these branching futures it is going to follow - it literally doesn't even know which word it is going to output next - it is just providing a bunch of probabilities (predictions) of next word, and the sampling process is then picking one - not necessarily the most confident next word prediction.
The model really is winging it word by word, even if those (multiple alternative) next words are only probable because they are part of coherent following sentences in the training data.
Tokens are tokens. If it was limited to words it wouldn’t be able to produce non-words, but GPT and other LLMs are quite capable of inventing words, outputting nonsense words, and modifying words.
Regarding the ‘no idea which future it is going to follow’ - sure, it doesn’t know which future; indeed the sampler phase is going to pick an output merely based on the probabilities it’s outputting. But it’s outputting higher probabilities for some tokens because they are good tokens to use to lead to probable futures. It’s suggesting taking steps down certain paths because those paths are likely to lead to useful places.
But, it doesn't make any difference whether you are considering tokens or words. There are multiple possible continuations of the prompt, and the next word (or token) output does not - in general - force the word (or token) after that ...
Your "large grey mammal" could be an "elected official in a grey suit".
Right, it’s possible, but when the LLM places a high probability on the “ele” token it’s not because it predicts “elected official” is a likely continuation. It’s because it’s thinking about elephants.
Likewise when a coding LLM starts outputting a for each loop, it’s doing so because it expects to want to write some code that operates on each item in a list. I don’t see how you can explain that behavior without thinking that it must be generating some sort of high level algorithmic plan that causes it to feel like the next thing it should output is some sort of ‘foreach’ token.
I'm not disagreeing with what is presumably happening, but rather on how to characterize that.
Of course next word predictions are not based directly on surface level word sequence patterns - they are based on internal representations of what these word sequences mean, and predicted continuations are presumably going to be at a similar level of abstraction/representation (what you are calling a plan). This continuation "plan" then drives actual word selection/prediction.
Where we seem to differ is whether this high level continuation representation can really be considered as a "plan". To me the continuation is just a prediction, as are the words that might be used to start expressing that continuation, and presumably it's not even a single continuation with multiple ways of expressing it (turning it into a word sequence), but rather some superposition of multiple alternate continuations.
When we get to the level of words output it becomes even less plan-like since the actual word output is randomly sampled, and when fed back in as part of the "sentence so far" may cause the model to predict a different continuation (or set of continuations) than it had at the prior step. So, any "plan" (aka predicted continuation) is potentially changing continuously from word to word, rather than being decided ahead of time and then executed. As I noted elsewhere in this thread, the inability to plan multiple words ahead is behind these model's generally poor performance on the "give me a sentence ending in <word>" task, as opposed to perfect performance on the "give me a sentence starting with <word>" one.
If we contrast this behavior of a basic LLM to the "tree of thoughts" mechanism that has been proposed, it again highlights how unplan-like the basic behavior is. In the tree of thoughts mechanism the model is sampled from multiple times generating multiple alternate (multi-word) continuations, which are then evaluated with the best being chosen. If the model were really planning ahead of time it seems this should not be necessary - planning would consist of considering the alternatives BEFORE deciding what to generate.
The model outputs words, not tokens, so that is not a great example.
Virtually all modern transformer models use pieces, which may be words, but also subwords. Theoretically, they could be longer units, but in most cases some characters (like whitespace) are used as piece boundaries when training the piece vocabulary. If they didn’t use pieces, they’d work terribly on languages where e.g. compounds are a single word.
In most realistic piece vocabs, ‘elephant’ will be a single piece, since it’s a fairly frequent word. But it’s totally possible in a small vocab that it would be split like the parent said and conversely, it would generate elephant by first predicting one piece.
Some piecing methods, like BBPE have bytes as the smallest unit, so theoretically an unknown token could be split up (and generated) as pieces consisting of bytes.
If you work out the loss function next token prediction, next 2 token prediction or next n token prediction, you will find they are identical. So it's equally correct to say the model is trained to find the most probable unlimited continuation. Saying "it only predicts the next token" is not untrue but easily leads to wrong conclusions.
Think before generating output - plan the entire sentence before you generate the first word(s) and maybe talk yourself into a corner. Tree-of-Thoughts (not Chain) is one way to provide something a bit similar - kind of like DeepBlue or AlphaGo generating possible branching future lines of play and picking the one with best outcomes.
To be more brain-like you'd really want the system to generally be "looping" internally - a bit like our thalamo-cortical loop - and only start outputting when the thought had gelled.
It's a shame HC doesn't use an LLM to upvote/downvote rather than people. Take the emotion out of technical discussions and rate based on factuality instead.
I suppose whoever downvoted this either hasn't heard of tree-of-thoughts, or doesn't understand what it is and what problem it is addressing. Or, maybe they just didn't like that their "gotcha" question had a simple answer.
I mean are we as humans planning ahead of the new few words? I certainly am not. But what matters is a deeper understanding of the context and the language model itself, which can then produce sensible spontaneous output. We as humans have the advantage of having a non language world model as well as abstract concepts but all of human language is a pretty strong proxy for it.
The spontaneity of it isn't the issue, it's what's driving the spontaneity that matters. For e.g. 1M context window is going to have a wildly more relevant output than a 1K context window.
> I mean are we as humans planning ahead of the new few words? I certainly am not.
For me, sometimes either way. At least, that's my subjective self-perception, which is demonstrably not always a correct model for how human brains actually work.
We also sometimes appear to start with a conclusion and then work backwards to try to justify it; we can also repeatedly loop over our solutions in the style of waterfall project management, or do partial solutions and then seek out the next critical thing to do in the style of agile project management.
Many of us also have a private inner voice, which I think LLMs currently lack by default, though they can at least simulate it regardless of what's really going on inside them and us (presumably thanks to training sets that include stories where a character has an inner monologue).
> I mean are we as humans planning ahead of the new few words? I certainly am not.
Sometimes we do, sometimes not.
Sometimes we just say stock phrases such as "have a nice day", or "you too" that are essentially "predict next word", but if I asked you something you'd never done before such as "how can we cross this river, using this pile of materials" you'd have to think it though.
Some people may use their inner monologue (or visualization) to think before speaking, and others may essentially use "chain of thought" by just talking it though and piecing together their own realizations "well, we could take that rope and tie it to the tree ...".
We should absolutely be anthropomorphizing a neural network trained to most accurately model anthropomorphic data.
I've watched as GPT-4 accurately modeled the over-justification effect as users started promising tipping and then there was a persistent memory added that revealed they weren't collecting the tip and it output complaints that it was hard to stay motivated not being paid.
That's a very nuanced level of simulation for output of anthropomorphic data with huge implications for synthetic data strategies.
Really? You think it's wise to "not anthropomorphize" a computer program designed to create the most effective neural network to model massive amounts of anthropomorphic data as accurately as possible?
That's an interesting choice, and might leave you confused as to why Anthropic's system message for the SotA model at the moment talks about it being 'happy' to do tasks (a prompt strategy I was mentioning months ago here on HN).
The data is anthropomorphic. We should expect anthropomorphic behavior and modeling from the LLMs if they do a halfway decent job and expect even more of it as they do a better job.
> Chain-of-Thought simply let's the model see it's own output as an input and therefore build upon that. It lets them break a complex problem down into a series of simpler steps which they can see (output becomes input) and build upon.
Sure, but why does that make the model more effective? Are you sure it's "breaking the problem down into simpler steps", or is it just appearing to do so? How does this breakdown happen in the model, exactly? If we can better understand the mechanics involved, then maybe this process can be built into a new model that can achieve the same results more efficiently instead of as a recursive process that runs the model more than once.
You can think of an LLM as a production line - feed a series of tokens in, and they get embedded and then processed through the system one step at a time though however many transformer layers the model has (undisclosed for most recent models, but GPT-3 has 96).
Those fixed 96 (or whatever) steps of processing limit the complexity of what the model can do, so it will fail if the task is too complicated unless it it breaks it down into simpler steps that each can be done well with that depth (96 steps) of processing.
It's not just appearing to do so - with chain-of-thought prompting you are literally telling it to "think step by step" as part of the prompt, so this is what it outputs. You could also tell it to generate a step by step plan, then elaborate on each of those steps.
I don't think we can say exactly how it is deciding to break a task into steps, anymore than we can in general say exactly how these LLMs are working, but intuitively it's similar to how we think and talk (which is what the LLM is trained on) - a good speaker/writer will introduce a complex topic as a top-down decomposition.
> Sure, but why does that make the model more effective?
If you look something up (on the internet, in a book, asking people directly) and receive two answers, one which describes the steps used to arrive at the answer, and one which doesn't, on average which one is more likely to be the correct answer?
As a prediction machine, an LLM is more constrained by what is likely to appear after a chain of reasoning.
You're just repeating the previous explanation with different words, so that's not really satisfactory. A mechanistic demonstration of how step by step reasoning tends to constrain the space of solutions to ones that are more likely to be correct would be an actual explanation, until then this is a just-so story.
> Sure, but why does that make the model more effective?
Because the model can't "think". There is no "reasoning" that goes into generating an answer. So for complex problems that require multiple steps of reasoning, the model needs to persist those intermediate steps in order to be able to build up to a solution.
How does the non-COT system generate the second word of its not using the output as input? Or do you mean that non-COT systems use only the latest output word when computing the next word, not all the earlier words from the output?
Every output word is always appended to the prompt, so if prompt is P then input after W1 (word 1) is output is P W1, then P W1 W2, etc. So, the LLM is only looking at it's own PAST words, not planning ahead.
If you do COT (think step-by-step) then the difference is that it has broken the prompt request/problem down into steps, so while it's still only seeing it's own past words, those now include all of step-1, which helps it generate step-2, etc and eventually combine all steps it generated into a complete answer.
This is basically what humans do with frameworks that help organize our thoughts and ensure a more methodical and complete way to think through an issue.