People have done experiments trying to get GPT-4 to come up with viable conjectures. So far it does such a woefully bad job that it isn't worth even trying.
Unfortunately there are rather a lot of issues which are difficult to describe concisely, so here is probably not the best place.
Primary amongst them is the fact that an LLM would be a horribly inefficient way to do this. There are much, much better ways, which have been tried, with limited success.
Whereas your post sounds like "Just give the approach more time, it shall continue to incrementally improve until it finally works someday, cuz reasons."
Early attempts at human flight approached it by strapping wings to people's arms and flapping: Do you think that would have eventually worked too, if only we had just given it a bit more time and faith?
> Early attempts at human flight approached it by strapping wings to people's arms and flapping: Do you think that would have eventually worked too, if only we had just given it a bit more time and faith?
Interestingly, we how have human powered aircraft... We have flown ~60km with human leg power alone. We've also got human powered ornithopters (flapping wing designs) which can fly but only for very short times before the pilot is exhausted.
I expect that another 100 years from now, both records will be exceeded, altough probably for scientific curiosity more than because human powered flight is actually useful.
> Just give the approach more time, it shall continue to incrementally improve until it finally works someday, cuz reasons
Yes. Because we haven't yet reached the limit of deep learning models. GPT-3.5 has 175 billion parameters. GPT-4 has an estimated 1.8 trillion parameters. That was nearly a year ago. Wait until you see what's next.
Why would adding more parameters suddenly make it better at this sort of reasoning? It feels a bit of a “god of the gaps” where it’ll just stop being a stochastic parrot in just a few more million parameters.
I don't think it's guaranteed, but I do think it's very plausible because we've seen these models gain emerging abilities at every iteration, just from sheer scaling. So extrapolation tells us that they may keep gaining more capabilities (we don't know how exactly it does it, though, so of course it's all speculation).
I don't think many people would describe GPT-4 as a stochastic parrot already... when the paper that coined (or at least popularized) the term came up in early 2021, the term made a lot of sense. In late 2023, with models that at the very least show clear signs of creativity (I'm sticking to that because "reasoning" or not is more controversial), it's relegated to reductionistic philosophical arguments, but not really a practical description anymore.
I don’t think we should throw out the stochastic parrot so easily. As you say there are “clear signs of creativity” but that could be it getting significantly better as a stochastic parrot. We have no real test to tell mimicry apart from reasoning and as you note we also can only speculate about how any of it works. I don’t think it’s reductionist in light of that, maybe cautious or pessimistic.
They can write original stories in a setting deliberately designed to not be found in the training set (https://arxiv.org/abs/2310.08433). To me that's rather strong evidence of being beyond stochastic parrots by now, although I must concede that we know so little about how everything works, that who knows.
I didn't look at the paper but... How do you design a setting in a way that you're sure there isn't a similar one in the training set, when we don't even precisely know what the training set for the various GPT models was?
The setting in the paper is about narrating a single combat between Ignatius J. Reilly and a pterodactyl. Ignatius J. Reilly is a literary character with some very idiosyncratic characteristics, that appears in a single book, where he of course didn't engage in single combats at all or interact with pterodactyls. He doesn't seem to have been the target of fanfiction either (which could be a problem if characters like, say, Harry Potter or Darth Vader were used instead), so the paper argues that it's very unlikely that a story like that had been ever written at all prior to this paper.
Well, we've been writing stories for thousands of years, so I'm a bit skeptical that the concept of "unlikely enough to exist" is a thing. More to the specific example, maybe there isn't a story about this specific character fighting a pterodactyl, but surely there are tons of stories of people fighting all kind of animals, and maybe there are some about someone fighting a pterodactyl too.
Sure, but the evaluation explicitly addresses (among other points) how well that specific character is characterized. If an LLM took a pre-existing story about (say) Superman fighting a pterodactyl, and changed Superman to Ignatius J. Reilly, it wouldn't get a high rating.
Do you know how that “creativity” is achieved? It’s done with a random number generator. Instead of having the LLM pick the absolute most likely next token, they have it select from a set of most likely next tokens - size of the set depends on “temperature”.
Set temperature to 0, and the LLM will talk in circles and not really say anything interesting. Set it too high and it will output nonsense.
The whole design of LLMs don’t seem very well thought out. Things are done a certain way not because it makes sense but because it seems to produce “impressive” results.
I know that, but to me that statement isn't much more helpful than "modern AI is just matrix multiplication" or "human intelligence is just electric current through neurons".
Saying that it's done with a random number generator doesn't really explain the wonder of achieving meaningful creative output, as in being able to generate literature, for example.
> Set temperature to 0, and the LLM will talk in circles and not really say anything interesting. Set it too high and it will output nonsense.
Sounds like some people I know, at both extremes.
> The whole design of LLMs don’t seem very well thought out. Things are done a certain way not because it makes sense but because it seems to produce “impressive” results.
They have been designed and trained to solve natural language processing tasks, and are already outperforming humans on many of those tasks. The transformer architecture is extremely well thought out, based on extensive R&D. The attention mechanism is a brilliant design. Can you explain exactly which part of the transformer architecture is poorly designed?
People use the term "stochastic parrot" in different ways ... some just as a put-down ("it's just autocomplete"), but others like Geoff Hinton acknowledging that there is of course some truth to it (an LLM is, at the end of the day, a system who's (only) goal is to predict "what would a human say"), while pointing out the depth of "understanding" needed to be a really good at this.
There are fundamental limitations to LLMs though - a limit to what can be learned by training a system to predict next word form a fixed training corpus. It can get REALLY good at that task, as we've seen, to extent that it's not just predicting next word but rather predicting an entire continuation/response that is statistically consistent with the training set. However, what is fundamentally missing is any grounding in anything other than the training set, which is the what causes hallucinations/bullshitting. In a biological intelligent system predicting reality is the goal, not just predicting what "sounds good".
LLMs are a good start in as much as they prove the power of prediction as a form of feedback, but to match biological systems we need a closed-loop cognitive architecture that can predict then self-correct based on mismatch between reality and prediction (which is what our cortex does).
For all of the glib prose that an LLM can generate, even if it seems to understand what you are asking (after all, it was trained with the goal of sounding good), it doesn't have the intelligence of even a simple animal like a rat that doesn't use language at all, but is grounded in reality.
> even if it seems to understand what you are asking (after all, it was trained with the goal of sounding good
It was trained not only to "sound good" aesthetically but also to solve a wide range of NLP tasks accurately. It not only "seems to" understand the prompt but it actually does have a mechanical understanding of it. With ~100 layers in the network it mechanically builds a model of very abstract concepts at the higher layers.
> it doesn't have the intelligence of even a simple animal
It has higher intelligence than humans by some metrics, but no consciousness.
> It was trained not only to "sound good" aesthetically but also to solve a wide range of NLP tasks accurately.
Was it? I've only heard of pre-training (predict next word) and subsequent RLHF + SFT "alignment" (incl. aligning to goal of being conversational). AFAIK the NLP skills that these LLMs achieve are all emergent rather than explicitly trained.
I'm not sure we can really say the net fully understands even if it answers as if it does - it was only trained to "predict next word", which in effect means being trained to generate a human-like response. It will have learnt enough to accomplish that goal, and no more (training loss tends to zero as goal is met).
Contrast this to an animal with a much richer type of feedback - reality, and with continual (aka online) learning. The animal truly understands it's actions - i.e. has learnt to accurately predict what will happen as a result of them.
The LLM does not understand it's own output in this sense - it exists only in a world of words, and has no idea if the ideas it is expressing are true or not (hence all the hallucinating/bullshitting). It only knew enough to generate something that sounded like what a person might say.
> Was it? I've only heard of pre-training (predict next word) and subsequent RLHF + SFT "alignment" (incl. aligning to goal of being conversational). AFAIK the NLP skills that these LLMs achieve are all emergent rather than explicitly trained.
I believe you are right about that. I did some research after reading your comment. Transformers were certainly designed for NLP, but with large enough models the abilities can emerge without necessarily being explicitly trained for it.
> I'm not sure we can really say the net fully understands even if it answers as if it does - it was only trained to "predict next word", which in effect means being trained to generate a human-like response.
It depends on your definition of "understand". If that requires consciousness then there is no universally agreed formal definition.
Natural Language Understanding (NLU) is a subset of Natural Language Processing (NLP). If we take the word "understanding" as used in an academic and technical context then yes they do understand quite well. In order to simply "predict the next word" they learn an abstract model of syntax, semantics, meaning, relationships, etc, from the text.
> and has no idea if the ideas it is expressing are true or not (hence all the hallucinating/bullshitting).
That is not really an issue when solving tasks that are within it's context window. It is an issue for factual recall. The model is not a type of database that stores its training set verbatim. Humans have analogous problems with long term memory recall. I can think straight within my working memory but my brain will "hallucinate" to some extent when recalling distant memories.
The context window only has to do with the size of input it has access to - its not related to what it's outputting, which is ultimately constrained by what it was trained on.
If you ask it a question where the training data (or input data = context) either didn't include the answer, or where it was not obvious how to get the right answer, that will not (unfortunately) stop it from confidently answering!
> The context window only has to do with the size of input it has access to - its not related to what it's outputting, which is ultimately constrained by what it was trained on.
Wait a minute. You are completely missing the entire "attention mechanism" thing which is what makes transformers so capable. For each output token generated in sequence, the attention mechanism evaluates the current tokens relationship to all tokens in the context window, weighing their relevance. There are multiple "attention heads" running in parallel (16 in GPT-3.5). Now for each layer of the neural network there is an attention mechanism, independently processing the entire context window for each token. There are ~100 layers in ChatGPT. So now we have 100 layers times 16 attention heads = 1600 attention mechanisms evaluating the entire context window over many deep layers of abstraction for each output token.
I'm not sure what your point is ... Hallucinations are where the net hadn't seen enough training data similar/related to the prompt to enable it to generate a good continuation/response. Of course in cases where it is sufficiently trained and the context contained what it needs then in can make full use of it, even copying context words to the output (zero shot learning) when appropriate.
The real issue isn't that the net often "makes a statistical guess" rather than saying "I don't know", but rather that when it does make errors it has no way to self-detect the error and learn from the mistake, as a closed-loop biological system is able to do.
> If you ask it a question where the training data (or input data = context) either didn't include the answer, or where it was not obvious how to get the right answer, that will not (unfortunately) stop it from confidently answering!
I haven't found this to be the case in my experience. I use ChatGPT-4. It often tells me when it doesn't know or have enough information.
If you haven't used GPT-4 I recommend signing up for a month. It is next level, way better than 3.5. (10x the parameter count). (No I'm not being paid to recommend it.)
I read that paper back in the day and honestly I don't find it very meaningful.
What they find is that for every emerging ability where an evaluation metric seems to have a sudden jump, there is some other underlying metric that is continuous.
The thing is that the metric with the jump is the one people would actually care about (like actually being able to answer questions correctly, etc.) while the continuous one is an internal metric. I don't think that refutes the existence of emerging abilities, it just explains a little bit of how they arise.
Why would it not? We've observed them getting significantly better through multiple iterations. It is quite possible they'll hit a barrier at some point, but what makes you believe this iteration will be the point where the advanced stop?
No I'm not that's what this whole sub-thread is about how bad LLMs are at the stuff thats described in the OP.
For context this is the grandparent within which my original reply was scoped:
I feel very comfortable saying, as a mathematician, that the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level.
The reason LLMs fail at solving mathematical problems is because: 1) they are terrible at arithmetic, 2) they are terrible at algebra, but most importantly, 3) they are terrible at complex reasoning (more specifically they mix up quantifiers and don't really understand the complex logical structure of many arguments) 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility
Solving grade-school problems might mean progress in 1 and 2, but that is not at all impressive, as there are perfectly good tools out there that solve those problems just fine, and old-style AI researchers have built perfectly good tools for 3. The hard problem to solve is problem 4, and this is something you teach people how to do at a university level.
(I should add that another important problem is what is known as premise selection. I didn't list that because LLMs have actually been shown to manage this ok in about 70% of theorems, which basically matches records set by other machine learning techniques.)
(Real mathematical research also involves what is known as lemma conjecturing. I have never once observed an LLM do it, and I suspect they cannot do so. Basically the parameter set of the LLM dedicated to mathematical reasoning is either large enough to model the entire solution from end to end, or the LLM is likely to completely fail to solve the problem.)
I personally think this entire article is likely complete bunk.
Edit: after reading replies I realise I should have pointed out that humans do not simply backtrack. They learn from failed attempts in ways that LLMs do not seem to. The material they are trained on surely contributes to this problem.
Humans and other animals definitely different when it comes to reasoning. At the same time, biologically humans and many other animals are very similar, when it comes to brain, but humans have more "processing power". So it's only natural to expect some emergent properties from increasing number of parameters.
> it’ll just stop being a stochastic parrot in just a few more million parameters.
Is is not a stochastic parrot today. Deep learning models can solve problems, recognize patterns, and generate new creative output that is not explicitly in their training set. Aside from adding more parameters there are new neural network architectures to discover and experiment with. Transformers aren't the final stage of deep learning.
Probabilistically serializing tokens in a fashion that isn't 100% identical to training set data is not creative in the context of novel reasoning. If all it did was reproduce its training set it would be the grossest example of overfitting ever, and useless.
Any actually creative output from these models is by pure random chance, which is most definitely different from the deliberate human reasoning that has produced our intellectual advances throughout history. It may or may not be inferior: there's a good argument to be made that "random creativity" will outperform human capabilities due to the sheer scale and rate at which the models can evolve, but there's no evidence that this is the case (right now).
There is also no evidence for your conjecture about there being some sort of grand distinction between "probabilistically serializing tokens" and "deliberate human reasoning" other than scale. There might be, but there is no evidence.
There's plenty of evidence that humans reason differently than ML models; namely basically any human intellectual discovery in history versus the (approximately) zero randomly generated ones by ML.
We don't know exactly how human reasoning works, but the observational evidence clearly indicates it is not by randomly piecing together tokens already known.
> There's plenty of evidence that humans reason differently than ML models; namely basically any human intellectual discovery in history versus the (approximately) zero randomly generated ones by ML.
This reasoning is invalid. For fun, I checked if GPT4 would catch the logical errors you made, and it did. Specifically, it correctly pointed out that absence of evidence is not evidence of absence. But even if there had been evidence of absence, this reasoning is invalid because it presumes that human reasoning must result in intellectual discovery irrespective of how it is employed, and so that if we can't find intellectual discoveries, it must mean an absence of human reasoning. In other words, it invalidly assumes that a difference in outcomes must represent a difference in the structure of reasoning. This is trivially invalid because humans think without making intellectual discoveries all the time.
However, it's also a strawman because I did not claim that humans and ML models reason the same way. I claimed there is no evidence of 'some sort of grand distinction between "probabilistically serializing tokens" and "deliberate human reasoning" other than scale'.
1) This explicitly recognizes that there is a difference, but that it might be just scale, and that we don't have evidence it doesn't. Your argument fails to address this entirely.
2) Even at scale, it does not claim they would be the same, but argues we don't have evidence that "probabilistically serializing tokens" must be inherently different from deliberate human reasoning" to an extent sufficient to call it "some sort of grand distinction". We can assume with near 100% certainty that there are differences - the odds of us happening upon the exact same structure is near zero. That does however not mean that we have any basis for saying that human reasoning isn't just another variant of "probabilistically serializing tokens".
I'll note that unlike you, GPT4 also correctly interpreted my intent when asked to review the paragraph and asked whether it implies the two must function the same. I could* take that to imply that LLMs are somehow better at humans at reasoning, but that would be logically invalid for the same reasons as your argument.
> We don't know exactly how human reasoning works, but the observational evidence clearly indicates it is not by randomly piecing together tokens already known.
Neither does LLMs. Piecing together tokens in a stochastic manner based on a model is not "randomly piecing together" - the model guides the process strongly enough that it's a wildly misleading characterization, as you can indeed trivially demonstrate by actually randomly piecing together words.
But even if we assume a less flippant and misleading idea of what LLMs do, your claim is incorrect. Observational evidence does nothing of the sort. If anything, the rapidly closing gap between human communication and LLMs shows that while it is extremely likely to be structural differences at the low level, it is increasingly unclear whether they are a material distinction. In other words, it's unclear whether the hardware and even hardwired network matters much relative to the computational structure the trained model itself creates.
You're welcome to your beliefs - but they are not supported by evidence. We also don't have evidence the other way, so it's not unreasonable to hold beliefs about what the evidence might eventually show.
Ever heard of something called diminishingly returns?
The value improvement between 17.5b parameters and 175b parameters is much greater than the value improvement between 175b parameters and 18t parameters.
IOW, each time we throw 100 times more processing power at the problem, we get a measly 2 time increase in value.
You are missing the point that it can be a model limit. LLMs were a breakthrough but that doesn’t mean they are a good model for some other problems, no matter the number of parameters. Language contains more than we thought, as GPT has impressively showed (ie semantics embedded in the syntax emerging from text compression), but still not every intellectual process is language based.
You were talking about the number of parameters on existing models. Like the history of Deep Learning has shown, simply throwing more computing power at an existing approach will plateau and not result in a fundamental breakthrough. Maybe we'll find new architectures, but the point was that the current ones might be showing their limits, and we shouldn't expect the model suddenly become good at something they are currently unable to handle because "more parameters".
Yes you're right I only mentioned the size of the model. The rate of progress has been astonishing and we haven't reached the end, in terms of both of size and algorithmic sophistication of the models. There is no evidence that we have reached a fundamental limit of AI in the context of deep learning.
Indeed. LLM is an application on a transformer trained with backpropagation. What stops you from adding a logic/mathematic "application" on the same transformer?
Unfortunately there are rather a lot of issues which are difficult to describe concisely, so here is probably not the best place.
Primary amongst them is the fact that an LLM would be a horribly inefficient way to do this. There are much, much better ways, which have been tried, with limited success.