>All evidence suggests that current LLMs don't scale (EDIT: they do scale, but at a certain point somewhere around GPT4 the scaling slows down very quickly)
There seems to actually be no evidence that LLMs dont scale. All data from LLMs leading up to GPT4 indicates they scale incredibly well actually.
Do you have a single paper you can point to that says otherwise?
No OP, but no amount of data will ever enable an LLM to know when it doesn’t know something, for instance. That’s a fundamental limitation with the current architecture.
When we have an external source of information that is not self-accountable, it is different from when we are both the learner and functor - when we notice mistakes we learn from them. We're always looking out for errors, and if they happen then (usually) we alone feel responsible for their effects.
Again, you're talking about something different than scaling. There are many different aspects of emotions, life, or even learning. But the comment was about scaling learning, which can be unaffected by other aspects, such as whether you learned something incorrect.
More succinctly:
I can still teach Simpson's Rule in a Calculus class to a young man or woman who subscribes to Creationism, or the Flat Earth theory. None of their false knowledge affects their innate ability to take in new knowledge.
But what is the purpose of "scaling?" Broadly speaking, it means that there is a marked and obtainable improvement on an objective function in response to additional data. But for certain criteria, like the possession of knowledge, it appears attention-based LLMs aren't capable of that, regardless the amount of data. So they don't "scale" if you use that as an objective function.
Negation in the training data, exhaustion, or unknowable unknowns like future events is about an LLM will ever be able to say it "doesn't know" reliably.
LLMs have no concept of truth or correctness really, they are just probabilistic appending to output to match learned ideal patterns.
They're trained (and possibly scripted) to not "know" things or not be able to do things.
But at their core, they're going to generate the most plausible continuation, even if that isn't very plausible (like non-existent functions). I think this can be improved (playing with the temperature in self hosted models depending on the nature of the task e.g.), but it doesn't seem fully solvable with current LLMs.
Do you know much about how this all works? I've been thinking about how a human child ingests (maybe) petabytes of tokens every day through video and action. They initiate movement and observe the consequences through video, audio, proprioception, etc. The decades of schooling afterwards seems like fine tuning on a base model.
The base model is likely to have lots of genetic determined stuff that is there before any training: some animals like horses can walk and perceive/navigate very soon after being born. There's also an innate fear of snake shapes in most mammals.
Another one that's bothered me is how our models so often start from scratch (aside from fine tuning or transfer learning) - I wish we could just start with say a smaller model with low resolution and grow it into a higher resolution one, a small LLM that can be grown into different specialisations (which we do with fine tuning - but do they all need the same number of layers, size of each layer, etc).
I don't understand what you mean. To improve performance using the same basic architecture the model need to scale both compute and training data. Where are we going to find another 10x the web sized text training corpus?
And that is just web scrapped data. There are trillions of valuable tokens worth of text from the likes of pdfs/ebooks, academic journals and other documents that essentially has no web presence otherwise.
>To improve performance using the same basic architecture the model need to scale both compute and training data.
What you are trying to say here is that you need to scale both parameters and data as scaling data increases compute.
That said, it's not really true. It would be best if you scaled both data and parameters but you can get increased performance just scaling one or the other.
The 6T token dataset is surely a high quality subset / refined extract from much larger public datasets. It's misleading to compare their sizes directly.
We don't know what the dataset is. "high quality subset from much larger public datasets" is not just inherently speculation, it's flat out wrong as no such public datasets existed when GPT-4 was trained.
I think the main argument is based on diminishing returns. They expected larger improvement in performance given the 5x increase in number of parameters.
> To study how the parallel structure of transformers might limit their capabilities, the pair considered the case where transformers didn’t feed their output back into their input — instead, their first output would have to be the final answer. They proved that the transformers in this theoretical framework couldn’t solve any computational problems that lie outside a specific complexity class. And many math problems, including relatively simple ones like solving linear equations, are thought to lie outside this class.
> But no matter how a prompt is phrased, as long as it causes a language model to output step-by-step solutions, the model can in principle reuse the results of intermediate steps on subsequent passes through the transformer. That could provide a way to evade the limits of parallel computation... They quantified how that extra computational power depends on the number of intermediate steps a transformer is allowed to use before it must spit out a final answer. In general, researchers expect the appropriate number of intermediate steps for solving any problem to depend on the size of the input to the problem. ...Indeed, Merrill and Sabharwal proved that chain of thought only really begins to help when the number of intermediate steps grows in proportion to the size of the input, and many problems require the number of intermediate steps to grow much larger still.
So there’s a large space of O(n) and O(constant) problems that LLMs seem to scale very well on, but if they need O(n^2) examples in their training data to learn how to solve O(n^2) problems (even with prompt engineering!) then that’s a hard limit.
ETA: there’s also good (IMO irrefutable) evidence that AI researchers have deluded themselves about the scalability of LLMs because of motivated statistical reasoning: https://arxiv.org/abs/2304.15004
In fact it seems to me that “LLMs scale well” means “LLMs can solve O(n) problems if you give them O(n) examples in their training data,” which is actually not very interesting. The interesting thing is that the internet + Moore’s law allows us to build (and compute on) O(n) training data for an enormous variety of problems.
>> To study how the parallel structure of transformers might limit their capabilities, the pair considered the case where transformers didn’t feed their output back into their input — instead, their first output would have to be the final answer. They proved that the transformers in this theoretical framework couldn’t solve any computational problems that lie outside a specific complexity class. And many math problems, including relatively simple ones like solving linear equations, are thought to lie outside this class.
This seems like somewhat of an artificial constraint? E.g. we can explicitly ask GPT4 to solve problems using code which I wouldn't think has the same limits. The model might not be able to solve a problem directly but can in code.
>ETA: there’s also good (IMO irrefutable) evidence that AI researchers have deluded themselves about the scalability of LLMs because of motivated statistical reasoning: https://arxiv.org/abs/2304.15004
Curious to hear how this relates to scaling. I think a lot of the emergence dialog is just semantics. I.e. if GPT3 can only score a 50% on a 3 digit arithmetic test but GPT4 scores 80% this 'ability' might not literally be emergent but functionally it is, at a larger scale it 'gained' the ability to do 3 digit arithmetic. It's not clear how this indicates issues with Scalability though. For example in the paper the authors state: "However, if one changes from nonlinear Accuracy to linear Token Edit
Distance while keeping the models’ outputs fixed, the family’s performance smoothly, continuously and predictably improves with increasing scale"
> This seems like somewhat of an artificial constraint?...The model might not be able to solve a problem directly but can in code.
First of all, I would say that it's not an "artificial constraint" so much as what you're saying is an artificial mitigation. When it comes to human programmers, we expect them to be able to execute small O(n^2) and O(2^n) algorithms on paper, and if they're not able to we rightfully question whether or not they actually understand the algorithm versus just having a bunch of code memorized.
But the bigger problem is that the O(n) limitation doesn't just apply to formal math/CS problems; many natural language queries have a computational complexity, and I believe this result puts a limit on those that GPT4 can (easily) solve. I'll give a concrete example from my testing GPT-4 last year[1]. Naively GPT-4 struggled to solve two simple problems:
1) Given a pair of sentences, choose the one that has the most words.
2) Given a list of n numbers, choose the largest.
Without chain-of-thought, GPT-4 was not able to solve these problems reliably because it was clearly using spurious statistical cues and "guessing," e.g. seeming to think "antidisestablishmentarian" was always in a sentence with many words. Using chain-of-thought prompting ("let's think step-by-step") GPT-4 was able to solve these problems easily.
However, these are both O(n) problems. If you asked GPT-4 "given a list of n sentences, choose the one with the most words. Let's think step by step:" it could not reliably solve the problem! It either counted the sentences step-by-step but then guessed as to what the largest number was; or it guessed the count of the sentences and them compared the count step-by-step. Its ability to solve two separate O(n) problems did not extend to an O(mn) problem - you would need to tailor the chain-of-thought prompt and basically hold GPT's hand to get the right answer. I think this theoretical result puts my experiment on firmer ground: transformer LLMs can be trained on a variety of linear complexity problems but they cannot natively extend those results to quadratic complexity.
And yes, GPT could put this in code and get the right answer. But you'd have to train it to write a Python function instead of using its own abilities. I don't think "if the problem seems too complex, try writing a Python function" will work as a prompting strategy: it's GPT-4's inability to recognize quadratic complexity that's the root cause of the problem. (It's also why chain-of-thought isn't a cure-all - humans need to decide whether CoT is appropriate! If you use CoT prompting on a simple factual question you'll get a bunch of useless "steps.")
Going back to the scaling - the promise of scaling is that eventually big data pays off and the LLM gains some sort of generalizable understanding of the problem. But if the performance varies smoothly and predictably, then that's probably not happening - LLMs never "grok" subjects, they just screw things up less and less. E.g. if GPT-3 got a 50% on 3-digit arithmetic because its training data include 30% of all 3-digit arithmetic problems, and GPT-4 got an 80% because its training data includes 50% of all such problems, then that's a vacuous and uninteresting form of scaling - there are only ~2,000,000 such problems so this isn't as ridiculous as it might sound. Likewise if GPT-4 had the same training data but doubled its performance because it doubled the resources spent on arithmetic training. The point isn't to deny that scaling exists, it's to question whether its worth all the money and excitement. I really don't think it is - it's not even interesting.
[1] GPT-4 can solve these examples correctly now, but that's mostly a testament to the impossibility of doing reproducible investigation on closed LLMs. I strongly suspect OpenAI later trained GPT-4 on this and other "dumb" problems: certainly you could write a Python script to generate appropriate training data. The problem is you would need to generate an ~O(mn) amount of data for every pair of O(m) and O(n) problems that someone might want to ask in conjunction. I don't actually use LLMs for work, so maybe you could see how performance varies between Claude, Mixtrial, etc.
I wonder if part of my disagreement is in thinking of GPT4 as a System rather than just as an LLM. If the system has the ability to solve problems in code it feels like intentionally handicapping it by forcing it to solve problems in natural language. So it's not super interesting for me if you say that GPT4 the llm can solve only O(n) problems when GPT4 the system can solve more complex ones. Though I do admit theres an issue where the prompter needs to preemptively identify problems that should be solved in code and that's a significant shortfall.
> the promise of scaling is that eventually big data pays off and the LLM gains some sort of generalizable understanding of the problem. But if the performance varies smoothly and predictably, then that's probably not happening - LLMs never "grok" subjects
Is this any different from humans? Even a Nobel Prize winning Physicist will make some mistakes, does this indicate they don't grok Physics?
>if GPT-3 got a 50% on 3-digit arithmetic because its training data include 30% of all 3-digit arithmetic problems, and GPT-4 got an 80% because its training data includes 50% of all such problems, then that's a vacuous and uninteresting form of scaling
Genuinely curious, is there any indication this is actually what's happening? It seems pretty trivial to generate novel questions in other domains for example and GPT4's performance seems to generalize to those questions. I.e. GPT4 can do well on unreleased GRE problems for example. Or do you think GPT4 is actually just memorizing everything and has 0 ability to generalize?
There seems to actually be no evidence that LLMs dont scale. All data from LLMs leading up to GPT4 indicates they scale incredibly well actually.
Do you have a single paper you can point to that says otherwise?