Characterizing emergent phenomena in large language models

mjburgess · on Dec 19, 2022

> we discuss the phenomena of emergent abilities, which we define as abilities that are not present in small models but are present in larger models

Reading anything by major researchers in AI feels like an adversarial battle where they're trying to misuse as much technical scientific and philosophical language as possible and we adjacent are trying to hold the line.

In philosophy and esp. the philosophy of science, emergence is a relation between a whole and its parts such that a property of the whole does not obtain just in virtue of properties of its parts taken in isolation. "Emergence" has this prior positive, semi-magical, scientific association which confuses the issue in this case.

No properties of the LLM obtain from its parts differently as parameters scale, the mechanism is the same. The performance differs not due to emergence, but due to the "modelling gap" present between the statistical structure of free text and that of mathematics. With enough examples, the gap closes... indeed, you can model the addition function (add(x, y) = x + y) just by an infinite sample of its domains.

A better technical term here might be "scale-dependent capabilities". For LLM, simple arithmetic is extremely scale dependent, whereas basic text generation is less-so. The reason for this seems obvious, as given above... so the use of the term "emergence" here I interpert as more PRish mystification.

naasking · on Dec 19, 2022

I think "scale-dependent capability" is a much more precise term for what they're describing, but I'm not sure that that term doesn't fall under the general umbrella of emergent properties. The opening paragraphs of emergent properties in philosophy [1] cites a number of examples that I would argue are comparable to LLM suddenly becoming able to do arithmetic past a certain scale.

> In philosophy and esp. the philosophy of science, emergence is a relation between a whole and its parts such that a property of the whole does not obtain just in virtue of properties of its parts taken in isolation.

This is not a settled definition. I think everyone can agree that this applies epistemologically, where studying the parts in isolation cannot always yield sufficient information to predict macroscopic properties, but to claim that all properties of the whole do not reduce to the properties of its parts is controversial.

For instance, it seems unlikely that we would have predicted H2O's dipole moment and the phenomenon of surface tension just from studying hydrogen atoms and oxygen atoms in isolation, but it would be incorrect to say that surface tension is not the result of the properties of hydrogen and oxygen in isolation. We simply cannot discover all the relevant properties without studying them together.

Edit: to clarify, H2O's dipole moment seems obvious in hindsight when we have a good model for what's going on, and analogously, LLM's ability to do arithmetic seems obvious as a scale-dependent property in hindsight, but that doesn't mean it was obvious that this would happen before it was created.

[1] https://plato.stanford.edu/entries/properties-emergent/

mjburgess · on Dec 19, 2022

Well it's scale dependence is an illusion.

It's uniformly able to model Q->SingleAnswer problems, and uniformly able to model Q-> ManyAnswer problems.

Basic arithmetic is the former kind, and it becomes useful to people at large scales, ie., accurate on basic arithmetic.

This dual condition of "useful-to-people" is the thing introducing the illusion, since it changes depending on what we're modelling. The system isnt acquring any new property.

Consider a researcher putting a book on a thin ice-sheet, and then putting a car on it. Here, they're concluding the ice has different properties in each case -- but it doesnt.

darawk · on Dec 19, 2022

> Consider a researcher putting a book on a thin ice-sheet, and then putting a car on it. Here, they're concluding the ice has different properties in each case -- but it doesnt.

This is just a linguistic shell game with the meaning of the word "property". You could just as easily say the difference between the mind of a human and a monkey is a matter of degree, and therefore going from one to the other does not gain any novel "property".

It should be obvious that the degree of a property can fundamentally change its nature, and that there is no hard distinction between "properties" and degrees of things. The difference between a tickle and a gunshot are matters of "degree", but that fact is of near zero semantic utility.

mjburgess · on Dec 19, 2022

Emergence is about intrinsic properties, observer-independent properties.

If emergence were about observer-relative properties it would be a meaningless term. My shoe gets an "emergent property" to hold my door open when I put it in a door way.

This is mumbojumbo.

Systems acquiring observer-relative "properties" are all well and good, but the claim here is a much stronger one.

This gross misuse of language amounts to saying that "models with enough parameters to accurately approximate a function" have "emergent properties" that "models without enough parameters" do not have.

This is a deeply silly way to describe the conditions under which one function can approximate another, and the rate at which that aproximation converges to something useful.

naasking · on Dec 19, 2022

> Emergence is about intrinsic properties, observer-independent properties. If emergence were about observer-relative properties it would be a meaningless term.

I'm going to address this in case it was also intended to reply to my other comment about "useful to people" possibly being a property.

"Useful to people" would be an observer-independent property, if it's a property at all. An alien species analyzing humanity would come to the same conclusions as humans about whether some system, like the internet, was useful to people. This would be evident by whether the use of that system spread.

As for whether it's "intrinsic", I'm not sure how you're applying this. As you said in a later comment, "Liquidity isn't [a property] of water without a container/pressure environment". In other words, liquidity isn't an intrinsic property of H2O. Moreover, the only reason we identify and created a label for "liquidity" is because it's useful to people, which is the very criterion that you're claiming should not be applied to describe some surprising scaling behaviour of LLMs.

I just don't think you've made the distinction you're attempting to make clear, because there is parity of reasoning between the allegedly emergent properties you describe and those in the article.

darawk · on Dec 19, 2022

Let's make this concrete. What in your mind is a specific example of a concrete system with an emergent property, then?

mjburgess · on Dec 19, 2022

Liquidity is not a property of h2o molecules but it is of water. Liquidity isn't of water without a container/pressure environment.

The trajectories of particles of air are underdetermined by their own intrinsic properties so a tornado cannot be reduced to some mere aggregate of them.

Emergence is an ontological relationship between properties of objects --- it isn't a mathematical property of a data distribution nor of an approximation function.

The very use of the term has crated all this confusion right here.

Would anyone who thought NNs were showing emergence be also content to find out that the reason for this 'emergence' was just that in the case of so-called 'emergence' our expectations of performance were changing?

Do we call it 'emergence' when we replace estimating data using mean() to estimating with ax+b ?

There's definitely an illusion here, but one quite easy to spot if you weren't in the game of peddling illusions.

aaroninsf · on Dec 19, 2022

Two observations,

There is utility and need for some convention of feature description, for the case of the external behavior of the system being correct, for some domain, despite lack of specific training. It is reasonable for users of such systems to say that the internal states and representation don't matter, so long as the behavior is correct; and in cases like these we will benefit from some consensus on how to talk about such things. Fine with me if some new term is applied.

More of interest to me though is that it is not at all clear to me that genuine emergence is not possible through scaling (independent of whether it is in any given existing LLM). Because the optimal (most compact) correct representation for a lot of e.g. language output, is exactly that which benefits from abstraction.

What reason is there to believe that the abstractions derived at higher levels (of the network generally but not necessarily, depends on the architecture) do not encode non-linear problem spaces in the world, which are "real" emergence?

I.e. if the way some network learns arithmetic is to settle on an internal weighting that performs computation, rather than "memorizing assertions", me, I would call that "emergent."

But I'm happy to use some other term should one, er, emerge.

darawk · on Dec 19, 2022

> Liquidity is not a property of h2o molecules but it is of water

The ability to speak English is not a property of floating points, but it is of certain, very specific large tensors of them. What's the difference?

> Emergence is an ontological relationship between properties of objects --- it isn't a mathematical property of a data distribution nor of an approximation function.

I don't see a hard distinction between ontological relationships and data distributions. All information is fundamentally statistical. Our access to ontology is forever and always mediated by "data distributions".

One could, of course posit that there are fundamental, non-statistical ontological things out there. However, the liquidity of water being an ontological relationship while the English-speaking of GPT not being so is merely a hypothesis, not an objective fact of the universe, at least not as far as I can tell.

mistermann · on Dec 19, 2022

> What in your mind is a specific example of a concrete system with an emergent property, then?

Not sure if you're joking, but the brain (and consciousness/mind as an emergent phenomenon) is the classic example. Even better, it is what's causing the fundamental problems in this very conversation, due to the inconsistent manner in which it translates terms into meaning (ie: "emergent", "is"), typically not realizing[1] that meaning and reality are similar to the speed of light in that their appearances[2] vary depending upon the frame of reference of the observer.

I am fairly optimistic that AI is going to "force humanity's cultural hands" such that we will eventually have to grapple with this long known but rarely discussed phenomenon. Though, I anticipate censorship, propaganda, and consciousness will play heavy roles in such discussions and screw everything up, as is usually the case with fundamentally important ideas.

[1] During realtime cognition, regardless of whether substantial abstract knowledge is in the person's possession.

[2] And maybe even the things themselves, depending on how (or from where) you look at it - I am still undecided on this.

naasking · on Dec 19, 2022

> This dual condition of "useful-to-people" is the thing introducing the illusion, since it changes depending on what we're modelling. The system isnt acquring any new property.

I have a couple of possible responses to this, but maybe the most obvious is that I'm not sure why "useful to people" can't qualify as a new property.

For instance, a system that suddenly becomes useful to people can be transformative to society, which can lead to new emergent social or economic properties at the societal scale. To conclude that "useful to people" is not a meaningful property, aren't you basically implying that something that suddenly becomes useful cannot even in principle lead to new societal scale emergent properties? That seems dubious. Edit: or you're implying that emergent properties are not reducible to interactions between constituent properties, which also seems dubious.

For a concrete example, the internet probably falls into this category. It has transformed society and led to new emergent properties at societal scales, but computers didn't suddenly acquire any new computational properties, or new properties to manipulate bits. Only the scale of their deployment changed, and that scaling was itself useful to people, and this led to new societal properties. That arguably can't happen unless "useful to people" is itself a meaningful property, no?

YeGoblynQueenne · on Dec 19, 2022

Thanks for bringing some sense into the debate. It's scandalising to see how the machine learning research community is so ready to jump on to such ... innovative uses of terminology.

Take "few shot learning", for instance. OpenAI's preprint introducing GPT-3 was titled "Large Language Models are Few Shot Learners" [1]. This was promptly adopted by the community, even though LLMs need first to be trained on millions of examples before they can accept "few", and they don't even learn from those few (because no weight updates). So it's not really "few shot" and it's not really "learning", and yet, here we are. "Large Language Models are few-shot learners" and nobody bats an eyelid anymore.

Which is not to say we have to shut up and take it. I personally draw inspiration from the tale of the little child who pointed out the King's sartorial negligence.

________________

[1] That, in itself, is a title designed to claim some groundbreaking progress not just in technological capabilities but also in scientific understanding. LLMs are few-shot learners, man! Few-Shot!

mgraczyk · on Dec 19, 2022

I don't see the issue here. The exact same definition of "few shot learning" has been used for at least 20 years. Nothing changed with the GPT-3 paper.

The definition is something like

Given a task, a few shot learner is an algorithm that generalizes well with only a small number of training examples for that task.

The same definition is what I'm familiar with from undergrad. Do you know of a different definition that precedes GPT-3?

mjburgess · on Dec 19, 2022

The issue is that the aim is to model the one shot of animals, not some other target .

Animals are one shot learners because we're in causally direct sensory-motor contact with reality, such that we can dibiguatw it live.

No train/predict system can disambiguate the causal origin of data and so can never be one shot.

What they're targeting is a triviality within a system of trivialities and misdiscrubjng it

mgraczyk · on Dec 19, 2022

But then you're saying that the definition has always been wrong, which is a very different claim.

I personally think that claiming a definition has always been wrong is vacuous. Just substitute the word in your head for something else if you don't like it.

YeGoblynQueenne · on Dec 19, 2022

I reference the GPT-3 preprint because it's the source of the latest twist to the meaning of "few-shot learning".

>> Given a task, a few shot learner is an algorithm that generalizes well with only a small number of training examples for that task.

I don't know where this definition comes from and I'd prefer if you had a more solid reference than your recollection of your undregraduate years, but it doesn't really matter because what you describe is not what LLMs do.

The input prompts to GPT-3 and friends are not "training examples". Training examples are labelled instances of a concept to be learned- that's PAC-Learning. LLM prompts are not examples, and they're not used in training. They're input sequences that an already-trained model completes with a sequence of tokens with maximal probability given the input.

That's indeed nothing new, it's how language generation with a language model works, and has always worked. But I don't remember ever hearing anyone referring to the input sequences given to Hidden Markov Models or Probabilistic Context-Free Grammars as "training examples", or the process of generating their completions referred to as "learning", let alone "few-shot learning". And yet, this kind of generation is exactly what LLMs do, too. Except of course LLMs have much smoother, larger models than any HMM or PCFG ever trained. But as the OP is arguing, that's not a qualitative difference, only a quantitative one, and renaming it is misleading. Doubly so if the renaming walks roughshod over long-established terminology, like "few shot", "learning" or "examples".

Btw, the OpenAI GPT-3 preprint gives a definition of their "few shot" setting but it's informal, long-wided and overall a vague mess, so it's really no surprise that so much confusion and chaos is generated as a result.

mgraczyk · on Dec 19, 2022

I disagree with what you're saying here.

> The input prompts to GPT-3 and friends are not "training examples". Training examples are labelled instances of a concept to be learned- that's PAC-Learning. LLM prompts are not examples, and they're not used in training.

In the PAC learning setting, these are training examples because you use the labels to select a function, in this case the conditional output given the few shot examples.

Whether or not you actually update any weights has nothing to do with "few shot learning" and never has. In the PAC setting there are no weights, just model functions that depend on the training data.

EDIT: The reason you didn't hear people refer to the inputs of HMMs as "training examples" is because HMMs are very poor few-shot learners. That's why GPT-3 is interesting, because it is a good few-shot learner.

You could use an HMM as a few shot learner, but computing the results is expensive and the results are not good for most tasks.

YeGoblynQueenne · on Dec 20, 2022

>> In the PAC learning setting, these are training examples because you use the labels to select a function, in this case the conditional output given the few shot examples.

That's not right. When you give a prompt to an LLM it doesn't use it to select a function [1]. The function is already selected (its weights are trained). The prompt only invokes the function.

The fact that there are no weight updates is important. Weight updates is how neural networks learn. Without weight updates neural nets can't be said to learn anything.

To dredge up from the depths yet another ancient definition of machine learning, the one by Tom Mitchell, an LLM will never improve its performance no matter how many prompts you give it. Every time you leave an interaction session with GPT-3, all the information you gave it in your prompts simply ceases to exist, as far as its model is concerned. There's nothing that is remotely like "learning" in that.

_______________

[1] More precisely, in PAC-Learning a learner is said to have learned a class of concepts if, given an unlabelled instance from a concept in the class it correctly labels the instance as a member of the concept, with some probability of some error and in polynomial time, or after having observed a polynomial time of examples. As you say, PAC-Learning doesn't say anything about weights, but it doesn't say anything about selecting functions, either. It's a framework within which to study learnability. That's a bit of an aside, hence the footnote.

mgraczyk · on Dec 20, 2022

It does select a function.

The distribution over outputs conditioned on the input prompt is a different distribution, hence a different (random) function.

It's not important that there are no weight updates. The new function, the model conditioned on a few example input-output pairs, is good at producing new outputs for unseen inputs. That is what learning is.

There is a reason that everyone working on LLMs is comfortable with this definition. It really does exactly match the definition that has been used for decades.

YeGoblynQueenne · on Dec 20, 2022

>> The distribution over outputs conditioned on the input prompt is a different distribution, hence a different (random) function.

As I say in another comment (and if I understand what you mean correctly) that is not supported by observations.

>> It's not important that there are no weight updates. The new function, the model conditioned on a few example input-output pairs, is good at producing new outputs for unseen inputs.

To be blunt, I've seen as much evidence for that as I've seen evidence for the existence of the Sasquatch. The people who train those models don't even know what's in their datasets, let alone anyone being able to say that an "example" (i.e. a prompt) is "unseen".

visarga · on Dec 19, 2022

> Remarkably, conditioning the model on such an “example-based specification” effectively enables the model to adapt on-the-fly to novel tasks whose distributions of inputs vary significantly from the training distribution. The idea that simply minimizing the negative log loss of a single next-word-prediction objective implies this apparent optimization of many more arbitrary tasks – amounting to a paradigm of “learning” during inference time – is a powerful one, and one that raises many questions.

http://ai.stanford.edu/blog/in-context-learning/

YeGoblynQueenne · on Dec 20, 2022

Hey man. Give some context to links please. Don't just dump them like that...

But thanks for the link, it seems like an interesting read though I'm too tired now to read it (but not for spouting off on HN... i know right).

>> Remarkably, conditioning the model on such an “example-based specification” effectively enables the model to adapt on-the-fly to novel tasks whose distributions of inputs vary significantly from the training distribution.

That would be indeed interesting if it was the case, but as we have seen in practice, it isn't. For example, GPT-3 is great in addition and subtraction with two- and three-digit numbers, but its performance deteriorates rapidly after that, plus it's bad at multiplication and completely incapable of division [1]. That's exactly what you'd expect of a model trained on a dataset where "easy" arithmetic problems predominate, i.e. the internet, and not at all what you'd expect of a model that can somehow magickally generalise beyond its training distribution without even being trained, and without even retaining any memory of its "examples" (i.e. prompts).

In any case, the way to convince me that I'm wrong is to do the good, old-fashioned hard work of proving LLMs' learning ability, like people used to do when it was all hairy stuff like statistical learning theory and VC-dimension and so on. But that's too much hard work and people prefer to poke LLMs with arbitrary prompts instead nowadays. And I agree that it's more fun that way, but at some point you have to give up the fun and try to understand exactly what's going on with a complex system.

____________________

[1] See Figure 3.10 in "Language models are few-shot learners".

Strange, I could have sworn there used to be a "large" in the title.

YeGoblynQueenne · on Dec 21, 2022

Edit: @visarga, I was kind of expecting you to tell me about the commas in arithmetic. I wonder whether you didn't because it's obvious to yo, too, that that is evidence refuting the claim of learning out-of-distribution concepts. As in, yes, GPT-3 gets better at arithmetic with comma-separated values, but that's because the commas help it overfit even more.

HarHarVeryFunny · on Dec 19, 2022

I'm OK with the term emergent used here - it seems the best word to describe non-trivial capabiities/properties that weren't designed in. At least prior to the first LLMs, I think most people would just expect them to be capable of doing literally what they are trained to do - predict (locally plausible) next word(s), and this is certainly all that is "designed in" if we're talking about a basic LLM (vs one with further RL-based "alignment", etc).

Of course we can also appreciate that to get REALLY REALLY good at "predict next word" would require intelligence/understanding of what is being generated, but I think the point here is would anyone - the model designers in particular - have expected that the transformer architecture (+ scale) is all it would take to become so good at this task? I don't think "attention is all you need" was really anticipating transformers reading API docs and generating code to perform requested functions! One might have expected it to take a much more elaborate and evolved architecture to achieve this level of capability!

So, to me, it seems entirely appropriate to describe these models as having emergent capabilities - things they are capable of doing that are really so far above and beyond "predict next word", that it seems churlish and inaccurate to describe them as designed it (or even just confidently predicted).

maria2 · on Dec 19, 2022

Why should we let philosophy define technical terms for ML? Many words have many meanings. Welcome to the imprecision of human language.

As a slight tangent, I really hate this type of comment that inevitably appears on many HN submissions. Personally, I find it distracting when the main conversation happening on an article is a pedantic discussion on whether or not a word means what the author thinks it means.

ot · on Dec 19, 2022

I believe the term is reasonably appropriate here.

The abilities being described here are "emergent" in the sense that the model was not specifically trained for them, but they show up anyway. Your example is about modeling a specific function and having its accuracy increase with model complexity, which is classical ML formulation, but this is not what is happening here.

LLMs are trained on a very simple task: given a natural text, predict the next word. But as model complexity and training set sizes increase, they start exhibiting more sophisticated abilities, such as basic forms of reasoning, and contextual memory.

In your definition, the parts of the whole are "lots of statistics about text" and the emergent property is "semantic reasoning".

Scale is inevitably a part of this: somewhere else in the thread you mention that "liquidity" is an emergent property of H2O, but if you take a handful of H2O molecules they don't behave as a liquid.

fumeux_fume · on Dec 19, 2022

Is is that big of a deal? The authors explain their definition of emergent abilities at the beginning of the paper.

mjburgess · on Dec 19, 2022

It's a mystification of what's going on -- the term makes it harder to understand, not easier. It's prone to popular misunderstanding, and it seems even to confuse the researchers themselves.

xg15 · on Dec 20, 2022

> With enough examples, the gap closes... indeed, you can model the addition function (add(x, y) = x + y) just by an infinite sample of its domains.

Infinity is not a number.

Trainsets cannot be infinite and neither were the number of examples of addition that we humans saw before we became able to do addition on unseen samples.

So the question of how finitely many samples you need to generalize a math operation is a different one.

smeagull · on Dec 20, 2022

People speak in metaphors. I don't know why people get so hung up when it happens. Language isn't a formula unless you're a model. Even then, LLM do cope with some idioms.

rafaelero · on Dec 19, 2022

What a waste of time to be worried about how people are using a word.

xpe · on Dec 19, 2022

Well said. How did the authors and reviewers miss this?

mjburgess · on Dec 19, 2022

The transition from useless to useful ML models no doubt often seems magical to researchers. But it follows just from the distribution of the training data and from the degree of its compression by the function approximation algorithm they're using.

What's "magical" is not their system, but rather that the vast library of text they use for training has useful properties which can be approximated.

What researchers are observing is more like the illusion of a "phase transition" in the quality of approximations. This illusion arises because we have discontinuous standards for the approximations.

Ie., when assessing free text prediction by LLMs there's very very many ways for them to generate an acceptable answer. For mathematics, there's only one acceptable way.

If we applied the same standard/goal to both, no such apparent "quality transition" would occur. LLMs would be exposed as equally good, or equally bad, at prediction regardless of scale.

xpe · on Dec 19, 2022

Interesting arguments. It seems plausible and insightful. IMO, your analysis here deserves a longer write-up. Is it something you are working on?

mjburgess · on Dec 19, 2022

Any person with a "scientific attitude" in this field would find it incredibly easy to observer that the training target for natural language is,

f(Q) = {A1..An} -- n being very very large

and the target for mathematics is,

g(Q) = A1

And the model they're using approximates with,

m(Q) = A_guess

So it's incredibly easy to model f with m because A_guess just has to be close to one of A1...n; and its very hard to model mathematics because it has to be only A1.

The reason articles like this are written isn't because people dont know this; it's because they just do not have a sceptical attitude. And that's a problem I can't fix.

If they'd approached this issue with the goal of "finding nothing surprising about this behaviour", ie., trying to make it maximally consistent with existing (basic, trivial, widely-taught) theory, they'd reach this conclusion in under 5min.

The problem is their goal is always to find something surprising so they can release some PR about it. There's nothing I can write to fix this problem, it's edemic in the whole field.

It makes ML/AI research much more like psychology than physics. Alas!

baandang · on Dec 19, 2022

From Introduction To The Theory Of Complex Systems:

• Complex systems can exhibit a rich phase structure and have a huge variety of macrostates that often cannot be inferred from the properties of the elements. This is sometimes referred to as emergence.

This is the term and as common a term in complex systems as there is.

"scale-dependent capabilities" implies inference from elements.

I think some people just don't like the very idea even though it is not unlike the concept of stochastic process. I would think the same reasons to not like the concept of emergence applies to stochastic process. Murray Gell-Mann though couldn't even raise complex systems above arguing if emergence is magical thinking so it is probably a lost cause.

Such an interesting field that always ends up as this conversation.

naasking · on Dec 19, 2022

> "scale-dependent capabilities" implies inference from elements.

Don't confuse the post-hoc explanation with an a priori inference. We can post-hoc explain water's emergent liquidity property using modern quantum theories, but that doesn't mean we could have inferred it if given Schrodinger's equation and the atomic structure of hydrogen and oxygen.

"Scale-dependent capability" is a post-hoc explanation that of course looks obvious in hindsight, just like liquidity and pressure looks obvious in hindsight once you understand electromagnetism and atomic theory.

seydor · on Dec 19, 2022

Eh eh . imagine how neuroscientists feel

mjburgess · on Dec 19, 2022

Well, of late, NS's have been fond of misusing "hallucinate" likewise which means a non-veridical perception. And they're using it to mean a veridical constructed perception.

Leading everyone down a path of ever-more mysticism.

It would be nice if neuroscientists spoke out against both their own mystical PR and that of AI, but I don't hear it much.

oidar · on Dec 19, 2022

I think the word confabulate would be a better match for what ChatGPT does. When people confabulate they are VERY confident about their invented retellings. This matches the attitude that ChatGPT comes across as when it makes up shit.

PartiallyTyped · on Dec 19, 2022

> that a property of the whole does not obtain just in virtue of properties of its parts taken in isolation

Thank you. I have been expressing variants of this for a while. A paper that comes to mind is OpenAI's hide and seek. They claim that cooperation is emergent behaviour, but each agent is playing it's own version of prisoner's dilemma, and thus learn to cooperate.

visarga · on Dec 19, 2022

That model was not learning from language, it was learning from a simulation. When you can use a simulation to produce training data it is possible to have a model discover new abilities all on its own, like AlphaGo.

xpe · on Dec 19, 2022

Wikipedia has a fine definition of what _emergent_ means:

> In philosophy, systems theory, science, and art, emergence occurs when an entity is observed to have properties its parts do not have on their own, properties or behaviors that emerge only when the parts interact in a wider whole.

The linked article uses this definition:

> we discuss the phenomena of emergent abilities, which we define as abilities that are not present in small models but are present in larger models

The concept in the paper has to do with capabilities / abilities that grow non-linearly as a function of model size. This is distinctly different from _emergent behavior_ in systems theory.

<opinion>The authors and reviewers could find a better word for their concept. There is no need to muddle the concept.</opinion>

Furthermore, the idea that networks of certain sizes are necessary for certain kinds of representational abilities is not new. Perhaps a term exists already?

xpe · on Dec 19, 2022

This comment says it more eloquently than I did: https://news.ycombinator.com/item?id=34051845

CGamesPlay · on Dec 19, 2022

Do these scale-dependent (I like this adjective better than "emergent") properties survive model distillation? It may be that our training/optimization processes are inefficient and require these scales to achieve, but the underlying model may not actually require the number of parameters that we are giving them. I haven't read any of the papers about distillation yet, does anyone know if this has been tested?

visarga · on Dec 19, 2022

Good question, my guess is that you can't distill chain-of-thought or zero shot prompting in small models, they got to be 15-20B parameters or larger. Maybe someone has a link to a related paper?

lossolo · on Dec 19, 2022

For all models smaller than 62B, direct prompting outperforms CoT. The first model where CoT outperforms direct prompting is Flan-cont-PaLM 62B on BBH. For 540B models, there are more settings where CoT outperforms direct prompting, but not all of them. Also, the number can be smaller than 540B. In Suzgun et. al. 2022, the authors show that the 175B InstructGPT and 175B Codex also have better CoT performance than direct prompting. Combining all the results, we get the two numbers 62B and 175B. So yes indeed, to enter the game of scale you do need a ticket to larger models than average.

However, there are also other large models like OPT, BLOOM, and the first version of GPT-3. They all have 175B, yet their CoT performance is significantly worse, or even cannot do CoT.

source: https://yaofu.notion.site/A-Closer-Look-at-Large-Language-Mo...

visarga · on Dec 19, 2022

Found a paper myself:

> Teaching Small Language Models to Reason

https://arxiv.org/abs/2212.08410

CGamesPlay · on Dec 20, 2022

Wow this paper is from 4 days ago, but it's exactly what I was asking about!

evrimoztamur · on Dec 19, 2022

Has there been any efforts in processing calculation prompts, where instead of letting it internally 'compute', it's trained to identify equations and process them with an external calculator instead (perhaps one which outputs not only the result but the individual steps too)?

vutekst · on Dec 19, 2022

Yes: https://twitter.com/goodside/status/1581805503897735168

visarga · on Dec 19, 2022

Language models with toys. The calculator, Python REPL, search engine, database, simulation, games, other AI's can easily blend with large language models lifting some weight off their shoulders.

For example, for a physics question the LM could write a small simulation, run the simulation and interpret results back to the user. That's possible when models can do code execution.

obiefernandez · on Dec 19, 2022

Been wondering the same

ttctciyf · on Dec 19, 2022

There's a quite accessible IAS presentation[1] from another Google researcher on Solving Quantitative Reasoning Problems with Language Models which gives some likely related background on having language models solve this type of math problem, including the "chain of thought" technique mentioned here.

I found it pretty interesting and as something of an ML skeptic was a bit surprised at the degree of coherence shown in "reasoning" examples similar to the ones in the linked article.

1: https://www.youtube.com/watch?v=qV4Ku5L4BuMt

djoldman · on Dec 19, 2022

Paper: https://openreview.net/forum?id=yzkSU5zdwD

seydor · on Dec 19, 2022

The X axis here is the training Flops but what about parameter size and how does it account for the different architectures. Comparing apples to shoelaces may not be a fruitful approach or indicative of what to expect from ever-expanding scale. Also , is it emergence or overfitting