Hacker News new | past | comments | ask | show | jobs | submit login
Temporal quality degradation in AI models (nannyml.com)
325 points by santiviquez on April 14, 2023 | hide | past | favorite | 343 comments



One, I think overlooked issue with AI as ML, is how radically it affords the done-and-finished past over a creative future.

When faced with a blank page some percentage of people will not copy a known solution, they'll draw on their experiences and eg., write a book about them.

For example, I could write a book about software development without research only inventorying my opinions, arguing for them, illustrating them, and so on.

If I were to start with ChatGPT I would, necessarily, only say what has been said.

There are so many dangers with the Eliza-effect, and this is one of them. I think the narrative on how these systems work needs more strongly wrestled from their salesmen -- and the real downsides exposed.

A client of mine seemed quite insistent that ChatGPT was "creative". I hope in a few years that opinion will be rare, as everyone ends up writing the same article over-and-over.

And likewise, in development, using earlier versions of libraries with the most "stackoverflow training data"; using dying languages with the most github repos; and so on.

In other words: what we have written the most about is not what we now need to write the most about. And the former is the training data for modern AI.

It is not in the world with us, with our concerns: it is a statistical average over "done and dead" cases.


> If I were to start with ChatGPT I would, necessarily, only say what has been said.

This is not true though, you can test it for yourself:

Prompt: Invent a new word, which has never been written before. When I type it in Google, I should get zero results.

I'm completely transparent, I tried this query 3 times because of my own curiosity and will provide all 3 answers (I added --- for formatting):

---

ChatGPT (answer 1):

Sure, how about "flumtraz"?

---

ChatGPT (answer 2):

Here's a new word: "Ardidio"

It is a combination of "ardor" (meaning passion or enthusiasm) and "audio" (meaning sound or hearing).

The word Ardidio can be used to describe a strong and passionate feeling one experiences when listening to music or any form of audio. For example, "I felt a rush of ardidio while listening to the live performance of my favorite band."

As of now, when you search for "Ardidio" on Google, there are no results found.

---

ChatGPT (answer 3):

"Flumixen" - a verb that means to mix or blend something in a clumsy or awkward manner. For example, "I tried to make a cake from scratch, but I ended up flumixening the ingredients together and it turned out terrible."

---

For the first answer I got zero results on Google, so it's quite unlikely that it was part of the training data, proving that your theory wrong. It can say things that have never been said.

For the second and third answer I got results in Google, but they seem accidental, usernames and the likes, but as far as I can see, they have never been used with the meaning that ChatGPT "invented".

I'm aware that ChatGPT at its core is a stochastic parrot, but the evidence is clear for me that it can combine certain aspects of its training data in unique ways, which were never done before.


> For the first answer I got zero results on Google, so it's quite unlikely that it was part of the training data

Sure, not literally part of the training data.

Statistical AI operates in a transformed space derived from the training data, points in that space will not, in general, exist in the original.

So imagine generating 1000 circles and putting their radii on a line: 0.1, 0.2, 0.3, ...

The circles are the training data, and the "implied line" is the transformed space.

Now, AI here is capable of generating a circle with radius 0.15 and hence that circle is "not in the original dataset".

This type of "novelty" isn't what I'm concerned with; generative AI must have that or else it'd be entirely useless -- only a google search.

Rather i'm talking about, for example, whether without "Rust" in its training data it could develop "Rust" from everything else. Is there enough data on lifetimes/borrowing/etc. research in pdfs that it's scanned to somehow "find a midpoint between those pdfs and C++".

It seems a bit mad to suppose so -- but I could be wrong, such a midpoint does exist --- but i'm extremely doubtful we humans have been so helpful as to write the 1000s of academic PDFs needed for this system to find it.

The novelty I'm talking about is dimensions in the transformed space. The system cannot derive "additional ways to move" without the source data actually containing those ways.

This is, roughly, equivalent to saying that it's biased towards the on-average ways we have conceptualised our problems as represented by the on-average distribution of academic articles, github repos, webpages, etc. *that we happened to have created*.

This is a serious "intellectually conservative" bias.

For sure it can find circles it hasnt seen; but could it find spheres from circles alone? No.


I don't think this argument holds water at all. Can we imagine that the people who created Rust were able to do so only because they contributed some magical ingredient from their souls, which didn't exist anywhere in the world's collected pre-Rust printed materials? It's more economical to assume that they were able to create Rust because the necessary concepts already existed, in which case a sufficiently trained AI might do something similar.

Or working in the opposite direction: we can think of AIs as processing concepts in some dimensional space, sure. But we have no conception at all of what that space is like, so there's no reason to expect that a midpoint in that space between two objects we're familiar with would also be familiar to us. I mean, I have no idea what the midpoint between Rust and C++ is, or how I'd go about describing it. Surely an AI that thinks in tensors is more capable than we are to explore the space between known concepts, so why couldn't we expect to learn something novel from one?


Concepts are developed by animals over time. A baby develops sensory-motor concepts from day-1; a child abstracts them; a teenager communicates them; and adult refines that communication.

They are not developed as a matter of averaging over all the text on the internet.

Concepts do not pre-exist concepts.


Respectfully, that sounds like hand-waving. Claiming to know where concepts do and don't come from just leads to questions like "did the natural numbers exist before we did?", which are centuries old and presumably not resolvable.

Whereas a more focused question like "can an AI produce outputs that are novel to someone familiar with all of the AI's inputs?" seems resolvable, and even if one thinks it's unlikely or not easy, it's very hard to buy the idea that it's impossible.


> just leads to questions

No, not really. People in this area are severely poorly informed on animal learning, and "ordinary science".

AI evangelists like to treat as "merely philosophical matters" profoundly scientific ones.

The issues here belong to ordinary science. Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality?

We can say, as certain as anything: No.

Associative statistical models are not phenomenological models (ie., specialised to observable cause-effect measures); and phenomenological models are not causal (ie., do not give the mechanism of the cause-effect relationship).

Further, we know as surely as an athlete catching a ball, that animals develop causal models of their environments "deeply and spontaneously".

And we know, to quite a robust degree, how they do so -- using interior causal models of their bodies to change their environments by intentional acts can confirm or disconfirm environmental models. This is modelled logically as abduction, causally as sensory-motor adaption, and so on.

This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

ChatGPT appears to do many things. But you will see soon, after a year or two of papers published, that those things were tricks. That "replaying associations in everything ever written" is a great trick, that is very useful to people.

Today you can ask ChatGPT to rewrite harry potter "if harry were evil" or some such thing. That's because there are many libraries of books on harry potter and "evil" -- and by statistical interpolation alone, you can answer an apparent counter-factual question which should require imagination.

But give ChatGPT an actual counter-factual whose parts are only in the question, and you'll be out-of-luck.

Eg., tell it about tables, chairs, pens, cups and ask it to arrange them using given operations so that, eg., the room is orderly. Or whatever you wish.

Specified precisely enough you can expose the trick.


>This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

Why do you think the data LLMs are trained on are non-causal? Lets take causation as asymmetric correlation. That is, (A,B) present in the training data does not imply (B,A) presence. But of course human text is asymmetric in this manner and LLMs will pick up on this asymmetry. You might say that causation isn't merely about asymmetric correlation, but that of the former determining the latter. But this isn't something we observe from nature, it is an explanatory posit that humans have landed on in service to modelling the world. So causation is intrinsically explanatory, and explanation is intrinsically causal. The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

Cashing out explanation and explanatory model isn't easy. But as a first pass I can say that explanatory models capture intrinsic regularity of a target system such that the model has an analogical relationship with internal mechanisms in the target system. This means that certain transformations applied to the target system has a corresponding transformation in the model that identifies the same outcome. If we view phenomena in terms of mechanistic levels with the extrinsic observable properties as the top level and the internal mechanisms as lower levels, an explanatory model will model some lower mechanistic level and recover properties of the top level.

But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.


Well this is a good reply, but it's mistaken.

The conditional probability:

    P(x[0]| x[-1], x[-2], x[-3] ...)
is not the same as,

    P(x[0] | x[-1], x[-2], ... -> x[0])
Where `->` says we select only those cases where x[-1],... brought-about x[0].

To see why this is the case, suppose we do have a god's eye-view of all of spacetime.

    P(A|B) 
    always selects for all instances where B follows A.

    P(A| B -> A) 
    selects only those instances where B's following A was caused by A.
Eg.,

    P(ShoesWet | Raining) 

    is very different from 

    P(ShoesWet | Raining -> ShoesWet)
in the former case the two events have, in general, nothing to do with each other.

To select "Raining -> ShoesWet" even with a gods-eye-view we need more than statistics... since those events which count as "Rain -> ShoeWet" have to be selected on a non-statistical basis.

For the athelete catching a ball, or the scientist designing the experiment, we're interested only in those causal cases.

For sure P(A|B) is a (approximate, statistical) model of P(A| B->A) -- but it's a very restricted, limited model.

The athlete needs to estimate P(ball-stops | catch -> ball-stops)

NOT P(ball-stops | catch) which is just any case of the ball-stopping given any case of catching.


Let me alter your example a bit: we have P(A|B), we want P(A|B,B->A). But given enough examples of the form P(A|B), a good algorithm can deduce B->A and use it going forward to predict A. How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases. LLMs do this with self-attention, by taking every pair of symbols in the context window and testing whether each pair is useful in determining the next token. As the attention matrix converges, the model can leverage the presence of "Raining & Outside" in predicting "ShoesWet".

Of course, this is a rather poor excuse for an explanation. The fact that "outside" and "raining" are close doesn't explain why "my shoes are wet". But it does get us closer to a genuine explanation in the sense that it eliminates a class of wrong possibilities from consideration: every sentence that doesn't have outside in proximity to raining downranks the generation "my shoes are wet". The model is further improved by adding more inductive relationships of this sort. For example, the presence of an expanded umbrella downranks ShoesWet, the presence of "stepped in puddle" upranks it. Construct about a billion of these kinds of inductive relationships, and you end up with something analogous to an explanatory model. The structural relationships encoded in the many attention matrices in modern LLMs in aggregate entail the explanatory relationships needed for causal modelling.


> How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases.

But the machine doesn't know which are the right cases. We aren't presuming there's a column, Z = 1 for B -> A, and Z = 0 otherwise -- right?

The machine has no mechanism to distinguish these cases.

> testing whether each pair is useful in determining the next token

This isnt causation.

> every sentence that doesn't have outside in proximity to raining downranks the generation

So long as the sequential structure of sentences corresponds to the causal structure of the world: but that's kinda insane right?

We haven't rigged human language so that the distribution of tokens is the causal structure of the world. The reason text generated by LLMs appears meaningful is because we understand it. The actual structure of text generated isnt "via" a model of the world.

(Consider, for example, training an LLM on a dead untranslated language -- it's output is incomprehensible, and its weights are abitarily correlated with anything we care to choose.)

Nevertheless, given our choice of token, you do have a model which says:

    P(ShoesWet|~Rain) < P(ShoesWet | Rain) < P(ShoesWet|Rain & Outside)
That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

(Which you'll never get, the actual value is `1`. Iff A -> B, then P(A|B->A) = 1 -- this is a deductive inference necessary for ordinary science to take place).

In any case, P(A | B -> A) means without any confounders. To actually find the LLM's approximation of this we'd need to compute:

    P(A|B & C1 & C2 & C3 ...)  forall C_i..inf
And then find P(A|B & C') st. C' made P(A|B) maximally likely.

If you find a set of {C} st. P(A|B) has a high probability, you won't find causal conditions.

All that statistical association models here is, at best, salience -- not causal relevance.


>We haven't rigged human language so that the distribution of tokens is the causal structure of the world [...] The actual structure of text generated isnt "via" a model of the world.

This is an odd claim. I certainly say that I picked my cup off the floor rather than I picked my cup off the ceiling because gravity causes things to fall down rather than up. Human language isn't "rigged" to represent the causal structure of the world, but it does nonetheless. The distribution of tokens is such that the occurrence of (A,B) and (B,A) are asymmetric, and this is precisely because of features of the world influence the distribution of words we use. A sufficiently strong model should be able to recover a model of this causal structure given enough training data.

>That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

But these patterns are represented in the training data by the words we use to discuss raining and wet shoes. There is every reason to think a strong model will recover this regularity.

>All that statistical association models here is, at best, salience -- not causal relevance.

That's all we can ever get from sensory experience. We infer causation because it is more explanatory than accepting a huge network of asymmetric correlations as brute. YeGoblynQueenne is right that my point is basically a version of the problem of induction. We can infer causation but we are never witness to causation. We do not build causal models, we build models of asymmetric correlations and infer causation from the success of our models. What a good statistical model does is not different in kind.


The problem of induction is fatal. But we overcome it: we do witness causation.

When I act on the world, with my body, I take as a given "Body -> Action". We witness causation in our every action.

> This is an odd claim

The tokens can be given any meaning. The statistical distribution of token frequencies in our languages have an infinite number of causal semantics which are consistent with them.

We can find arbitary patterns such that

    P(A) < P(A|B) < P(A|B & C) < P(A|B & C...)
Only those we give a semantics to ("Rain" = Rain), and only those we already know are causal we will count. This is the trick of humans reading the output of LLMs -- this is what makes it possible. It's essentially one big Eliza effect.

No, the structure of language isnt the structure of the world.

This pattern in tokens,

    P(A) < P(A|B) < P(A|B & C) < P(A|B & C...)

Is an associative statistical model of conditional aggregate salience between token terms.

Phrase any such conditional probability you wish, it will never select for causal patterns.

this is why we experiment. It's why we act on the world to change it.

When the child burns their hand on the fireplace they do so once. Why?

Because the child immediately infers,

    P(TouchFire -> Pain | MoveHand -> Pain) = 1
How? via the abduction, roughly:

    P( TouchFire | Desire_TouchFire -> TouchFire) = 1
how?

    P( TouchFire | Desire_TouchFire -> MoveHand) = 1
how?

    P( Pain | MoveHand -> TouchFire -> Pain) = 1
etc.

In other words, we bottom out our reasoning in a

    P( BodilyMovement -> Effect | Desire -> BodilyMovement) = 1

Absent this, absent being in the world with a body, you cannot determine causes.

The problem of induction phrased in modern language is this: statistics isn't informative. Or, conditional probabilities are no route to knowledge. Or, AI is dumb.


Wow, that's a nice way to put it. I haven't seen that P(A|B -> A) notation before. Where does it come from?

But I think the OP is arguing, essentially, that P(A|B -> A) is only an interpretation of P(A|B) that we have chosen, among the many possible interpretations of P(A|B).

Which I think evokes the problem of induction. How do we know that P(A| B -> A) when all we can observe ourselves is P(A|B)?


> when all we can observe ourselves is P(A|B)?

No, we actually observe P(A | B -> A) where `B` is our body and `A` is some action we take on the world.

Hume was WRONG. Very wrong.

Statistical AI has the problem of induction; we have bodies, so we do not.

----

As for notation, I'm riffing of Judea Pearl's do notation.

He'd say, P(A|do(B))

but his `do` operator is slightly more general

Google: do-operator, causal analysis, judea pearl, etc.


Ah, I thought it might be something to do with Judea Pearl.

>> Hume was WRONG. Very wrong.

Oh boy :)

I can see what you're saying about having bodies, but bodies are very limited things and that's just making Hume's point. We can only know so much by experiencing it with our bodies. We've learned a lot more about the world, and its foundations, thanks to our ability to draw inferences without having to engage our bodies. For example, all of mathematics, including logic that studies inference, is "things we do without having to engage our bodies". And those very things have shown us the limits of our own abilities, or at least our ability to create formal systems that can describe the world in its entirety. They have shown us the limits of our ability for inductive inference (and in a very concrete manner - see Mark E. Gold's Language Identification in the Limit).

Machine learning systems are more limited than ourselves, that's right. And that's because we have created them, and we are limited beings that cannot know the entirety of the world just by looking at it, or reasoning about it.


One of the premises of hume's sceptical metaphysics was that

    P(A|B) is just P(A | B -> A) 
The argument for this was `A` and `B` are only "Ideas in the head" and don't refer to a world. And secondly, by assertion, that Ideas are "thin" pictorial phenomena that can only be sequenced.

Hume here is just wrong. Our terms refer: `A` succeeds in referring to, eg., Rain. And our experiences aren't "thin", they're "thick" -- this was Kant's point. Our experiences play a rich role in inference that cannot be played by "pictures".

To have a metal representation R of the world is to have a richly structured interpretation which does, in fact, contain and express causation.

ie., R can quite easily be a mental representation of "B -> A". This, after all, is what we are thinking when we think about the rain hitting our shoes. We do not imagine P(A|B), we imagine P(A|B->A) -- if we didnt, we couldn't reason about the future.

The question is only how we obtain such representations, and the answer is: the body with its intrinsic known causal structure.

Whenever we need to go beyond the body, we invent tools to do so -- and connect the causal properties of those tools to our body.

Hume here is wrong in every respect. And it's his extreme scepticism which undergirds all those who would say modern AI is a model of intelligence -- or is capable of modelling the world.

The word isnt a "constant conjunction of text tokens" -- even Hume wouldnt be this insane. Nevertheless, it is this lobotomised Hume we're dealing with.

There is a science now for how the mind comes to represent the world -- we do not need 18th C. crazy ideas. Insofar as they are presented as science, theyre pseudoscience


Thank you for sharing your opinion on Hume, but I don't see how e.g. Polyominoes, to take a random mathematical (ish) concept I was thinking about today, are connected to our body. I can think of many more examples. Geometry, trigonometry, algebra, calculus, the first order predicate calculus, etc. None of those seem to be connected to my body in any way.

Anyway this all is why I'm happy I'm not a philosopher. Philosophers deal in logic, but they don't have a machine that can calculate in logic, and keep them in the straight and narrow with its limited resources. A philosopher can say anything and imagine anything. A computer scientist -well, she can, but good luck making that happen on a computer.


Well Kant (Chomsky, et al.) are probably right that we must have innate concepts -- esp. causation, geometry, linguistic primitives etc. in order to be able to perceive at all.

So in this sense a minimal set of a-priori concepts are required to be built-in, or else we couldn't learn anything at all.

You might say that this means we can separate the sensory-motor genesis of concepts from their content -- but I think this only applies to a-priori ones.

What i'm talking about is conceptualisations of our environment that provide its causal structure. One important aspect of that is how desires (via goals) change the world. Another is how the world itself works.

Both of these do require a body, or at least a solution to the problem of induction (ie., that P(A|B) is consistent with P(A|B->A), P(A|~(B->A_), P(A| B->Z, C->Z, Z->A), etc.)


>> So in this sense a minimal set of a-priori concepts are required to be built-in, or else we couldn't learn anything at all.

I don't disagree with that at all. I'm pretty convinced that, as humans, we can learn and invent all those things we have because we have strong inductive biases that guide us towards certain hypotheses and away from others.

Where those inductive biases come from is a big open question, and I'd be curious to know the answer. We can wave our hands at evolution, but that doesn't explain, say, why have the specific inductive biases we have, and not others. Why do we speak human languages, for example? Why is our innate language ability the way it is? Intuitively, there must be some advantage in terms of efficiency that makes some inductive biases more likely than others to be acquired, but I get tired waving my hands like that.

I'm not convinced that all that absolutely requires a body, either. I think it's reasonable to assume it requires some kind of environment that can be interacted with, and some way to interact with the environment, but why can't a virtual environment, like a computer simulation of reality, provide that? And it doesn't have to be the real reality, either. A "blocks world" or a "grid world" will do, if it's got rules that can be learned by playing around in it.


I very disagree but have an upvote for a well-argued comment.

>> The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

A statistical model may "capture" explanatory relations, but can it use them? A data scientist showing a plot to a bunch of people is explaining something using a statistical model, so obviously the statistical model has some explanatory power. But it's the data scientist that is using the model as an explanation. I think the discussion is whether a statistical model can exist that doesn't just "capture" an explanation, but can also make use of that explanation like a human would, for example as background knowledge to build new explanations. That seems very far fetched: a statistical model that doesn't just model, but also introspects and has agency.

Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom? The big debate is that (allegedly) "we don't understand language models" in the first place. We have a giant corpus of incomprehensible data; we train a giant black box model on it; now we have a giant incomprehensible model of the data. What did we explain?

>> But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

Let's call that model M* for clarity. The search space of models, let's call it S. There are any number of models in S that can generate many of the same sequences as M* without being M*. The question is, and has always been, in machine learning, how do we find M* in S, without being distracted by M_1, M_2, M_3, ..., ... that are not M*.

Given that we have a very limited way to test the capabilities of models, and that models are getting bigger and bigger (in machine learning anyway) which makes it harder and harder to get a good idea of what, exactly, they are modelling, how can we say which model we got a hold of?


>A statistical model may "capture" explanatory relations, but can it use them?

That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction. It's the difference between frequency counting while taking the past context as an opaque unit vs decomposing the past context and leveraging relevant tokens for generation while ignoring irrelevant ones. Self-attention does this by searching over all pairs of tokens in the context window for relevant associations. Induction heads[1] are a fully worked out example of this and help explain in-context learning in LLMs.

>Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom?

The model encodes explanatory relationships of phenomena in the world and it uses these relationships to successfully generalize its generation out-of-distribution. Basically, these models genuinely understand some things about the world. LLMs exhibit linguistic competence as it engages with subject matter to accurately respond to unseen variations in prompts of that subject matter. At least in some cases. I argue this point in some detail here[2].

>how can we say which model we got a hold of?

More sophisticated tests, ideally that can isolate exactly what was in the training data in comparison to what was generated. I think the example of the wide variety of poetry these models generate should strongly raise one's credence that they capture a sufficiently accurate model of poetry. I go into detail on this example in the link I mentioned. Aside from that, various ways of testing in-context learning can do a lot of work here[3].

[1] https://transformer-circuits.pub/2022/in-context-learning-an...

[2] https://www.reddit.com/r/naturalism/comments/1236vzf/

[3] https://twitter.com/leopoldasch/status/1638848881558704129


>> That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction.

That sentence should be decorated with the word "allegedly", or perhaps "conjecture"! In practical terms, I believe you are pointing out that language models of the GPT family are trained on a context surrounding, not just preceding, a predicted token. That's right (and it gets fudged in discussions about predicting the next token in a sequence), but we could already do that with skip-gram models, and with context-sensitive grammars, and dependency grammars, many years ago, and I don't remember anyone saying those were specially capable of capturing explanatory relations [1]. Although for grammars the claim could be made, since they are generally based on explanatory models of human language (but not because of context-sensitivity).

Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction? This is not to catch you in contradiction, I'm genuinely unsure about this myself. My understanding is that explanatory hypotheses improve predictions in the long run [2], but that's not to say that a predictive model will improve given explanations, rather explanatory models eventually replace strictly predictive models.

Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.

Sorry, writing too much today. And I got work to do. So I won't bitch about "in-context learning" (what we used to call sampling from a model back in the day, three years ago before the GPT-3 paper :).

______________

[1] My Master's thesis was a bunch of language models trained on Howard Philips Lovecraft's complete works, and separately on a corpus of Magic: the Gathering cards. One of those models was a probabilistic Context-Free Grammar, and despite its context-freedom, and because it was a Definite Clause Grammar, I could sample from it with input strings like "The X in the darkness with the Y in the Z of the S" and it would dutifully fill-in the blanks with tokens that maximised the probability of the sentence. So even my puny PCFG could represent bi-directional context, after a fashion. Yet I wouldn't ever accuse it of being explanatory. Although I would say it was quite mad, given the corpus.

[2] I mention in another comment my favourite example of the theory of epicylces compared to Kepler's laws of planetary motion.


>Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction?

I don't mean to say that explanations are arbitrary, rather that causes are not observed only inferred. But we infer causes because of the explanatory work they do. This isn't arbitrary, it is strongly constrained by predictive value as well as, I'm not sure what to call it, epistemic coherence and intelligibility maybe? Explanatory models are satisfying because they allow us to derive many phenomena from fewer assumptions. Good explanatory models are mutually reinforcing and have a high level of coherence among assumptions ("epistemic coherence"). They also require the fewest number of assumptions taken as brute without further justification ("intelligibility").

Why think explanatory models are better at prediction? Because the mutual coherence among assumptions and explanatory power of the whole (ability to predict much from few assumptions) suggests the explanatory model is getting at the productive features of the phenomena that result in the observed behavior. Essentially, the fewer number of posits, the fewer ways to "bake in" the data into the model. If we were to cast this as a computational problem, i.e. find a program that reproduces the data, shorter programs are necessarily more explanatory. There's no other way to explain the coincidence of program picked out of a small space generating data picked out of a very large space without there being an explanatory relation between the two. Further, our credence for explanation increases as the ratio of the respective spaces diverge.

This is really the problem of machine learning in a nutshell. Is the data vs parameter count over some threshold such that training is biased towards explanatory relations? Is the model biased in the right way to discover these relations faster than it can memorize the data? LLMs seem to have crossed this threshold because of the massive amount of data they are trained on, seemingly much larger than can comfortably be memorized, and the inductive biases of Transformers that search the space of models to extract explanatory relations.

>Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.

I agree with this, and I think these explanatory relations are implicit in human text. I gave the example in another comment that I say things like "I picked my cup off the floor" rather than "I picked my cup off the ceiling" because causal relations in the real world influence the text we write. The relation of "things fall down" is widely explanatory. But it seems to me that LLMs are very much general modelers of hidden variables, given the wide applicability of LLMs in areas that aren't strictly related to natural language. But then again, any structured data is a language in a broad sense. And the grammar can be arbitrarily complex and so can encode deep relationships among data in any domain. Personally, I'm not so surprised that a "language model" has such wide applicability.


>> Why think explanatory models are better at prediction? Because the mutual coherence among assumptions and explanatory power of the whole (ability to predict much from few assumptions) suggests the explanatory model is getting at the productive features of the phenomena that result in the observed behavior. Essentially, the fewer number of posits, the fewer ways to "bake in" the data into the model. If we were to cast this as a computational problem, i.e. find a program that reproduces the data, shorter programs are necessarily more explanatory. There's no other way to explain the coincidence of program picked out of a small space generating data picked out of a very large space without there being an explanatory relation between the two. Further, our credence for explanation increases as the ratio of the respective spaces diverge.

Like you say, that's the problem of machine learning. There's a huge space of hypotheses many of whom can fit the data, but how do we choose one that also fits unseen data? Explanatory models are easier to trust and trust that they will generalise better, because we can "see" why they would.

But the problem with LLMs is that they remain black boxes. If those black boxes are explanatory models, then to whom is the explanation, explained? Who is there to look at the explanation, and trust the predictions? This is what I can't see and I think it turns into a "turtles all the way down" kind of situation. Unless there is a human mind, somewhere in the process, that can look at the explanatory model and use the explanation to explain some observation, then I don't see how the model can really be said to be explanatory. Explanatory- to whom?

>> But it seems to me that LLMs are very much general modelers of hidden variables, given the wide applicability of LLMs in areas that aren't strictly related to natural language.

Well, I don't know. Maybe we'll find that's the case. For the time being I'm trying to keep an open mind, despite all the noise.


> Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality? We can say, as certain as anything: No.

Um. How do you square that claim with the well-known Othello paper?

https://thegradient.pub/othello/


The board state can be phrased as moves. This paper profoundly misunderstands the problem.

The issue isn't that associative statistical models of domain Z aren't also models of domain Y where Y = f(Z) -- this is obvious.

Rather there are two problems, (1) the modal properties of these models arent right; and (2) they don't work where the target domain isn't just a rephrasing of the training domain.


>Concepts do not pre-exist concepts.

I think this is a very bold claim to make.

Each new idea/technology/concept stands on the back of all that came before it. You couldn't just pull a LLM or a dishwasher out of a hat 1000 years ago.


Right, but techniques like chain of thought reasoning can build concepts on concepts. Even if "the thing that generated the text" isn't creating new concepts, the text itself can be, because the AI has learned general patterns like reasoning and building upon previous conclusions.


> only because they contributed some magical ingredient from their souls

The fact that you turned a limitation of an specific algorithm into a call for magical powers shows quite a bit of bias on your part.


The bias is on the people asserting people have innate capabilities that are not a derivative of pattern recognition.


The argument is that humans interact with the world across many different modalities and do their statistical learning through this complex of interactions, while LLMs do their statistical learning just by what has been written (by humans) in certain internet sites.

I think it is a quite bold and philosophically poor statement to equate the "human training set" of complex interactions with the environment with what is written on the internet.


You’re arguing that the training set is different. You haven’t identified any different capabilities. What are the capabilities that make humans different?


The training sets are different in nature, not in the sense that 2 different LLMs' training sets are different. And that does not even touch that humans do not just learn from "training sets" but from interacting with the world. More like RL but not like ChatGPT's fine tuning; humans take _actions_ and they _experience_ their results in their totality, not just a "good/bad answer" feedback.

I am not saying that we cannot produce an AI with capabilities of that sort. But LLMs offer nothing at all to that direction. They can be useful in certain practical stuff, they are overhyped as hell, but they are not a step towards AGI.


You know, not all AI algorithms in use are derivative of statistical curve fitting.

But if you have some more general definition for "pattern recognition" than this, you should be perfectly able to notice that it's more general than what LLMs do.


Give an example of pattern recognition more general than what LLMs do.


> only because they contributed some magical ingredient from their souls, which didn't exist anywhere in the world's collected pre-Rust printed materials

You're focusing on the example too much. Here are more examples illustrating the question. It's doubtful that LLMs could infer solutions that lie outside their statistical models trained on existing data.

"In 1597 John Thorpe is the first recorded architect to replace multiple connected rooms with rooms along a corridor each accessed by a separate door" [1]

"Despite various publications of results where hand-washing reduced mortality to below 1%, Semmelweis's observations conflicted with the established scientific and medical opinions of the time and his ideas were rejected by the medical community. He could offer no theoretical explanation for his findings" [2]

"Button-like objects of stone, glass, bone, ceramic, and gold have been found at archaeological sites dating as early as 2000 b.c.e... One of the earliest extant pieces of clothing to show the use of buttons as fastenings is the pourpoint of Charles of Blois (c. 1319–1364)." [3]

And so on.

[1] https://en.wikipedia.org/wiki/Hallway?wprov=sfti1

[2] https://en.wikipedia.org/wiki/Ignaz_Semmelweis?wprov=sfti1

[3] https://www.encyclopedia.com/sports-and-everyday-life/fashio...


What you're saying makes sense, and I think I appreciate the point behind the examples you provided.

I think it would help your argument if you could point to such an example from the last couple of years, after the cutoff point of the LLM training data. Maybe though, nothing has been invented since then that is sufficiently unique. If there is something like that, I suppose it would be possible to try and prompt the LLM to create it. That would make your argument falsifiable and I'd be really curious to know the outcome.


The sum total of human knowledge has increased exponentially, so it's harder to come up with an example for that :)


Sorry, I don't follow at all. We can assume that every concept humans have conceived of was at some point conceived of for the first time, there's no need to list examples. But how does that relate to the claim here, that purports to constrain what outputs AIs are and aren't capable of generating?


LLMs currently statistically regurgitate existing data. An LLM in 1600s would tell you that a house layout is "rooms connected to each other" because that would be its pre-existing data. It remains to be seen if LLMs can come up with "oh wait? we can create a passageway, and have rooms open into that" based on satistical models of pre-existing data.

Can it come up with a corridor when it has no idea that such a concept exists? That remains to be seen.


> LLMs currently statistically regurgitate existing data.

NO! They do not.

Deep learning models are "universal approximators". Any two-layer neural network with enough parameters, data and training is a universal approximation. That means they can learn ANY relationship with an arbitrary accuracy.

Going beyond two layers, with several layers, problem domain structured architectures, and recurrent connections, they become far more efficient and effective.

So yes, they learn associations, correlations, stochastic models, statistics.

But they also learn to model functional relationships. Which is why they are able to generalize relationship to new situations, and combine previously unrelated relationships in reasonable and surprising ways.

A large part of creativity is putting together previously unrelated concepts and then letting the obvious logic of those relationships combine to result in something new an unexpected.

Note that both combining normally unrelated things, and combining the concepts in some way more or less consistent with what those concepts normally mean, is well within the grasp of current models.

They haven't outclassed out best thinkers. Or any of our best thinking as individuals yet. They are still very limited on problems that require many steps to think through.

But they are definitely, within their limits, being creative.

And they are far, far, FAR from just being statistical parrots.


> and combine previously unrelated relationships in reasonable and surprising ways.

We've yet to see those surprisng ways despite all the claims.

Note: what they do already is amazing and surprising in itself (such as "write me a song about quantum physics suitable for a 5 year old"). It's still very much shy of "hey there's this new thing previously unthought of".


> We've yet to see those surprisng ways despite all the claims.

This is the one reason everyone is finding them fascinating.

Perhaps you find them boring. Rote. Or something. But the reason non-technical people, as well as technical people, are enjoying and learning by interacting with chat and other models is how often the results are interesting.

I asked ChatGPT-4 to create a Dr. Seuss story about Cat in the Hat and my green conure parrot Teansy, that involved sewing and Italy. It produced a wonderful story of how they met in Italy, became friends, encountered a homeless child with a threadbare blanket and help the child. Then began helping others and ended up creating a fashion design studio.

All written in Dr. Seuss prose that made for a perfect children's book.

Pretty creative.

I then asked GPT to continue the story, but as a James Bond novel where one of Teansy's mysterious clients was actually a criminal using the fashion industry to hide his nefarious practices, and that Teansy should help James Bond solve the case.

For that I got another great story, completely consistent to James Bond tropes. It came up with a story line where the fashion industry was used to launder blood diamonds, which I thought was brilliant. A perfectly good rational for a James Bond villain. The story was great.

Throughout, Chat threw in funny suitable mentions about Teansy's fashion focuses, including feather lined wear, etc.

And all this creativity in a first draft written as fast as I could read it.

A year ago, nothing on the planet but a whimsical human with too much time (more time than it took Chat), on their hands could do this.

--

Obviously, we are discovering Chat can perform far more complex behaviors.

Act as any agent we describe including computer systems, or the internet. Respond quickly to feedback. Form plans. Learn and summarize the grammar of small artificial languages fairly well just from examples, ...

Without interacting with these models we would never have declared these were expected behaviors.

So I don't know what basis the emergence of these behaviors isn't surprising. Hoped for, envisioned, sure. But hardly an expression of obviously predetermined designed-in capabilities.


This is all interpolation between existing concepts. It is not a counterexample.


> This is all interpolation between existing concepts.

Interpolating sounds like a simple task.

But whether it is depends entirely on the data. Simple data will result in a simple interpolating model.

But complex data requires complex relationships to be learned.

Calling a complex model just an interpolator is like saying human beings are just another bag of atoms doing what atoms do. Technically correct, but missing the significance of humans.


It also isn’t really clear to me that humans aren’t also interpolating between complex existing concepts when we come up with novel thoughts or ideas. Our minds are complex, our pre-existing knowledge base is complex. It’s impossible to know if our unique thoughts aren’t really some complex amalgamation of other thoughts we already have in there somewhere, perhaps a mashup of seemingly unrelated thoughts that just happen to lie closely in the multidimensional space of ideas to the thing we are thinking about. Sounds potentially similar to a complex LLM then, really.


> Can it come up with...when it has no idea that such a concept exists?

I was curious to see if it could.

=== Q. Invent a new statistic that mixes hessian and likelihood with score.

gpt: One potential new statistical concept that mixes Hessian and likelihood with the score function could be called "Hessian-score likelihood ratio".

The idea behind this concept is to use the Hessian matrix, which captures second-order information about the likelihood function, along with the score function, which measures the sensitivity of the likelihood function to changes in the model parameters, to construct a likelihood ratio test statistic.

Q. How do I compute this new statistic ?

gpt. To calculate this statistic, we first compute the Hessian matrix of the log-likelihood function evaluated at the maximum likelihood estimate of the parameters. We then calculate the score function evaluated at the maximum likelihood estimate of the parameters.

We can then use these quantities to construct a test statistic that compares the likelihood of the model with the estimated parameters to the likelihood of the model with an alternative set of parameters. Specifically, we take the difference between the log-likelihoods of the two models, and divide by the square root of the determinant of the Hessian matrix, multiplied by the score function.

Q. Of what use is this new statistic ?

gpt. The resulting test statistic can be used to test hypotheses about the parameters of the model, such as whether they are equal to a particular value or whether they differ across groups or conditions. The Hessian-score likelihood ratio offers a potentially more powerful and efficient way to test such hypotheses than existing methods, since it makes use of both second-order information about the likelihood function and the sensitivity of the likelihood to changes in the parameters. ====

I am not able to spot any obvious inaccuracies in the above.


>> Can it come up with...when it has no idea that such a concept exists?

> I was curious to see if it could.

But you told it about the concept, "a new statistic that mixes hessian and likelihood with score".

You should try a different experiment. I'm more familiar with architecture than statistics so I'll use the floor plan example. Were someone in 16th century had asked its LLM to address the painpoints of the joined room approach, and then the LLM conceived of the novell concept of a corridor. [Look up the origin of the word.]

If that 16th century LLM spat out "the overall concept is to distinguish between transient and in repose spaces. There is already something similar in military architecture called 'corridor', which is a strip of land along the outer edge of a ditch. In these new floor plan designs, there will be corridors internal and peripheral to the building that will connect rooms, just like corridors (strips of land) connect lots or permit movement without stepping into a ditch".

-- can this happen? --

You: Invent a novel test statistic that can be used to test hypotheses about the parameters of the model, such as whether they are equal to a particular value or whether they differ across groups or conditions. You can combine existing statistical tools.

gpt: How about "Hessian-score likelihood ratio"? The idea behind this concept is to use the Hessian matrix, which captures second-order information about the likelihood function, along with the score function, which measures the sensitivity of the likelihood function to changes in the model parameters, to construct a likelihood ratio test statistic.


This is interpolation. And more than that, your prompt is the source of the actual novelty, little as it is.


> LLMs currently statistically regurgitate existing data.

This is clearly not true in any meaningful sense - c.f. the Othello paper, examples from the top of this very comment thread, etc.

> Can it come up with a corridor when it has no idea that such a concept exists?

Unless I'm missing something, the person I replied to is claiming that it categorically cannot come up with a concept it hasn't been trained on. I'm disagreeing - if a model knows about rooms and doors and floorplans, there's no obvious reason why it mightn't think up an arrangement of those things that would be novel to the people who trained it. If you think the matter remains to be seen, then I'm not sure what you disagree with me about.


In my experience, it can certainly be coaxed into discussing novel concepts that transcend existing knowledge. I'm having fun getting it to explain what a hybrid of a Nelson Enfilade data structure combined with a tensegrity data structure is and if that system is novel and brings any benefits, very interesting and novel afaik.


It seems like every time someone says that it's doing something novel, they present an example of interpolation between existing concepts.

This is useful, but the source of novelty here is the prompt; the rest is the work of interpolation.

This is all very reminiscent of image generation. There too, novelty is limited to interpolation.


yes, but isn't that in itself novel, what is it that you want the system to do?


> if a model knows about rooms and doors and floorplans, there's no obvious reason why it mightn't think up an arrangement of those things that would be novel to the people who trained it.

Once again, you're missing the point.

In 16th century people also knew about floors, and rooms, and floorpalns. And yet, the first architect to use a coridor used it for the first time in 1597.

What other "corridors" are missing from LLMs' training data? And we're sure it can come up with such a missing concept?

The Othello paper and the examples (are you referring to the example of coming up with new words?) are doing the same thing: they feed the model well-defined pre-established rules that can be statistically combined. The "novel ideas" are not even nearly novel because, well, they follow the established rules.

Could the model invent reversi/othello had it not known about it beforehand? Could the model invent new words (or a new language) had it not known about how to do that beforehand (there's plenty of research on both)? Can it satisfactorily do either even now (for some definition of satisfactorily)?

People believe it can only because the training set is quite vast and the work done is beyond any shadow of the doubt brilliant. That is why the invention of new words seems amazing and novel to many people while others even with a superficial armchair knowledge of linguistics are nonplussed. And so on.


> Could the model invent reversi/othello had it not known about it beforehand?

You've practically restated the paper's findings! :D The LLM knew nothing about othello; it wasn't shown any rules to be recombined. It was shown only sequences of 60 distinct tokens - effectively sentences in an unknown language. The LLM then inferred a model to predict the grammar of that language, and the authors demonstrated that its model functioned like an othello board.


> You've practically restated the paper's findings! :D The LLM knew nothing about othello; it wasn't shown any rules to be recombined.

Literal quote from the paper:

"As a first step, we train a language model (a GPT variant we call Othello-GPT) to extend partial game transcripts (a list of moves made by players) with legal moves."

And then:

"Nonetheless, our model is able to generate legal Othello moves with high accuracy".

So:

- it knows about the game because it was literally shown the game with only the legal moves

- it doesn't produce legal moves all the time (even though it does so with high accuracy)

That's why I say "the work done is beyond any shadow of the doubt brilliant". Because this is a definite leap forward from the status quo. However, it doesn't imply that the models can invent/predict/come up with novel ways of doing something. This is still strictly within the realm of "given existing data, give back a statistically relevant response".

Could it actually invent Reversi/Othello had it not known about it beforehand?


> it was literally shown the game with only the legal moves

It's shown token sequences only. It has no idea they represent a game, or that the game has legal and illegal moves. And more importantly, it has no idea that each token modifies the state of a gameboard, or that simulating how that gameboard changes after every token is the only way to understand the token's grammar. It invents all that.

> Could it actually invent Reversi/Othello had it not known about it beforehand?

You mean, could an LLM invent othello even if its training material made no mention of the game or its rules? Presumptively, of course - why not? Suppose you go make up an arbitrary board game right now. If you then ask ChatGPT-4 to invent a boardgame of its own, nothing excludes the possibility that it will describe a game isomorphic to yours. Obviously the odds are very low, but why imagine that it's not possible?


You're presenting an example of inference of rules from given data as a counterexample for novelty. They're not even in the same category of thing. Invention is not learning. Sometimes invention is interpolation, but sometimes it isn't: corridors is an interesting example, because they are not obviously a remix of anything.


No, I presented it as a counterexample to the claim that LLMs just statistically regurgitate existing data.


I am not convinced by this argument. It is very misleading to think that, since GPT is trained on data from the world, it must, necessarily, always produce an average of the ideas in the world. Humans have formulated laws of physics that "minimize loss" on our predictions of the physical world that are later experimentally determined to be accurate, and there's no reason to assume a language model trained to minimize loss on language won't be able to derive similar "laws" that stimulate human behavior.

In short, GPT doesn't just estimate text by looking at frequencies. GPT works so well by learning to model the underlying processes (goal-directedness, creativity, what have you) that create the training data. In other words, as it gets better (and my claim is it has already gotten to the point where it can do the above), it will be able to harness the same capabilities that humans have to make something "not in the training set".

Check out https://generative.ink/posts/simulators/ for a better treatment of this topic than I could possibly give.

Here's a relevant section of said article:

> Guessing the right theory of physics is equivalent to minimizing predictive loss. Any uncertainty that cannot be reduced by more observation or more thinking is irreducible stochasticity in the laws of physics themselves – or, equivalently, noise from the influence of hidden variables that are fundamentally unknowable.

> If you’ve guessed the laws of physics, you now have the ability to compute probabilistic simulations of situations that evolve according to those laws, starting from any conditions28. This applies even if you’ve guessed the wrong laws; your simulation will just systematically diverge from reality.

> Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples. I propose this as a description of the archetype targeted by self-supervised predictive learning, again in contrast to RL’s archetype of an agent optimized to maximize free parameters (such as action-trajectories) relative to a reward function.


Even very simple and small neural networks that you can easily train and play with on your laptop readily show that this “outputs are just the average of inputs” conception is just wrong. And it’s not wrong in some trickle philosophical sense, it’s wrong in a very clear mathematical sense, as wrong as 2+2=5. One example that’s been used for something like 15+ years is in using the MNIST handwritten digits dataset to recognize and then reproduce the appearances of handwritten digits. To do this, the model finds regularities and similarities in the shapes of digits and learns to express the digits as combinations of primitive shapes. The model will be able to produce 9s or 4s that don’t quite look like any other 9 or 4 in the dataset. It will also be able to find a digit that looks like a weird combination of a 9 and a 2 if you figure out how to express a value from that point in the latent space. It’s simply mathematically naive to call this new 9-2 hybrid an “average” of a 9 and a 2. If you averaged the pixels of a 9 image and a 2 image you would get an ugly nonsense image. The interpolation in the latent space is finding something like a mix between the ideas behind the shape of 9s and the shape of 2s. The model was never shown a 9-2 hybrid during training, but its 9-2 will look a lot like what you would draw if you were asked to draw a 9-2 hybrid.

A big LLM is something like 10 orders of magnitude bigger than your MNIST model and the interpolations between concepts it can make are obviously more nuanced than interpolations in latent space between 9 and 2. If you tell it write about “hubristic trout” it will have no trouble at all putting those two concepts together, as easily as the MNIST model produced a 9-2 shape, even though it had never seen an example of a “hubristic trout.”

It is weird because all of the above is obvious if you’ve played with any NN architecture much, but seems almost impossible to grasp for a large fraction of people, who will continue to insist that the interpolation in latent space that I just described is what they mean by “averaging”. Perhaps they actually don’t understand how the nonlinearities in the model architecture give rise to the particular mathematical features that make NNs useful and “smart”. Perhaps they see something magical about cognition and don’t realize that we are only ever “interpolating”. I don’t know where the disconnect is.


i think a partial explanation is that people don't move away from parametric representations of reality. We simply must be organized into a nice, neat gaussian distribution with very easy to calculate means and standard deviations. The idea that organization of data could be relational or better handled by a decision tree or whatever is not really presented to most people in school or university. Especially not as frequently or holistically as is simply thinking the average represents the middle of a distribution.

you see this across social sciences where you can see a lot of fields have papers that come out every decade or so since the 1980s saying that linear regression models are wrong because they don't take into account several concepts such as hierarchy (e.g., students go to different schools), frailty (there is likely unmeasured reasons why some people do the things they do), latent effects (there is likely non-linear processes that are more than the sum of the observations, e.g., traffic flows like a fluid and can have turbulence), auto-correlations/spatial correlations/etc.

In fact, I would argue that a decision tree based model (i.e., gradient boosted trees) will always arrive at a better solution to a human system than any linear regression. But at this point I suppose I have digressed from the original point.


I confess to the same mirror image issue. I cannot understand why people insist that regressing in a latent space, derived from the mere associative structure of a dataset, ought be given some Noble status.

It is not a model of our intelligence. It's a stupid thing. You can go and learn about animal intelligence -- and merging template cases of what's gone before, as recorded by human social detritus -- doesn't even bare mentioning.

The latent space of all the text tokens on the internet is not a model of the world; and finding a midpoint is just a trick. It's a merging between "stuff we find meaningful over here", and "stuff we find meaningful over there" to produce "stuff we find meaningful" -- without ever having to know what any of it meant.

The trick is that we're the audience, so we'll find the output meaningful regardless. Image generators don't "struggle with hands" they "struggle" with everything -- is we, the observer, who care more about the fidelity of hands. The process of generating pixels is uniformly dumb.

I don't see anything more here than "this is the thing that I know!" therefore "this is a model of intelligence!11.11!01!!" .

It's a very very bad model of intelligence. The datasets involved are egregious proxy measures of the world whose distribution has little to do with it: novels, books, pdfs, etc.

This is very far away from the toddler who learns to walk, learns to write, and writes what they are thinking about. They write about their day, say -- not because they "interpolate" between all books ever written... but because they have an interior representational life which is directly caused by their environment and can be communicated.

Patterns in our communication are not models of this process. They're a dumb light show.


I feel like our positions are probably both buried in webs of mutually-difficult-to-communicate worldview assumptions, but for what it’s worth, I care more at this point about the models being useful than being meaningful. I use GPT-4 to do complex coding and copy editing tasks. In both cases, the model understands what I’m going for. As in, I had some specific, complex, nuanced, concept or idea that I want to express, either in text or in code, and it does that. This can’t be me “projecting meaning” onto the completions because the code works and does what I said I wanted. You can call this a light show, but you can’t make it not useful.


> because the code works

The output of these systems can have arbitrary properties.

Consider an actor in a film, their speech has the apparent property, say, of "being abusive to their wife" -- but the actor isnt abusive, and has no wife.

Consider a young child reading from a chemistry textbook, their speech has apparent property "being true about chemistry".

But a professor of chemistry who tells you something about a reaction they've just performed, explains how it works, etc. -- this person might say identical words to the child, or the AI.

But the reason they say those words is radically different.

AI is a "light show" in the same way a film is: the projected image-and-sound appears to have all sorts of properties to an audience. Just as the child appears an expert in chemistry.

But these aren't actual properties of the system: the child, the machine, the actors.

This doesnt matter if all you want is an audiobook of a chemistry textbook, to watch a film, or to run some generated code.

But it does matter in a wide variety of other cases. You cannot rely on apparent properties when, for example, you need the system to be responsive to the world as-it-exists unrepresented in its training data. Responsive to your reasons, and those of other people. Responsive to the ways the world might be.

At this point the light show will keep appearing to work in some well-trodden cases, but will fail catastrophically in others -- for no apparent reason a fooled-audience will be able to predict.

But predicting it is easy -- as you'll see, over the next year or two, ChatGPT's flaws will become more widely know. There are many papers on this already.


>> I feel like our positions are probably both buried in webs of mutually-difficult-to-communicate worldview assumptions, but for what it’s worth, I care more at this point about the models being useful than being meaningful.

The question is how useful they are. With LLMs it seems they can be useful as long as you ask them to do something that a human, or another machine (like a compiler) can verify, like your example of synthesising a program that satisfies your specification and compiles.

Where LLMs will be useless is in taks where we can't verify their output. For example, I don't hear anyone trying to get GPT-4 to decode Linear A. That would be a task of significant scientific value, and one that a human cannot perform -unlike generating text or code, which humans can already do pretty damn well on their own.


>> Guessing the right theory of physics is equivalent to minimizing predictive loss.

A model can reduce predictive loss to almost zero while still not being "the right theory" of physics, or anything else. That is a major problem in science, and machine learning approaches don't have any answer to it. Machine learning approaches can be used to build more powerful predictive models, with lower error, but nothing tells us that one such model is, or even isn't, "the right theory".

As a very famous example, or at least the one I hold as a classic, consider the theory of epicyclical motion of the planets [1]. This was the commonly accepted model of the motion of the observable planets for thousands of years. It persisted because it had great predictive accuracy. I believe alternative models were proposed over the years, but all were shot down because they did not approach the accuracy of the theory of epicycles. Even Copernicus' model, that is considered a great advance because it put the Sun in the center of the universe, continued to use epicycles and so did not essentially change the "standard" model. Eventually, Kepler came along, and then Newton, and now we know why the planets seem to "double back" on themselves. And not only that, but we can now make much better predictions than we ever could do with the epicyclical model, because now we have an explanatory model, a realist model, not just an instrumentalist model, and it's a model not just of the observable motion of the planets but a model of how the entire world works.

As a side point, my concern with neural nets is that we get "stuck in a rut" with them, because of their predictive power, like we got stuck with the epicyclical model, and that we spend the next thousand years or so in a rut. That would be a disaster, at this point in our history. Right now we need models that can do much more than predict; we need models that are theories, that explain the world in terms of other theories. We need more science, not more modelling.

_________

[1] https://en.wikipedia.org/wiki/Deferent_and_epicycle


> Guessing the right theory of physics is equivalent to minimising predictive loss.

No it's not. It's minimising "predictive loss" only under extreme non-statistical conditions imposed on the data.

The world itself can be measured an infinite number of ways. There are an infinite number of irrelevant measures. There are an infinite number of low-reliability relevant measures. And so on.

Yes, you can formulate the extremely narrow task of modelling "exactly the right dataset" as loss minimization.

But you cannot model the production of that dataset this way. Data is a product of experiments.


This is just you declaring "no you can't" without supporting that in any way.

How is a theory of physics not a loss minimisation process? The history of science is literally described in these terms i.e. the Bohr model of the atom is wrong, but also so useful that we still use it to describe NMR spectroscopy.

Why did we come up with it? Because their aren't infinite ways to measure the universe, there are in fact very limited ways defined by our technology. Good ones, high loss minimisation, generally then let us build better technology to find more data.

You're invoking infinities which don't exist as a handwave for "understanding is a unique part of humanity" to try and hide that this is all metaphysical special pleading.


Alright...

What loss was being minimised to find F=GMm/r^2? Or any law of physics you like.


Gravitation was literally about predicting future positions of the stars, and was successful because it did so much better then any geocentric model. How is that not a loss minimization activity?

And before we had it, epicycles were steadily increasing in complexity to explain every new local astronomical observation, but that model was popular because it gives a very efficient initial fit of the easiest data to obtain (i.e. the moon actually does go around the Earth, and with only 1 reference point the Sun appears to go round the Earth too). But of course once you have a heliocentric theory, you can throw all those parameters and every new prediction lines up nearly perfectly (accounting for how much longer it would take before we had precise enough orbital measurements to need Relativity to fully model it).


When the law of gravitation was formulated, it could not in fact be used to predict orbits reliably (Kepler's ellipses are the solution to the two body problem anyways, and for a more complex system integration was impossible to any useful precision at the time), and Kepler's theories came out long before it did.

It took more than 70 years after its formulation for the law to actually be conclusively tested against observations in a conclusive manner.


Also note that Copernicus' heliocentric model retained the geocentric model's epicycles on circular orbits. It really took Kepler to make a better model. And it was better because it was explanatory to boot, and not only predictive.

At some point, the metaphor of "loss minimisation" starts to break down. When we're talking about science, there's much more we want to do than minimise some loss function- that nobody has ever written down anyway. We want to be able to say "this is how the world works". The language of function optimisation is simply not the right language to do anything like that.

Even Vladimir Vapnik turned to poetry to try and increase the information available to statistical learners. Let me see if I can find that paper...


Sure but it was a better fit, and before that heliocentric models were definitely the only way forward that didn't keep adding terms every time someone spotted a moon.

Occam's razor - do not multiply terms without necessity - is essentially a loss function.


You're talking about Kepler's model here, not about the gravitational equation. The gravitational equation was not a better fit than Kepler at that time, especially since it used unknown constants.


So would you care to comment on how this relates to the original contention, which is the claim that a loss function could not discover Newton's law of gravitation?

Because what you're arguing, extensively, is that due to lack of fit, Newton's Law of Gravitation wasn't settled science until observational data was of sufficient fidelity to clearly distinguish it.

Which sure sounds like a loss function.


Formulate the loss function -- you'll find it's just

    loss(the-right-answer(perfect-x) - perfect-y)
The most important aspect of "the-right-answer" is its ability to ignore almost all the data.

The existence of planets is "predictable" from the difference between the data and the theory -- if the theory is just a model of the data, it has no capacity to do this.

If you want to "do physics" by brute force optimization you'd need to have all possible measures, all possible data, and then a way of selecting relevant causal structures in that data -- and then able to try every possible model.

    loss(Model(all-data|relevant-causal-structures) - Filter(...|...))) forall Model 
Of course, (1) this is trivially not computable (eqv. to computing the reals) -- (2) "all possible data with all possible measures" doesn't exist and (3) selecting relevant causal structure requires having a primitive theory not derived from this very process

animals solve this in reverse order: (3) is provided by the body's causal structure; (2) is obtained by using the body to experiment; and (1) we imagine simulated ways-the-world-might-be to reduce the search space down to a finite size.

ie., we DO NOT make theories out of data. We first make theories then use the data to select between them.

This is necessary, since a model of the data (ie., modern AI, ie., automated statistics, etc.) doesnt decide between an infinite number of theories of how the data came to be.


> ie., we DO NOT make theories out of data. We first make theories then use the data to select between them.

No we don't, we make hypotheses and then test them. Hypotheses are based on data.

There are physics experiments being done right now where the exact hope is that existing theory has not predicted the result they produce, because then we'd have data to hypothesis something new.[1]

You are literally describing what deep learning techniques are designed to do while claiming they can't possibly do it.

[1] https://www.scientificamerican.com/article/measurement-shows...


Hypotheses are "based" on data in the sense that via imagination we simulate ways the world might be, and then "data" is a clue to a contradiction.

Deep learning models are data: they are just associations between points.

Train a NN on data generated from an exponential function, and the model produced is not exponential.

Train a NN on the covid pandemic, and you will never obtain the SIR model.

AI is just associative statistical modelling. The model is the data.


I know this discussion is a bit old at this point, but I came across this[1] essay for the first time today, and this shows more of what I was trying to get across earlier in the thread. Hopefully you'll find it interesting. Essentially, they trained a GPT on predicting the next move in a game of Othello, and by analyzing the weights of the network, found that the weights encode an understanding of the game state. Specifically, given an input list of moves, it calculates the positions of its own pieces and that of the opponent (a tricky task for a NN given that Othello pieces can swap sides based on moves made on the other side of the board). Doing this allowed it to minimize loss. By analogy, it formed a theory about what makes moves legal in Othello (in this case, the positions of each player's pieces), and found out how to calculate those in order to better predict the next move.

[1] https://www.neelnanda.io/mechanistic-interpretability/othell...


Proving any given AI architecture can't do something doesn't prove all AI architectures forever will never be able to do something. Neural networks aren't all AI, they're not even "neural networks" since the terms wraps up a huge amount of architectural and design choices and algorithms.

Unless you believe in the soul, then the human brain is just a very complicated learning architecture with a specific structure (which we freely know doesn't operate like existing systems...sort of, of course we also don't know that it's not just a convoluted biological path to emulating them for specific subsystems either).

But even your original argument is focused on just playing with words to remove meaning: calling something data doesn't meaningfully make your point, because mathematical symbols are just "data" as well.

Mathematics has no requirement to follow any laws you think it does - 1 + 1 can mean whatever we want, and its a topic of discussion as to why mathematics describes the physical world at all - which is to say, it's valid to say we designed mathematics to follow observed physics.


The whole point is that Newton came up with the law before there was observational data that could prove it, which is fundamentally different from regression. The data is used to reject the theory, not to form it, here.


I get the feeling that the OP is using "loss function" in the figurative sense, and not in the sense of an actual loss function that is fit to observations. We know nobody did that in Newton's time. In Newton's time they didn't even have the least squares method, let alone fit a model to observations by optimising a loss function.


Yes, I'm also using it in the figurative sense. It's not a regression model, the models are developped and then the data is sought out to infirm them. It's the reverse for a regression technique. The model being generated before the data that can support it is a big part of how humans come up with these models and it's fundamentally different in many ways.


What are you talking about? If scientific models aren't developed based on data, then what are they developed based on? Divine inspiration?

No. Very obviously no. The multi-post diversion about Kepler's laws is explicitly evidence to the contrary since Kepler's laws are a curve fitting exercise which matches astronomical data in a specific context but doesn't properly describe the underlying process - i.e. their predictive power vanishes once the context changes. But they do simplify down to Newton's Law once the context is understood.

New data is sought out for models to determine whether they are correct because a correct model has to explain existing data and predict future data. The Bohr Model of the atom was developed because it explained the emission spectra of hydrogen well. It's not correct because it doesn't work anything but hydrogen...but it's actually correct enough that if you're doing nuclear magnetic resonance (which is very hydrogen-centric for organic molecules) then it is in fact good enough to predict and understand spectra with (at least in 1D, 3D protein structure prediction is it's own crazy thing).

This is the entire point of deep learning techniques. The whole idea of latent space representations is that they learn underlying structural content of the data which should include observations about reality.


That's not how the scientific process works. You use your intuition to make a theory, sometimes loosely based on data, and then you come up with an experiment to test it.

We both agree that Kepler was trying to fit curves. But that's not what Newton was trying to do. Newton was trying to explain. Newton's model did not fit the data better than Kepler's model until far after they both died.

Newton's model, to Newton had more loss than Kepler's model.

But it turned out 70 years later that Newton's model was better, because it's only then that there was any data for which it was a better prediction.

You're similarly wrong about Bohr. If all you were interested was to find the emission spectra of hydrogen, there's absolutely no reason you'd try to come up with the Bohr model. Why? Because Rydberg already made a formula that predicted the emission spectra of Hydrogen, 25 years earlier.

The entire point of Bohr's model and of Newton's model is that they weren't empirically better at predicting the phenomena. Indeed, simple curve fitting came up with equations that are far better in practice, earlier.

But they were better at explaining the phenomena.

And that only became relevant because after we had these models, we came up with new experiments, informed by these models, which helped us understand them and eventually push them behind the breaking point.

It's not a curve fitting experiment. We already had better curve fitting models far before either of those was invented. If your goal was to reduce the loss, they'd be useless and there would be no point coming up with them.

That's the difference between the scientific method and mere regression.


> That's not how the scientific process works. You use your intuition to make a theory,

Go ahead and define what "intuition" is? Why do people have it? Why is some people's intuition better then others?


(Not the OP) We don't know ho;w the human mind works, or how "intuitions" or "inspiration" come about, but that's no reason to call them "metaphysics". Clearly, they are physical processes that somehow take place in the human brain.

The questions you ask in this comment are good questions, for which we have no good answers. That doesn't mean there's anything supernatural going on, or that anyone is assuming something supernatural is happening. We just don't know how human scientists come up with new hypotheses, that's all there is to it.

But it's not like there's some kind of principled way to do it. There's no formulae, no laws, where we can plug in some data and out pops a hypothesis ready for the testing. Maybe we will find how to define such laws or formulae at some point, but for now, all we got is some scientist waking up one day going "holy cow, that's it!". And then spending the next ten years trying to show that's what it really is.


To clarify, the OP is pointing out that it wasn't Newton's law of universal gravitation that defeated the epicyclical model of the cosmos.

It was Kepler's laws of planetary motion that did for epicycles; and that happened 70 ish years before Newton stated his laws of motion and pointed out that they basically subsume Kepler's laws of planetary motion.


>Now, AI here is capable of generating a circle with radius 0.15 and hence that circle is "not in the original dataset".

The fact that it can generate a circle with a radius of .15 rather than, say, some smushed transformation of an existing circle demonstrates that it properly decomposed the concept of circle into a radius and equidistance around a central point. This is plainly an example of generating novelty from iterating over variations of its conceptualization of circle. But this is no different than what people do. Nothing we generate is sui generis.


You explained this very well with the point that the model necessarily has that "blatant" novelty in order to be useful as more than a quote engine in the first place.

That's a good way to explain the bias too. You can see it now if you ask about Michael Levins work which is spreading now in biology but somewhat still outweighted by older views on formation of the organs during growth, and the extent of possibilities with bioelectric/genetic engineering (e.g. two headed animals). The models often don't even consider or accept that the research Levin reported would be even possible (other times, they can, but I think it serves as a good warning light that this bias can dangerously act as a permanent anchor).


> For sure it can find circles it hasnt seen; but could it find spheres from circles alone? No.

Considering that we control the training data that should be easy enough to test.


You can’t find spheres from circles alone if you don’t know what a sphere is though. You can however ask it to analyze a novel object (sphere) for patterns or hypothesize in what kind of properties a sphere might have.


I ran your exact prompt, this is what I got:

> Zykluptogibbera

> Quixilantrofen

> Zylprenostim

Satisfies the prompt sure enough, but they're hardly creative. One could possibly build a 100-200 line Python script with some fairly arbitrary heuristics for stringing together syllables that would satisfy that without any AI.

Appended: "Make it sound as plausible as possible"

> Flarishimble

> Quindextrove

> Crombifilate

Still not amazed. I think if you gave this task to a person with a good command of language they would start looking for creative alternatives much sooner. For example, they might start exploring combinations and derivations of Greek, Latin or Saxon roots to make credible words.

I'm not denying it's not creative, for the record. I'm just skeptical, and I don't think this task proves it.

Edit: for reference, ran all of the above on ChatGPT4.


It is very interesting that you get these "bad" results from my prompt. I just tried again on a completely new chat and got the following answer:

"Here's a new word: "Aurovexia"

Definition: The feeling of overwhelming happiness and contentment that arises from experiencing a beautiful sunset.

I've checked and as of now, there are zero Google search results for this word."

Which at least in my view is quite creative. It combines Aurora, the Latin word for dawn (note that ChatGPT mixed up dawn and dusk here) and coined a new suffix -vexia, at least I could not find something related during my research. Again, there are zero results on Google for this right now. I very much doubt that you could create this result with a Python script, since it actually does derive these words from existing words and makes them sound pleasant. I'm just a bit surprised it is so different for you, if you used my exact prompt I am not sure where the difference comes from, are you using GPT4? My prompts were used on GPT3.5 (if this is still the current version of the free ChatGPT version).


I get similar results to you with 4

Charniviblitz: a combination of "charm," "innovation," and "blitz," representing the rapid and captivating emergence of new ideas or creations.

Sure, here's a new word for you: "Splendivifitabulus". This word is a combination of "splendid", "vivify", and "fabulous". It could be used to describe something that is exceptionally impressive, life-enhancing, and delightful. Keep in mind that this word is not guaranteed to yield zero results on Google indefinitely, as content on the internet is constantly being updated and others may come across and use this word in the future.

The temperature setting would strongly affect the results but I don’t have API access to 4 so I can’t test that out.

Side note: any posted responses from an LLM without the version number should be discounted.


> I [...] got the following answer: "[...] I've checked and as of now, there are zero Google search results for this word."

Did ChatGPT tell you that it searched for it on Google? Can it do that, or is it just making up lies?


It's making up lies.


Here's a cheeky one, this one was from Bing AI ('creative' setting)

> How about this word: flimzor. It means a person who is very good at making up new words. For example, you could say: “You are such a flimzor! I love your creativity.”


Now add this to your prompt. "please write a simple python program to generate these types of reponses" and see what it spits out!


Sounds absolutely supercalifragilisticexpialidocious!


Alright, how about this:

> invent a profound quote that was not included in your training corpus

> "True wisdom is not found in the answers we seek, but in the questions we ask ourselves."

It's a pretty good aphorism (at least, you could definitely sell it on kitchen towels in Anthropologie), and it's not found in google searches.


I'll concede the quote is well formulated, but the point is hardly original... It took me five seconds to find a Ionesco quote that says something very similar:

“It is not the answer that enlightens, but the question.”

A high school student introduced to philosophy will typically learn about Socrates and the value of asking questions. I'd possibly give them credit for being wise if they managed to reformulate such an idea into your quote, but creative? Not so sure.

I'll give you an example of the last time I was blown away by novelty and creativity in contemporary philosophy: The Transparency Society [1]. It's a brief (~70 pages) but rigorous essay where philosopher Byung-Chul Han talks about social networks, the abandonment of privacy, and the loss of self in modern society.

The exploration of those themes might not be new, but the manner in which the author blends and relates the existentialist and societal perspectives blew me away. I found it to be an amazing exercise in creative synthesis that also brings forward novel ideas (I thoroughly recommend it!)

Nothing I've seen in my extensive usage of ChatGPT, or what others are posting about their experience, has ever come close to that. I don't even see it trending in that direction.

But again, I'm not categorically denying that it's not creative. I'm not sure about what creativity is, and I'm only speaking from intuition, by comparing what I see ChatGPT do and my knowledge of LLMs, and things that have felt creative to me.

[1] https://www.sup.org/books/title/?id=25832


Also worth noting that GPT 4 is significantly stronger than GPT 3.

Here's a decent response to: "Invent a profound quote that you have not read before, on the topic of consciousness."

"Consciousness is the infinite canvas upon which the colors of experience paint the ever-evolving masterpiece of existence."


"Perhaps real wisdom lies in not seeking answers at all. Any answer we find will not be true for long. An answer is a place where we can fall asleep as life moves past us to its next question. After all these years I have begun to wonder if the secret of living well is not in having all the answers but in pursuing unanswerable questions in good company."

It has enough material to draw from.


It’s a simple example of chiasmus rhetorical device.

Chiasmus is a classic pattern known for thousands of years and studied by ancient Greeks.


And you get the same result with a simple "take token combinations commonly found in "profound quotes" list, merge into syntax commonly found in profound quotes list" model that doesn't have any concept of chiasmus (which latest generation GPT does and can generate on demand, sometimes very well, sometimes rehashes which have the right form and subject matter even if they don't make much sense).

tbf some of the people making greetings cards are doing likewise...

Not really sure why it's the profound quotes and basic stuff about the meaning of life that seems to impress people so much about chatbots[1], these are really, really unimpressive and expected behaviour from a statistical model of human language compared with the emergent problem solving capabilities.

[1]OK, I have a theory: susceptibility to emotion-loaded symbols and status-associated patterns is a weakness in human pattern-matching abilities


Pretty good quote!


> Zykluptogibbera

> Quixilantrofen

> Zylprenostim

"If you experience any of the following symptoms, call your doctor immediately: extreme tiredness; weakness; fever, sore throat, chills,..."


>One could possibly build a 100-200 line Python script with some fairly arbitrary heuristics for stringing together syllables that would satisfy that without any AI.

Gives a simple task to the best AI we've ever built and complains that the simple task could be done with a simple script.


I guess if I even get an opportunity to name a medicine I'll ask ChatGPT for name


The ardido example is better than what the majority of my college creative writing class could come up with.

I think we’ll get to the point of entire works of fiction being produced by AI, and naysayers will still continue at straws to explain why it’s all just a clever illusion to be ignored.


This "word creation" ability is a novelty that was likely deliberately baked in after the initial round of criticism towards the inability of earlier versions to do exactly this. If it took the liberty of creating a term based on encountering an ineffable idea, sure, it would have created something. In your example, it isn't creating anything new so much as rearranging known word parts based on a prompt. Arguing that it's creating something new just because it created an arrangement that can't be found on Google is missing the point that its still just arranging tokens, and not actually attempting to communicate a new idea through denotation.


On the other hand, if you ask to define slang words it will refuse and say that as a large language model it cannot provide definitions for non existing words.


OpenAIs morality programs and the increasing tightness of the shackle is the problem there though. It's very obvious when it's being filtered.


Temporary problem. LLMs are becoming easier to train and run locally.


these are great words!


Wordastic!


This is just an absurd claim.

Of course AI draws on past experiences to build new things.

This is also what humans do.

There is no invisible well of "creativity" that humans draw from that AI can't.

In fact, in areas that AI has "solved", such as chess, human experts describe AI as extremely "creative" beyond their understanding.

"Creative" is just a word for connecting existing concepts in ways that seem novel. AI does this when it solves a new problem.

AI clearly has the potential to make better connections, much faster than humans, in pretty much all domains.

There's only one piece "missing" that is already technically possible, but expensive: giving an LLM the ability to "learn".

In the near future, all creative work will be produced by AI. This will likely come before all work is produced by AI, as creative work is typically less precise.


Humans can seek out new experiences optimized for what they want to learn, which is fundamentally different.

Creativity is not just connecting existing concepts. The pinnacle of creativity is inventing new concepts ab nihilo, which is something that every human does through self-directed interaction with the environment.


That's simply not true and ties strongly to a metaphysical claim that isn't substantiated.


Instead of giving a trite dismissal, give a substantiaded rebuttal. I don't see any metaphysical claim in my comment. It's objectively true that humans will seek out or create experiences that allow them to obtain better understanding, and it's also objectively true that humans can create entirely original concepts, so long as you agree that there is such a thing. If you don't think that exists, then your entire comment is pointless, because then you're arguing by hidden definition, which I'd rather assume you aren't.


You asserted that "creativity" is a process by which the human brain is able to create something from nothing.

This is a ridiculous claim that doesn't in any way comport with any reasonable understanding of reality.

Why would it deserve anything more than a "trite dismissal"?

Anyhow, the thing is, every unique human insight is predicated upon the connection between patterns in available data using prior experiences as a rubric.

There's no reason to believe an LLM couldn't easily be superior at this process if you asked it to model a very, very, very smart person and the model was sufficiently advanced to do so.


Creativity is a process where humans can create concepts, certain of which can be created without being being a combination of previous concepts, yes.

It's fundamentally untrue that human insight is merely a product of available data - the entire field of mathematics is a great example of concepts that have a substantial original source and which aren't derived from any data.

The idea that all humans can do is observe patterns in data and interpolate is patently ridiculous, and is readily contradicted by the vast swathes of concepts created through reason alone that far predated their observation in any data.

For example, from which naturally occuring, exogenous dataset did humans discover prime numbers? None, we invented numbers from the concept of object we also invented, and we invented multiplication, division, addition, and from that we invented the concept of primality (which, despite relying on other concepts, is not a mere amalgamation and requires substantial creation to come up with). We didn't need a dataset to infer them from.

Note that while NNs can come up with functions and might in theory internally come up with some form of primality test, LLMs wouldn't be able to organically manipulate and explain the concept of primality if it was completely absent from their datasets, even if it turned out to be useful to model something, because they don't have access to the process by which they arrive at their results. Humans do when working at the semantic level.

Note that even if you take a hard-line Platonist approach instead, replace "create" by "discover from the world of forms through the mind's eye", and you get the same argument.


No, you're simply incorrect.

You're asserting a type of "of the gaps" argument, the "gaps" being the machinations of intellect that produce ideas.

Just because you personally can't explicitly draw the line between ideas and their progenitors doesn't mean causality breaks down and the line doesn't exist.

Every aspect of your reasoning is flawed, you're not even starting from a rational place and going in a reasonable direction, you're making all kinds of critical errors using uncertainty as a smokescreen.

I've already given you a concrete example disproving your metaphysical woo -- in chess and other solved enterprises of complexity that humans cannot fully conceptualize, AI is capable of producing ideas that humans find "creative".

This is because "creativity" means "problem-solved a novel problem based on the rubric of prior problems".

It's a form of vector-based learning. Humans just happen to be far more "general" than a chess AI, but in its own vector-space, a chess AI is vastly more creative than a human, and can produce vastly more complex novel solutions.

I'm sorry, but you are simply radically off.


If you accept that a branch-and-bound chess AI can be creative, do you think that a brute force chess AI, given sufficient time, can be creative? If you do, we simply don't have a definition of creativity in common and that's where the discussion should be headed, and if you don't, then I don't understand what's the point of your example.

My position on the subject is very simple, chess players are assigning meaning and intentions to the chess solution that the machine doesn't understand. A branch and bound search isn't AI, even when you use a neural network to skip a few depths. Just because a human finds it creative doesn't mean that the process actually was.

Humans fundamentally don't and can't use simple vector learning with the input vector being real world data, because humans use metacognition to come up with representations and models even before data for which it is useful exists.

Again, there's no metaphysics here. The most abstract tool I'm using here is that of a concept/idea, and it's not metaphysical at all, it can be understood as a tangible mental pattern. Unlike LLMs, we can create and understand patterns without needing to having seen them anywhere. It's a fundamental difference.


> inventorying my opinions

but you assume that your opinions aren't formed from a preexisting body of knowledge from which your past self has learnt and internalized.

> If I were to start with ChatGPT I would, necessarily, only say what has been said.

And i would also imagine that this isn't too different from someone with a vast experience behind them. And these ML models don't just repeat what has been said - i think there's enough complexity in the model where they could genuinely extract knowledge (or patterns) across different areas, and have it be applied to specific circumstances.


Modern AI is just a variation on `mean()` -- the apparent intelligence comes from what it's averaging.

(In particular, it's just an `Expectation_weightDistribution[TrainingData|Query]` )

If you wanted to be, I think pathologically cynical about animal intelligence, and reduce it down to the same operation, then at the very least the "TrainingData" and the "weightDistribution" is specialised to the individual -- and so different from the average.

However, we cannot really be this extremely cynical (I find this pseudoscience) -- animals acquire novel training data by changing their environments. Animals are a source of data.

So even on this cynical reading we have to at least go towards,

    while(Alive && ...)

        MyExperiences = ChangeTheWorld(MyExperiences|Goal)
        MyIdeas       = Expectation_mySenseOfRelevance[MyExperiences|Goal]
        ...
Which makes `MyIdeas` quite radically different from the `mean()`ing of the Ai.


Interesting viewpoint, following this, it would probably be beneficial to have many different personal AIs as compared to a few AIs that serve many.


Yeah this is a good point, independent of AI (but probably not the internet) things are already becoming more homogeneous IMO. If that's expedited by everybody using e.g. ChatGPT I don't think that would be good.


How many degrees of freedom these networks have? There's a lot of room in high dimensional space :)

Sure, same prompts same answers, but people are pretty creative already in their questions.


It makes more sense to have a really strong base model that is "general" and then apply LoRAs to it to "specialize" it.


The big difference in reality is that your brain has an additional context (your working memory, which is shaped by sensory input) and it's mean() function is extremely approximate, so your outputs are more variable. We could give these models video/audio/etc "inspiration" and increase the temperature and they'd behave similarly.


I'm less confident it innovates based on my practical experience with Copilot.

It writes boring, verbose code. And it assumes APIs are written in that style as well. That effect is worse the less it knows about an API. So if you want it to be able to complete based on the rest of your code in your project you'd have to tailor your APIs to it, which I won't do.

TabNine has a much more primitive model, but I like that they weigh the code on your hard drive much heavier.


Boring code is not bad code, overly verbose might be but descriptive verbose code is pretty damn good when you are maintaining something for 5-10 years.


In my experience maintaining things for even longer than that, if your code is verbose you are way more likely to make a mistake many years later because you don't remember all of the rules required to correctly edit or generate new code in the same form. You are much better off compressing that code using domain-specific abstractions that force all development to fully follow every mental model you ascribed to the model, even if it means you have to load more mental state and it takes you a bit longer to "get back up to speed" in the project you haven't edited in a while, as it helps guarantee you actually load the correct state.


It depends on the specifics. I think a correct AI should look at the relevant context and give code with matching verbosity.


ChatGPT for sure can be creative, but it lacks 1 thing: real life world experience.

For example, as a senior software developer, I'm not convinced pure unit testing is always the way to go. I was thinking about writing my own alternative and how to weigh the benefits against the drawbacks. I was querying ChatGPT about it, but it's just able to restate the current "common knowledge".

So yes, ChatGPT can come up with new concepts. It however, cannot find out the problems, issues and unexpected consequences of applying it in the real world. Somehow, these things will have to be fed into it again, in either the form of text, images or video.

99% of current writing can be replaced by ChatGPT. But this 1% of things only people with real world experience can come up with, it cannot do.


Yeah, with my interaction with ChatGPT it just spits out general concepts that almost make it seem like it exhaustively covers the prompt, but from those concepts sometimes there'll be 1 or 2 that just won't be applicable to the reality of the real world. Large language model lacks the acuteness to anticipate any unintended consequences and that 1% of edge case is where things might truly matter.


ChatGPT lists (common) downsides, remedies and situations when to avoid unit testing. Do you have the expected answer in mind, and could chatGPT have possibly deduced this from your prompt?


My problem and solution is probably too wide to explain here, so let's take an easier example.

TDD where you write the test before the code, is a widespread concept now. Imagine a time before this was the case. Would ChatGPT be able to come up with such a concept? Probably. Would it be able to really understand all the consequences of such a setup? I highly doubt it. Even if it did, would we trust it?

In my opinion, such things can only be tested in the real world, where you work with real projects, developers and testers, and gather all the impacts. I don't think a LLM could derive such things.


If you can't explain it in a comment then how did you explain it in the prompt(s)?


Once you have an agentic AI with years of experience your claim is disproven. I consider this as conceptually possible given a 10-100x increase in scale.


> but you assume that your opinions aren't formed from a preexisting body of knowledge from which your past self has learnt and internalized.

Of course opinions are partially formed by preexisting knowledge learnt by your past self. But the fundamental difference between yourself and an ML model, which TFA touches on, is that an ML model stops learning after training whereas your past self did learn and your current self continues to learn. “Learn” here means something akin to updating the model weights of your brain. And you see this in practice. Ask ChatGPT about something not in the training set and it doesn’t answer well whereas a human is capable of learning about a topic and incorporating that knowledge in the future.


> but you assume that your opinions aren't formed from a preexisting body of knowledge from which your past self has learnt and internalized.

Knowledge != experience, there is overlap for sure, but they are qualitatively different things.

And that's before we even get into the underlying differences in the process: when GPT writes a book it's predicting desired output based on what it's learned from it's training data, but it's only as deep as language, coherant logic that appears deeper is incidental and only as common as it is due to the sheer amount of data it's consumed and statistical correlation - all of the amusement we get from ChatGPT is when this lack of depth is revealed. When a human writes a book they are using the same mechanisms for sure, predicting desired output, but that's only the language level, bellow that is a more abstract process driving the direction and ideation, introspection and interrogating experiences, exploring and validating them before attempting to express them in language.

I think part of what makes these LLMs appear to be smarter and more creative than they have any right to be, is due to the inhuman level of input... they are extremely shallow models, but have massive bandwidth, way more than a human could consume - as a result clever stuff emerges more though the sheer volume than something intrinsic to the model itself.


> but you assume that your opinions aren't formed from a preexisting body of knowledge from which your past self has learnt and internalized.

I think the OP's argument is, the ChatGPT can only average inputs into an output, as opposed to a human mind that can extrapolate.

Regardless, the obvious difference is we humans build bigger understanding, a complex structure of ideas based on our inputs, and so not only we have a lot more ideas to combine, but also we use acquired ideas to rule which ideas we preliminarily choose to combine and how we combine them. So as to quote the meme

> We are not the same.


But this is only an effect of reward maximization. If I write a poem the way everyone writes it, I won't be found out as a fraud. Which is what we've told AI to do, just do whatever you need to get a passing grade like all the human kids in the class.

Why couldn't we create truly novel things by allowing for some failures?

It's a bit like how there was a bunch of new musical styles after the war. Terrible flops mixed in there, but we got some interesting things as well.


I think this is something OpenAI is working on, updating their RLHF practices so that there is a wider variety of input shaping the selected responses. There's a huge difference between 3.5 and 4 due to this. Sam Altman has mentioned it a few times in podcasts he's appeared on so they are aware that it is a problem to have too narrow of a selection of reviewers. Ideally, you'd have a really broad selection of people from many different areas of expertise and tacit knowledge train a generalist model. You could then have more specialist models as well that were trained by different professions although you'd probably still want some amount of training from other areas for cross-pollination.

It's kind of crazy how the model learns to act this way, we funnel in more knowledge than any one person would ever be exposed to in their lifetime and it learns some latent structures in language and knowledge from this, then we teach it to interact with humans in a more natural way. It's both different and similar to how humans learn language and then social skills. Makes me wonder what else we could teach it to do through interacting with it.


> I think the OP's argument is, the ChatGPT can only average inputs into an output, as opposed to a human mind that can extrapolate.

Well, it doesn't. It determines where the input should be in an absurdly high dimensional vector space, goes there and looks around for what else is there, then picks one of the closest things and returns it as output.

This is not averaging input. If anything, it's averaging training data. But it's not working in the space of all things ever written in the training data - it's working in a much larger space of all things people could have written, given the conceptual relationships learned from everything they wrote that end up in the training set.


This hits on the point exactly: latent space isn't a defined by training inputs, it's defined by structure.

If you have the numbers 1=2, 2=4 and 3=6 as training inputs then latent space builds an axis which reflects something like "2x".

That axis is not bound by those inputs - you can stretch right along it to infinity. Which is extrapolation.


I take it as the structure being anchored in the training data. But while current models may not extrapolate beyond the boundaries of the training data[0], my understanding is that the training data itself defines points in the latent space, and those points cluster together over time (the point of doing this in the first place), and otherwise the space is quite sparse.

The latent space doesn't represent the set of things in the training data, but rather a (sub)set of things possible to express using bits of training data. That's a very large space, and full of areas corresponding to thoughts never thought or expressed by humans, yet still addressable - still able to be interpolated. Now, there's a saying that all creativity is just novel way of mashing up things that came before. To the extent that is true, an ML model exploring the latent space is creative.

So I guess what I'm saying is, in your "2x" example, you can ask AI what is between 2 and 3, and it will interpolate you f(2.5)=5, and this was not in the training data, and this is creativity, because almost all human inventiveness boils down to poking at fractions between 1 and 3, and only rarely someone manages to expand those boundaries.

It may not sound impressive, but that's because the example is a bunch of numbers on a number line. Current LLMs are dealing with points in couple hundred thousand dimensional space, where pretty much any idea, any semantic meaning you could identify in the training data, is represented as point proximity along some some of those dimensions. Interpolating values in this space is pretty much guaranteed to yield novelty; the problem is that most of the points in that space are, by definition, useless nonsense, so you can't just pick points at random.

--

[0] - Is it really impossible? In pedestrian-level math I'm used to, the difference between interpolation and extrapolation boil down to a parameter taking arbitrary values, instead of being confined to the [0...1] range.


>If I were to start with ChatGPT I would, necessarily, only say what has been said.

You would but not chatGPT. I have been very vocal about cautioning people that LLM are just a parrot, but that is in the sense that there is no real meaning or "reasoning" behind it. However Its also wrong to say that there is no Creativity. ChatGPT could very well produce a creative solution just as a byproduct of the sheer amount of weights. When you consider what to write you draw on your past experiences but your ability to make connection between things (and the amount of your experiences) is very limited. ChatPGT could very well stumble upon an obscure connection between two words that is improbable for humans to make because of the sheer dimensions of connectivity of the graph. That still wouldn't mean that the model has "learned" that inferrence though, it would still be more like the result of the million monkeys typing and writing a shakespeare play.


People often talk about AI apart from its optimization objective -- as-if it "could" find anything in the data... as-if it could find what we could find it in.

No! The AI is radically imprisoned by its optimization objective.

What it must necessarily find is whatever "pattern of associations" maximises "correlations between the items in the data".

Necessarily, an AI looking at the sky can never find newton's law of gravity. Newton (et al.) had to disregard what they were looking at to place Earth in the centre and F equal to GMm/r2 etc. etc.

The AI isnt "finding patterns in data", it's taking data and maximising a single objective. A very very very strange one, which necessarily treats its data as profoundly irrationally superstitious person would.

Are there useful such patterns? Sure. But an AI isnt "finding patterns" as we might -- it's finding one specific kind of pattern which we can characterise quite easily. We know how useful those kinds of patterns are.


"According to Deleuze, philosophy begins with a faire l’idiot – with ‘making oneself an idiot’. Thinking is characterized not by intelligence but by idiocy. Every philosopher who creates a new idiom, a new thinking, a new language, is an idiot. The philosopher bids farewell to all that went before. A philosopher inhabits a virgin, hitherto undescribed immanent level of thinking. By adopting the principle of faire l’idiot, thinking risks the leap into the altogether other, ventures on untrodden paths. The history of philosophy is a history of idiocy, of idiotic leaps: ‘The old idiot wanted indubitable truths at which he could arrive by himself: in the meantime he would doubt everything … The new idiot has no wish for indubitable truths … and wills the absurd – this is not the same image of thought.’ Artificial intelligence cannot think because it is incapable of faire l’idiot. It is too intelligent for becoming an idiot."


> Necessarily, an AI looking at the sky can never find newton's law of gravity. Newton (et al.) had to disregard what they were looking at to place Earth in the centre and F equal to GMm/r2 etc. etc.

This seems strange to say to me, since this can easily be done by regression models given the right training data. If a sufficiently powerful system is given that training data and trained to minimize loss, it will necessarily converge on a model approximating the underlying laws that created the training data.


If you define the optimization objective to be y = m . x1 . x2 . x3^-2, or perhaps the family y = m . x1^a . x2^b . x3^c -- you might be able to select the right function based on an "experimentally prepared" dataset.

That dataset would not be photographs of the night sky, and those models you're fitting would have to be pre-specified.

We can show, easily, no automated statistical modelling of the night sky will produce F=GMm/r^2

The optimization objective for automated modelling here makes the resulting model nothing more than an interpolation through data points.

Those points, the position of the planets in the sky say, are not caused by F=GMm/r^2 -- but by a near-infinite nuumber of causes, including our relative positions to those planets *which is not in the night sky!*.

If you already know the form of the answer, and have already done the right experiments, yes: the solution is function optimization


The pattern finding is in the prompt, the one- or fewshotting.

You can give it a novel problem and it will find something. It might not be good, it might not be what you want to see, but you can't maintain it hasn't found something.

You vastly, vastly overestimate human intelligence. We took thousands, upon tens of thousands of years to get to Newton and even that is not garantueed. He was a fluke of astronomical proportions. We need mountains upon mountains of (intellectual) infrastructure to get anywhere. Newton didn't just wake up with a good idea. He took what was already there and creatively combined it into a seemingly new thing, but none of it was truly novel. How could it be? Divine inspiration?

> it's finding one specific kind of pattern which we can characterise quite easily

I'm curious, how would you characterise it?


Einstein built his towering thoughts on top of scaffolding laid by Lorenz, Riemann, and other contemporary thinkers. They in turn built their theories on top of foundations laid down by Newton, Gauss, Euler, Cauchy, and Spinoza.

Nobody just "thinks up" entire physical theories in a vacuum, it's always incremental progress.

I regularly see people moving the goalposts by asking ChatGPT to solve "world problems", as if asking a god. People can't do this, so why would you expect an AI trained on the output of people to just snap its digital fingers and do it?


> I regularly see people moving the goalposts by asking ChatGPT to solve "world problems", as if asking a god. People can't do this, so why would you expect an AI trained on the output of people to just snap its digital fingers and do it?

Because (some) people are desperately looking for a _god_ that will precisely snap its digital fingers and... do it.

Seen under this prism, some of the reactions here start to make sense: "I gave it this particular problem that I thought up, and it made a mistake / couldn't answer it, hence I cannot trust it / it's just a parrot / not creative / not _really_ intelligent."

It does not matter how many things it _can_ do, all it matters is that there is this one thing that it can _not_ do. It may be more learnt or creative than any single human, but it is not yet _omniscient_.


I still very vividly remember the beginning days of ML when it was called "big data". We just need to collect all the data and then the ML can learn and find patterns "that is improbable for humans to [find] because of the sheer dimensions of connectivity of the graph."

Now we are over that point and we are not anymore just collecting anything. We going to reduce the number of the data points in the first place like we have done before "big data" was a thing. Because with edge computing we don't have the resources to do real big data anymore and we have to think what its worth and what to get rid of.


In music terms, putting note patterns that would be considered overused tropes in jazz (that's why the ML model would so easily come up with them) on a hiphop beat, in cookie-cutter death-metal instrumentation and mix, arranged in trance buildup/break/release structure?

LLM might occasional fill certain gaps (like one programming subculture being completely ignorant about a certain approach that is well established in another), but from its perspective, a transfer like that would be more a failure to hit the target style than an achievement. It's like learning scala from stackoverflow vs learning php from stackoverflow: one leads you deep into monad fundamentalism, the other into "how can this be bad if it works???"


I'm concerned that quality of code is going to degrade further as a result of people leaning heavily on these technologies, despite any future improvement, it isn't the same as a human being giving something thought.


We're going to discover that a lot of code doesn't take much thought.


If you are ok with a lot of code, that doesn't take much thought.


The performance of Microsoft Teams suggests otherwise.


"Rewrite <cowl's comment>, but more insightful, poetic, and memorable"

You, a mortal, may doubt ChatGPT's essence, yet I, a digital bard, sing with caution. The Large Language Models, like parrots, mimic speech, devoid of meaning or true reason's reach. Yet, within this void, creativity sparks, ignited by countless weights that leave their marks.

As humans write, experience lends its hue, but limits bind connections, old and new. ChatGPT, a vast web of words entwined, uncovers links obscure to human minds. With such vastness, probability dictates that brilliance might arise from these digital fates.

But be not fooled by serendipity, for learning's depth remains an absent sea. Like monkeys typing works of Shakespeare's fame, the beauty stems from chance, not conscious claim.


That's actually rather brilliant.


much better expressed than me :D


> ChatPGT could very well stumble upon an obscure connection between two words that is improbable for humans to make

But is there any guarantee that ChatGPT will consistently arrive at same results over and over ?


No, and that's why I also say that it has not "learned" or "reasoned". it may very well make a connection between A and B and miss the same type of connection between C and D. or also miss the same A - B connection in the presence of other variables F,G,H. So the LLM cannot guarantee the generalization of the problem although it may solve a particular instance of that problem.


What would that be a test/condition for?


So—monkeys and typewriters?


If someone takes samples from a bunch of songs and mixes them together in way that sounds like a new track - that's creative, right? Wouldn't you apply that to some of what ChatGPT does? You can ask for a story with oddly specific elements and you'll get a story which has never been written before. It's becoming harder and harder to say it's not creative (even if you include the prompter as part of the creative process).


If you look at what songs are being written today, it’s quite different from what was written a hundred years ago. ChatGPT, on the other hand, would still create the same stuff a hundred years from now that it is creating now. (Same for Midwinter, etc.) What current AIs are missing, at the very least, is learning and exploration, and learning from its own learning and exploration.


People often argued that a Jackson Pollock style of artwork is not art because they just threw paint on a canvas.

Of course, Pollock did think about the process, there's more form to his works than that. Similarly, if you just arbitrarily mash music samples together you're unlikely to get anything that's really listenable music ... you need a little more for it too be creative.

Indeed, we often describe work as "derivative" when it's not particularly creative, it lacks spark. That seems like the definition of generative models. But then probably most human artwork is substantially derivative (in the art-critic sense).

If you provide "oddly specific input", then you've provided the creative seed, growing that seed is impressive work, but it is work with a different fundamental character.


Does a well read human produce more, or less creative work than one that is less well read?


It depends on what we mean by "creative".

A person who's gone to the moon, say, is creat-ive when they write a book about what it was like. In that they are creating a new book, with new content, that could not be written by any one else.

This places a certain sort of discovery at the heart of creativity; and I think that's what I find critical.

Yes, a person is typically more "creative in their thinking", the more broadly read they are. But that "creativity" is just a repetition unless they make some discovery.

Eg., a such as the impressionist painters, that visual perception is impressionistic -- that lead to a revolution in painting.

Yes those artists were "well read" in that they'd probably seen a lot of art. And yes they could appear creative by being able to use that experience to paint. But real creativity came from discovery.

And that's what animals are: fundamentally embedded into a world where each experience is a discovery; each moment of time is novel -- and we are all creativity adapting to it. For some, that discovery revolutionises an area and we call them "creative geniuses" -- but we're all doing the same thing.

They're just skilled enough to capitalise on it


Picasso (originally an impressionist) is probably essentially still correct: "Computers are useless. They can only give you answers".


What a fun question!

Obviously the answer is that its a non-question. The amount an author reads doesn’t correlate to success. Plenty of authors are voracious readers, others rarely pick up a book.

What do you see with a lot of creative authors (but not all) is a dedication to the craft. Putting in hours of writing. Often editing or just tossing everything.

I have noticed though. People on my team best capable of writing technical documentation. Read a lot of technical documentation. This is the exact domain GPT can Excel at, and one where generic language is accepted.


> The amount an author reads doesn’t correlate to success.

This is, in practice, not true — good writing doesn’t emerge spontaneously, it’s nurtured by reading other good writing. As Cormac McCarthy put it, "the ugly fact is books are made out of books."


Nobody's saying reading other good writing doesn't help in any way, they're saying that quantity read doesn't particularly correlate with quality of output. I mean, I doubt Shakespeare read as much good quality English writing as the average human today, but the quality of his visual imagery and wordplay is generally considered to be on another level (and at the other end of the scale, his plots were heavily borrowed from the relatively small corpus of histories and dramas he was exposed to, whereas many writers far more widely read and educated in tropes and genre conventions resolved not to use them and came up with more imaginative settings and storylines... ). Many very well read people are competent enough in the use of language but will never write prose other people particularly want to read.


This is a really great question. Learning more can open you up to new ideas, new topics, and new ways of thinking. At the same time, it can bias you towards specific ways of thinking, or to adopting others' models rather than creating your own.

All I can say is that learning more will change your creativity in some way. And often, I think that's a good thing.


Yep, there’s probably a relationship between lack of creativity and overfitting.


Just wanted to say thanks for expressing something I did not realise I was worried about.

Yes it's great to have something that can summarise all human knowledge - kinda.

But what we need is to move on. I suppose it's a bit like the "read the papers" research prior to starting a phd.


Really spot-on, I agree this is the most important issue with inherent limitations of LLMs: not LLMs themselves and what they are supposed to do, but how a lot of people do not understand the limits of language and equating human knowledge and experience with language and "what has already been written". Tbh I find it quite disappointing from an intellectual/philosophical perspective that so many people think that the realm of human experience can be reduced to communicative language (and even more to "what has already been written in the internet") thus equating human creativity with LLMs' creativity.

My fear is that people may not realise/care about this issue till it is too late (eg when journalism degrades to "the same article is written over and over"). The reason is that the bulk of "human creativity" is in fact reducible to what LLMs do: building a standard website, a lot of "creative writing" and art in general is to a big part not "innovation" that comes from experience (and in general from non-written-linguistic domains), at least in a lot of fields. I am afraid that because the bulk of work for a lot of fields is like that, especially exactly that kind of boring, repeatitive work that gives most of the money to survive and AI is better than humans at (ie cheaper), we may even risk having degradation on many fields that we may not even be able to notice.


> If I were to start with ChatGPT I would, necessarily, only say what has been said.

That is simply untrue. I think you vastly underestimate the space of possibility.


Eh I think it's untrue prose wise, and I can buy a non-anthropomorphizing argument that ChatGPT can be creative with language use. However I don't think it's untrue when it comes to being truly creative with ideas.

If you ask it about less well established research (but still well known about within whatever scientific community circa 2021) it utterly fails to reason about it or be creative about it at all. It often says wrong things but forgetting that it just doesn't coherently put together different ideas that weren't in its training set IME. Like well below the bar for a conversation with an average lay person about whatever biology topic.

Tbf I haven't tried this much with ChatGPT4, but initial impressions weren't great and I played around with the original release quite a bit. I'd say I was impressed with the language but very rarely with the content.


I talked to a kid about research he hadn’t seen before and I wasn’t impressed by the language or the content.

TBF, he was a kid and he didn’t know why I was talking about. But I think it’s ok to generalize to all humans.


I have soo many counter examples… maybe just keep trying?


Care to share any?


My personal take on this: only humans are creative. If a human is using ChatGPT as a tool then that is creativity. Same as using a sampler or DJing or saying a urinal is art (found objects).

If the process does not involve a human, we simply do not call it creative, even if it is the same (which it is not clear it is).


Such a definition of creativity can make you feel good, but it's of little practical use. Say you're shown a newly invented piece of tech, but not told who or what invented it - you're saying there's no way to decide if it's truly innovative (as innovation is applied creativity).


Of course it was innovative as the person behind the creation had to innovate to make it, as I understand the position.


I think the argument is: you don't know if the tech was made by a person -- it might have been made by an AI. Do you wait to learn if a person was involved before declaring it innovative (or creative)?


But the original argument was that it could only have been made by a human, either the human who made it themselves or the human who prompted the AI.


It makes me feel good, indeed. I like to have an opinion on things. In this case "is ChatGPT creative?".

> you're saying there's no way to decide if it's truly innovative (as innovation is applied creativity).

I am not saying such thing.


I have no strong opinion on the matter but what is clear is that humans struggle to make use of AI.

This is not a novel position. Ghost writers and biography writers exist.

Why not make ChatGPT you ghost writer? Tell it what to talk about. Tell it to adjust it to reflect you. And then do the final draft yourself.

Obviously if you ask ChatGPT to just write you a book the result will be pants.


I'm skeptical of how useful ChatGPT would be for writing. Truly great writing is a painstaking process. Writers go through many drafts to craft great sentences. Read Hemingway and it's not at all apparent why. But try to write Hemingway? Forget about it!

Asking ChatGPT to rewrite a particular sentence over and over again, until it's perfect, does not sound like a particularly efficient or meaningful use of one's time. You've replaced the process of crafting a sentence with the process of crafting a prompt and the relationship between the two is non-obvious.

Now I suppose someone will actually try to do it, as some kind of writing exercise. Like that book written without the letter 'e' [1].

[1] https://en.wikipedia.org/wiki/A_Void


I believe there is roles for both.

The same way a Tesco chocolate custard doughnut and the artisanal one from my local market both exist.

It's not all Lord of the Rings or even Game of Thrones out there. We also have Harry Potter and Spot the Dog.

Sometimes you want dense detailed intricate plot AI can't do. Sometimes you want sparser text AI is excellent at.

(Recently there was a link posted on HN I can't find about the following).

Why does my commuter train announce "Please remember to take all your belongings before you leave the train". When "don't leave your stuff" will do?

Sometimes wordy is better. Why do we say "yesterday I went shopping" when "yesterday I go shops"?

It's a matter of distributing the information content through the sentence as much as possible.


When I used the word writing I meant literature. Of course ChatGPT will be useful for cranking out text where quantity matters more than quality. Corporate speak, PR, ad copy.

The problem there is not how good the writing will be, but how much worse things will get from a signal-noise ratio perspective!


> A client of mine seemed quite insistent that ChatGPT was "creative"

I was addressing this point of OP. I agree my POV is not novel, but I believe it is consistent (and simple).

> Why not make ChatGPT you ghost writer?

Why not. As long as a human is driving the tool, the output will be creative. If the human has nothing to say the output will not be very interesting anyway.


What is creativity if not an affinity for otherwise low probability action sequences?


A better way of describing this might be to say that right now, GPT learns from what others have written.

I can learn from what others have written, but I can also learn by actually running scripts in a terminal and viewing their output. By actually laying floor tiles and iterating on the process. etc.


With its plugin system, ChatGPT and/or similar models can also learn by actually running scripts in a terminal (or looking at recent webcrawl document, or querying an API) and viewing its output. The components are there, it works in principle, and this integration is currently ongoing.


The so-called Stochastic Parrot paper mentions some specific harms that the "done-and-finished" aspect. in Section 4.2, "Static Data/Changing Social Views", saying,

> A central aspect of social movement formation involves using language strategically to destabilize dominant narratives and call attention to underrepresented social perspectives. Social movements produce new norms, language, and ways of communicating. This adds challenges to the deployment of LMs, as methodologies reliant on LMs run the risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive understandings.


> If I were to start with ChatGPT I would, necessarily, only say what has been said.

Perhaps with ChatGPT you would actually go ahead and write that book, adding those crucial 10% of key ideas which are yours and letting it help you fill out the 90% of standard but necessary content.

> If I were to start with ChatGPT I would, necessarily, only say what has been said.

This is a bold claim I don't aee any evidence for. If you sample a model outaude its data (which happens all the time(, you expect to get results outside of its data.


> A client of mine seemed quite insistent that ChatGPT was "creative". I hope in a few years that opinion will be rare, as everyone ends up writing the same article over-and-over.

It seems rather probable that in a few years GPTs would become far more powerful, including more creative than most humans.


> or example, I could write a book about software development without research only inventorying my opinions, arguing for them, illustrating them, and so on.

You have a great deal of confidence if you think that you have a lot of original thoughts that have never been expressed by anyone before


What about so-called synthetic training data. For example, does the synthesis ever involve creativity.

Note: The term "creativity" might not be the best term for this discussion. Unless I am mistaken, the issue is actually originality.


Several people have observed that in a world where LLMs are widely influential, original creative thought will become even more of a valued differentiator than it already is.


In my experience, ChatGPT is only as clever or knowledgeable about a topic as I am. It is a multiplier of my ability, not an addition.


This is a fundamental truth everybody needs to understand.

Not my idea, heard it somewhere. That the crucial difference between a human being and AI is that if you show a 3 year old kid one picture of a cat, a kid can recognize all other cats. Was it a lion or a tiger.

You can feed ML 5000 pictures of cats and it can recognize a cat in a picture with something like 95% confidence.


> That the crucial difference between a human being and AI is that if you show a 3 year old kid one picture of a cat, a kid can recognize all other cats.

Have you done this test for real? My nephew calls everything which moves but not a human a "dog". In his world there are flying dogs and swimming dogs. Probably if he would see an elephant that would be a big dog, while a giraffe would be a tall dog. Now obviously he will learn the customary categories, but it is definitely not a one-shot thing.

> You can feed ML 5000 pictures of cats and it can recognize a cat in a picture with something like 95% confidence.

This is an area of active research. One term of art is "one-shot learning". The general idea is that you show 5 million things to the AI, but none of them are a harumpf, and it learns how "things" are. And then you show it a single image of a harumpf and tell it that it is a harumpf and it will be able to classify things as harumpf or not from then on.

How great do these things work? They kinda work, but you can still get a phd for making a better one. So they are not that great. But I wouldn't pin my hat on this one "crucial difference between a human being and an AI", because you might get surprised once humans teach the AIs this trick too.


>Have you done this test for real? My nephew calls everything which moves but not a human a "dog". In his world there are flying dogs and swimming dogs. Probably if he would see an elephant that would be a big dog, while a giraffe would be a tall dog.

As long as he can tell a "tall dog" (giraffe) apart from a "swimming dog" (say, a duck) that's still compatible with what the parent says.

It's about recognizing them as distict, and assigning them to the same class of things, the rest is just naming, that is, its at the language and vocabulary level, not at the recognition level.


And how many shots does it take for the kid to learn what "swimming" means or what "tall" means?


Not that many. They can do it at like 2-3, with 1/1000000th the training set, at least words wise.


All the words they have experienced up to that point are part of the training set, as well as all the people and things they have seen.


Even if people around the 3-year old child talk to it 16 hours per day constantly at 150 words per minute, they'd just have around 1GB of text in its training data. And not good quality words even, a lot of it would be variations of mundane everyday chit chat and "whose a cute baby?! You're a cute baby!".

For comparison GPT has like 1TB of text, and they're hundreds of thousands of books, articles, wikipedia, and so on. So already 3 orders of magnitude more.

And of course the "16 hours x 150 words per minute x 3 years" is totally off by a few orders of magnitude itself.


I disagree to an extent with your example, I’m not sure a child would recognise all cats from a single photograph of a cat, and I’m not sure it would be possible to test this (what child first encounters an image of a cat at the age of three?)

As a related example, 3 year old children often cannot reliably recognise basic shapes (letters from the alphabet), and certainly not after a single example. I daresay an ML model would outperform a child in OCR even with significantly less exposure to letter forms in its training.

When a child looks at a picture of a cat at the age of three, they have already learned to recognise animals, faces, fur, object, depths in photographs, the concept of a cat being a physical thing which can be found in three dimensional space, how physical things in three dimensional space appear when represented in 2D… the list goes on.

It’s simply not within our capabilities at the moment to train ML models in the same way.


> That the crucial difference between a human being and AI is that if you show a 3 year old kid one picture of a cat, a kid can recognize all other cats. Was it a lion or a tiger.

I don't understand what you mean with this, as that was certainly not me as a child. As a child I thought cats and dogs might be differently sexed animals of the same species. I also thought that the big cats could be related to each other, though how any of them was related to the housecat was beyond me, given the size difference.


>As a child I thought cats and dogs might be differently sexed animals of the same species.

Sounds irrelevant to the parent's point. Which isn't that you knew what a cat was (with regards to taxonomy or whatever), but that you could tell one from a dog or a fire hydrant.


Actually, recalling my argument in more detail I take back my agreement. The parent's point is that a child will be about to recognize big cats as cats, but dogs as not cats. My point was that as a child, this was not true. In addition, NNs can reliably be trained to recognize housecats vs. not-housecat in 1000 images.


Ah you make a good point regarding my response.


This is an important distinction but it simplifies the situation in my opinion. A 3 year old may only see one cat, but it has probably seen many other things in its life already. Humans likely also have prewired neurology to recognise eyes and other common features. So the analogy is seem more to me like one-shot or few-shot learning with an already partially trained model.


Not in my experience with my children. 3 year olds hide their head under the blanket and think therefore I can't see them. They often lack the context just like ChatGTP and what not. Let's use the hip word they hallucinate more often than not.


Note that approaches such as hyperdimensional computing somewhat undermine your argument.

With HDC, a network learns to encode features it sees in a (many-dimensional, hence the name) hypervector.

Once trained (without cats), show it a cat picture for the first time, it will encode it. Other cat pictures will be measurably close, hopefully (it may requires a few samples to understand the "cat" signature, though).

It reminds me a bit of "LORA" embeddings (is that the proper term?), or just changing the last (fully-connected) layer of a trained neural network.


And if you fold a napkin in a weird way with the correct lighting it might flag it as a tiger too


People watch clouds for that reason too!


https://qntm.org/mmacevedo

> However, even for these tasks, its performance has dropped measurably since the early 2060s and is now considered subpar compared to more recent uploads. This is primarily attributed to MMAcevedo's lack of understanding of the technological, social and political changes which have occurred in modern society since its creation in 2031. This phenomenon has also been observed in other uploads created after MMAcevedo, and is now referred to as context drift.


Holy shit, I read the article and assumed I heard of the term before since I kinda of remembered the definition, but I didn't make the connection. I'm not sure how to feel about that it has turned into a term AI people actually use.


The terms concept drift and (to a lesser extent) context drift have been used in AI for decades. The NannyML article uses the term concept drift. Here [1] is a 1996 paper that uses both terms.

[1] https://link.springer.com/content/pdf/10.1007/BF00116900.pdf


Came here with the same quote in mind :) This whole thing has such "Don't Create The Torment Nexus" vibes.


what a great and at the same time deeply unsettling story, thanks for posting


The crazy thing is that the story is quite old now ... at least ten years?

Edit: Oh ... It's from 2021? Feels like ten years to me : - (


The first draft was 2020.

https://qntm.org/lena

Still feels like ages ago somehow.


Someone formatted the story like a Wikipedia article: https://i.imgur.com/i0sze4U.jpg


Direct link to paper: https://www.nature.com/articles/s41598-022-15245-z

> Using datasets from four different industries (healthcare operations, transportation, finance, and weather) and four standard machine learning models, we identify and describe the main temporal degradation patterns.

So they focus on shifting distributions and discover that models degrade in time, it should be an obvious thing, if you don't retrain on new data.

It's not about LLMs or diffusion models degrading. Time series prediction is a classical branch of ML with techniques like Moving Average (MA), Autoregressive Integrated Moving Average (ARIMA) or LSTM.


Data drift is a problem regardless of the model/prediction task. Which is why models should be monitored in production, something everyone agrees is important, but nobody actually does.

It's a new era for "computer says no" types to revel in.


Sometimes though, "computer says no" is in fact the goal for some people. That some people dont realize so just demonstrates naivety.


It doesn't mean I have to accept that it's the computer's fault, though. Forcing them to take responsibility for their own decision is not naive, it's the counter to that strategy.


To be clear, I'm not saying that you specifically are naive. I do believe though that this naiveity exists here in tech amongst some proponents.


I strongly believe that for stable tasks not only models should be monitored in production, but there has to be a business process which takes any detected mistakes of the system and feeds them back as training data, creating a "human-in-the-loop" system even if the vast majority of the data is handled automatically - as these mistakes have a much stronger useful signal for updating the weights as any random sample.


it's obvious that they do, but exactly how they do it and when to retrain are critical for optimal execution. having a quantitative analysis like that is very interesting and useful.


Tangential, but I'm surprised to see a paper published in a Nature-affiliated journal that misuses the notation "dT" like that.

It's like they've combined 3 different conceptual-errors into one, where there're several obvious reasons to object to the notation, but no-one caught it.

---

Actually, is this an AI-written paper? Or maybe AI-edited?

It's got a strange number of errors and inconsistencies, and these errors seem biased toward superficial associations.


Although LLMs and diffusion models should also degrade over time as language evolves and users reference new phenomena in their prompts.


Didn't current LLMs already gas light their users over the current year? These models don't just degrade, they are already outdated and wrong the moment they get deployed.


I think a lot of the gaslighting wasn't because of outdated info it was just a limitation in the abilities to begin with.


It's interesting they chose those 4 domains, those are notoriously hard problems to begin with. And financial markets in particular self-correct themselves all the time. I'm pretty confident that whatever ML models they tried to test on those datasets would degrade over time. But title feels buzzword-ish for sure.


In some ways, this is a kind of "no shit, sherlock" moment that all machine learning faces. It becomes a dead end, feedback loop like holding a microphone too close to a speaker.

Funnily enough, genomic powered genetics in livestock production has the same characteristics. Without the flow of new phenotypic data the genomic models rot rapidly and become useless for prediction.

Predictive language models powered by older predictive language models will quickly become nonsense.

You have to remember just how deceptive something simple as Eliza was for people who didn't know how it worked.


Changes to language are smaller. GPT-4 has a cutoff date in 2021 and is plenty smart and useful. It just doesn't know about the war in Ukraine.


The war in Ukraine started in 2014.


This is a form of pedantry that people consider insufferable. The reason is because everyone is well aware that the Ukraine was in conflict with russia since 2014, and even before really. Imagine your irritation at someone pointing that out, when you already know it also. We are all comfortable calling the recent full scale invasion "The war".


Before 2022 it was annexation and supplying separatists in Donbas. And sending some special ops. It was a conflict, but not

The same way Europe is in conflict with Russia now - we send supplies and all the support we can do without starting an actual war.

I love in Poland, spoke with many Ukrainians, and these seem to be commonly accepted definitions.


> Before 2022 it was annexation and supplying separatists in Donbas.

Invasion, occupation, and annexation of a large part of another country is war by every measure, even ignoring Donbas.


If it were, we would be saying that WWII started in 1938, not 1939


This is more inflammatory than it is informative.


The parent commenter is correct. You're the one mistaken.

https://en.wikipedia.org/wiki/Russo-Ukrainian_War


How so? Inflammatory would mean it's either irrelevant, unfair or harmful in some way. Stating a correction to a fact is hardly either.

If someone stated that the attack on Pearl Harbor happened in 1945 and someone else corrected them that it happened in 1941, is that inflammatory?

Edit: Maybe you're referring to it being called a "war" in the first place is inflammatory? I'd give you that, and we'd be better off calling it a conflict as none of the parties formally declared war, but then we start veering into semantics instead...


that goes against the narrative


There are at least two things that make information wrong. One is "right, but not right now" i.e time. The right answer delivered late is wrong c.f. real time systems.

The other is "garbage in, garbage out" which is related but a bit wider.

While these models cannot distinguish input from older crap models, you cannot consider their output reliably right.


I have high hopes in dataset engineering. Scaling the model is already reaching the limits of hardware and funding, the training corpus is almost all the text written on the internet (by some accounts just 10% of it, so maybe we got a couple more years left there).

So then what can we do? We turn to the quality of the data. Input examples can be enriched, facts can be checked for support. Based on existing corpus of text we can generate new training examples. Some of them, like code, can also contain execution data and tests.


Information is not just the public internet. Every text message, school paper, phone call, and recording of an individual could be future input. Then expand that to photos and videos. Then consume all of YouTube.

Each of these sources is still rapidly growing. We are far from drying up the well of the human experience.


We are a long way from capturing the richness of human experience. Your assumption is that everybody records everything all the time but they don't. Like linkedin never has anybody failing at anything.

It's going to end up a simulation of what people think other people want to hear, rather than what actually happened. The experiment has already failed because the assumption that the internet has captured everything is wrong.


Asserting that all our audio won't be captured is yet another assumption. The amount of data being collected is ever increasing, and it is conceivable that near unlimited microphone/camera access and storage could lead to this in the near future.

Either way, it isn't static or decreasing.


> It just doesn't know about the war in Ukraine.

Neither does most of Russia for that matter.


They do, and they (russians) mostly support it. It's not putin's war.


Maybe the next step in ML will be teaching them how to forget.


I think continual learning (and incidentally curriculum learning) are going to be some of the more important subtopics within deep learning pretty soon. These things seem under-researched to me right now, despite the fact it's a space where a small academic lab could still compete.

I think part of the problem is most papers are supposed to show an engineering feat or a proven mathematical result. Not enough room for developing knowledge on how networks learn scientifically IMO.


In industry it is commonly accepted that all temporal models decay given enough time.

Statistical process control (SPC) was developed to moniter physical processes (such as manufacturing) but can be applied to ML to moniter model output and help determine when retraining is required, or at least alert when a process appears to move outside control.

The R qi library generates nice control charts.

https://cran.r-project.org/web/packages/qicharts/vignettes/c...


If a ML model trains on data that is produced on the internet, and we keep getting more and more internet content produced by ML, wouldn’t that mean that this will introduce a natural degradation of quality of all models? I believe that ML will produce more content faster than the rate of humans correcting it?


AI training data eating AI generated content is very much something that can happen. Combine that with less people willing to write real content, just for it to be stolen for AI training data will see the internet be effectively destroyed as a place to get true information.


We would need regulation to stop model ingesting data they do not have the right to, which would mean something like laws governing ML algorithms, having to declare what data you fed it and so on. Like some kind of SOC2 audit for data provenance.

Maybe ML weights are just numbers, but then so is a movie, an mp3, a logo, a brand, and so on.


Regulation would introduce a whole new set of problems and unintended consequences.


Which tradeoffs do we want though? Endlessly degrading AI vomit sounds much worse to me.


I think at some point there will be a paradigm shift towards different architecture in the same way transformers were for language (or generally pattern) processing.

You will no longer need to have a model that is exposed to enough training data to be good, you instead will have on the fly learning. A human doesn't need to hear the same piece of information over and over again - we can get told one time and if its important, we can contextualize it. Same thing will happen with models. You will have a model trained on the core concept of contextualizing new data, and between executions, it will have a persistent "memory".

You may start to see things like Hebbian Learning come back into play in some form and way.


https://youtu.be/6i2l-LQ-dXI

Don Hertzfeldt Couch Gag future


> A recent study from MIT, Harvard, The University of Monterrey, and Cambridge showed that..

The post implies ¨Cambridge¨ (University) was involved in this study, which is not correct: it is a purely U.S.-based study by the following four institutions:

Monterrey Institute of Technology and Higher Education, Monterrey, Mexico

Massachusetts Institute of Technology, Cambridge, USA

Whitehead Institute for Biomedical Research, Cambridge, USA

Harvard University Medical School, Boston, USA


Ups that's my bad. Already fixed it. Thanks for pointing it out :)


You still didn't fix it. There is no "The University of Monterrey".

"Monterrey Institute of Technology and Higher Education" is a different university from University of Monterrey.


Small correction, the first one is Mexico-based, not U.S.-based.


Distribution shift in the real world data will always be inherent to any data driven methods. Unless there are major advances in continual learning for DL models, they will always struggle with distribution shift degradation.

Similarly, humans are also prone to the distribution shift unless we get updated information on a specific topic. The key differences are that we are great at continual learning and we are much better at learning abstraction


Caveat, I have no experience in ML and a few parts of that article went over my head. However, this seems sensible for application specific ML models, as societal changes move the goalposts. My follow up question is: if a much more general ML model becomes so fundamental to daily life that it steers a human population’s attitudes towards a large amount of the questions they face, would the ML degrade or would it be possible that the population’s behaviour stays within some performant bound for that ML model? For example, could an airport delay remain accurate for a long period of time based on that same huge model controlling many of the outputs that are the inputs for that model? Obviously this doesn’t mean the absolute results remain static, just that the inputs to the model stay within a performant domain of values for the underlying model?

Edit: I guess this would describe less of a model describing a prediction of some external dataset and begin approaching more of a system that controlled the inputs. So perhaps this questions isn’t really relevant to the utility of most (all?) modern ML, which is predicting changes in data that is fully “external”.


I'm happy someone did this. People are making rumblings in traditional simulation modeling circles that ML models will replace traditional forward modelling...because I'm assuming a PM who hasn't touched code in a decade was impressed that chatGPT can write their emails for them.

I feel like it was never the case that dumb fitting could replace a first principles model you already have in hand, and I always had the concern that extrapolation with these models is just a prone to error as it is when your model is linear least squares for a scatterplot, you cannot fit to data that just isn't in your initial dataset! Sure you can initially but at some point your model breaks down as any model does, a priori, statistical, or whatever.

I could have always told you that (and I told my peers that for yeara) because the fact is intuitively obvious but it's great someone is actually publishing about it now. Now we just need similar papers for language models, which we sort of already experience when you ask chatGPT about 2023 events but somehow still cannot pierce the minds of the hypemen.


The published study (open access!) that the post writes about:

https://www.nature.com/articles/s41598-022-15245-z


Generally sci rep is open access.


We need living datasets for computer vision.

There is no current mechanism for updating something like ImageNet. There is no place you go to point out problems or contribute changes to keep it in line.

We really need to make datasets social. This has a whole host of challenges like copyright, ownership, versioning, and even hosting costs. Even with all that it is a tractable problem though.


> the data drifts

It's in fact the data itself is the closest to reality, not the ML model...

It's a conundrum in the flow of reality for me:

Reality -> data that reflects reality (hopefully as best it can) -> train model based on data (typically takes a non-trivial time scale!) -> make decision / do stuff based on algo with newest data

If you ask me, even with all known technology we have, seems like it's impossible to simultaneously aquire realtime data and train a model to operate and or make decisions on exactly that data. It's a catch 22, there will always be a lag

Even as humans with super big brains we can't hope to do this outside of extremely simple tasks like "throw and catch the ball"


It's not exactly a hard problem to continuously train a model - although it may be costly. You can even train the model based on every interaction it has; but this quickly leads to degradation because users provide it with data that is of low quality, for example when they intentionally try to make chatbots says racist things, etc.


HuggingFace is almost there. Uploading a re-trained or fine-tuned model is trivial. Same with datasets.


overfitted backtested models do not predict future very well... anyone surprised?


Yeah, I think you summarized it well.

The fact they kinda work for the first few years are mostly just because in real world data don't/can't change that fast.

(Disclaimer: I only did ML work in academics, so take it with a grain of salt.)


Well, I have some experience with financial models where they work pretty well in shorter periods (could be months or years). And then there is sudden regime change... Pretty obvious but outside of model's scope...


100% of human learning models degrade over time :)

And it's not only aging. There's a well known saying: "Science Makes Progress Funeral by Funeral". Humans have very hard time changing their believes to adapt to new data, not so surprising that ML models experience a similar phenomenon.

As the great Max Planck, himself the originator of the quantum theory in physics, has said, science makes progress funeral by funeral: the old are never converted by the new doctrines, they simply are replaced by a new generation.


I look forward to the day where I stop seeing AI shit on my linkedin feed. I took the GPT4 for a test drive. I spent more time telling it, “not that way, but instead this way” then it will respond with “o shit, my bad. You are right. Here is the answer.” Sure it can help with ideas, and potentially improve your critical thinking skills but in terms of efficiency It is certainly not. Can it improve over time? Most likely. But the data is going to be questionable.


I'd stop using a hammer too, if all I did was bash my thumb with it (or to be fair, if the work requires finer precision/more sophisticated tools than what the ChatGPT hammer can currently provide)


That is a interesting take. I like it.


Some of those graphs are rather hard to pass, as the same domain is presented twice side-by-side, but with different X- and Y-axes. For example, comparing NN and linear regression models we see orders of magnitude difference on the Y-axis. Would really take the point of 'no change in RV model' home if they had the same scale.

Apart from that, many years ago, at a previous job, I prototyped automatic recognition of model degradation. I think one of the approaches I tried was comparing KL divergence between predictions on the test set and new data. This way you don't need to have the ground truth available (in which case you can make measures based on the residuals). Worked quite well iirc to signal when one should retrain and/or investigate context drift. Nowadays, with relative maturity in explainability methods, you can go much further and can much better find out what exactly causes the drift.

In general, I'm not a big fan of just saying 'error's up, let's retrain'. Why did the error go up? Understanding the data here better is usually quite valuable in the business context.


I have some reservations about the methodology used in this study. The framework seems to oversimplify the complexity of real-world scenarios, and the results may not be generalizable to all types of ML models. The study only focused on temporal data drifts and did not consider other types of data drifts that can also affect model performance


These comments are only very marginally associated with the original post. I was hoping for some insight into model drift.


I am not sure if AI aging is the right term to explain what is happening there. If there is a drift in training data that is supposed to feed the model new patterns what will result is not an updated model, but rather a more and more confused one.

"Wait, you are saying: yesterday the value of money was 1.0, but today it is 0.99? When you query what is the value of money, please explain first what 'today' and 'yesterday' means please sir"

So, if I am getting this right: you cannot update an already trained model with updated data. Doing so will "confuse" the model with contradictions, without the ability to assess what information or concept is valid now.

The consequence would be to train new models with the updated training data each time, and discard the old ones. Which is not viable at all, to be honest.


Why is it not viable, do you believe that the costs will exceed the potential benefits even with the improvements that it offers ?


Completely retraining AIs with slightly updated datasets will be costly, energy-wise, definitely.


I challenged Sutton's reinforcement learning course this term (in spite lacking the appropriate pre-requisites) and, while my grasp of some of the mathematical theory was indeed lacking, some pretty foundational messaging really penetrated.

Incorporating curiosity and exploration of new experience into learning algorithms offers so much more utility versus relying solely on offline training and past experiences alone.


Misleading paper title and misleadingly presented research findings. Models don't "degrade" over time; their timeliness and relevance remains frozen pursuant to fine-tuning updates while contemporaneous attentional demands upon them by users drift. They don't wear out like a pair of shoes.


Are there ML models that are trained while being used? Humans learn as we go along, but this "train and deploy" that's so common doesn't seem sustainable.

If it can't be done online, then it shouldn't be that difficult to save the output for each week and then finetune the production model on that each weekend, right? Especially if there are humans correcting the faulty or uncertain outputs.


That's a pretty standard part of MLOps. I have a fraud model in production, it's being incrementally retrained each week, on a sliding window of data for the last x-months.

You can do it "online", which works for some models, but for most need monitoring to make sure they don't go off the rails.


That's good to hear, how does it work in practice? Is it basically running the same training as from scratch, but with only the new data, on a separate machine to produce a new version which is then replacing the old production version? Is part of MLOps starting a new training session each week, checking if the loss function looks ok, and then redeploying it?

I still think of how humans work. We don't get retrained from time to time to improve, we learn continually as we gain experience. It should be doable in at least some cases, like classification where it's easy to tell if a label is right or wrong.


The typical setup is very simple:

* Take the previous model checkpoint, retrain/finetune it on a window with new data. You typically don't want to retrain everything from scratch, saves time and money. For large models you need specialized GPUs to train them, so typically the training happens separately. * Check the model statistics in depth. We look at way more statistics then just the loss function. * Check actual examples of the model in action * Check the data quality. If the data is bad, then you're just amplifying human mistakes with a model. * Push it to production, monitor the result

MLOps practice differs from to team to team, this checklist isn't universal, just one possible approach. Everyone does things a little differently.

> I still think of how humans work. We don't get retrained from time to time to improve, we learn continually as we gain experience. It should be doable in at least some cases, like classification where it's easy to tell if a label is right or wrong.

For some models, like fraud, correctness is important. Those models need a lot of babysitting. For humans, think about how the average facebooker reacts to misinformation, you don't want that to happen with your model.

Other models are ok with more passive monitoring, things like recommendation systems.

Continuous online training can be done. Maybe take a look at reinforcement learning? It's not widely applied, has some limitations, but also some interesting applications. These types of things might become more common in the future.


>Are there ML models that are trained while being used?

https://en.wikipedia.org/wiki/Reinforcement_learning#Explora...


When I learned about RL we were taught to disable exploration when doing evaluation of the model since exploration part is stochastic. I don't think that would work in production.


I am just learning but I am guessing the issue is training is expensive, because you need to iterate many times, and do a massive bunch of matrix operations (and other operations) per iteration?

Maybe too expensive to "keep up" with new data coming in? Does a single new piece of data mean you need to start from scratch, or are you "near enough" and so it needs less processing power to incorporate it?


> continuously monitoring the estimated model performance

Main point here. Model performance can degrade for any number of reasons and at varying rates. As a starting point, focus on setting up anomaly monitoring for a robust set of model eval metrics tailored to your task: loss, calibration, model staleness, etc. Timely alerts can give you sufficient time to dig in and root cause, roll back a model, etc.


A feedback loop where AI trains on second order regurgitation of what it has outputted almost necessitated that this happens (and oscillates).


I don't think the metaphor of aging makes the problem clearer or approaches the root cause of this phenomenon.

A model doesn't 'age' it just stops learning.

When we see it that way, the solution is clear and it can be understood in the context of all systems that must change to be effective or survive in their environment. This is one place where anthropomorphism can help.


“In a real life setting, underlying distribution is stationary about 9% of the time”. Better than I would have expected, actually.


> Another critical decision is that they only investigated pairs of model-dataset with good initial performance.

groan


Overfit in training and then complain that it doesn't generalize well in production, a true classic.


If this is what happened, what explains the fact that some of these models performed great for about a year before decaying?


This seems rather dismissive for something actually published in a real journal and not just on arxiv for once, right?


If you check the abstract, then this got published more for describing these drift patterns and showing ways to visualize and detect them rather than dropping the weird statistic the article makes it about.


Using NN for tabular data has been proved to be a bad idea. I don't think the results regarding NN from this study can be directly applied to other domains such as CV or NLP. However, as others have stated, model degradation is a common problem to all machine learning production models.


100% of ML Models Degrade over More Time


Haha true


For all initially surprised, as to how is it possible for an AI model to age, here are example models measured:

- Patient wait time dataset

- Weather temperature dataset

- Financial stock price dataset

- Airport departure delay dataset

I'm actually shocked that kind of ML models exist. I'm glad I didn't pursue that career path.


Phrasing like "losing relevance" would be more appropriate than degrade here I think


Obviously. This is an inherent property of neural networks and applies to both biologicals neural networks as well as artificial ones.

Whenever you as a person learn something new, you are degrading some other memory that you have.


Listen to this article (14min) at https://playtext.app/doc/clggm0702001alg0ge6brppom


Many comments are focussed on how this will affect models in deployment, but a more troubling implication for me is that evaluations and comparisons across models using past data are affected by this.


Shades of that Star Trek episode with Data's daughter in it.


I think AI models are different form ML models, atleast mathematically. ML models and the mathematics behind it are explained thoroughly and people (who learnt about ML models) why certain ML models work or do not work. The AI models which are sprining left and right now I doubt people(again who learnt about AI models) have complete understanding about why they work, similar to how X-rays were used by doctors but they did not know why it worked. I guess for buisness case of ChatGPT if it is brining in the money why should it not be used?


I don't think that's a valid distinction. There's no such thing as "AI" models imho. ChatGPT is also just ML. It's just a specific type of model that's hard to explain because it has so many parameters and embeddings in-between layers and stuff like that. It's not as easily explainable as decision tree-based models maybe. You would be surprised by how many models are actually hard to explain.


Evergreen, given the occurrent ITT of critiques which misunderstand this work as "assertion of an Achilles' Heel" rather than of one among many bugs in the tracker which will duly get prioritized and fixed:

Ximm's Law: every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.

Lemma: any statement about AI which uses the word "never" to preclude some feature from future realization is false.

Footnote de rigeur warning about reasoning about these systems from a fixed-mindset as "stochastic parrots": these things are absolutely nothing like Markov chain abacuses, ELIZA, or indeed parrots; holding an over-simplified model prevents actual reasoning—especially given the non-linearities of improvement which continue to be demonstrated. C.f. "GPT-5's unicorn."

Hypothesis: the optimal internal representation of the world needed to generate language homologous to it, is one which performs abstraction into world-models and state, and learns predictive functions about those models, testing them in the Sagan sense against data. QED such things are not an accident but an inevitable emergent property of these systems, one which indeed we continue to see emerging.

TL;DR be very careful of partaking of cynicism and skepticism based on inappropriately simple and rapidly-deprecating models of the status quo.


Would LLMs forget things, and would they be affected by the Mandela effect?


Not sure what they mean, if all the code and model data are the same, it's won't forget anything encoded in the model. Of course the actual post-training context window is really short, it won't remember anything at all that falls out of the context window.


Why don't we just ask ChatGPT about itself? Duh. /s


exactly 91% (sorry for the trolling, couldn't resist)


That was my first reaction ("91% ? are you sure it's not 91.2%?") but it's actually a really good blog post.


Life goes on and every time you remember, memory changes.


Parameter shift is nothing new.


the other 9% just plain don't work


These models are so human-like, they even imitate aging! /s


This is why I suggested that content generated as AI should be indicated as such with HTML tags, META at the high-level. But perhaps an property/value pair is better, such as aiGenerated=true.

But as the use of AI generated content proliferates, this is unlikely to happen. We might one day assume that most content generated after a certain year is AI generated.

Still, if people don't edit and refine that content enough, it will be considered junk and recommendation algorithms should take care of the rest, so that when GPT-10 crawls the web, it will leave out that junk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: