Hacker News new | past | comments | ask | show | jobs | submit login
The Curse of Recursion: Training on generated data makes models forget (2023) (arxiv.org)
122 points by surprisetalk 10 days ago | hide | past | favorite | 107 comments





My takeaway after scanning the paper -

In an ideal setting, a trained model learns exactly the real world probability distribution, and generates data indistinguishable from those sampled from the real world. Training on them would be fine, but pointless, since the model is already a perfect representation of the real world.

Practically, however, a model is only a lossy approximation of the real world probability distribution. Repeated self-training would simply compound the loss - amplifying both the probable and the improbable.


This paper was first published in May 2023 and discussed on HN the following month:

https://news.ycombinator.com/item?id=36319076

Some research since seems to add nuance to its conclusions:

https://arxiv.org/abs/2404.01413


> The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs.

TL;DR: This paper confirms that Model Collapse can happen if the original data is replaced with synthetic data, but if both are used alongside each other, it no longer happens.


There is a mantra in ML that has been around for a while. It's that when training on synthetic data, your learned model is only as good as your generator model.

Catchy! And a really good point.

Seems like there could be room for a couple of special situations with caveats though? With the GAN formulation your generator can be practically as good as your discriminator and your discriminator can probably be better than it would have been without adversarial regularization?


isn't it obvious ? synthetic data was nice to generate an underlying distribution and avoid jumps in 'data holes', but with methods like diffusion now it's not as necessary

That was very much evident even from back ehwn the first GPT's came out. The moment you started introducing synthetic data, the quality plummeted.

But there is another use case where LLM's can truly help with synthetic data: the more classical classification and regression problems - specifically gathering training data. I had this exact case at work two days ago: A large dataset with a small subset of labeled data. For a binary classifier, there was a huge imbalance in the data - the ratio was roughly 75-25%. I did not have the desire to do all this manually so I used an LLM to get a list that would even out the numbers(and get a 50-50 ratio). And using the data I had, plus the additional synthetic data, the accuracy of my small classifier ended up picture-perfect(given that my actual target was "85-90%" accuracy and the actual result was just shy of 99%).


I'd argue that the case you give isn't an example of using a computer to generate data, it's a case of a human adding data (the data being the fact that the binary classifier should have a 50/50 balance).

This sort of massaging of data has its drawbacks as well--obviously this only works if the balance of that binary classifier actually is 50/50 in reality: I don't know enough about your case to say you were wrong, but I can imagine a lot of scenarios where a binary classifier should not be represented 50/50 in the data.


This is a question of definition. It is synthetic in that I just passed a prompt, asking for N amount of examples of X. And I did not go over the entire list I got and blindly trusted it. In this context, I needed an even(or nearly even distribution) of samples in the training data and it worked way better than I was hoping. Mind you, I have to face a similar issue next week and I'm not sure this approach would cut it - I need way more training data and way more classes to work with. 249 classes if I'm not mistaking.

I question what "it worked way better than I was hoping" means in this context. If you're saying that filtering the input data to create a uniform distribution created a uniform distribution in the output, I'm not sure why you'd hope for any less--that's exactly what I'd expect to happen. But that's a poor measure of success, because you don't know what side effects that had: the removed data ostensibly contained other variables besides your binary variable, and you don't know if those variables were sampled in any useful way, so I'd be hesitant to say this worked well without at least an attempt to measure those other variables.

Can you clarify what you mean by using an LLM to “get a list that would even out the numbers”? If you’re doing binary classification, you need datapoints for features as well as the target class so how does an LLM synthetically create that without causing problems?

Just curious, but did you compute that 99% using purely real test data, or does your test set also include artificial data?

Apart from using the results from the training/testing/validation sets? Several people manually went over several thousand random samples.

> the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

Does it mean that data hungry corporation like Google, Facebook, Amazon, openai with Microsoft backing, that are already all around the internet and our phone tracking us have an incredibly advantage over open source model?

Is that why Google is pushing gemini so hard on Android even though it's half ass done? they need fresh human data so much to be able to compete and beat the competition ?


> Does it mean that data hungry corporation like Google, Facebook, Amazon, openai with Microsoft backing, that are already all around the internet and our phone tracking us have an incredibly advantage over open source model?

Yes, absolutely. Back around 2017 The Economist had an article calling "data is the new oil", I first heard that from a VC back in 2010.

These companies are sitting on immense reserves of data, Google, Facebook, Amazon, Bytedance are the Saudi Arabia, UAE, etc. of the information age.


The quality of reddit's data is different from other data I encounter online.

It represents information more closely related to people's lives. People share information there that is closely related to the topic of the subreddit. This may not always be the case, but even though I spend much, much less time on reddit than I did in 2011, many, many people are contributing to this day.

That spigot of connection to the real world through text sounds valuable to AI based on TFA. I feel the oil analogy would be about the quality and the ease of extraction of the stake


Given that the top google results are now generated I think we already have a massive recursion problem. I think we would benefit from training a model specifically to just detect a likelihood of content being generated and then bias other models against the higher likelihood generated content so that we don’t end up with LLM echo chambers.

Right. Google already has a solution https://deepmind.google/technologies/synthid/ Everyone insists on training theirs to look human generated so the horses have left the stable on this

Isn't everybody always gushing about how LLMs are supposed to get better all the time? If that's true then detecting generated fluff will be a moving target and an incessant arms race, just like SEO. There is no escape.

Yep, that's what I've been thinking since people started talking about it. I hear that AI plagiarism detectors can never work, since LLM output can never be detected with any accuracy. Yet I also hear that LLMs-in-training easily sift out any generated content from their input data, so that recursion is a non-issue. It doesn't make much sense to have it both ways.

I wonder if the truth about sifting out synthetic training data is based on signals separate from the content itself. Signals such as the source of the data, reported author, links to/from etc.

These signals would be unavailable to a plagiarism/ai detector


My intuition given the rapid, informal developments of agent-type systems is that this is obvious insofar as the initial dataset was formed from a huge hidden "data cleaning" task that was human evolution and society. This isn't really that interesting of a claim and is it clear that it holds if you simply loop the LLM back onto the data cleaning task itself as a critic to the new training set? Is this what the author would classify as fine tuning?

Another question is what is the interpretation of the output of an LLM generation when unprompted? Isn't that always effectively garbage when there's not a deliberate bias in the training set?


Isn't this obvious?

I'm glad this was published to point out the problem, but I'm a bit puzzled why people tried to train models on generated data in the first place. Synthetic data... isn't data.

The one exception I can see is medical data, where synthetic data can be used to avoid violating people's privacy, but even in that case it's clearly not ideal from a technical perspective.


To me it seems intuitive that training on any unseen word patterns should increase intelligence as long as said patterns are consistent with ground truth. That's why it's counter-intuitive (to me) that training can fail, purely based on where the training data came from. The source of the information is something only the universe itself should be able to take into consideration (full causality chain), and not the training process.

I am unable to parse what you're saying here.

I was just saying it's counter-intuitive that the "source" of any training data would ever matter as much as the "correctness" of the data; but you're right, that was very sloppy wording on my part, sorry.

Here's a longer, related post, from me (albeit also confusing, haha):

https://news.ycombinator.com/item?id=42352759


I think you're trying to separate two inseparable concepts. The ONLY means we have of verifying the correctness of data is by comparing it with observation, i.e. data from the real world. Real world data is correct data, and synthetic data is inherently only as correct as its correlation with real world data.

There is of course some real world data that is more correct than other real world data, based on collection methods, sample sizes, etc., but again the only way we know that is, again, real world data.


But I think we can write computer programs to generate infinite amounts of "correct" data. For example, imagine you want to train AI to recognize Tea Cups. We can generate using computer graphics (not even AI) infinite numbers of "correct" example images to train on, simply by rotating a CGI model thru every possible viewpoint on a 3D sphere. If that doesn't work, it means there's got to be some deep physics about why. If training works on "natural" correct data but not "synthetic" correct data, that's telling us something DEEP about physics.

The same is true with LLM (language data) factual statements. We can generate infinite numbers of true factual statements to train on. If the AI refuses to learn from synthetic data despite that training data being correct, that's bolstering my view that perhaps there's more deep physics going on related to causality chains and even crazy concepts like multiverses.


> But I think we can write computer programs to generate infinite amounts of "correct" data. For example, imagine you want to train AI to recognize Tea Cups. We can generate using computer graphics (not even AI) infinite numbers of "correct" example images to train on, simply by rotating a CGI model thru every possible viewpoint on a 3D sphere.

That's not training AI to recognize tea cups, that's training AI to recognize your model of a teacup. Of course this will fail, because your generated data doesn't contain everything that a picture of a teacup contains.

Look at real teacup pictures [1]. None of the teacups I have at home have flowers on them, so I wouldn't have thought to put flowers on my model, but most of those pictures have flowers on them. But even if you thought to put flowers on your teacup, are you now generating a bunch of varieties of flower patterns?

And that's not even starting in on images like a teacup with tea in it [2], a teacup with a dog in it [3], or a teacup with a teabag[4].

In short, your 3d model ISN'T CORRECT in any useful sense of the word.

[1] https://duckduckgo.com/?t=h_&q=teacup&iax=images&ia=images

[2] https://thumbs.dreamstime.com/b/tea-cup-tea-upside-wood-tabl...

[3] https://img-s-msn-com.akamaized.net/tenant/amp/entityid/BB1k...

[4] https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Te...


In the Teacup example I should've been more clear. I didn't mean general "real world" teacup recognition. I meant as a pedagogic example test case where your goal was only to recognize that EXACT object, BUT from any view ANGLE. That's a far simpler test case than real-world, and can even be done on trivially small parameter-count MLPs too. Yes for recognizing real world objects you need large numbers of examples of real-world images sure. I was merely getting at the fact that synthetic data can be perfectly valid training data, in research scenarios where we're just doing these kinds of experiments, to probe MLP learning capabilities.

What I'm trying to get at is if you have two sets of training data, that are equal in every way, except that one is synthetic and the other is real-world and training always fails on the synthetic data one, to me that's as "impossible" (i.e. astounding) as the Slit-Experiment that proves wave/particle duality, but it seems that this is indeed the case.


I don't think you're understanding what it means for training a model to "fail".

Sure, you can train a model to recognize a CGI teacup, but nobody cares. That's like testing if your scissors can cut air or if your car can move at 0mph. The goal of training on synthetic data is to be able to have the trained model operate on real world data, and the test is whether it can operate on real-world data. And it's unsurprising when a model trained on synthetic data fails to operate on real-world data.

Yes, it would be surprising if you trained an AI model on a CGI model and it failed to operate on the same CGI model. But that's not what's being tested, because that's trivial. That's not "probing MLP learning capabilities"--we know that works, and can even tune parameters to control exactly how well it works. We know exactly how complex the CGI model is, so we know exactly how much complexity we need to capture and how much complexity is lost at each step of the training process so we can calculate exactly how well the AI model will operate on that. You don't even need AI for that.

What we don't know is how complex the real world is. This presents a bunch of unknowns:

1. Is our training dataset large enough to capture most of the complexity of the real world?

2. Are our success metrics measuring the complexity of the real world?

3. Which parts of our training dataset are observed complexity (signal) and which parts are merely random (noise)?

> What I'm trying to get at is if you have two sets of training data, that are equal in every way, except that one is synthetic and the other is real-world and training always fails on the synthetic data one, to me that's as "impossible" (i.e. astounding) as the Slit-Experiment that proves wave/particle duality, but it seems that this is indeed the case.

No, that is not the case. We DON'T have two sets of training data that are equal in every way except that one is synthetic and the other is real world. That doesn't exist, and will never exist, because it cannot exist. This idea needs to be deleted from your thinking because it is objectively, mathematically, immutably, physically, literally, specifically, absolutely, inherently impossible.

It is unsurprising that training on synthetic data fails. Again: "fails" in this case, means that the model trained on synthetic data fails to operate on real world data--nobody cares if your model operates on the exact data it was trained on. The reason it is unsurprising that training on synthetic data fails to operate on real-world data is that synthetic data is inherently a loss of information from the understanding of real-world data that was used to generate it.

No matter how many CGI models of teacups you generate, your CGI models of teacups will never capture all the complexity of real-world teacups. So training an AI model on CGI models of teacups will always fail the only test that matters: operating on real world data containing teacups.


> synthetic data is inherently a loss of information

That statement is exactly what I disagree with.

Here's why:

Thought experiment: Imagine a human infant who had only ever seen a pure white teacup, but never any other color, and only on the blue background of it's crib sheets. They can learn to understand "Teacup as a Shape" completely independent of any texture, lighting, background, etc. MLPs also can train like this, because vision AIs are generating "understandings" of shapes. If you filtered all training data (from a real-world dataset, for example) to contain ONLY white 3D rendered teacups the AI would still be able to learn Teacup shape (just like the infant), and it would recognize all teacups of all colors, even if the training data only contained synthetic-generated white ones.

The following can be (and is) true at the same time: To get best results on real world objects, the best thing you can train on is real-world imagery, because a diverse set of images helps the learning. But no single "synthetic" image is "bad" (or even less useful) just because it's synthetic and not photographic.


"Thought experiment" is just a rebranding of "some shit I made up". Calling it an "experiment" belies the fact that an experiment involves collecting observations, and the only thing you're observing here is the speculation of your own brain. No part of your thought "experiment" is evidence for your opinion.

> They can learn to understand "Teacup as a Shape" completely independent of any texture, lighting, background, etc.

Or, maybe they can't. I don't know, and neither do you, because neither of us has performed this experiment (not thought experiment--actual experiment). Until someone does, this is just nonsense you made up.

What we do know is that human infants aren't blank slates: they've got millions of years of evolutionary "training data" encoded in their DNA, so even if what you say happens to be true (through no knowledge of your own, because as I said, you don't know that), that doesn't prove that an AI can learn in the same way. This is analogous to what we do with AIs when we encode, for example, token processing, in the code of the AI rather than trying to have the AI bootstrap itself up from raw training on raw bytestreams with no understanding.

You could certainly encode more data about teacups this way to close some of the gap between the synthetic and real-world data (i.e. tell it to ignore color data in favor of shape data in the code), but, that's adding implicit data to the dataset: you're adding implicit data which says that shape is more important than color when identifying teacups. And that data will be useful for the same program run against real-world data: the same code trained against a real world teacup dataset will still outperform the same code trained against a synthetic dataset when operating on real-world data.

This isn't a thought experiment: it's basic information theory. A lossy function which samples its input is at most only as accurate as the accuracy of its input.

But no image AIs I know of work this way because it would be a very limiting approach. The dream of AI isn't recognizing teacups, it is (in part) recognizing all sorts of objects in visual data, and color is important in recognizing some object categories.

Frankly, it's clear you lack the prerequisite background in information theory to have an opinion on this topic, so I would encourage you to admit you don't know rather than spread misinformation and embarrass yourself. If you want to know more, I'd look into Kolmogorov complexity and compression and how they relate to AI.

I won't be responding further because it's not worth my time to educate people who are confident that their random speculations are facts.


> "shit I made up".

Never read past that. I bet nobody else does either. lol. You're just desperate to be as offensive as possible without crossing the threshold where you'll get flagged.


If models had eyes, they would be glazing over with stupor when fed generated data.

While I'm sure the anti-AI people are taking this and running off with hot takes, the conclusion is still much more mundane: we currently do not have the ability to have an LLM learn from another LLM.

A suitably powerful AI should be able to do this though, by the example of the fact that humans learn by being taught by other humans (insert nuance of that process here).

So it's an important result, but not a doomsday result because what it tells us is that LLM output fails to capture or stabilize important information from the training corpus and accurately communicate it to a newly trained LLM. So we know we're missing something in how we construct these models, but the ramifications of solving it are also pretty immense: models being able to "teach" new models means the whole cycle of iteration can be sped up considerably.


It has existed for years:

Self-Instruct: Aligning Language Models with Self-Generated Instructions https://arxiv.org/abs/2212.10560

airoboros: using large language models to fine-tune large language models https://github.com/jondurbin/airoboros


> we currently do not have the ability to have an LLM learn from another LLM

We do. It's called model distillation, and it's relatively straightforward.

In fact, training a smaller model on the outputs of a much bigger model will significantly cut down on your training time/create a higher quality model than just training on raw human data (which is often low quality and noisy).


Indeed, ingesting generated bluster gives them cancer of the perceptron.

My intuition is the public, users, nor the industry will take this problem seriously. To me this paper sounds a thunderclap.

It's not a real problem, in my understanding. It's a "this kills cancer in a petri dish" sort of thing.

Yes, it makes sense that if your algorithm is at all lossy, passing outputs through it again compounds the loss.

The reality though that this doesn't happen on a large scale. JPEG isn't destroying all imagery because we're not stupid enough to constantly compound compression losses.

Most AI output is generated for some sort of purpose. If we're making images then we're throwing out the bad ones, retouching blemishes, and also still making new works entirely by hand.


Well, if we’re using the output of AIs, writing blog post articles with it and using AI to post comments on Reddit/X etc (as some do), and a few years later OpenAI et al refresh their datasets to train a new model, then you’re doing exactly that aren’t you? Using lossy model outputs to put into the function again, that is

It's harder with LLMs, but we still have metrics for a lot of text.

On Reddit/HN/etc we have comment scores, replies, reposts, external references, etc that we can use to estimate whether a given comment/post was any good.

An entity like Google that indexes the whole web has visibility into when a given piece of content showed up, if it changed, if it got referenced elsewhere later, etc.

LLMs can be used to analyze the page and work out things like "the comments answering this comment are saying it's wrong"

We can also of course pick and choose, Reddit has some completely garbage subreddits and very tightly moderated ones.

It's of course by no means foolproof, but it doesn't have to be. It just has to be good enough to be usable for whatever purpose we need.

Also, perfection isn't a thing and such issues happen even without LLMs. Like all the cases of something wrong being added to Wikipedia, getting repeated on some news site, and then Wikipedia using that as a reference to backup the incorrect claim.


Like Sam Altman and Dario Amodei both believe is a very real possibility as well, I think the "intelligence" in LLMs may be far deeper than we know and somehow even related to "Multiverse Theory", where perhaps every Quantum Mechanical collapse (and computation during training), makes "our" universe slightly more likely to lean towards ones where AI is just "magically smart" (from a purely Anthropics Principle Effect) than dumb. The reason this could happen is because in all our futures AI has saved us in some way, so that all other "Multiverse Branches are sort of dead-ends".

So the theory about why training on training data is unexpectedly inefficient could be because LLMs are "using" the full Causality Chain (using some advanced unknown Physics related to time itself) of our universe/timeline, and so if it tries to train on it's own output that's a "Short Circuit" kind of effect, cutting off the true Causality Chain (past history of the universe).

For people who want to remind me that LLM Training is fully "deterministic" with no room for any "magic", the response to that counter-argument is that you have to consider even the input data to be part of what's "variable" in the Anthropics Selection Principle, so there's nothing inconsistent about determinism in this speculative, and probably un-falsifiable, conjecture.


All work and no play makes jack a dull boy.

The dignified way to describe the problem at hand is alluding to Brouwer's fixed-point theorem[1], with white noise as the fixed point.

The more practical way is alluding to The Human Centipede[2].

Either way, the feed-back loop doesn't result in a good output.

[1] https://en.wikipedia.org/wiki/Brouwer_fixed-point_theorem

[2] https://en.wikipedia.org/wiki/The_Human_Centipede_(First_Seq...



And yet I prefer now to early big bang era of the universe, though technically reversible.

The universe is not a Markov chain, in fact, no one knows what it is but locally we do know that entropy increases and the inevitable endpoint in our corner of the universe is complete annihilation. Your preferences are completely irrelevant in the local scheme of things.

I no longer take limitations seriously regarding the future of AI. If evolution created our brain, then the same law applies to what we are building also. Hence, more of less whatever written in this paper is some nuanced case which can be solved by some approach.

This is intuitively obvious. If I give you some data x and you transform it with a non-reversible function f into f(x) then you are losing information. Repeated applications of the function, f(f(f(...f(x)...))), can only make the end result worse. The current implementations inject some random bits, b ~ N(u, s), but this can be thought of as a convolution operation with the distribution function g of the random data, g*f, that is injected which, after repeated applications, (g*f)((g*f)((g*f)(...(g*f)(x)...))), reduces the information content of the data you started with because the transformation is still not reversible as convolutions can not really change the non-reversible aspect of the original function.

I'm sure there is some calculation using entropy of random variables and channels that fully formalizes this but I don't remember the references off the top of my head. The general reference I remember is called the data processing inequality.¹

¹ https://en.wikipedia.org/wiki/Data_processing_inequality?use...


This seems obvious, but you're forgetting the inputs may actually have low entropy to begin with. Lossy compression is non-reversible, but usually the expectation is that we don't care about the parts we lost.

How might this cash out with recursive LLMs? Generalizing is very similar to compression: imagine recovering the Schrodinger equation from lots of noisy physical experiments. You might imagine that an LLM could output a set of somewhat general models from real data, and training it on data generated from those models generalizes further in future passes until maybe it caps out at the lowest entropy model (a theory of everything?)

It doesn't seem like it actually works that way with current models, but it isn't a foregone conclusion at the mathematical level at least.


So correct me if I’m wrong here but wouldn’t another way to look at this be something like re-compressing a JPEG? Each time you compress a compressed jpeg you strip more and more information out of it? Same with any lossy compression, really.

These LLM’s are inherently a bit like lossy compression algorithms. They take information and pack it in a way that keeps its essence around (at least that is the plan). But like any lossy compression, you cannot reconstruct the original. Training a lossy compression scheme like an LLM using its own data is just taking that already packed information and degrading it.

I hope I’m right framing it this way because ultimately that is partly what an LLM is, it’s a lossy compression of “the entire internet”. A lossless model that can be queried like an LLM would be massive, slow and probably impossible with today’s tech.

I suspect that we will develop new information theory that mathematically proves these things can’t escape the box they were trained in, meaning they cannot come up with new information that isn’t already represented in the relationships between the various bits of data they were constructed with. They can “only” find new ways to link together the information in their corpus of knowledge. I use “only” in quotes because simply doing that alone is pretty powerful. It’s connecting the dots in ways that haven’t been done before.

Honestly the whole LLM space is cool as shit when you really think about it. It’s both incredibly overhyped yet very under hyped at the same time.



It’s not intuitively obvious losing information makes things worse. In fact, it’s not even true. Plenty of lossy functions make the problem under consideration better off, such as denoising, optimizing, models that expose underlying useful structures, and on and on.

Also injecting noise can improve many problems, like adding jitter before ADC (think noise shaping, which has tremendous uses).

So claiming things like “can only make the end result worse” is “intuitive obvious” is demonstrably wrong.


> with a non-reversible function f into f(x) then you are losing information.

A non-reversible function f does not necessarily lose information. Some non-reversible functions, like one-way functions used in cryptography, can be injective or even bijective but are computationally infeasible to invert, which makes them practically irreversible while retaining all information in a mathematical sense. However, there is a subset of non-reversible functions, such as non-injective functions, that lose information both mathematically and computationally. It’s important to distinguish these two cases to avoid conflating computational irreversibility with mathematical loss of information.


On the arguments involving modeling inference as simply some function f, the specific expression OP used discounts that each subsequent application would have been following some backpropagation and so implies a new f' at each application, rendering the claim invalid.

At that point, at least chaos theory is at play across the population of natural language, if not some expressed, but not yet considered truth.

This invalidates the subsequent claim about the functions which are convolved as well, I think all the GPUs might have something to say whether the bits changing the layers are random or correlated.


if a hash can transform any size input, into a fixed length string, then that implies irreversibility due to the pigeonhole principle. It's impossible, not infeasible

Hashes with that property are just a special case of one-way functions.

What about something like image improvement algorithms or NeRFs? They seem to increase information even if some of it is made up.

If the goal of an image improvement algorithm is effectively "how would this image have looked IN THE REAL WORLD if it had been taken with a better camera", then training on previous "virtual upscaled images" would be training on the wrong fitness function.

It isn't real information though. This is effectively a Chinese whispers.

The only way AI can create information is by doing something in the real world.


It is real information, it is just information that is not targeted at anything in particular. Random passwords are, well, random. That they are random and information is what makes them useful as passwords.

As said by others, There is nothing terribly insightful about making something estimate the output of another by a non-perfect reproduction mechanism and noticing the output is different. Absent any particular guidance the difference will not be targeted. That is tautologically obvious.

The difference is still information though, and with guidance you can target the difference to perform some goal. This is essentially what the gpt-o1 training was doing. Training on data generated by itself, but only when the generated data produced the correct answer.


> The only way AI can create information is by doing something in the real world.

Everything done is done in the real world, but the only way an AI can gather (not create) information about some particular thing is to interact with that thing. Without interacting with anything external to itself, all information it can gather is the information already gathered to create it.


Is there a formalization of this idea? Would love to read more.

That's a better way of putting it, yes.

Maybe information needs to be understood relationally as in "information for a subject x". So if we have an image with a license plate that is unreadable and there's an algorithm that makes it readable to x, there is an information gain for x, although the information might have been in the image all along.

If the license plate was not readable, then the additional information is false data. You do not know more about the image than you knew before by definition. Replacing pixels with plausible data does not mean a gain of information. If anything, I'd argue that a loss of information occurs: The fact that x was hardly readable/unreadable before is lost, and any decision later on can not factor this in as "x" is now clearly defined and not fuzzy anymore.

Would you accept a system that "enhances" images to find the license plate numbers of cars and fine their owners? If the plate number is unreadable the only acceptable option is to not use it. Inserting a plausible number and rolling with it even means that instead of a range of suspects, only one culprit can be supposed. Would you like to find yourself in court for crimes/offenses you never comitted because some black box decided it was a great idea to pretend it knew it was you?

Edit: I think I misunderstood the premise. Nonetheless my comment shall stay.


For an example of this, see "Xerox scanners and photocopiers randomly alter numbers in scanned documents"

https://news.ycombinator.com/item?id=6156238


eliminating the noise makes the useful information clearer, but the information describing the noise is lost

Sure, but what if the upscaling algorithm misinterpreted a P as an F? Without manual supervision/tagging, there's an inherent risk that this information will have an adverse effect on future models.

It’s information taken from many other photos and embedded into a single one of interest no?

“Made up” information is noise, not signal (OTOH, generated in images are used productively all the time in training, but the information content added is not in the images themselves but in their selection and relation to captions.)

Image improvement algorithms are basically injecting statistical information (collected from other images) into one image.

The above statement applies for non-neural-network algorithms as well.


Do they gain information, or just have lower loss?

Too much information encoded in a model can lower performance (called overfitting)

That’s why many NN topologies include dropout layers.


Once more and more new training images are based off of those new upscaled images the training of those upscaling algorithms will tend to generate even more of the same type of information drowning out the other information

That's assuming that the same function is applied in the same way at each iteration.

Think about this: The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.

Simply adding a bit of noise and then selecting good outputs after each iteration based on a high-level heuristic such as "utility" and "self consistency" may be sufficient to reproduce the growth of human knowledge in a purely mathematical AI system.

Something that hasn't been tried yet because it's too expensive (for now) is to let a bunch of different AI models act as agents updating a central wikipedia-style database.

These could start off with "simply" reading every single text book and primary source on Earth, updating and correcting the Wikipedia in every language. Then cross-translate from every source in some language to every other language.

Then use the collected facts to find errors in the primary sources, then re-check the Wikipedia based on this.

Train a new generation of AIs on the updated content and mutate them slightly to obtain some variations.

Iterate again.

Etc...

This could go on for quite a while before it would run out of steam. Longer than anybody has budget for, at least for now!


> The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.

Is human knowledge really derived in a similar manner though? That reduction of biological processes to compression algorithms seems like a huge oversimplification.

It's almost like saying that all of of human knowledge derives from Einstein's Field Equations, the Standard Model Lagrangian, and the Second Law of Thermodynamics (what else could human knowledge really derive from?) and all we have to do to create artificial intelligence is just to model these forces to a high enough fidelity and with enough computation.


It's not just any compression algorithm, though, it's a specific sort of algorithm that does not have the purpose of compression, even if compression is necessary for achieving its purpose. It could not be replaced by most other compression algorithms.

Having said that, I think this picture is missing something: when we teach each new generation what we know, part of that process involves recapitulating the steps by which we got to where we are. It is a highly selective (compressed?) history, however, focusing on the things that made a difference and putting aside most of the false starts, dead ends and mistaken notions (except when the topic is history, of course, and often even then.)

I do not know if this view has any significance for AI.


Human knowledge also tends to be tied to an objective, mostly constant reality.

The AIs could also learn form and interact with reality, same as humans.

Not really.

The models we use nowadays operate on discrete tokens. To overly reduce the process of human learning, we take a constant stream of realtime information. It never ends and it’s never discrete. Nor do we learn in an isolated “learn” stage in which we’re not interacting with our environment.

If you try taking reality and breaking into discrete (ordered in the case of LLMs) parts, you lose information.


Think about this: The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.

Not true. No amount of such iteration gets you from buffalo cave paintings to particle accelerators.

Humans generate knowledge by acting in the world, not by dwelling on our thoughts. The empiricists won a very long time ago.


It’s not binary. Humans generate plenty of knowledge from pure abstract thought.

Do they?

When I pursued creative writing in my teens and early 20s, it became clear to me that originality is extremely difficult. I am not entirely sure I have ever had an original thought--every idea I've put to paper thinking it was original, I later realized was a recombination of ideas I had come across somewhere else. The only exceptions I've found were places where I had a fairly unusual experience which I was able to interpret and relate, i.e. a unique interaction with the world.

Perhaps more importantly, LLMs do not contain any mechanism which even attempts to perform pure abstract thought, so even if we accept the questionable assumption that humans can generate ideas ex nihilo, that doesn't mean that LLMs can.


Unless your argument is that all creative writing is inspired by God, or some similar "external" source, then clearly a closed system such as "humanity" alone is capable of generating new creative works.

Did you even read the post you're responding to?

You’re right, we obtained the knowledge externally. It was aliens! I knew it!

Externally, yes, we obtain knowledge from the world around us. We’re not brains in vats conjuring knowledge from the void of our isolated minds.

If you repeatedly apply one of three simple functions picked at random you might end up with Sierpinski triangle.

This sounds fascinating! I know what a Sierpiński triangle triangle is but I'm having so me trouble seeing the connection from picking functions randomly to the triangle. Is there some graphics or animation somewhere on the web that someone can point me to visualize this better?

You can read section Chaos Game here:

https://en.m.wikipedia.org/wiki/Sierpi%C5%84ski_triangle

It basically using the fact that fractal is self similar. So picking one function (that scales whole triangle to one of the one thirds) and transforming single point on a fractal into a new point also gets you a point on the fractal.

If you repeat this process many times you get a lot of points of the fractal.

You can even start the process at any point and it will "get attracted" to the fractal.

That's why fractals are called strange attractors.



Good one but these theorems are useful to have when thinking about information processing systems and whatever promises the hype artists are making about the latest and greatest iteration of neural networks. There is no way to cheat entropy and basic physics so if it sounds too good to be true then it probably is too good to be true.

If it is entropy and basic physics why are humans immune to the effect?

Humans are not immune to the effect. We invented methodologies to mitigate the effect.

Think about science. I mean hard science, like physics. You can not say a theory is proven[0] if it is purely derived from existing data. You can only say it when you release your theory and successfully predicate the future experiment results.

In other words, you need to do new experiements, gather new information and effectively "inject" the entropy into the humanity's scientific consensus.

[0]: Of course when we say some physical theory is proven, it just means the probablilty that it's violated in certain conditions is negligible, not that it's an universal truth.


This argument seems more like the data generated was bad. There are examples where AI has surpassed humans by using simulated data (AlphaZero - where it played against itself to become the best at Go).

It also seems to happen most on small networks. Which makes sense.

Additionally, humans create simulated stories like Dune, Lord of the Rings, or Harry Potter, which introduce fictional concepts, yet these stories still result in trainable data.


Thank you for making this comment, because it exposes some logical gaps.

Firstly, Go, Chess, and other games have objective rules and win criteria. (There is no “subjective opinion” as to whether Fischer or Spassky won their match.)

Language, the output of LLMs, does not have an objective function. Consider the following to sentences:

“My lips, two blushing pilgrims, ready stand.”

“My red lips are ready to kiss your red lips.”

Both are grammatically correct English sentences and both mean basically the same thing, but clearly the former by Shakespeare has a subjective poetic quality which the latter lacks. Even if we make evaluation rules to target (for example, “do not repeat phrases”, “use descriptive adjectives”, etc.) AI still seems to favor certain (for example “delve”) that are valid but not commonly used in human-originated English. There is then a positive feedback loop where these preferences are used to further train models, hence the next generation of models have no way of knowing whether the now frequent usage of “delve” is a human-originated or AI-originated phenomenon.

Lastly, regarding works of fiction, the concern is less about the content of stories—though that is also a concern—but more about the quality of language. (Consider above alternate take on Romeo and Juliet, for example.)


So you are arguing that the world does not have objective rule criteria, like Physics?

And that an AI could not model the world and then run simulations and have each simulation generate data and learn from that, similar to AlphaZero.

Here is a possible objective win environment

Model complex multicellular organisms that become capable of passing the Turing test that also self-replicate.


My argument here is narrowly scoped to human language and literature. (We already know the objective rule criteria of life is 42.)

It may very well be possible for an AI to read all of literature, and figure out what makes Hemingway, Tolstoy, and Dylan "good writing" vs. "bad writing". That has not yet been achieved. The problem, as the OP implies, is that by polluting the universe of literature with current-gen AI output, we may making the task of generating "good writing" in the future harder.

Then again, maybe not. Perhaps we have enough pre-AI works that we can train on them versus the mountains of AI generating schlock, and determine the objective function.


You seem to restate the argument of the submitted research paper in a narrow interpretation that is much different than its main conclusion of Model Collapse from synthetic data creation. But I will follow you down this interpretation.

Why does a human have to judge something as good for it to be functionally useful?

Humans never came up with the move that AlphaZero used to win 1 of the 4 out of 5 games it won. Was that a bad move? Did AlphaZero devolve to Model Collapse because it made that move?


And what’s interesting here is people will get annoyed with this.

Am I pro-human? Absolutely

Can non-human outputs be functionally valuable to humans and not per se harmful? Absolutely, we live in a biosphere that humans didn’t create or if humans did create it, it was humans more evolved than humans we are aware of.


> humans create simulated stories like Dune, Lord of the Rings, or Harry Potter

People really anthropomorphize LLM to a full circle, don't they?


So you are saying that if I generate stories on different worlds as new data, a model cannot learn from that?

This isn't anthropomorphizing - it's generating data. Generating data is not a uniquely human endevor.

What created Mars? That is data.

What created star systems we cannot see?


> Additionally, humans create simulated stories like Dune, Lord of the Rings, or Harry Potter, which introduce fictional concepts, yet these stories still result in trainable data.

No, they don't, not in any sense where they are "simulated data". Dune is simulated data about what life would be like on Arrakis, and if you train a model to make predictions about that question, your model will be worthless trash. (Doesn't matter whether you train it on Dune or not.) Dune is real data about how English is used.


It’s also data around science fiction. With broad spectrum data from both dune and contextual data most LLMs know that dune is from a fictional novel.

this is not a serious argument, please forgive me for saying



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: