AI has poisoned its own well

sogen · on June 19, 2023

An aspect never mentioned is that this data only fed from active users, which are the minority of people.

Majority of online users are lurkers, so all of these models are extremely biased on whom they got their information from.

asdff · on June 19, 2023

The demography and expertise of your typical lurker probably vary over time as well. E.g. someone writing in a physics newsgroup in the usenet era I would guess is more likely to have relevant expertise than one writing in a physics subreddit today, now that the internet has grown beyond a population of predominantly academics and techies.

sogen · on June 19, 2023

Yep, exactly. For example Saurik only pops up here on HN from time to time.

Same thing could be said of people that aren't online too much or barely never on tech forums, like John Carmack.

HellDunkel · on June 20, 2023

Carmack is on twitter.

thghtihadanacct · on June 19, 2023

Interesting you say 'subreddit' ... like thats a thing beyond the social platform ... while responding in an HN comment.

kuhewa · on June 19, 2023

How dare someone mention a common publicly accessible discussion board.. on a different publicly available discussion board that isn't focused on the same topic

zogrodea · on June 21, 2023

It is kind of odd wording though to speak of "a physics subreddit" since it seems like it would make sense for there to be only one ("the physics subreddit" instead of "a physics subreddit"). That's not how things are though with /r/math, /r/maths and /r/mathematics existing all at the same time despite focusing on the same topic.

kuhewa · on June 21, 2023

Do you really think so? I imagine if you aren't familiar with Reddit from context a subreddit just means a topical discussion board, and if you are familiar with Reddit your default assumption would be there isn't only one subreddit for topics unless they are really niche like a mid sized city or sports team.

I suppose if you are only vaguely familiar with Reddit you might it assume it's like a vBulletin setup where only a handful of subforums are set up by admins rather than users.

thghtihadanacct · on June 19, 2023

Either way, thats like saying an oil worker from the 1950's is different than an oil worker now. Yeah, thanks for the obvious

hnaccount141 · on June 20, 2023

More like saying that an oil worker is different from some guy quoting There Will Be Blood. The problem is that the difference in credibility between the two isn't easily accounted for once their posts are vacuumed up into a giant dataset.

YetAnotherNick · on June 20, 2023

Yes, that's why Google will likely have immense advantage in this. I think content present in things like Google docs or Gmail is likely many times of public internet. And they have data for typing speed, human like randomness detection, backspace etc. which can very reliably detect human generated content.

tlb · on June 21, 2023

Human social media users get a similarly biased sample of information.

The thing is, there's no objectively unbiased sample of information. You could weight towards tweets or books, NYT or the Economist, arXiv or journals. The books could be bestsellers or classics. There's no right answer, just preferences.

raxxorraxor · on June 19, 2023

If the generated content is vetted by an adversarial network, we theoretically have a recursively improving AI. Perhaps by introducing more randomness we could even reach some form of evolution that can introduce new concepts, since limited scopes is still what gives AIs away in the end. That is the optimistic perspective at least.

On the net, search and content quality is already pretty low for certain key words. If a word is part of the news cycle, expect hundreds of badly researched news paper articles, some of which might be generated as well. Or if they weren't, you wouldn't notice a difference if they become that.

But I don't believe the companies made a mistake. They could even protect their position with the data they already acquired and classified. Maybe a quality label would say genuine human®.

If all that fails large companies would also be able to employ thousands of low wage workers to classify new content. The increasing memory problem persists, I think that is a race where the model that can extract data as efficiently as possible will win. But without the data sets, there is no way to verify performance.

deeviant · on June 19, 2023

This is exactly what I thought when I read the article. Specifically, that it's exactly opposite-of-correct; this is exactly the thing that will push LLMs to the next level.

I think the simple idea using chatGPT to filter it's input set with result in drastically higher quality dataset. Add in the ability for chatgpt to use plugins and such like wolfram alpha to fact check it's answers then chatgpt can actually start generating input data from know good sources in areas it knows it's having quality issues.

I mean, it literally looks like the beginning of self-learning AI.

sesteel · on June 19, 2023

Yep! The singularity concept people talk about relies on a feedback loop like you described. So, she isn't wrong, per se, but just telling a fragment of the story that exist today.

Arnt · on June 19, 2023

The article seems to build on several foundations, but one is critical: That the AIs need enormous amount of data, and that that data has to be from a pool that people can now poison with AI-generated data. If that doesn't hold, nothing in the article holds.

And it doesn't seem clear to me. It may be true, but far from obvious.

For example, DALL-E, which was better than its predecessors. Was it better because of mode input data, that is, was more input data a necessary condition for its improvement? Or even the biggest reason? Reading the OpenAI blog makes it sound as if they had new kinds of AI models, new ideas about models, and that the use of data from a web crawler was little more than a cost optimisation. If that's true, then then it should be possible to build another generation of image AI by combining more new insights with, say, the picture archives of Reuters and other companies with archives of known provenance.

Maybe I'm an elitist snob, but the idea that you can generate amazing pictures using the Reuters archive sounds more plausible than that you could do the same using a picture archive from all the world's SEOspam pages. SEOspam just doesn't look intelligent or amazing.

spacebanana7 · on June 19, 2023

> That the AIs need enormous amount of data

I was surprised to learn that it didn’t take an enormous amount of data to train Llama.

Meta used 1.4 trillion tokens training Llama, but that’s only about a million times the Harry Potter collection [1]. Given that the kindle store has 12 million books [2], it’s credible to get 1.4T tokens just from that. Twitter has 200 billion new tweets per year, so it’d only take 7 tokens per Tweet for that to produce a Llama sized dataset every year.

[1] https://en.wikipedia.org/wiki/LLaMA [2] https://blog.fostergrant.co.uk/2017/08/03/word-counts-popula... (1 million words) [3] https://www.omnicoreagency.com/twitter-statistics

Joker_vD · on June 19, 2023

> too many people have pumped the internet full of mediocre generated content with no indication of provenance

I don't know about the text models, but e.g. Stable Diffusion (and most of its derived checkpoints) has very recognizable looks.

By the way, does anyone know if such generative models could be used as classifiers, answering the "what's the probability that this input was generated by this model" question? That'd help solving the "obtaining quality training data" problem: use the data that has low probability of being generated by any of the most popular models. It's not like people started to produce less hand-made content anyhow!

waveBidder · on June 19, 2023

the previous generation of image generation, GANs (generative adversarial networks) was actually done by pairing a discriminator trying to tell if the image was generated or not, and a generator.

mark_l_watson · on June 20, 2023

I am not yet convinced that generated data ‘poisons the well’ if there is some aspect of adversarial training.

About 7 years ago when I managed a deep learning team at Capital One, I did a simple experiment of training a GAN to generate synthetic spreadsheet data. The generated data maintained feature statistics and correlations between features. Classification models trained on synthetic data had high accuracy when tested on real data. A few people who worked for me took this idea and built an awesome system out of it.

Since the poisoned well is a known thing now, it seems like a solvable problem/

fnordpiglet · on June 19, 2023

Humans have generated much more content to date than these models are being trained on. Facebook has enormous amounts of human to human interactions at its disposal, and presumably will continue collecting more and more. Likewise there will exist forums where humans write to other humans, like this one, regardless of the pervasiveness of spam on the internet. Finally, most LLM are trained off curated data sets that are not all encompassing of all written text. The process of curating the dataset is necessarily constraining and that means the data admitted so far must be much smaller than the data possible to admit. These analyses also assume we’ve reached a fixed point in the algorithmic ability of these models to converge.

I think the truth is we’ve written all that ever needs to be written, and even if they universe becomes populated by on AI LLM chat bots communicating, they will be fine to feast off of what we’ve left them as a legacy.

JimtheCoder · on June 19, 2023

"I think the truth is we’ve written all that ever needs to be written"

I highly doubt this statement.

How useful would ChatGPT 4 be if it was trained only on data up to 2013? (Assuming the total amount of data it was trained on was the same) Would it be like talking to a human who was sitting in their basement for the past decade? I am not sure how useful that would be to me.

randcraw · on June 19, 2023

And if the data that LLMs have trained on is only simple social conversation, what level of intelligence will that begat? No nobel-laureate-level 'thinking' is going to arise from training text composed of obvious or trivial statements -- which surely compose 99% of ChatGPT's training material.

These days, it's a rare piece of text that surprises the reader with creative or ingenious or outside-the-box thinking. If we want higher level cognition from future LLMs, where will such deep thought come from? Surely not the training data used now: email, tweets, reddit, mass media, etc. GIGO indeed.

Much has been written, yes. But not much of that is worth reading.

xp84 · on June 19, 2023

It does seem like limiting new training material to only good quality information would be prudent rather than slurping up entire GPT-produced spam sites. Sure, LLMs will have been used to refine and assist much of it from here on out, but I'd like to think theses, newspapers, etc won't just be produced entirely by unsupervised robots and not even fact-checked. If that's the case we'll have bigger problems.

JohnFen · on June 20, 2023

> These days, it's a rare piece of text that surprises the reader with creative or ingenious or outside-the-box thinking.

I don't think it's any more rare now than it ever was. It's always been rare.

> Much has been written, yes. But not much of that is worth reading.

Sturgeon's Law is immutable and timeless.

https://en.wikipedia.org/wiki/Sturgeon's_law

fnordpiglet · on June 19, 2023

Well, I’d say that given LLM have no agency, if a Nobel laureate uses LLMs to produce their text with their direct guidance of the concepts to convey and details. I think LLM aren’t just useful as oracles but as calculators for writing. To that end I don’t find math done using matlab is any less useful than artisanal hand made math. Likewise text constrained and informed by a human mind with novel information but presented by a LLM- is it inferior?

fnordpiglet · on June 19, 2023

That’s fair - current events aren’t covered. However, my point wasn’t about salience of facts but language ability.

throwuwu · on June 19, 2023

Once it can read, has a broad vocabulary and can reason enough to synthesize information from what is given then you don’t need to train it any further. We’re at that point now. Everything going forward is just engineering, even just finding ways to increase the context length will allow these models to work with any data available. OpenAI is very publicly working on that and so is Anthropic. You can also apply some finesse and combine the model with external tools like search and databases or custom built APIs, practically everyone and their dog is experimenting with this approach. So even if no better models are made, which seems unlikely to be the case, we’ll be utilizing the current generation in all kinds of ways from here on out.

mvdtnz · on June 19, 2023

> Once it can read, has a broad vocabulary and can reason enough to synthesize information from what is given then you don’t need to train it any further

And what is it going to read? How will it distinguish anything it reads today from AI generated content? Don't you see you've just set up the exact same circumstance that the article talks about?

throwuwu · on June 20, 2023

Do you know the difference between training and inference? If not, this conversation isn’t going anywhere.

mvdtnz · on June 20, 2023

nerdbert · on June 19, 2023

> and can reason enough to synthesize information from what is given

That's a qualitative leap from what it does now, and there's no clear basis for believing the current trajectory of these systems can make that leap.

throwuwu · on June 20, 2023

How? It can use what’s in its context window. I can copy paste documentation or code it’s never seen before and ask it to operate on it and get useful answers. This is what practically all the new AI tools are built on.

nerdbert · on June 20, 2023

It's not reasoning, it's predicting likely text, which it can do because the bits of input you provide overlap with enough bits in its training corpus, because most programming tasks are repetitions of similar patterns. The outcome approximates reasoning in some cases but the process is completely different. It is not building up from principles, it is sifting down from examples.

JimtheCoder · on June 19, 2023

"...and can reason enough..."

Are we actually at this point now? I am less confident than you are...

throwuwu · on June 20, 2023

I use it every day to help solve problems, that’s enough for me to say it has some level of reasoning ability. Not claiming its perfect but it’s better than some people I know.

cainxinth · on June 20, 2023

I recently contributed to the LLM “low background steel” problem. I have a blog where I share Wikipedia articles. I wanted to do one on a wiki article for a topic related to AI, but the article I chose was thin on details. So, instead I asked GPT-4 for the info. I edited the output, found scholarly citations for just about every sentence with Google, and then added it to the wiki page.

I’m a copywriter (I know, my days are numbered), so I didn’t just plop the generated text in unchanged, but it’s still primarily GPT’s content. I did a nice job, imho, and improved the article greatly. It’s been up for several days now, so I think it has a good chance of staying long term.

Still, it’s a funny situation. On the one hand, I did something I’ve done many times before: researched a topic and added to a wiki article. I always feel gratified when I contribute to Wikipedia. I’m adding to the sum of human knowledge in an incredibly direct way.

But, the information I added this time will be used to train future LLMs and thus “poison the well” with generated content. The verdict is still out on just how bad generated content will be for training new models. But I definitely feel slightly conflicted about whether I did something that is a net positive or negative.

thaw13579 · on June 19, 2023

I don't think it's so cut and dry. The article paints a somewhat simplistic picture of how the best performing LLMs work. The unsupervised pre-trained networks are indeed data hungry, but the secondary supervised learning stages actually can get by with a far smaller set of highly curated prompt-response data, e.g. LIMA (https://arxiv.org/abs/2305.11206).

Another factor is that generative data distributed online may be quite high quality (because people find it interesting enough to share), so it's plausible this could actually improve model performance. Some LLMs have been trained with data from other models with good results, e.g. supposedly Bard with GPT-4 prompt-response pairs and GPT-4 with Whisper transcripts of YouTube (https://twitter.com/amir/status/1641219919202361344/photo/1). Of course, there could be trolling or misinformation that "poisons" the data, and that is a problem (whether synthetic or organic)!

achrono · on June 20, 2023

Another problem with this take is that it's giving humans too much credit here -- there's a vast continuum from pure-human-creation to pure-AI-creation.

Human beings are given to (1) repeating tropes (2) arguing incoherently (3) missing the point, etc.

Think of all the books and movies from before 2023. Isn't there a lot that is formulaically wrong/misleading/suboptimal in there?

So this might not really be "poisoning the well" -- a more interesting area to look at would be, how can we make GPT-n aware of its own gaps and use this knowledge rather than just let the user know that it knows its gaps?

helen___keller · on June 21, 2023

Doubtful regarding the effort people will take to protect their content from harvesting.

Creative content has been continuously devalued for decades now. The whole reason you can find so much music, creative writing, and art, available for free online is because this content is essentially worthless until you can build a brand or clientele to monetize it, and the only way to do that is to broadcast it for free to as many people as possible.

blibble · on June 19, 2023

I certainly replaced my highly starred projects on GitHub with randomly generated crap (build passes!) when they announced copilot

Borrible · on June 20, 2023

Ah, the Ourobous Language Problem.

more_corn · on June 19, 2023

This is a common but misunderstood concern. It’s one of those worries that seems sound in theory, but practice doesn’t bear it out. Remember when people were up in arms about SSD wear cycles? Yeah that’s not actually the way they fail. There are real problems with AI. This is not one of them.