Hacker News new | past | comments | ask | show | jobs | submit login
The data that powers AI is disappearing fast (nytimes.com)
136 points by sgammon 48 days ago | hide | past | favorite | 227 comments




I think the worst possible outcome is a licensing regime that means that Disney or Paramount or Elsevier or whoever all get to have a monopoly on training large models within their niche. My guess is that any successful calls for regulation will have this outcome, which means that individuals won't be able to legally use AI-based tools except when creating works for hire, etc.

Currently, I think most of the training use cases can be covered by the existing "you can't copyright a fact" carve out in the law. That's probably better for society and creators than my licensing regime scenario.

Anyway, I'm rooting for "no regulation" for now. The whole industry is still being screwed over by market distortions created by the DMCA, and this could easily be 10x worse.


> I think the worst possible outcome is a licensing regime that means that Disney or Paramount or Elsevier or whoever all get to have a monopoly on training large models within their niche.

Why is this the worst possible outcome? Companies using AI would be training with properties they own or have licensed appropriately, rather than the existing scheme of ignoring copyright law to extract $$$ from the creative works of ordinary people.


One nitpick with this: “Ordinary” people mostly don’t own any IP that earns them any income.

Most who earn their income from IP are society’s elites. Perhaps the lowest-paid least elite IP profession I can think of is journalists, and I’m pretty sure they are doing works for hire and not owning the copyright.

Arguably when it comes to the IP questions, ordinary people are far more likely to be benefiting from AI (for instance getting it to make art they couldn’t express because of lack of talent) than they are to be be robbed of their IP in a way that matters.

Obviously the question of AI harming ordinary people in terms of automating mundane jobs is a very real one, but interestingly it’s totally irrelevant to the IP issues.


Ordinary people would also benefit from free cars and free lunches, yet we are not advocating for those are we? Why should the creators bear the blunt of those politics then?


...because they're the creators. They could have created anything, and they created this.


I am an author. While 99% of "ordinary people" are not authors, 99% of authors are "ordinary people".

Under your moral logic, everyone should just be free to pirate/steal any creative work, as why shouldn't the un-talented (or un-trained, or un-dedicated) have equal rights to the works.

All this leads to a case where there is no longer an incentive to create and popularize creative work, and suddenly all that is available are AI rehashes of AI summaries. I, for one, don't look forward to such a marketplace of non-ideas.


LLMs don't steal/pirate works, "copying" is the word you're looking for. So much faster, cheaper and precise to copy than to use LLMs. What LLMs do is to combine user information with patterns learned from data. Almost never a full work, they rarely generate more than 1000 words at once. Books are 100x larger


Actually, "launder" is the word I'd use for LLMs.

Clean things up just enough that it's difficult to prove where their semantic content came from.

How useful this is or isn't, and how much of a threat to rightful IP owners, depends very much on the type of IP.


If "laundering" IP is made illegal, then we're all in for a huge surprise. Almost everything we say and do has been said and done before. And we're rarely the originators of our main ideas, we "launder" 99.99% of what we know, even subconsciously. Any one human could be suspected of secretly using AI today.


Totally agree, the difference is speed and scale.

Copyright laws didn't need to be invented until the printing press came along, because the act of copying was slow and difficult.

Not a fan of the patent model for software, for example, but perhaps this is an argument for it. Or, we just get used to the fact that idea-reuse is cost-free, accept a couple of decades of uncomfortable economic dislocation, and get on with it.


I don't think "copying" applies to what an LLM does either!


"Good artists copy. Great artists steal." - Picasso


I don't think I should be able to steal or pirate your work.

I do think I should be allowed to read your book & do math with it.


>I do think I should be allowed to read your book & do math with it.

This is such a painfully obtuse construction of what is occurring that it escapes any reasonable discussion.


Up to you. You can stop people from reading your book but you can't stop people from doing math.


You are being obtuse, copying is "just math" and is already illegal. "Just math" is not a legal argument, this is childish.


> there is no longer an incentive to create and popularize creative work

It's reasonable to be worried about this scenario. If there was no incentive to produce creative work, our society would be much worse.

But the notion that there's only one way to prevent this scenario, and it requires a drastic expansion of the already sweeping "intellectual property" regime...well it just lacks creativity.

It's not that we want to eliminate the incentive to be creative, it's that we believe there are better ways to prevent that scenario than to further entrench a broken system.


You give me your creative work for free. What I do with texts in my computer is my business, not yours. Don't like it? Stop publishing your creative work online. There is nothing being stolen.


This is strange logic that ignores the idea of copyright. Just because I allow you to view my work for free does not mean I relinquish copyright protection. If I write a song that I perform for free, it doesn’t give you license to record and sell that song, for example.


> This is strange logic that ignores the idea of copyright

What is strange in ignoring the idea of copyright? If you write a song and I can play it in my computer, I use it the way I want in my computer. If you don't want that, don't make your work public. Copyright is a men's creation, nobody is forced to respect it.

The code I develop can be accessed for free in github and even in the browser "view source", I won't be fighting for other people to have the right to force others to pay for using their creation while they don't pay for mine and all other open source and open science creations.


>nobody is forced to respect it

I was afraid the discussion would go this route.

Yes, copyright is a convention. It’s subject to change. However, IP protections are written into the US Constitution and the bar to change it is relatively high.

Murder is also a crime by convention, there’s no natural law against it. But we generally recognize that to live in a stable society, we must live by certain conventions.

You can play a song on your computer because that is considered appropriate use. You selling tickets to play it, or to copy it and sell it is not because they conceivably limit the authors ability to make money from their creation.

You may not realize it but software is also covered by copyright. For most intents and purposes, it’s considered the worlds worst book; you cannot legally copy and sell it if the license doesn’t allow it.

Edit: added the word "legally" to be more precise


> you cannot copy and sell it if the license doesn’t allow it.

Well, good luck trying to force me to not sell your work in my country.


AI changes the rules here, because AI is able to automate extracting structure/meaning, and launder it of its origins.

In classical copyright, fair use allows certain.. fair uses. Going beyond that is considered stealing. You can chop things up into small parts, and there are rules governing how you can put the pieces back together and claim them wholly or partly as your own work.

LLMs behave more like a solvent. Everything goes in the pot, gets melted down, and by the time it's recast you can't say for sure where anything came from or who it once belonged to. Even if sometimes you might get a strong whiff.


Not according to the NYTimes' attorneys.


I think the counter argument is that it works provide a pathway for “ordinary” people to monetize IP. Copyrights exist as soon as a creative work is made, it’s an entirely different IP tool than, say, a patent that has a more arduous process.


Yes but most LLM usage is for a single reader - the user who prompted the model. The output is rarely published. Maybe there should be a different rule for using LLM text to author public books and articles.


AFAIK copyright laws do not make such a distinction. The most common exemption is fair use, but it requires the addition of some new creative aspect. Whether or not LLMs are creative or derivative may be an area that the courts are forced to define. Given the recent rulings that put this determination squarely on the court, I’m not sure the courts have the expertise to do so.


> but it requires the addition of some new creative aspect

That is one thing I can't get. Why is everyone not seeing the elephant in the room? The user and the user prompt - they add new intention and purpose to the material being referenced. Not only that, but a LLM response is usually shorter than a full length article or book. On top of that, LLMs retrieve from multiple sources when they use search, and respond from even more sources when they generate answers in closed-book mode. They are clearly doing synthesis.

LLMs are not even good tools for infringement. You have to work really hard to put the LLM into an infringing mood, and it requires snippets from the target content, and only works like 1% of the time. If you have snippets, you probably have the whole thing already. Copying is so much easier and faster. LLMs would just reproduce approximatively a desired content. What a LLM gets from a source text in closed-book mode is on the same level with a image thumbnail from a full resolution image.

What LLMs do well is to recombine ideas in new ways, as demanded by the prompt. Should that be infringing? If the answer is yes, then humans should also be barred from recombining ideas. After all, any human could secretly be using LLMs. But that would just tank creativity for fear of litigation. Isn't it strange how copyright promotes creativity by restricting it?

Even the name "copyright" refers to copying, as it was initially envisioned. Now authors want to expand it to idea ownership. I think it's a power grab.


>Should that be infringing?

That’s the big question I was alluding to. Does the prompt provide sufficient novelty to avoid claims of copyright? Or is it too derivative? I don’t have a strong opinion but can see both sides and suspect eventually it will be decided by the courts.

The distinction you may be missing is that copyright refers to licensed reproduction. IP is, after all, about property/ownership rights.


Not everyone agrees with that view, of course.

https://www.gnu.org/philosophy/not-ipr.html


Of course. IP is a shorthand for these types of forums/discussion, although it is also used in court.


> “Ordinary” people mostly don’t own any IP that earns them any income.

I'm not sure why this is a relevant point. Is IP only important if it's earning money?


Yes. Earning money is the only argument anyone's been able to make for why we should respect IP as a society, and even that argument is a bit of a stretch.


While I admit it can be controversial to some, certainly in many countries an argument is made for "the moral rights of the author."


I think the original argument for copyright isn't directly based on money and isn't a stretch. The argument is that it encourages people to distribute their own works by reducing the risks of doing so, which is a thing that benefits us all.

I don't think that copyright restrictions really apply to the use of data to train LLMs, and that fact is why I, and a number of others, have removed our works from the public web entirely. There is no other real way to protect ourselves from people using the works in a way we object to.


What non-monetary risks of distribution does copyright hedge against? I can see risks of distribution like potential liability risks, but copyright doesn't seem to protect against this. I can understand how things like the GPL leverage the copyright systems to enact non-monetary restrictions, but in a world without the concept of copyright would something like that even be necessary?

As you say, removing your content from the public sphere is the only way to protect yourself from people using it in ways you object to. This would seem to be the case regardless of the state of copyright protection since people gonna do what people gonna do.


> What non-monetary risks of distribution does copyright hedge against?

Copyright allows the copyright holder to decide what uses their work can be put to. The non-monetary risks are that the work will be used for a purpose the creator strongly objects to.

> This would seem to be the case regardless of the state of copyright protection since people gonna do what people gonna do.

Not true. Copyright provides for the possibility of legal repercussions when "people gonna do what people gonna do". Even for ordinary people.


> Copyright allows the copyright holder to decide what uses their work can be put to.

No. It allows the copyright holder to monopolize the copying and public performance of the work. That’s it.

It’s not a copyright violation for me to take your copyrighted painting and draw on it with crayons or hang it on a wall labeled “World’s Worst Painters.” As long as the copy was sanctioned by you I can put it to any use I want as far as copyright law cares. (One popular use now is having a computer compute facts about its word patterns)


> It allows the copyright holder to monopolize the copying and public performance of the work. That’s it.

Yes, we aren't disagreeing here.

> It’s not a copyright violation for me to take your copyrighted painting and draw on it with crayons or hang it on a wall labeled “World’s Worst Painters.

Exactly correct. I wasn't talking about purely private use. Copyright is about distribution of the results. If you do something with a work and never distribute the results, copyright doesn't enter into it (ignoring the complications of the absurd anti-circumvention clause of the DMCA, anyway).


In reality it would be a situation where a/the few big players control all the usefull datasets and trained models, and everybody has to pay in perpetuity (SaaS style) to use those, rather than being able to train their own. The copyright law being ignored will not be "solved" by pushing all the benefits to a few large players, but you will create new problems up ahead.


Because it would simply perpetuate the current copyright system which primarily rewards the large distribution companies disproportionately to the creators, especially smaller ones, while doing little to alleviate the effects on the job market for said creatives, in fact enhancing this effect by effectively feeding even more of the value into existing license holders as opposed to those creating new works (now instead of new works with existing franchises primarily rewarding those who own the franchises, this would effectively apply to all new works, if generative AI were to become mandatory to compete).


Not sure why Paramount is on that list, but ofc. Disney and Elsevier are in the business of applying copyright law to extract $$$ from the creative works of ordinary people.

(Disney: takes fairy tales from the public domain. Elsevier: You pay them to get published)


While I don't clap for Disney at all, it's important to note that the fairy tales are still there where Disney found them.

The problem with Disney is their extremely aggressive and repeated push to safeguards their own golden goose. (And IMHO they have a serious quality problem. "Somehow Palpatine returned.")

In general the quasi-forever current copyright regime is simply too blunt, and arguably waay over the optimum point in terms of length of granted protection with regards to "incentivizing the creation of arts", and of course it's incentivizing the collection of money-making properties and flooding the market to extinguish most of anything else. (Naturally attention is a limited resource, so "arts is a zero sum game".)


They are extracting slivers of pennies on a per-work basis. Not $$$.


That’s still value though. It was illegal in the plot of Superman 3, I don’t see why it would be ignored here. Jaron Lannier has proposed a micro financing approach where the data owners are paid for their contributions to the data that is monetized by tech.


> would be training with properties they own or have licensed appropriately

Cool - you want to train on medical data? That will be $1M per paper per day to Elsevier for a licence. (Apply similar for movies, news, books, etc.) Ensuring no knowledge remains "common".

The moment they can, the big data holders will charge whatever they can get away with. For an average person that may be even worse than the current content stealing.


Papers are ridiculously low signal-to-noise when it comes to knowledge. (see https://markusstrasser.org/extracting-knowledge-from-literat... ... but sure, LLMs might have a better extraction rate)

And, importantly if papers become so valuable ... authors will suddenly start to want their cut and will go to the publisher that pays the more. And so on.

It would be amazing if papers would worth that much (as it would mean they can help creating at least that much value downstream), and it would mean there would be a lot of money in writing new papers. (Oh imagine that flood of low quality shit! Oh no, it's almost the same as now! Ehhh.)


If they want to ignore copyright great, let’s change the law so copyright only lasts say 10 years.

But these companies want to have their cake and eat it.


It is good that you are rooting for the poor industry because they are "being screwed".

Sad that you didn't consider the content creators, people who's faces, voices, writings, art, personal information, etc. are being used without consent and without any compensation as the ones "being screwed" here.


> are being used without consent

gatekeeping the information that is being learnt off these data is not a right they have been granted.

I do not see an ai weight/model as a derivative works of the individual works used to train it, provided that you can obtain new works that would not have violated the copyright of the original had it been manually made.


Copyright supports creativity by limiting creativity. It's paradoxical. But more recently we have seen open source, wikipedia and open scientific publication pushing this trend back.


They've generally already been screwed by distributors, they're being screwed further by gen AI, and in the case where large corporations which own their works are allowed to extract further rent from those works from those creators, they will be triply-screwed. It's a shit situation but the existing copyright system is not a good solution for the average creator.


It's kind of like the history of the web with regular human consumers and various attempts to allow microtransactions, which receded in favor of ads and tracking.

If the pattern holds, the "cost" of training non-paywalled data will be attempts to hack/influence the model, as opposed to hacking humans.


Content creators or artists, will have to find a way to deal with AI (as a broad term). Both in how they will or won't use it, but also in how it affects or threathens their jobs and livelyhoods.

People in the AI (using this term very loosely here) business are worried that only the biggest players effectively monopolize AI with the help of badly thought out (or plain corrupt) regulations that may be set in place in the near future.

These two issues are not mutually exclusive problems, these two groups are not fighting against eachother. It feels like a typical divide and conquer, where the regular people are being pitted against eachother because of your view, while they should work together towards a solution for both of their problems and fight bad regulation (and monopolies in AI).


Generative "AI" enthusiasts build their models by ripping off creative human beings for profit.

Generative "AI" enthusiasts fear that bigger fish will squeeze them out of the niches they created for themselves through theft.

Suggesting that artists and genAI enthusiasts should "work together" to defend the regulatory environment of the latter is ludicrous; it would be like a burglar breaking into my house, stealing my appliances, and asking for my support in his campaign for mayor!


Caveat: In countries that adopt an AI-aware copyright statute. In other countries they will just scrape whatever-the-hell from wherever-the-hell.

Right now all we're seeing is gentlemens' agreements and self-enforcement. A robots.txt is just asking you nicely. I don't know of any country that has any judicial ruling on the topic of AI scraping yet?


Japan passed a law allowing use of copyrighted material for machine learning.

https://www.deeplearning.ai/the-batch/japan-ai-data-laws-exp...


Thank you. Well, if the USA does the opposite then all apps and research will be moved overseas.

This is a complicated topic and I don't know the right answer. I hope people with far greater skills in IP can figure out a meaningful solution that everyone doesn't hate. It might be that there is no good solution.


What about copyleft analogues? Plenty of important work in our field is licensed with e.g. apache license, which only came into existence because people needed new kinds of licenses for new kinds of purposes.


Copyright enables copyleft. Does copyright in relation to AI also enable copyleft in relation to AI?

The logic seems to be that "some of the training input was copyright, therefore the inference output is copyright." The same logic says "some of the training input was copyleft, therefore the inference output is copyleft."

Maybe future training sets will have to worry about the license compatibility between all the data in the training set?


You're making a historical mistake. Copyleft was precisely designed by Richard Stallman to use copyright against itself.


The idea copyleft doesn’t need copyright has been dead since TiVo, you need copyright to ensure people’s freedom, if you simply repealed copyright laws you would still have your rights as a user trampled.

The problem in this case is the AI companies aren’t ignoring copyright and then releasing their output with no copyright, which would happen in a copyleft or copyrightless world.

They instead want the benefits of closing their output but they want the benefits of taking other peoples work to make that output.


It's already happening. Mega content aggregators will either build their own LLMs, or license for $$$$. Twitter's doing both and reddit did the latter.


> Anyway, I'm rooting for "no regulation" for now.

Not me. I want appropriate regulation. Without any regulation, these companies will just continue their abusive practices without restraint.

Will regulation encourage a concentration in a couple of large companies? Maybe. It depends on the nature of the regulation. Even if it does, though, I'd take that over the current status quo.


who gives a shit about disney? We are talking about AI. Just don't train on the script of little mermaid, not the end of the world.

Elsevier on the other hand. Same deal as ever, if you want knowledge, pay for it. Writers deserve food on the table, and have the right to have control over who sees their work for economic or strategic reasons.

No one owes you anything


Elsevier is a scientific publisher which doesn't pay writers at all.

Public money paid the researchers, results should be free.


There is already a mandate that publicly funded research will be open access going forward (in the US). There are loopholes though, that I exist will be used just because it’s cheaper and quicker to do so. There’s also plenty of research published that is not publicly funded.


I agree the results should be free for personal use, but not for commercial use (such as training paid AIs). Training free AIs is probably fine. Similar to how it's free to use google maps in your project if your use is free to everyone, but you must pay if you're charging people to use it.


Elseiver is a racket. They're making money hand over fist gatekeeping world's research output, while rewarding no one doing actual research or editorial world.

As for the rest, I'm in favor of being able to train AI freely on works of culture. Current copyright regime is suffocating culture already pretty badly, let's not make it worse.


An AI that doesn't know the little mermaid would have trouble interfacing contextually with a society where it’s common knowledge.


"This product would not work without copyright infringement" is not a good argument.


who cares


People and business in fact owe each other things all the time.


How to balance innovation, intellectual property and access to technology


There is yet to actually be a use case for chatbots that isn't profit-motivated.

The entire effort to find a problem for them to solve has meant ecological harms at a scale that makes crypto mining seem trivial.

The data required has copyright and other commercial use restrictions on it. Much of the data being overtly restricted was never available for harvest to start with.

Given these facts, bringing and end to the LLMs an snaping techbros out of their latest delusion is a good thing.


> Those restrictions are set up through the Robots Exclusion Protocol

If anyone would like to join in, there's an actively maintained robots.txt here:

https://github.com/ai-robots-txt/ai.robots.txt

Yes, I know this isn't legally binding and scrapers can ignore it if they want to.


I would be fine if you use my data to train your AI models if you let me use your models for free. If you can’t do that, you can’t have my data.


unfortunately all our data are belong to them already.


If you put it on the internet, someone can read it.


The decision to block bots is not always about protecting intellectual property. A practical consideration I haven't seen mentioned is that some of these AI bots are stupidly aggressive with their requests, even ignoring robots.txt. I had to activate Cloudflare WAF and block a variety of bots to prevent my web app servers from crashing. At least they're reasonable enough to identify themselves!


Those aggressive bot....

Crawls each and every date link on my synology dsm hosted calendar without throttling.


yeah, we had a bunch of them crawling out git repositories in a very aggressive way, repeating the crawl within a few days, etc. etc. 403 to the lot of them, regardless of the bot's purpose.


Whats the deal with (Amazon) bots crawling private Gitlabs aggressively?


> Some companies believe they can scale the data wall by using synthetic data — that is, data that is itself generated by A.I. systems — to train their models. But many researchers doubt that today’s A.I. systems are capable of generating enough high-quality synthetic data to replace the human-created data they’re losing.

I'm thinking synthetic datasets are how it's going to go. Of course you can't get information from nothing, but they might not need nearly as much seed data to generate lots of examples that are specifically designed to train reasoning skills.

Maybe it will take a while to get over the hump, but they'll be motivated to make it work.


As you mentioned, the data processing inequality[1] applies here, but I imagine synthetic data could help training squeeze out more from the existing data.

[1] https://en.m.wikipedia.org/wiki/Data_processing_inequality


It's neat how a longer "digestive tract" loses entropy, but can make up for it by making more sense of things. It's akin to adding a NN layer, to a more computationally-intensive lossy compression algorithm, or to asking a LLM to explain the problem domain and the relevant variables (populating attention) before getting to the point.

It's probably true for people too. Instead of asking an expert for an opinion right away, ask them to discuss the options out loud first.


There are probably a lot of applications where the LLM could rely more on data that's supplied to it just-in-time in the context window, and less on specialist knowledge from its training set.

Also, "natural" data taken from the Internet is probably quite inefficient as training material. It's going to have a lot of duplication. You only need each fact once to be able to synthesize more examples of it.


Garbage in Garbage out is real and shouldn't be underestimated. Its about as useful as trying to understand quantum physics from the perspective of a flat-earther


Synthetic data doesn't require using another ML models output as training data. One incredibly successful way of generating useful synthetic data is using rules.

For example, AlphaZero uses synthetic data (playing against other versions of itself) but it does gain information because the data is generated against the rules of the game.

Similarly, word2vec and other vector representations use synthetic training data, but that data is derived from observations of text (no, it's not using raw text, each training instance is generated from the text, with varying levels of simplicity of course).

Tesseract is trained using synthetic data generated from font files and simulated "rendering errors."

Synthetic data should not be underestimated.


Garbage in garbage out feels like a road bump more than a blockade. It doesn't seem far fetched at all for a future model to use known clean datasets to train on, and then become a curator of larger datasets.


Synthetic and generated data is also real and shouldn't be underestimated. LLM training can be very counterintuitive at times.


Synthetic and generated data might not be garbage. Consider random chess positions. Those are generated,yet there's an engine out there that has objectively a higher chance of winning from that position.

There was also a paper on persistent computer programs after running random data iteratively over millions of cycles. This doesn't even have a 'goal'. which by itself is very interesting because of how similar it feels to how organisms develop by the act of persisting after a few billion years.

That is to say I largely agree with the premise that you can train and improve alot by absense of data. But if your original dataset is garbage, then training over and over might get limited by a hidden nash equilibrium that the algoritm can't escape due to lack of new information.

Obligatory Anton petrov: https://www.youtube.com/watch?v=L_IWVZPmc-E Paper: https://arxiv.org/pdf/2406.19108


>Its about as useful as trying to understand quantum physics from the perspective of a flat-earther

Wouldn't it make more sense to say trying to understand general relativity from the perspective of a flat-earther? At the quantum level, the world is flat--I mean, ontologically. "Dimension" is not used to describe space and time in the same way as in Einsteinian physics which retains a very Kantian model of the cosmos. But then you could certainly re-interpret Kant to make it cohere with quantum physics, I'm thinking of people like Deleuze and Latour here, which is why I alluded to a "flat" ontology.

But in truth, it sort of negates your point, since when it comes to gathering things from the totality of the productive process (which is what an LLM does), indexing them and creating a surrealist wet-dream of output, there is not going to be a "right" answer, in just the same way that a quantum physicist and a cosmologist are not going to be able to agree on the correct definition of time. Truth is at the level of feeling, not sense. No scientific instrument can detect the feeling that one thing caused another, even though we are absolutely sure of its continuity, and it is such continuity that stands as the basis of all scientific investigation. In the end, empirical laws, just like moral laws, simply "feel right," and that's it. The LLM tells us things that feel right. It is a good-feeling machine, yet it is precisely when we notice something wrong, when it seems totally off, in those sublime gaps, that it becomes the most enigmatic and, in fact, the most useful. The real beauty of the AI is in those gaps, in its imagination, because it is actually our own, projected outward and reflected back like a society-wide mirror. There is something wonderful in the dreams of a machine.


No need even for synthetic data. Advance away from doing statistics on troves of human-generated data, to doing statistics on raw environmental data, like humans do. This will also likely mean moving away from the GPT architecture completely.


This is the way. Robotics, self driving cars, mobile devices themselves, either phone or glasses.... They can generate the data needed for the next steps. (pun intentional) Text was easy, the process has been defined. Multi-modality improves the models. The next set of data is the world. Everything we currently observe using our senses and data outside our senses like radio waves. Build a better world model.


Seems silly to say: "Instead of training this LLM on internet data, we will train the LLM on the output of an OCR model pointed at a screen scrolling internet data. This is clearly more ethical."


I made no allusion to ethics here; merely logic. Humans don't learn by scraping the web, but by observing the environment. The web is an incomplete distillation of humanity's knowledge of facts and various interpretations of them (which also includes "creative" works as one cannot create anything without basing said creations on environmental observations).


Well, unless you exclude common crawl and block all robots…it’s still going to end up in a dataset someday. Or deleted and gone forever!


The NYT had another article a couple of days ago about Getty leveraging the images it owns to go into AI.

https://www.nytimes.com/2024/07/19/technology/generative-ai-...


Getty often makes false claims and generally is a super shitty organization, for example (there are many): https://www.latimes.com/business/hiltzik/la-fi-hiltzik-getty...


Getty will have to indemnify users against infringement claims in the event that the images it generates turn out to be based on unlicensed materials and judged as derivative by a court. So will all other AI companies eventually, it's just that Getty has _some_ content while other companies rely much more on unlicensed content.


They wont and their clients will get sued, it's exactly what happens when getty licenses images to its clients that it does not own any rights to.


To me, what getty is doing is actual theft. How do you “accidentally” issue copyright claims using images you don’t own? How did getty start claiming this image as their own?

In contrast, training an AI model you download a piece of info once and then don’t resell that (exact) piece of info. Nobody is claiming they own the image when they don’t.


> Nobody is claiming they own the image when they don’t.

An AI generated work may still be derivative enough to be judged as infringing by a court. AI is not a black box which magically performs copyright laundering.


It's fairly easy to do accidentally in the case of a large corporation like this: different parts of it don't check with other parts. In this case one part added some public domain photos to their site, which is scummy but legal, and then another part assumed that those photos were like the many others they have exclusive rights to. In another case mentioned in that article it's a third party which lied to them. Accidental doesn't mean they aren't reckless with their threats, however, which large-scale copyright license collection systems almost by necessity are, and that causes a lot of harm (and they are rightfully losing cases over it).


I look forward for AI trained entirely on Wikipedia and classical literature with no twitter and no contemporary art in sight. It would be sublime. Let's face it, the creators of century XXI way overestimate the importance of their stuff. It's mostly deleterious to the culture.


> trained entirely on Wikipedia

does the CC used by wikipedia allow training?

> and no contemporary art in sight

will note per above, they may be able to scrape other CC if they -can- scrape wikipedia.

> It's mostly deleterious to the culture.

Yep and based on how I already sometimes catch people treating LLM hallucinations as fact, it's likely gonna get worse anyway.


> does the CC used by wikipedia allow training?

This is backwards. Absent something that prohibits it, training is allowed. The question should be this: what do you believe prohibits it in this case?


The author of a work has copyright automatically, so what would prohibit it by default is copyright law.

In the specific case of Wikipedia I would guess it's allowed by the license, but that's not generally true.


Training isn't copying (any more than a browser or CDN cache is) or distribution. Copyright is out of scope. Is there anything else that would prohibit training?


All of those things are copying. Browser caches probably fall under fair use. CDNs are contracted by the distributor so literally licensed.


The act of copying while training is incidental, in the same way as browser caches are incidental to the viewing of the content. Except that with training, you don't want to end up with a duplicate. The whole point is not to copy the original. Training is not copying.


Wikipedia provides dumps of their data[1] and there are sources that prepare it for training as well[2].

[1]: https://dumps.wikimedia.org/

[2]: https://www.tensorflow.org/datasets/catalog/wikipedia


I think that's basically what models such as Mistral are doing. They mentioned that they use very sanitized datasets for training as compared to simply scraping the web.


Works where archive.ph is blocked:

https://web.archive.org/web/20240720013944if_/https://archiv...

This is how I read NYT now:

   tnftp -4o"|yy093" https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html > 1.htm
   links 1.htm
Using HTTP/1.1 pipelining, i.e., yy025 and tcpclient instead of tnftp, I retrieve many articles in a single HTTP request.

yy093 outputs a single HTML page with all the fulltext articles, the way I like it. 100% Javascript-free.

The web keeps getting better for text-only. More JSON, less HTML.


what's yy025 and yy093? (google only finds random Chinese(?) pages.) and ... woah, may I ask where your username came from (does it mean something)?

edit: okay, it seems those are scripts written by you, right? https://news.ycombinator.com/item?id=41035155

okay, WTF is this https://news.ycombinator.com/item?id=40894851 ? :D


TempleOS guy borne again


> We’re seeing a rapid decline in consent to use data across the web

That is such a weird and misleading way to put it. There was no consent in the first place. Take YouTube for example. Google did not consent to the videos it hosts being used by OpenAI. The uploader certainly did not ever consent to their face, voice and content being used to train models either.


I think this is an important thing to point out. This is the first really compelling test of the legal theory that the overbroad content-ownership boilerplate licenses big user-generated-content hosts use could actually bind people to allow arbitrary use of their content, likeness, etc. Before the uses of it were actually quite egregious, but kind of abstract, things like "build a profile on you to advertise products" has incredibly deep and creepy implications but they're for whatever reason not as unifying an objection as this


I doubt that this theory has not been put to the test, repeatedly.


Guess we just take legal matters on faith then? My lawyer's gonna be devastated


You did when you uploaded it and consented to the ToS.

Google was training on all of YouTube back in 2012 - https://slate.com/technology/2012/06/google-computers-learn-...


There's consent, and then there's informed consent. It's not really possible for people to have given informed consent to AI model training on their data in an era when AI had much more limited capabilities relative to what it has now. It's one thing to give consent for gmail to train on your data to develop better spam filters or profile your interests; it's another to give consent for Google to convincingly impersonate your writing style, your facial tics, and your speaking voice. I think few people on Earth in 2012 really understood that their data could ever be used in such a way, and so their consent was not informed.


> I think few people on Earth in 2012 really understood that their data could ever be used in such a way, and so their consent was not informed.

Arguably this is just not consent, plain and simple. There could be no reasonable expectation for a person to believe they are agreeing to this. Whereas uninformed consent, I'd argue, is more like how people were agreeing to giving away their data but do not actually understand how the data is taken and used (let's be real, this includes even a lot, if not most, tech people).


Most legalese is ignored by the people that are supposed to read that when they do online things. But that doesn't mean it doesn't apply. Ignorance is not a valid excuse. Arguably, the average tik tok user probably did not bother to read a single word of the terms of use they agreed to. The textual content of these things does not even matter at this point when users simply don't read it at all and just blindly agree to anything you put in front of them.

Funnily enough, you can actually use LLMs to make sense of these things. We're more empowered than ever to process legal text. It used to be that your only option was to consult somebody with legal training. Which of course isn't something a lot of users would do.

In any case, you are using a moralistic argument in a legal context. That's simply not how the law gets interpreted in court rooms. It might motivate politicians to write new laws though. Which then get interepreted in court rooms. But given the complexity of the matter at hand, the outcome might not be something that is a net positive. In fact that is extremely unlikely.


> It's not really possible for people to have given informed consent to AI model training on their data in an era when AI had much more limited capabilities relative to what it has now.

I can’t see that argument holding up in court. You’re using a definition of informed and consent that are rather novel.


These definitions are standard in medical studies, but AI training on personal data is the wild west by comparison. That might well change. It took a few ethical calamities for medical law to adopt these practices.

https://legaldictionary.net/informed-consent/


> I think few people on Earth in 2012 really understood that their data could ever be used in such a way, and so their consent was not informed.

I think we've been able to see this coming for many, many years.


Post hoc ergo propter hoc

You may think this now, but I'm willing to bet a lot of money that you didn't believe this in 2012 and even 2015. Be careful to not rewrite history. It's an easy thing to do.


If you use Google, your data actually is their data. If you're not paying for it, you don't own it


The "informed" aspect tends to be covered by a "future yet unknown uses" blanket clause.


The article is about third parties, e.g. OpenAI scraping Youtube and Reddit.


Responding to this GP point: > The uploader certainly did not ever consent to their face, voice and content being used to train models either.

I'm also not saying I _support_ ToS


I'd argue that there was no consent here in the first place. I want to remind everyone, there are plenty of times and reasons a contract can be invalidated.

Consent requires the person knowing what they are consenting to. How many people do you think know that posting your face on Facebook consents to using your face to generate photos? Your voice to create voice generators?

What about when someone posts a group photo or video and despite you not having an account that the same rules apply? What about all the photos created by others and uploaded by those with no ownership or rights to the photos in the first place? They can't legally give away the rights.

Then we also have to consider that the environment has changed and the terms changed under people. You may have signed up for Facebook and knew they were going to use your data to sell ads and do some analysis. But you didn't know AI was coming (let's be real, no one knew even at the time of AlexNet. It didn't become clear in tech groups until maybe 2016/2017). Sure, the agreements might "be the same" but people always contextualize things. In 2012 you might not care that Facebook "owns" your voice because you know they have to host the file, and contextually it is "just over legalize". But in 2024, now the context is that this trains a model that can replicate your voice and this means something VERY different.

I think there's this common attitude of "Oh, well you agreed to the terms" as if this makes the terms fair, reasonable, or okay in the first place. It doesn't account for the reality of the situation. It just further legitimizes this world where a person is legally culpable for not having domain expertise. News flash, we live in a specialized world and you can't be a domain expert in everything. It doesn't account for the fact that there is often no alternative and there is serious cohesion at this point. Unfortunately, not all the choices in your life are up to you and many are social (and many of those are idiotic and can force you into decisions nearly no one in the group wants[0]) and not all the choices you make ultimately come down to you. If it was up to me everyone would be using Signal to communicate with me and no one would have bought Apple products when they made their devices unupgradable and charged more for _increasing_ your disk space than it would cost to buy 1.5 drives in the first place. But also, thank god that everything isn't up to me because I'm also a fucking idiot.

No, I don't think there was consent this whole time and as time progressed we amount and capacity you have had to consent to has decreased. And dismissing it (the fact that consent is not binary) just makes the problem exponentially worse. Not to mention all the dark patterns which I can do a whole other rant about.

[0] Prime example: the American presidential race. The vast majority of people would rather have neither Trump nor Biden running. The vast majority of people do not want a gerontocracy but will vote for one of the two candidates because what other choice do they have? People __could__ agree to to vote for a third party, but doing so gives such a strategic disadvantage that people are rational in attempting to choose from the options provided to them (see primaries).


I agree that these content creators were not contacted individually and asked for specific consent to have their materials ingested for training purposes.

That being said - the general crux of the third-party doctrine (at least in the United States) is that information told to another party has no expectation of privacy. That seems to apply to a vast amount of 'user generated content' on the internet. Someone decided to utilize someone else's megaphone to post into the commons and now expects to retain some ownership. It's hard to have it both ways "just because someone else noticed."

Unfortunately this also extends to general artistic styles. If you, as an artist, continues to create in a consistent style and someone else notices you don't have a lot of recourse when some entity is able to recreate a similar style.

On a technical side this seems similar to your server responding to any incoming request with a 200 status and some content (LinkedIn v HiQ). In this case a user-agent asked (was it a person? scraper? AI training process?) and you gave it to "them".

I guess we end up in the position where if you don't want to train AI then don't post your content publicly.


Intellectual property and privacy aren't the same thing, so the third-party doctrine doesn't apply.


> information told to another party has no expectation of privacy.

I truly detest this phrase, but: “Citation Needed”


If Alice tells Bob something there's nothing preventing Bob from telling Carol. The third-party doctrine kind of boils down to this - someone else noticed and you can't restrain them from telling anyone about it.

Alice (content creator) told Bob (YouTube). Bob told Carol (OpenAI et al.) when Carol asked Bob for a video.


> If Alice tells Bob something there's nothing preventing Bob from telling Carol.

This is only true in the absence of an agreement between Alice and Bob that Bob won't tattle.


Right, I think a better way to put it is that there was complacency and neutrality (inattention) on the topic before. But now that people see that their work is going to be used to flood the culture with slop, the new default stance is to deny consent.


As a note; historically the typical real-world opt-in rates I have seen for EU apps that are trusted names, and self-evidently include content that would benefit from personalisation (names, key dates, location) is about 30%. less trusted falls to about 20%. Untrustworthy apps tend to go higher, as high as 60+% - I assume that is because of either deceptive patterns or the types of user they have.


I disagree, consent is important here, and the "move fast and break things" mentality is what's causing problems.

#1 It's not all that certain AI training is fair use. There could be significant damages if it is found to not be fair use.

#2 "We're going to steal your shit even if you won't give us consent to do that" is a bad idea. We're already seeing the practical results: People stop/reduce posting their works to the clearnet. What's your scraper bot going to do? Create an account on every 'private' forum and immediately get sued? Pay a subscription fee to every single Patreon page?

And the big one, #3: It destroys public support for your technology, which is essential if you want to survive the oncoming government regulation.

Look at the public response to various tech companies stopping their AI rollout in europe over EU regulations. Varying from "Yes, this is exactly what we asked for" to "Good fucking riddance".

And it will get only worse yet. Europe's privacy authorities are already issuing quiet statements that the scraping of social media posts, such as is done for AI data collection, is not legal under the GDPR. (The law's pretty clear on this, it doesn't matter that it was "posted publicly", you're not allowed to use personal data like that) There's already whispers of going after the LLMs themselves, as they contain and continue to process personal data as well.

AI needs public support, and the lack of consent is slowly bleeding it dry.


    > What's your scraper bot going to do? Create an account on every 'private' forum and immediately get sued?
More likely, that the owners of any repository of note will simply broker that data directly (a la Reddit).


Yes, though this scenario isn't all that interesting to discuss:

* In the case of big IP holders (e.g. media companies, news organizations) this is just obtaining consent. The only fun quirck is that OpenAI's purchasing of this drastically lowers the "AI training is fair use" claim by proving the existance of a market.

* In the case of platforms like Reddit, it just kicks the problem one layer down. The platform does obtain "consent" through it's ToS. (Beware that this consent is legally weak, and won't protect your ass from anything outside copyright) But users will still see it as "stealing" and may flee the platform.

There's still a notable shift away from the clearnet.


> We're already seeing the practical results: People stop/reduce posting their works to the clearnet.

It's really interesting. Ever since I started publicly stating that I've removed all my works from the public web because there is no realistic method of defending myself against AI scrapers, I've been encountering a surprising number of people who say they've done the same thing.

I don't know how far this will go, but at least I know I'm not alone.


> the "move fast and break things" mentality is what's causing problems.

I want to add nuance. I don't think it is the "move fast and break things" mentality that creates all the shit, but that that there is no "time to clean up, everybody do your share" mentality to complement it. Doing things often creates a mess, and doing hard things often creates a bigger mess. You can't make a fancy meal without dirtying a bunch of dishes. But are we seriously not hiring "dishwashers"? Creating a mess is unavoidable, and certainly it shouldn't be too large of a mess, but the dishes are piling up and we can't hide it anymore. It isn't a linear problem because the mess compounds and the mess itself generates more mess. We won't refactor. We won't rewrite. So we just have patchwork on top of patchwork. That's enshitification.

We also have a status quo that we sprint into a sprint and try to move as fast as we can but only measure how fast we're going by looking at how far we moved in a quarter. There's no long term measurement because "that's too hard." But this is like trying to circumnavigate the world and choosing to walk. You'll make progress every day and more importantly, __measurable__ (and easily measurable) progress. But you could spend 11 months building a fucking Cesna and still beat the person that could walk on water. You need to move fast (like the Cesna), but to move fast requires also slowing down. Who is willing to slow down?

I think most people don't have a problem with people using public data to do research or similar activities. That people wouldn't be up in arms if OpenAI scraped all of YouTube, trained on it, proved internally that they could do cool stuff with it, AND THEN either started to generate their own data for training or started to purchase data. Even though this would still be costly to YouTube and be a weird ethical ground (like getting a "free trial" (or theft) of the data and pay only if it works).

As someone with anxiety, I can assure you, it is not a good idea to constantly be rushing around chasing everything that needs to be solved. You just make more messes because you sloppily "fix" the issues, trying to move onto the next. The trick is to fight your own mind, slow down, triage, and solve anything that isn't a literal fucking fire with calm and care, no matter how much your own mind wants to convince you it is an emergency. But when everything is an emergency, nothing is. And that's the problem. We created an economy based on a business strategy that is functionally equivalent to an untreated and severe anxiety disorder.


On one hand this really sucks, particularly for newcomers starting from scratch, as now only the larger companies that already scraped the web can carry on in this vein, finding ways to improve the architecture to better utilize the data they already have. On the other hand, I see this being a forcing function to move away from generative AI sooner, which I've always considered to be a dead end for AGI, to future architectures that train on data similar to how humans learn, ie raw audio, video, etc streams from the environment.


> now only the larger companies that already scraped the web can carry on in this vein

That will have a finite lifetime.

We've already seen issues with just the couple of years' difference that the first LLMs have, and it will definitely get worse, as time goes on.

They will, however, be able to improve what they do with the data, and some of them will probably get enough funding to buy the data.

Don't be shocked to find that the reason so many organizations are restricting access to their data, has nothing, whatsoever, to do with things like privacy and copyright.

They just see a lot of money sloshing around, and want to hold out for their share.

The pipes will reopen, but the free lunch is probably done.

One big issue will probably be, that the free data will be extremely low-quality (possibly poisonous) data, so that might be very bad for folks on the lower tiers.


Without that "free lunch", the incumbents will still likely have the advantage in perpetuity. There will be fewer competitors as the cost will be a non-starter. Unless backed by some deep-pocketed entity that doesn't mind the possibility of losing a lot. The tech is already proven, and there's not enough value in the incremental improvements that'd make a new entrant sustainable. So even if the pipes were reopened, behind walls there won't be much demand.


The only interesting part is that it will become possible to remove a book without leaving an empty spot on the shelve.

Eager to create "safety" the wikipedia-ish discussion where no two truths can exist simultaneously will happen in private and the official narrative of everything however preposterous will be the only one.

We should also create LLM driven content moderation so that wrong think can be silenced immediately.

The masses of brainwashed slaves will no doubt learn to like it.


Let's not fool ourselves and think that these big AI companies care about licenses.

It's easier to ask for forgiveness than consent,

the cost of doing business,

etcetera, etcetera.


>Those restrictions are set up through the Robots Exclusion Protocol

Well so it's not really disappearing at all.


while "data is the new oil" might be so late 2000s and early 10s, it is finally coming to fruition in public discourse after people realized that one can dump them into a model and get something you can work with.

having worked in this topic, creating synthetic data is not always easy, and you still need real data to get the best results. if you look beyond basic internet media, there are tons of fields where "real data" does not exist in quantities that could help create effective models. these might be good case studies on how to proceed further in small endeavours.


Yes, this is the key point:

"Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers."

And also this:

"Mr. Longpre said that one of the big takeaways from the study is that we need new tools to give website owners more precise ways to control the use of their data."

This will happen once everything is looted/not needed anymore.


Outside of tech, AI isn't making any friends. It's an unexpected threat on the horizon for the bearers of accumulated privilege, such as monopolies and nation-state autocrats. They hear echoes of the computer and internet revolutions, which were key in upending many prior "garden patches" of power. Their cronies have been stirring up resistance among workers on the front lines of disruption, such as some content creators.

The truth is that no one knows where the dust will settle. All covered wagons are venturing west of the Rockies at the same time. Content creators are not really facing a worse outlook than everyone else. They are, however, in a better position to hinder the advance.

In principle, an AI learning from a scientific textbook is no different than a human student doing the same. Neither will violate copyright law when they're done learning, except perhaps accidentally - paraphrasing and facts are not violations. Unfortunately, legal and ethical principles can differ from legal reality. We're left hoping that some altruistic legal minds will open up a Northwest Passage for us, like Thomas Penfield Jackson in US v. Microsoft.

The worst possible outcome is that we end up with an all-powerful AI cartel which negotiates massive deals with IP conglomerates, locking out competition and open and free alternatives.


> In principle, an AI learning from a scientific textbook is no different than a human student doing the same.

In principle human brains don't count as a copy of a copyrighted work, whereas the AI model that is capable of regurgitating the original-- I don't understand how that's not a copy[1] or derivative work which is a good indication the judges presented with these questions won't understand either.

So, no, a brain with knowledge is a brain, and an AI model with knowledge is a potentially infringing copy or derivative work.

[1] https://www.law.cornell.edu/definitions/uscode.php?width=840...


A regression model - an equation for a line - is obviously not a copy of the data points that it is fit to. Yet, occasionally you might be able to get it to reproduce a point in the original data, but only if that point happens to be right on the line. It doesn’t mean that point is “in” the model, a regression model contains no points, just a slope and an intercept.

An AI is nothing more than a large network of regression models, a very large set of equations. It is not a copy of anything.


A JPEG doesn't include a single pixel of the original image.


You make a good argument. The sad thing is that I could see this might put an end to open models while still allowing big tech to host them.

Because creation of a derivative work isn't a problem itself, it only infringes at the point / time of distribution, which is the whole point of open models. But big tech could probably argue that their training of a private model doesn't infringe per se, and it is then the users who then create prompts that cause the regurgitation to occur that hit the tripwire of triggering copyright. Presumably, big tech can solve that, either with terms of use (telling the user it's their responsibility / liability) or YouTube style where a Content ID like system skewers anything that it can't prove isn't infringing before it gets to the user.


I can memorize and recite a poem, but in the context of a scientific textbook, neither I nor an AI will remember the actual words used, just key facts like E = mc^2.

If you limit your critique to the AI's ability to recite a poem or lyrics verbatim, then you have a point. However, I used the example of a scientific textbook for that exact reason. To the extent that the AI only paraphrases or cites facts from, it's not a copy.

The derivative work issue is more interesting. The AI can generate a derivative work, but won't do so without a prompt asking for one. If you ask for an Oz-like bedtime story, you should get an Anxious Lynx and not a Cowardly Lion.

If you ask for a story with Oz characters, the result will be a derivative work, but it will include your prompt. The work was not in the AI, though the ability to make it was. In this sense, the AI is more like an instrument that can be used to violate copyright law, like an employee who has read Oz would be. It's hard to find a close inanimate analogue, but it's probably on the same shelf as search engines and typewriters.


> neither I nor an AI will remember the actual words used […] To the extent that the AI only paraphrases

Many AIs have in fact reproduced copyrighted work in large portions, it’s one of the reasons there are a bunch of lawsuits.

> AI can generate a derivative work, but won’t do so without a prompt asking for one

This is an odd and vague claim. Which AI are you talking about? And why do you claim there’s any generative AI in existence that won’t remix from it’s training examples? Their design from the ground up is to create derivative works.

> AI is more like an instrument that can be used to violate copyright law

If the training data wasn’t legally acquired, using it to train in the first place may violate the law. People are trying to change copyright’s interpretation, but pending the outcome of that it’s pretty clear that today’s AI may have violated copyright law before generating any output.


> This is an odd and vague claim. Which AI are you talking about?

Let's exclude maliciously crafted prompts. Given an "innocent" prompt that is not meant to elicit a copyright violation, what is the probability than an LLM will generate something that will both reproduce a copyrighted work and fail the fair use tests? It's obviously non-zero, but it's exceedingly small.

I'd say it's around the same probability as Google Street View accidentally reproducing a copyrighted map that was displayed in a storefront window. A claim that Google has to pre-compensate all map publishers because of this possibility is just an attack on Street View per se.

> And why do you claim there’s any generate AI in existence that won’t remix from it’s training examples? Their design from the ground up is to create derivative works.

Just because something sounds like a derivative work doesn't mean it will infringe copyright or fail the fair use tests (see sibling threads.)


I reject the analogy to Street View, due to the fact that a sign in a window isn’t Street View’s primary use, it’s an accident (plus the store owner is responsible for ensuring permitted use of publicly visible signs and is culpable for copyright violations, and that photography of anything in public view is permitted by law - see the famous “photographer’s rights”), whereas with AI, statistically reproducing the training examples is the primary intent of it’s design. Reproducing derivative remix works is its function and nature through and through. Whether the output is actually recognizable as one of the individual inputs is a different question from whether the output is a derivative work.

> Just because something sounds like a derivative work doesn’t mean it will infringe copyright or fail the fair use tests

That’s true! Remixes where the source comes from multiple songs have a history of escaping copyright claims more often than derivative works of a single song.

But that doesn’t change the fact that using copyrighted data without permission and without Fair Use exemption to train AI is currently illegal. And it doesn’t change the fact that when AI reproduces substantial portions of any given work, regardless of what the probability was, it is illegal. Claims that the probability of it are “exceedingly small” don’t stand up to the (now many!) actual cases filed in court demonstrating AI reproducing copyrighted works, and not because the chances were small, but because it simply breaks the law.


So if I can regurgitate the textbook, my brain violates copyright?

I think the more sensible take is that the reproduction by the LLM violates copyright, but not the act of training nor the LLM weights themselves.


If your shared copies of your brain in that state, it seems like that would. Lots of LLM models can be downloaded.

If you had memorized the textbook from a stash of torrented epubs, as a commercial activity with intent to monetize, that likely wouldn't go over well for you either.

The "big AI" companies are trying to create a world of rules for thee not for me by claiming that when copyright infringement is large scale enough, it somehow becomes fair use.


You're never going to get a textbook out of an LLM, or a chapter, or a page. Big-IP lawyers probably spend weeks engineering prompts to find enough stuff that comes out un-paraphrased. Poems and lyrics, maybe.

The legal test for fair use considers "the amount and substantiality of the portion used in relation to the copyrighted work as a whole." In this case it's the portion regurgitated near-verbatim, since paraphrasing and summarizing is fair use. You can write Cliff Notes without the author's permission.

LLMs are probably capable of failing the fair use tests for less than 0.1% of the stuff they're exposed to.


Yes, that is what copyright law already says (that your brain reproducing a work is the violation, not your brain itself), which is why it took time for the lawsuits to build momentum - they had to find evidence of copyright violating reproduction.

The part you’re missing here is that acquiring and using copyrighted data without permission also violates copyright. Just like copying and watching a movie you didn’t pay for or legally acquire violates copyright. It’s a worse offense to distribute a movie you don’t have the rights to, but copying one and consuming it without permission is an offense nonetheless. The act of training on data not legally acquired, therefore, certainly can be a copyright violation.


> using copyrighted data without permission also violates copyright

This is simply not true. There are many things that copyright law prevents, but no copyright law goes as far as to say that all use without permission is forbidden. And no copyright law expressly forbids training AI either.


You inserted the word “all” into “all use without permission” which is not what I said. Fair Use is a thing, and the big AI players have tried to claim Fair Use, so far unsuccessfully. They’ve requested new rulings on Fair Use for AI, which admits they are aware they’re potentially violating existing copyright law, and are seeking to make their use legal.

The copyright office says downloading movies violates copyright.

https://www.copyright.gov/help/faq/faq-fairuse.html#:~:text=....

See also (same page) “If you use a copyrighted work without authorization, the owner may be entitled to bring an infringement action against you.”

> no copyright law expressly forbids training AI either.

That’s irrelevant. The law covers using copyrighted material without permission. Fair Use is a type of permission, but otherwise they don’t have to specify what it gets used for.


Art forgeries are an example of a human brain regurgitating the original content. Someone reading an article and rewriting it with some differences is a derivative work. So yes, a brain with knowledge is potentially infringing a copy or a derivative work.

Human brains are often treated specially in law, but I don't think they should be.


Why wouldn't nation state autocrats be interested in AI? Seems like a great tool for facial recognition at scale and eventually maybe even automated policing via drones and the like, thus sparing them the costs (and danger) of having to maintain a major police/military force to protect their regimes


They've largely already mastered control of their domains, e.g. by outlawing all non-state media. Generally speaking, if you're already at the top, everything is mostly a threat simply because nothing is much of an opportunity.


Yes, nation states will be interested, and it goes far beyond simple tools like automated policing.

Once AI is used as the buffer between all forms of communication (making sure the emails I send put me in the best light, filtering incoming email to make me more productive, generating web content, reading web content in a way that makes me more efficient, even voice and video calls, the nation state can not only monitor, but also subtly alter all communications, in real time, and in context to guide/enforce how people interact/communicate and think.

This is already happening in areas such as search, getting "answers", and content generation. Your information is constrained/mapped into what the the AI deems appropriate.

We already know that the NSA taps into the communication backbone of the internet. Why not add a real-time AI monitor/filter to best manage how we all communicate ?


> Outside of tech, AI isn't making any friends. It's an unexpected threat on the horizon for the bearers of accumulated privilege, such as monopolies and nation-state autocrats

How so? Generative AI has developed in a way that's very amenable to the bearers of accumulated privilege, it requires so much capital expenditure that it's in the hands of a few companies that can be variously negotiated with or strong armed. This is not at all like the computer and internet revolutions of the 90s.

> The worst possible outcome is that we end up with an all-powerful AI cartel which negotiates massive deals with IP conglomerates, locking out competition and open and free alternatives.

What's the difference with the status quo if the main players are Microsoft, Meta and Google?


> How so? Generative AI has developed in a way that's very amenable to the bearers of accumulated privilege, it requires so much capital expenditure that it's in the hands of a few companies that can be variously negotiated with or strong armed. This is not at all like the computer and internet revolutions of the 90s.

Even at today's bloated GPU prices (which won't last,) training rigs cost about the same as the servers and data pipes that serious dot-coms needed, and much less than the fabs that serious semiconductor outfits needed in ~1980, especially when adjusted for inflation. Many AI startups don't do their own primary training at all, making their costs identical to barebones dot-coms'.

> What's the difference with the status quo if the main players are Microsoft, Meta and Google?

You forgot OpenAI, Anthropic, Apple and Amazon. Of the seven, two are new, three emerged from the last revolution and two from the one before.

The incumbent companies in the 90s were the likes of IBM, DEC, HP, Unisys, SGI and Sun. Almost all of them smelled the coffee and invested heavily in various Internet plays. Only a few survived, most of them only barely. The previous revolution was even bloodier.

Today's incumbents are also smelling the coffee and have AI plays. None of them is certain they'll still be around 10 years from now.

The reason Meta opened up Llama is because they were convinced that the barriers to entry were minimal. In 2022, Altman famously claimed that if there was a Moore's Law in parameter count, it had already topped out. A large fraction of use cases can be handled with models under 10B. Training algorithms are increasing in efficiency faster than parameter counts are increasing. The only real barriers to entry are patents (and we won't know what that looks like until they break stealth) and the possibility of nefarious anticompetitive deals with the IP conglomerates that we're discussing here.

Many economists expect the disruption caused by AI to be much greater than that of PCs and the Internet. The usual parallel is electrification. The gentry from back East is in the covered wagons too.


> In principle, an AI learning from a scientific textbook is no different than a human student doing the same.

It's not about the process of building a model. It's about the copyright infringement to acquire the training set and it's about the copyright infringement when producing output. Humans, just like these AI companies, can infringe copyrights by downloading copyrighted material. Humans, just like AI programs, can produce material that is derivative of material protected by copyright. Whatever the model represents and whatever is in your brain is not high on the list of concerns.


> It's about the copyright infringement to acquire the training set and it's about the copyright infringement when producing output.

I just read your comment. I also read news articles I paid for today. None of that is copyright infringement, and scraping that same content to train an AI is no different.

Yes, when an AI produces a substantially close reproduction that would be copyright infringement, but some systems have guards to prevent that, like Github Copilot.


> None of that is copyright infringement, and scraping that same content to train an AI is no different.

Scraping is copyright infringement. Try telling the RIAA that scraping songs from YouTube or Spotify is not copyright infringement. Try telling the MPAA that scraping movies from Netflix is not copyright infringement. Same for book publishers.


Small scale personal use can be fair use while large-scale commercial use is not.


I've addressed the "when producing output" part in another reply, so here are my thoughts on the "to acquire the training set" part.

Unless you're downloading from something like Z-Library (which incidentally is a deliberate attack on Western IP on Putin's behalf) you generally have the fair use rights to "digest" the information you receive. You can process the contents of a webpage or momentarily OCR a printed page, provided that it was legally distributed and that you don't store a copy proper. There can be specialized restrictions if it's something that you have a license for instead of a copy of, see e.g. https://en.wikipedia.org/wiki/Shrinkwrap_(contract_law)

As for derivative works, not all the works that are made "in proximity of" previous works are derivative works. You can make a simulated reality movie without infringing on the Matrix, for instance. Just write a somewhat different plot and don't name your lead character Neo.


> You can process the contents of a webpage or momentarily OCR a printed page, provided that it was legally distributed and that you don't store a copy proper.

The AI companies are definitely storing proper copies of their "corpus".

> You can make a simulated reality movie without infringing on the Matrix, for instance.

You can also make a simulated reality movie, even without the name Neo, and a court could find it substantially similar to the Matrix and thus copyright infringing. It doesn't need to be the same.


> The AI companies are definitely storing proper copies of their "corpus".

Their own lawyers would be on their case if they were infringing halfway through the pipeline. They don't have to, and it would weaken their case where it counts.

> a court could find it substantially similar to the Matrix and thus copyright infringing

Substantially similar is not the standard, otherwise they would have had to stop making time travel movies about 70 years ago.


Substantially similar is the standard

https://crsreports.congress.gov/product/pdf/LSB/LSB10922

> Under U.S. case law, copyright owners may be able to show that such outputs infringe their copyrights if the AI program both (1) had access to their works and (2) created “substantially similar” outputs

From Shaw v Lindheim.

> Their own lawyers would be on their case if they were infringing halfway through the pipeline.

This is why many of these AI companies are their own entities with sponsorships or shares held by other, larger companies. If they're on the hook for obscene copyright infringement, they just close down.


> Substantially similar is the standard

This quote is about AI, not movies. If you read the entire report, there are many other caveats and provisions:

   Whether or not copying constitutes fair use depends on four statutory factors under 17 U.S.C. § 107:
   1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
   2. the nature of the copyrighted work;
   3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
   4. the effect of the use upon the potential market for or value of the copyrighted work.
and furthermore:

   The substantial similarity test is difficult to define and varies across U.S. courts.


> In principle, an AI learning from a scientific textbook is no different than a human student doing the same. Neither will violate copyright law when they're done learning

If AI were only learning from scientific textbooks very few people would have problem with it. When AI learns from creative works like novels, it runs into the same problems we have as humans. We can read a book, but the minute we start distributing copies of it from memory we're in legal trouble.


AI will train from AI then. The copyright source is twice removed.


This is a really peak crazy libertarian take on the issue of artists and content creators being peeved that AI is making crappier content for cheaper and driving down their job opportunities and wages, after training on their content (and sometimes regurgitating it word for word in large chunks) and not paying royalties.

If you think artistic people of various kinds are doing particularly well and need to have their privilege checked to stop the discrimination against AI tech bros...then I think you are out of touch. The starving artist meme is a thing for a reason, and most content creators are not making a lot of money. They enjoy it, and maybe make a little on the side. But most work a shitty day job as well to fund their passion, the thing that makes them enjoy life and helps them find meaning. They just want to contribute to culture and enjoy making art but wouldn't be able to justify it if all the existing low wage starving artist work dries up in favor of essentially free AI generated crap.

I'd hazard a guess that the majority of the country would take their side and call Silicon Valley tech bros and VCs privileged and anti-social.

I think a culture in which it is non-viable for an ordinary person to participate in art and culture on the side is kind of sad. I value these things because there is a person on the other end and it helps me empathize and try and understand how they are feeling and thinking. An AI kills that. You'll get great stock photos generic advertising, but you'll destroy something fundamental to human culture.


Everything not expressly prohibited by law is permitted. Extending copyright retroactively to cover AI training is like prohibiting the wearing of blue shorts on Thursdays and fining anyone who has ever worn blue shorts on a Thursday.


Content creators have an uphill battle to fight once courts learn that transformers are not massive repositories of artwork.

Image generator models are a few GB in size, despite copyright claimants saying there are petabytes in there.


Ah so if I compress an IMAX movie release with hevc with medium quality settings at 1080p I'm good to go. Thanks!


If you XOR it with 1000000 other movies then filter everything with fisheye then compress it to 10x6 0.1fps resolution you're absolutely OK with copyright law.


If my college textbook's author demanded a percentage cut of my future earnings I'd be much more peeved than content creators are now.

I think pointing out that it's crappy is a much better critique of crappy content than looking at how it was generated.

As for tech bros vs. artists, that's a reality show I haven't watched and one I'm not interested in. I'm interested in progress and human advancement, and I'm definitely not a Libertarian or tech bro.


> If my college textbook's author demanded a percentage cut of my future earnings I'd be much more peeved than content creators are now.

If you were writing highly derivative textbooks, your college textbook's author would be peeved, too. That is the AI analogy, not that you read a chemistry textbook and are then breaking some law when you go do chemistry.


You think chemistry textbook authors haven't read chemistry textbooks? Heck, you think they haven't read chemistry textbooks as references, while writing their own textbooks?


Again, it's not about reading. It's about either acquiring in a way that is infringing or producing output that is infringing. If you write a chapter with something like, "Ionic bonds are bonds that occur between two elements where one or more electrons transfer to a particular element," it's not copyright infringement for another author to discuss ionic bonds. It's copyright infringement to copy the material. It's copyright infringement to produce output that is the same or substantially similar.


Are you assuming that all or most AIs generate derivative passages from textbooks when you ask them Chemistry questions? I'd be surprised if that was the case for even a single modern LLM.


I assume that all generative AI output is derivative of the training material (and the prompt). Because, what else would it be?


Copyright does not protect ideas, just specific expressions of ideas. So even if something sounds like a derivative work, it doesn't mean it's infringing. You can write stories about a fictional 19th century detective living on Boulanger Street in London without infringing on Sherlock Holmes. You can write reviews of Disney movies without Disney's permission.


AI does not understand what ideas are. AI does not think. It just takes source material and regurgitates it.


"disapparing" = people getting aware that their data has value, and setting their robots.txt permissions acordingly?


Like people getting up in arms when they notice someone found a way to create valuable medicine from the litter from some trees in their yard that they always ignored people taking shade under. Now they want to block that novel use or get a cut of the profits, when they never saw any value before beyond use as shade. Senseless IMO.


No, disappearing = people thinking their data has more value than it has and setting their robots.txt so that it's no longer even indexed for search because they've banned any use of AI. Then, because it's actually not that valuable, no one notices and that data eventually ceases to exist when the creator eventually abandons it or dies.


The data is already incorporated into existing models. Those models will generate derivative data used to train the next models. They only needed to rip it all off once, after that it's actually preferable to not let anyone do it again.


There will be a new law which states that copyrights don't have effect if the data is being used to train machine learning models (except in the EU perhaps? heh).


I would be fine if you used my data to train your AI models as long as I’m able to use your models for free in return. If not, you can’t have my data.


I'm not even sure if that. The ability to summarize others is hardly a consolation prize for devaluing my work almost completely.


I would not be fine with that, and I hope that you don't believe that it should be imposed on me.


> consent

I mean the entire industry is based on doing stuff without seeking consent.

All the major players seem to have used at some point at least the books dataset so def no regard for consent or copyright.

Bigger issue is mechanical limits like Twitter and Reddit - walled gardens of info. That could entrench existing players, whether via money (pay me for data), ethics (oh now suddenly consent matters) or just timing (more stuff mechanically restricted)


What if everyone resorts to training on synthetic data ...


In the near future I see less real, quality human data that will go behind paywalls, but also much more AI generated data feeding into the next generation of AI. Because more and more people are using it to publish stuff online. Which then gets scooped up by AI training crawlers. And if it made sense to train an LLM on its own output, it would be done already :)

I doubt they can continue the progress at the same speed they have been at so far. Because the game is set to become more difficult.


Is it easy to recognize the AI content stealers by user agent? Can we just feed them garbage when detected?


> The data that powers AI is disappearing fast

.... says the article behind a paywall.


[flagged]


Although LLMs are different, politically speaking Google et al have been harvesting our data and controlling our focus for decades now. I find strange that the topic was not discussed enough in the past. Seems like "The Groundhog Day" movie.


“Tell HN: Stop Reading The New York Times”


It honestly makes me hesitant to publish my special personal code on gists or private repos vs at least 3-2 backup the really good stuff...

TBH I would not feel any sadness if LLM models plateaued, greatly decellerated development, or even -regressed- as they start to ingest all the garbage already being spewed by LLMs.


We should make the data on large platforms like YouTube and social media in general accessible to all companies for AI use (with the actual creator’s positive consent).


How many creators do you think would actually go out of their way to consent to their work being used for training? Maybe if they get paid, but otherwise forget about it.


There's actually an interesting perverse incentive here. A group of politically motivated actors (perhaps funded by a government or private interests) could create content with the sole purpose of opting in to all the AI data harvesting feeds. That way they could bias the output to their cause (assuming the AI company doing the training does take active measures to counter it).


I would assume that's happening regardless, platforms like Reddit are already selling their data firehoses to AI companies, so anyone who manages to slip propaganda bots under Reddits radar will end up having their talking points fed into future AI training sets.


Payment is out of the question - too much of a hassle. What is likely is websites either requiring you allow ai training in exchange for hosting your content, or giving some minor perk/incentive for it.

Consider this comment itself. How much do you think it is worth for an AI company? Maybe 0.00001 dollars? How would you handle the logistics of money that little money?


This comment neatly explains why the LLM bubble will burst as soon as prosecutors remember that the DMCA doesn't have a carve-out for AI.


I think all is about consent. And I don't even think that people would be so upset if the whole AI traing wasn't abot profit. But they way it is, companies are training their models on other people's work and try to make money with the models.


Imagine if we'd been appropriately skeptical of the way social media might worsen society in exchange for some short term fun and convenience, rather than blindly conflate invention and novelty with progress. We can forgive ourselves for that naivete, but having seen that what excuse do we have now?


NYT assumes that LLMs = AI, which is far from truth.

This is just a recent hype which relies on getting insane amounts of data to train, but we had and will have AI models that do not rely on training using data without consent.


It's not just LLMs but all generative models that rely on extreme amounts of training data. Text but also images, video, speech, music.

And it makes sense that LLMs were the ones to trigger the hype. AI in general has been making steady progress but most usecases are really hard to explain to layman. Give them a chatbot and they naturally understand what's happening.

The only thing I'm seeing is that people are using them wrong. They're using them like all-seeing oracles, which is exacerbated by the confidence with which LLMs provide their answers, and the innate human idea that "it's a computer so it must be right".

But knowledge isn't really where LLMs shine, at least not without search engine integrations. It's rather generation, summarisation, translation (with context sensitivity), rewriting styles etc.


Technicalities make things harder to understand for general masses. “The Data That Powers LLMs” means little to nothing to an average user. My 70 year old parents use ChatGPT, and they have no idea what LLM is, but call it AI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: