NYT's ask and my own are never going to happen, but I think forcing the models to be public domain would be the best outcome. It was trained on the creative output of a substantial portion of everyone online going back decades, including my own.
I think royalty and licensing fees are more likely. Some formula will be invented to calculate the value of the training data, and model makers will have to pay that out. But that only works for people with the ability to leverage the legal system. The rest of us will get told to pound sand.
NYT and similar publishers are just looking for a new income stream. Rent seeking in the new AI age will probably be more profitable than producing new content.
Speak for yourself. People who make their living in (actually) creative fields don’t have the luxury. We’ll see how all the brogrammers feel about industrial scale plagiarism 5 years from now when Gen AI models code better than any human.
I've been writing code professionally for about 45 years now. People have cribbed my code and ideas since the beginning. At some point, I just stopped caring about it.
I no longer sell code, just my services.
I did have one case where someone stole my code, and then tried to sue me for copyright infringement. Too bad for him that I had a registered copyright on it :-)
I think a substantial number of us don’t care or are happy to see it. It’s just another level of abstraction to work with. I think some artists feel the same although I understand those who don’t.
Presumably the "brogrammer" will still be the one feeding data into the model and stitching together the code it produces, vetting it, and deploying it because it is going to be massively more effective to have him do it vs the manager whose area of expertise is managing people and finance.
This will drastically increase the total output of all the "brogrammers" maybe even enough to replace some of the spagetti code out there with more reliable work.
Human progress has always been about building upon and extending the work of others.
Besides, AI only produces a sort of "average" of what is already out there. Is it really creative work to produce an equivalent of that average?
I remember when the "typing pool" was a thing. Word processors utterly destroyed that category of work, as well as the jobs for typesetting and layout, back in the 1980s.
The typing pool converting one format to another is not equivalent. It’s more like the demise of administrative support professionals. That work just got shifted to the people they used to work for because it was now supposedly “easier” to manage themselves with more advanced information technology. Which is why we all have to waste massive amounts of time figuring out our HR stuff, arranging work travel, etc.
I would argue that the analogy rests on an incomplete consideration of the dimensions of the matter. Such tools amplified a worker’s inherent utility. The name given to the new tool reflects its nature: processor of words. Whose words? The typist. “Other peoples’ words processor” vs “word processor”.
More generally, AI introduces yet another systemic mechanism for wealth extraction by the wealthy. The wealth here is creative power of the !wealthy. More sinister than the grand larceny by a thousand minor borrowings is the fact that meaning, ideals, motivating energy to move masses is taken away from the candidate pool — anyone of us can be the next demagogue, it’s not too late yet — which may include genuine thought leaders are going to be buried by the electric demagogue working for the proverbial man. “We need to hire more copyrighters for our propaganda using these ‘word processors’” becomes “Who could resist the onslaught of “our” creative efforts? Surrender now Dorothy.”
tldr:
Humanity has reached the absolute limit of the utility of ancient means of governance. New technology demands that we comprehensively review socio-economic order in society. Failure to do so will gift the current “winning” players in the ‘zero-sum-game’ of the du jour regime near guarantee of perpetual habitation in their very very special social perch.
In the 1800's Germany started behind Britain and raced ahead of it in industrial might. One factor in this was Germany did not recognize copyrights. Printers went looking for something, anything, to print and printed up a storm of technical literature that enabled this rapid industrialization and the increase in wealth of its ordinary citizens.
Isn’t this the case already with the Industrial Revolution where we began mass-producing (mostly) better day-to-day utility items rather than getting from a specific craftsman?
I’d think that the clothing industry went through something similar already.
I'm not sure the analogy really fits. As I see it, hypothetical farmers aren't an elite class that gets rich from copying and transforming the communications and creative output of billions(?) of unknowing people.
The Bolsheviks would beg to differ on whether the kulaks were an elite class.
When LLMs reproduce someone else's work, it is theft and appropriation, and our tech culture is rationalizing it out of hatred of the laptop class -- the kulaks, in my analogy. These traditional media do have their faults, and it's fine to point those out. But reporting is work, consisting of more than just "copying and transforming", and stealing it is wrong. "[F]orcing" it into the public domain is no more a "best outcome" than forcing the farms to collectivize.
There's a trivial solution here. Stick all the NYT's still-in-copyright text in a big database. Every time the chatbot produces a paragraph, check it against the NYT-censor. If it's more than 70% similar to anything in the database (mythical 30% rule), have the chatbot rephrase it. Thus, the chatbot is no longer reproducing someone else's work.
I'm going to go out on a limb and say that solution wouldn't be acceptable to the NYT, which is why I think they're trying for a land grab. They're trying to extend copyright beyond what was intended.
I understand what you're saying now, my comment was pretty flippant in retrospect. I didn't mean to devalue the NYT's work, my suggestion was hypothetically intended to liberate the free moat OpenAI got from the fruit's of NYT's labor. In retrospect it's obvious that putting GPT's model in the public domain is also putting NYT's work in the public domain, but that hadn't occurred to me. Thanks for the food for thought. Now I kind of like the idea of destroying the existing models if a royalty system can't be figured out, sort of a "reset" to force them to do it properly.
Can someone with actual fundamental understanding of LLMs explain to me why they think it's perfectly legal to train models on copyrighted material? I don't know enough about this. Please don't answer by asking chatgpt.
Consider how commercial search engines are fine to show text snippets, thumbnails and site caches.
AI developers will most likely rely on a Fair Use defense. I think this has a reasonable chance of success since, while the use of a given copyrighted work may affect the market for that work (in this case NYT's article), it can be argued to be highly transformative usage. As in Campbell v. Acuff-Rose Music: "The more transformative the new work, the less will be the significance of other factors", defined as "whether the new work merely 'supersede[s] the objects' of the original creation [...] or instead adds something new".
There's also potential for an "implied license", as in Field v. Google Inc for rehosting a snapshot of a site, where "Google reasonably interpreted absence of meta-tags as permission to present 'Cached' links to the pages of Field's site". As far as I can tell in this case, NYT's robots.txt of the time was obeyed, which permitted automated processing of all but one specific article for some reason.
Why do you think it is legal to train students on copyrighted material? Copyright is supposed to protect from unauthorized reproduction, not unauthorized learning. That the NY Times is able to show some verbatim reproduction, it is a real legal issue, but that should not be extended to training generally.
Students are humans. LLMs are not. Machine "learning" is a metaphor, not what's actually happening. Stop anthropomorphizing, and show some loyalty to your species.
The loyalty argument does sound somewhat bizarre, but I think the overarching point is about whether technology use benefits humans in society or not. We should not implicitly treat LLMs owned by corporations with the same rights as humans. LLMs without some form of legislation is looking like it will benefit corporations that are salivating at profits and the prospect of reducing or eliminating the number of creative workers they need.
Why would I want to quantify it? The burden of proof is on the thief.
I have a gadget that will, with some probability, steal your life's savings. It operates through a process that is analogous to a human chewing. When engineering it, we just say for simplicity that the gadget "chews". Of course, that's only a metaphor -- machines can't chew.
But (and here's where your argument gets ridiculous), unless you can quantify the fact that my gadget can't chew, then I will steal your savings. Good luck.
I think your question is incorrect. It’s very likely no-one thinks it’s perfectly legal. There probably are many people who think it’s not a big deal, though. Try coming up with a dataset that doesn’t have any copyrighted material in them. Like seriously try. You can’t use pretty much anything newer than a century old. Everything is copyrighted by default. Very few new things are explicitly in public domain or licensed in a way that would allow usage. Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?
> Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?
Not sure about the rest of the world, but at least for US content I don't think any company would publish that LLM.
That's like 40 years before the civil rights movement, and right about the time of the Tulsa massacre.
It's right around when women got the right to vote.
Trying to get it to not say anything horrible under modern standards seems fraught with issues. I don't know if it would even understand something like "don't be racist", given the context it was trained on.
Exactly. Copyright terms are so long that most material with expired copyright is not useful for modern uses of LLMs and looking for modern non-copyrighted materials is too hard to do quickly and its usefulness is unclear. So people who grew up with Internet and are used to making memes with copyrighted material are not exactly averse to do it on a bigger scale.
1. Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music, and later to write a book about music which likely includes some of the concepts you earlier learned.
2. Neither the LLM nor the output text contain sufficient elements of the copyrighted work to qualify for copyright protection. Just like if you turned old library books into compost and sold the compost, you wouldn't expect to pay authors of those books a royalty for the compost sales.
> Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music
If you learn a little too hard though, and reproduce the original textbook in it's entirety, you'll get in trouble.
My guess is that courts will determine that the training itself will not be found illegal, but either the AI companies, or the users, will be found liable for reproducing copywrighted work in output, and no one will want to hold liability for that.
If the work goes beyond fair use, it is a copyright violation. It doesn't matter if it was created by a person or an AI.
Technology that makes copyright violations easier/quicker have typically been found legal if "the technology in question had significant non-infringing uses".
This makes sense. It was allowed for the content to be read and used in certain ways (e.g. search engines or as references) without substantial reproduction. The NYT would then have to flag specific generated content as infringing a specific work which could then be judged as fair use or not on a case-by-case basis. If a particular site/company was repeatedly and/or primarily using substantial content then perhaps it could be 'delisted' as search engines do for links to pirated copies of works.
It really hinges on substantially similar. If I copy Harry Potter and change every instance of Harry Potter to Michael Rose surely it's infringing. If I write a coming of age story set in a magical land I'm probably OK. Which do you think LLM produce?
It's likely not possible of literally giving you Harry Potter. If you specify it narrowly enough that it qualifies as fan fic its probably exactly what you were going for. After all your word processor is capable of producing infringing works but is not itself an infringing work.
Fair use, probably. How many news pieces have you read that amount to, "The New York Times reports..." followed by a summary of the Times' article? It's not illegal to use copyrighted works at a source, as inspiration, or to guide style.
Surely. Remember when the VCR came out and some parties absolutely freaked out and Jack Valenti said
"I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone."
Then we invented from whole cloth reasons why they were perfectly OK because there was a ton of money to be made and everyone would actually be better off if the VCR was a thing and everyone knew it because it ended up argued after millions of VCRs were already in households.
Read about the 'fair use' doctrine and put yourself in the shoes of someone who is training a model, and see if you can argue, from their perspective, why it should be allowed.
Eventually the SCOTUS will have to decide if training is a form of fair use. There are plenty of arguments on both sides.
The LLMs built and trained 10 years from now will be much more advanced. Will it be possible for anyone to prove if specific content was used/not used in the training? If the courts rule against fair use, it will be a minefield to enforce.
That's pretty much the entire point of many publications. You think readers of Financial Times aren't reading FT in the hopes of getting their own material gain? What about Wall St analysts? Consuming something for gain is not copyright infringement, distributing it for gain is.
Yet fair use can trump the owner's prohibitions. Your ISP can cache copyrighted materials, storing and reproducing them for other customers. Your browser stores the copyrighted images in your cache and 'reproduces' them if you browse the same page again.
Maybe getting too off topic for the thread, but it feels like equating machine and human output reaches a level of nihilism even I shudder at. I think (hope) there is intrinsic value in something being made by a human being even if a machine could do comparable work 100x faster.
Exactly this. If I read a blog summary of a paywalled article that enhances my knowledge and I use it to do my day job better, did I infringe on the original copyright?
If you regurgitate the paywalled article verbatim, as a service, for customers, then yes, you infringed. If you didn't, and you didn't build a system that has some probability of doing so, then no, you didn't. How is this so hard to understand.
If I had a gadget that might steal your life's savings, but assured you the probability was "near-zero", would you be ok with that?
Perhaps you personally would be fine with it. But would it be ok for a court declaring that someone has no recourse, and must accept such an uncompensated risk?
I'm surprised it took this long for a lawsuite to emerge.
Interestingly, I think this going through would actually help the big players (Google/MS/Apple) a lot in the medium term. They sit on huge amounts of training data from their own services and also have the money to acquire copyrighted material, while everyone else would have to build on datasets under a permissive license (like Wikipedia), since scraped data would be a minefield.
If I were OpenAI, I would tell the court that the LLM's have now been repaired to never reproduce long passages of any of the training data. I would implement that with a simple post-filter.
I would then offer to pay damages of 3x revenue for all historic requests that resulted in 50+ words of Times articles to be reproduced.
And an analysis of the logs will probably show that only happened tens of times and total revenue from those requests was like $3.
The likelihood of a use being considered fair diminishes if it negatively impacts the market value of the original work. For instance, if The New York Times (NYT) argues that AI-generated content is not transformative, being quoted verbatim, it suggests the use is not transformative. This challenges the notion of fair use, implying the AI merely replicated the original content without significant alteration or added meaning.
In interactions with AI systems like ChatGPT, the user's intent and query nature significantly influence the AI's output. Regardless of user understanding, the fact that they engage with a generative prediction model is key. This novel use case is a first of its kind.
Similar to legal searches, where responsibility rests with the searcher, the intent behind AI queries — whether for research, information, or to demonstrate something about AI functionality — dictates the AI's response. The distinction between the AI's training data (e.g., articles from The Times) and the model's outputs is critical for evaluating transformative use. Each instance must be considered individually to ascertain this.
Just as intent is pivotal in legal contexts, it's also relevant when users interact with AI. If a user seeks verbatim content from sources like The NYT, responsibility could shift more towards them. This raises ethical questions about the intent behind the NYT's use of OpenAI's services.
If the NYT utilized OpenAI's services contrary to terms of use, such as for illegal activities or spreading misleading information, it would constitute a violation. Similarly, manipulating outputs to damage OpenAI's reputation would also breach these terms.
In copyright law, responsibility typically lies with the entity making copies or distributions. However, AI complicates this, as it generates content from various inputs. The user essentially initiates and guides this process through their prompts.
OpenAI could respond by filtering any NYT content and requiring users to agree to a legally binding contract. This contract would stipulate conditions for content usage in line with fair use principles, emphasizing a joint responsibility between the user and OpenAI. Such an approach aligns with other services where access is contingent on agreeing to specific terms.
He may be “author, futurist, thinker and systems architect,” but he doesn’t know how law works. He’s citing a prayer for relief against the specific defendants sued. A court doesn’t have power to order relief against a party not before it. Whatever models may exist beyond those made by the specific defendants in this case are not the targets of that language.
For years I've complained about how the MPAA, RIAA and big tech lobby politicians to wreck copyright and force all culture through an ownership sieve. But I might forgive them if it happens to prevent the commercialisation of all remaining human culture via AI.
"A few hundred million dollars" is not chump change to any organization; especially one that is in its growth phase. They raised $300M [1] in their last round so that would wipe out the all of the investment money they brought in.
It's reasonable for NYT to not want LLMs regurgitating their front page, but beyond that I don't care. I don't like that IP lasts so long, especially for content like news.
NY times is asking that all LLMs trained on Times data be destroyed ... because the Times knows that most of the stuff they publish is absolute bullshit and they know that AI trained on that rubbish will be screaming at all of the inconsistencies.
"I call on everyone to defend my corporation, clearly the good guys, against the new York times, clearly the bad guys."
Just your usual smiling trendy unkempt beard sociopath corporate marketer in post-corporate business casual trying to hide as a folksy warrior of the people.
Or maybe he's an AI generated fuckface profile that cranks out ai generated propaganda for the AI corporation. Well, if he's real he should be prepared to be replaced by one soon, because word salad bullshittery is incontrovertibly the BEST thing llms are capable of.
There have been issues where a computer science professor claimed an LLM was verbatim generating the code he wrote.
Many individual artists have similar claims [0].
Now NYT has.
Individuals cannot foot the bill for lawyers so they have no choice but to see it all while being unable to do anything. NYT apparently has the muscle to pull it off.
Morality or ethics don't stand a chance here, it is all about who can march the largest army of lawyers which clearly, is the Tech bro clan.
I'm sad for all the creators. And of course, those marketing it as "synthetic intelligence" are on a whole another level.
> Morality or ethics don't stand a chance here, it is all about who can march the largest army of lawyers which clearly, is the Tech bro clan.
There’s diminishing returns in this regard. Me suing Google has this dynamic, but the NYT is large enough to hire perfectly effective legal counsel. 1000 top lawyers won’t necessarily beat 100 top lawyers.
Would it be reasonable for NYT to demand that all humans forget anything learned from NYT articles?
The NYT is asserting a drastic expansion of copyright, to cover not just the specific expression of ideas but the ideas themselves.
We should be wary of AI companies, and copyright is woefully unprepared for our new world, but the answer is not to change the system so creators retain infinite control over who can even remember their work.
How many terabytes of data have you been exposed to in your life? Songs, movies, books, day to day vision and memories, etc.
Those all influenced you. You may not have perfect memory, but neither does GPT4. You may rememeber a tiny subset of that data perfectly, just like GPT4.
The inanity here is in seeking to extend copyright beyond reproduction and into learning and cultural accretion. It’s crazy to me that anyone would look at the past 100 years of copyright and think that what we need publishers with legal control over reading.
Full retraining is prohibitively expensive, especially if it has to be done for cases like this where NYT initially permitted automated processing through robots.txt at the time of training but then changed their mind upon realizing the potential market in licensing content to developers. To my knowledge there also hasn't been any successful training of capable LLMs on just public domain data.
Machine learning, including foundation models and web-scale pretraining, has widespread uncontroversially beneficial applications (defect detection, language translation, spam/DDoS filtering, agriculture/weather/logistics modelling, etc.). The federal government has invested billions in AI, and is desperately trying to prevent China from taking the lead. NYT aren't really in a position here to dig their heels in with "not our problem if our deletion demand makes training AI in the US infeasible" - they likely know that's not going to fly, and are instead using it to prompt negotiation for licensing their content.
Why is that their problem? If they didn’t get permission to use copyrighted content why is it unreasonable for retraining to occur without their content?
And please spare me the humans learn too nonsense. Computers are not humans.
Humans mostly don't have the ability to reproduce articles verbatim years later and those that can certainly don't have the capacity to serve copies and derivatives to the entire free world ... and even if they did they would quote and paraphrase selectively and non-competitively with attribution to the source. LLMs do all the functions humans can't with none of the guardrails or attribution.
I’m not a lawyer but it seems to me that the infringement would be in the publication of the output of the models, not in the mere existence of the models.
Let’s imagine a really obvious case of copyright infringement - where a shady publisher buys up copywritten books, assembles them into a private library, then prints and sells copies. The infringement is the “prints and sells”, right? Not the “assembles a private library”?
If we accept that ChatGPT is committing copyright infringement (which at least seems true if it returns big chunks of the copywritten material verbatim), I don’t think you can get from there to compelling deletion of the model.
The LLM model is fixed somewhere in a tangible medium and it (allegedly) includes an unlicensed copy of a copyrighted work. The LLM spitting that out to users is (allegedly) both an infringement and evidence the model itself is infringing.
Isn’t that a massive misunderstanding of LLMs? If I have infinite monkeys and typewriters, that does not mean the typewriters (or monkeys) contain copyrighted text.
LLMs are basically an optimization of infinite monkeys using statistics.
I can see how outputs can be infringing copyright, but I’m having a very hard time seeing how the weights can be. It feels like saying a musician is infringing copyright if they understand a song well enough to be able to create an infringing reproduction, even if they don’t.
A lot of analogies break down here. This is just an early salvo in many arguments to come, that will need reasoned debate by competent people, judged by wise leaders. I really hope we can have that debate.
I'd argue that training LLM's on news sites makes them less useful anyway. It's not like these sites do any meaningful research. They just disseminate information readily available elsewhere.
"I call on @facebook, @cohere, @AnthropicAI, @MistralAI, @huggingface, @google and everyone else who cares about the future of AI to join the defense and smash the NYT in court for this overreaching attempt to twist and expand copyright to the determinant of pretty much everyone else on Earth but the NY Times."
Preach it louder! I think most of them (certainly Huggingface and Mistral) will be ready to help on this. Let's hope that big tech companies are savvy at getting better lawyers who can more clearly articulate the pro-LLM position to lawmakers.
What license requires attribution to read and understand the code? Or to use techniques from code for totally different purposes, embodied in different code?
You may have a point for verbatim regurgitation of code, but I’m having a hard time seeing a violation in ingestion code and learning how it works.
reading has never been one of the rights reserved to copyright holders under us copyright law, and in fact there are a number of carve-outs in us copyright law to ensure that copyright holders cannot prevent people from reading their work, such as the first-sale doctrine and the exemption for copying an executable into memory
there have at times been regimes that legally restricted the mere reading of published works, but such policies have generally been considered repugnant to liberal democracy and indeed liberalism in general
And who will preach morality, ethics, ownership, the right to retain your creative endeavours and benefit from them so that overall society keeps progressing and there's enough reward for putting labour into something creative and genuine?
Before someone comes up with the argument that LLMs are creative and genuine, yes as a technique they are. The art form, the theory and science is marvellous result of human ingenuity. Hats off.
But the end result, is not.
Just like tape recorder is a great invention but the pirated music recorded on it is not.
This isn’t an appeal. There’s no general ability to submit amicus briefs. Well, there is a way to, but trial courts ignore them (as they should, because this is a case between specific parties, and unless someone has some sort of privity making them worth hearing from, outside opinions are just noise).
The lawsuit is against MS and OpenAI. They don't need to be called to action, they're already on the front lines. This is for everyone not already being sued.