Hacker News new | past | comments | ask | show | jobs | submit login
NY times is asking that all LLMs trained on Times data be destroyed (twitter.com/dan_jeffries1)
66 points by tosh 10 months ago | hide | past | favorite | 133 comments



NYT's ask and my own are never going to happen, but I think forcing the models to be public domain would be the best outcome. It was trained on the creative output of a substantial portion of everyone online going back decades, including my own.


I think royalty and licensing fees are more likely. Some formula will be invented to calculate the value of the training data, and model makers will have to pay that out. But that only works for people with the ability to leverage the legal system. The rest of us will get told to pound sand.

NYT and similar publishers are just looking for a new income stream. Rent seeking in the new AI age will probably be more profitable than producing new content.


I don't mind if there's a little piece of me somewhere in that model!


Me neither, as long as I get to own it along with everyone else!


Speak for yourself. People who make their living in (actually) creative fields don’t have the luxury. We’ll see how all the brogrammers feel about industrial scale plagiarism 5 years from now when Gen AI models code better than any human.


I'm in an "(actually) creative" field, not a brogrammer, 20 years of writing code, and have a vastly different opinion than you do.


I've been writing code professionally for about 45 years now. People have cribbed my code and ideas since the beginning. At some point, I just stopped caring about it.

I no longer sell code, just my services.

I did have one case where someone stole my code, and then tried to sue me for copyright infringement. Too bad for him that I had a registered copyright on it :-)


As do many “creatives” who don’t understand the technology, much less the endgame of those who will control it.


Sorry do we know each other? I understand the technology fine.


No, just noting that you are in good company.


I think a substantial number of us don’t care or are happy to see it. It’s just another level of abstraction to work with. I think some artists feel the same although I understand those who don’t.


Presumably the "brogrammer" will still be the one feeding data into the model and stitching together the code it produces, vetting it, and deploying it because it is going to be massively more effective to have him do it vs the manager whose area of expertise is managing people and finance.

This will drastically increase the total output of all the "brogrammers" maybe even enough to replace some of the spagetti code out there with more reliable work.


Human progress has always been about building upon and extending the work of others.

Besides, AI only produces a sort of "average" of what is already out there. Is it really creative work to produce an equivalent of that average?

I remember when the "typing pool" was a thing. Word processors utterly destroyed that category of work, as well as the jobs for typesetting and layout, back in the 1980s.


The typing pool converting one format to another is not equivalent. It’s more like the demise of administrative support professionals. That work just got shifted to the people they used to work for because it was now supposedly “easier” to manage themselves with more advanced information technology. Which is why we all have to waste massive amounts of time figuring out our HR stuff, arranging work travel, etc.


I would argue that the analogy rests on an incomplete consideration of the dimensions of the matter. Such tools amplified a worker’s inherent utility. The name given to the new tool reflects its nature: processor of words. Whose words? The typist. “Other peoples’ words processor” vs “word processor”.

More generally, AI introduces yet another systemic mechanism for wealth extraction by the wealthy. The wealth here is creative power of the !wealthy. More sinister than the grand larceny by a thousand minor borrowings is the fact that meaning, ideals, motivating energy to move masses is taken away from the candidate pool — anyone of us can be the next demagogue, it’s not too late yet — which may include genuine thought leaders are going to be buried by the electric demagogue working for the proverbial man. “We need to hire more copyrighters for our propaganda using these ‘word processors’” becomes “Who could resist the onslaught of “our” creative efforts? Surrender now Dorothy.”

tldr:

Humanity has reached the absolute limit of the utility of ancient means of governance. New technology demands that we comprehensively review socio-economic order in society. Failure to do so will gift the current “winning” players in the ‘zero-sum-game’ of the du jour regime near guarantee of perpetual habitation in their very very special social perch.

“Think of the children”


In the 1800's Germany started behind Britain and raced ahead of it in industrial might. One factor in this was Germany did not recognize copyrights. Printers went looking for something, anything, to print and printed up a storm of technical literature that enabled this rapid industrialization and the increase in wealth of its ordinary citizens.


Bless you Walter, you are a positive man. Let’s “hope” so buddy.


Isn’t this the case already with the Industrial Revolution where we began mass-producing (mostly) better day-to-day utility items rather than getting from a specific craftsman?

I’d think that the clothing industry went through something similar already.


> I think forcing the models to be public domain would be the best outcome.

I would seriously consider to take it even further. Require that all copyrighted material be made available for public model training.


This is nitpicky, but that includes the diary you keep under your bed. You probably mean something closer to "published".


The kulaks shouldn't be hoarding that grain. I think forcing the surplus to be public would be the best outcome.


I'm not sure the analogy really fits. As I see it, hypothetical farmers aren't an elite class that gets rich from copying and transforming the communications and creative output of billions(?) of unknowing people.


The Bolsheviks would beg to differ on whether the kulaks were an elite class.

When LLMs reproduce someone else's work, it is theft and appropriation, and our tech culture is rationalizing it out of hatred of the laptop class -- the kulaks, in my analogy. These traditional media do have their faults, and it's fine to point those out. But reporting is work, consisting of more than just "copying and transforming", and stealing it is wrong. "[F]orcing" it into the public domain is no more a "best outcome" than forcing the farms to collectivize.


There's a trivial solution here. Stick all the NYT's still-in-copyright text in a big database. Every time the chatbot produces a paragraph, check it against the NYT-censor. If it's more than 70% similar to anything in the database (mythical 30% rule), have the chatbot rephrase it. Thus, the chatbot is no longer reproducing someone else's work.

I'm going to go out on a limb and say that solution wouldn't be acceptable to the NYT, which is why I think they're trying for a land grab. They're trying to extend copyright beyond what was intended.


I would go one step further. retrain without any NYT data and any references to the NYT. As afar as any LLM user is concerned, NYT would not exist.


I understand what you're saying now, my comment was pretty flippant in retrospect. I didn't mean to devalue the NYT's work, my suggestion was hypothetically intended to liberate the free moat OpenAI got from the fruit's of NYT's labor. In retrospect it's obvious that putting GPT's model in the public domain is also putting NYT's work in the public domain, but that hadn't occurred to me. Thanks for the food for thought. Now I kind of like the idea of destroying the existing models if a royalty system can't be figured out, sort of a "reset" to force them to do it properly.


Can someone with actual fundamental understanding of LLMs explain to me why they think it's perfectly legal to train models on copyrighted material? I don't know enough about this. Please don't answer by asking chatgpt.


Consider how commercial search engines are fine to show text snippets, thumbnails and site caches.

AI developers will most likely rely on a Fair Use defense. I think this has a reasonable chance of success since, while the use of a given copyrighted work may affect the market for that work (in this case NYT's article), it can be argued to be highly transformative usage. As in Campbell v. Acuff-Rose Music: "The more transformative the new work, the less will be the significance of other factors", defined as "whether the new work merely 'supersede[s] the objects' of the original creation [...] or instead adds something new".

There's also potential for an "implied license", as in Field v. Google Inc for rehosting a snapshot of a site, where "Google reasonably interpreted absence of meta-tags as permission to present 'Cached' links to the pages of Field's site". As far as I can tell in this case, NYT's robots.txt of the time was obeyed, which permitted automated processing of all but one specific article for some reason.


> AI developers will most likely rely on a Fair Use defense.

Probably. The question for the courts to decide, then, is how much use is considered fair use.


Why do you think it is legal to train students on copyrighted material? Copyright is supposed to protect from unauthorized reproduction, not unauthorized learning. That the NY Times is able to show some verbatim reproduction, it is a real legal issue, but that should not be extended to training generally.


Students are humans. LLMs are not. Machine "learning" is a metaphor, not what's actually happening. Stop anthropomorphizing, and show some loyalty to your species.


Bizarre loyalty argument aside, why is it not learning? Can you quantify that statement?


The loyalty argument does sound somewhat bizarre, but I think the overarching point is about whether technology use benefits humans in society or not. We should not implicitly treat LLMs owned by corporations with the same rights as humans. LLMs without some form of legislation is looking like it will benefit corporations that are salivating at profits and the prospect of reducing or eliminating the number of creative workers they need.


Why would I want to quantify it? The burden of proof is on the thief.

I have a gadget that will, with some probability, steal your life's savings. It operates through a process that is analogous to a human chewing. When engineering it, we just say for simplicity that the gadget "chews". Of course, that's only a metaphor -- machines can't chew.

But (and here's where your argument gets ridiculous), unless you can quantify the fact that my gadget can't chew, then I will steal your savings. Good luck.


> I have a gadget that will, with some probability, steal your life's savings.

I can think of 2 instances of that machine already. the finance industry fees and an ex-wife.


Why can't machines chew? That's an even weirder analogy, it would be quite easy to make a machine that chews exactly like humans do.


I think your question is incorrect. It’s very likely no-one thinks it’s perfectly legal. There probably are many people who think it’s not a big deal, though. Try coming up with a dataset that doesn’t have any copyrighted material in them. Like seriously try. You can’t use pretty much anything newer than a century old. Everything is copyrighted by default. Very few new things are explicitly in public domain or licensed in a way that would allow usage. Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?


> Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?

Not sure about the rest of the world, but at least for US content I don't think any company would publish that LLM.

That's like 40 years before the civil rights movement, and right about the time of the Tulsa massacre.

It's right around when women got the right to vote.

Trying to get it to not say anything horrible under modern standards seems fraught with issues. I don't know if it would even understand something like "don't be racist", given the context it was trained on.


Exactly. Copyright terms are so long that most material with expired copyright is not useful for modern uses of LLMs and looking for modern non-copyrighted materials is too hard to do quickly and its usefulness is unclear. So people who grew up with Internet and are used to making memes with copyrighted material are not exactly averse to do it on a bigger scale.


> Try coming up with a dataset that doesn’t have any copyrighted material in them.

Isn't this what Mistral AI did?


Did they? That'd be interesting to take a look at. Do they publish contents of their dataset?


The RAW Weights here: https://docs.mistral.ai/models/


I think the main arguments are:

1. Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music, and later to write a book about music which likely includes some of the concepts you earlier learned.

2. Neither the LLM nor the output text contain sufficient elements of the copyrighted work to qualify for copyright protection. Just like if you turned old library books into compost and sold the compost, you wouldn't expect to pay authors of those books a royalty for the compost sales.


> Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music

If you learn a little too hard though, and reproduce the original textbook in it's entirety, you'll get in trouble.

My guess is that courts will determine that the training itself will not be found illegal, but either the AI companies, or the users, will be found liable for reproducing copywrighted work in output, and no one will want to hold liability for that.


I feel like there’s no way 1 will fly. Very soon ai and humans will explicitly have to follow different laws because they operate very differently.


I, a human, can read a copyrighted work and then write a new work and own the copyright on that new work as long as it is not substantially the same.


What if you produce a substantially similar work ?

Who owns the copyright then ?


If the work goes beyond fair use, it is a copyright violation. It doesn't matter if it was created by a person or an AI.

Technology that makes copyright violations easier/quicker have typically been found legal if "the technology in question had significant non-infringing uses".


This makes sense. It was allowed for the content to be read and used in certain ways (e.g. search engines or as references) without substantial reproduction. The NYT would then have to flag specific generated content as infringing a specific work which could then be judged as fair use or not on a case-by-case basis. If a particular site/company was repeatedly and/or primarily using substantial content then perhaps it could be 'delisted' as search engines do for links to pirated copies of works.


It really hinges on substantially similar. If I copy Harry Potter and change every instance of Harry Potter to Michael Rose surely it's infringing. If I write a coming of age story set in a magical land I'm probably OK. Which do you think LLM produce?


Whatever you ask and the model will be judged by its accuracy.

If you ask for Harry Potter and it gives you Bart Simpson it’s useless.


It's likely not possible of literally giving you Harry Potter. If you specify it narrowly enough that it qualifies as fan fic its probably exactly what you were going for. After all your word processor is capable of producing infringing works but is not itself an infringing work.


Fair use, probably. How many news pieces have you read that amount to, "The New York Times reports..." followed by a summary of the Times' article? It's not illegal to use copyrighted works at a source, as inspiration, or to guide style.


Surely. Remember when the VCR came out and some parties absolutely freaked out and Jack Valenti said

"I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone."

Then we invented from whole cloth reasons why they were perfectly OK because there was a ton of money to be made and everyone would actually be better off if the VCR was a thing and everyone knew it because it ended up argued after millions of VCRs were already in households.


I was thinking of the VCR as well. SCOTUS ruled that "the technology in question had significant non-infringing uses" making VCRs legal.


It’s currently not explicitly legal or not. There are lawsuits in progress to determine that very answer.


Read about the 'fair use' doctrine and put yourself in the shoes of someone who is training a model, and see if you can argue, from their perspective, why it should be allowed.


We all "train" ourselves on copyrighted materials and later use / or not gained knowledge for our own benefit be it financial or pleasure.

They're just on a hunt for some extra money.


Humans aren't computers. Come on, people.


I'm essentially a meat-based LLM and I'm trained almost exclusively on copyrighted material (most of which I pirated).


Eventually the SCOTUS will have to decide if training is a form of fair use. There are plenty of arguments on both sides.

The LLMs built and trained 10 years from now will be much more advanced. Will it be possible for anyone to prove if specific content was used/not used in the training? If the courts rule against fair use, it will be a minefield to enforce.


While I don’t necessarily agree with the NYT, I fail to see how or why LLMs are entitled to consume other peoples work for their own material gain.


That's pretty much the entire point of many publications. You think readers of Financial Times aren't reading FT in the hopes of getting their own material gain? What about Wall St analysts? Consuming something for gain is not copyright infringement, distributing it for gain is.


The people who read the FT usually pay for it. Most of these LLMs are trained on a set of pirated content that they didn't pay for - https://shkspr.mobi/blog/2023/07/fruit-of-the-poisonous-llam...

Most copyrighted works will specifically say that the customer / user is prohibited from storing and reproducing those works.


Yet fair use can trump the owner's prohibitions. Your ISP can cache copyrighted materials, storing and reproducing them for other customers. Your browser stores the copyrighted images in your cache and 'reproduces' them if you browse the same page again.

It's a complicated area, not clear cut at all


If it’s illegal to make any material gain off skills learned through other people’s work, we’re all criminals.


Computers aren't humans.

I feel like I'm going to be saying a lot in the coming years, as more and more people's brains get broken by false anthropomorphization.


Maybe getting too off topic for the thread, but it feels like equating machine and human output reaches a level of nihilism even I shudder at. I think (hope) there is intrinsic value in something being made by a human being even if a machine could do comparable work 100x faster.


On this point, you and I agree.


Exactly this. If I read a blog summary of a paywalled article that enhances my knowledge and I use it to do my day job better, did I infringe on the original copyright?


If you regurgitate the paywalled article verbatim, as a service, for customers, then yes, you infringed. If you didn't, and you didn't build a system that has some probability of doing so, then no, you didn't. How is this so hard to understand.


Because it’s a hard problem! there are nuances to this complex problem that need to be thought through before reducing too much.

In this case, then, regurgitation is the problem then, not the fact that it was ‘read’.

If the models ensured that probability of regurgitation is near-zero, would that be ok?


If I had a gadget that might steal your life's savings, but assured you the probability was "near-zero", would you be ok with that?

Perhaps you personally would be fine with it. But would it be ok for a court declaring that someone has no recourse, and must accept such an uncompensated risk?



What is nitter?


Alternative open-source Twitter front-end where you can browse posts and not have to sign-in.

More here: https://en.wikipedia.org/wiki/Nitter


Also Nitter loads in about 1 second. Just compared with Twitter, Twitter takes 14 seconds.


It’s like Gretzky says, you miss 100% of the shots you don’t take.


I'm surprised it took this long for a lawsuite to emerge.

Interestingly, I think this going through would actually help the big players (Google/MS/Apple) a lot in the medium term. They sit on huge amounts of training data from their own services and also have the money to acquire copyrighted material, while everyone else would have to build on datasets under a permissive license (like Wikipedia), since scraped data would be a minefield.


If I were OpenAI, I would tell the court that the LLM's have now been repaired to never reproduce long passages of any of the training data. I would implement that with a simple post-filter.

I would then offer to pay damages of 3x revenue for all historic requests that resulted in 50+ words of Times articles to be reproduced.

And an analysis of the logs will probably show that only happened tens of times and total revenue from those requests was like $3.


The likelihood of a use being considered fair diminishes if it negatively impacts the market value of the original work. For instance, if The New York Times (NYT) argues that AI-generated content is not transformative, being quoted verbatim, it suggests the use is not transformative. This challenges the notion of fair use, implying the AI merely replicated the original content without significant alteration or added meaning.

In interactions with AI systems like ChatGPT, the user's intent and query nature significantly influence the AI's output. Regardless of user understanding, the fact that they engage with a generative prediction model is key. This novel use case is a first of its kind.

Similar to legal searches, where responsibility rests with the searcher, the intent behind AI queries — whether for research, information, or to demonstrate something about AI functionality — dictates the AI's response. The distinction between the AI's training data (e.g., articles from The Times) and the model's outputs is critical for evaluating transformative use. Each instance must be considered individually to ascertain this.

Just as intent is pivotal in legal contexts, it's also relevant when users interact with AI. If a user seeks verbatim content from sources like The NYT, responsibility could shift more towards them. This raises ethical questions about the intent behind the NYT's use of OpenAI's services.

If the NYT utilized OpenAI's services contrary to terms of use, such as for illegal activities or spreading misleading information, it would constitute a violation. Similarly, manipulating outputs to damage OpenAI's reputation would also breach these terms.

In copyright law, responsibility typically lies with the entity making copies or distributions. However, AI complicates this, as it generates content from various inputs. The user essentially initiates and guides this process through their prompts.

OpenAI could respond by filtering any NYT content and requiring users to agree to a legally binding contract. This contract would stipulate conditions for content usage in line with fair use principles, emphasizing a joint responsibility between the user and OpenAI. Such an approach aligns with other services where access is contingent on agreeing to specific terms.


> If a user seeks verbatim content from sources like The NYT, responsibility could shift more towards them.

gee why didn't the piratebay think of that

"we're just a search engine, if you search for Disney movies that's your fault, you even agreed to it in our terms of use!"

ChatGPT wouldn't be useful at all without mass unauthorised use of copyrighted material (same as the piratebay)

just ChatGPT has been tarted up with a facade of respectability


Related ongoing thread:

Things are about to get worse for generative AI - https://news.ycombinator.com/item?id=38814093 - Dec 2023 (548 comments)

Also:

NY Times copyright suit wants OpenAI to delete all GPT instances - https://news.ycombinator.com/item?id=38790255 - Dec 2023 (870 comments)

NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT - https://news.ycombinator.com/item?id=38784194 - Dec 2023 (84 comments)

The New York Times is suing OpenAI and Microsoft for copyright infringement - https://news.ycombinator.com/item?id=38781941 - Dec 2023 (861 comments)

The Times Sues OpenAI and Microsoft Over A.I.’s Use of Copyrighted Work - https://news.ycombinator.com/item?id=38781863 - Dec 2023 (11 comments)



He may be “author, futurist, thinker and systems architect,” but he doesn’t know how law works. He’s citing a prayer for relief against the specific defendants sued. A court doesn’t have power to order relief against a party not before it. Whatever models may exist beyond those made by the specific defendants in this case are not the targets of that language.


For years I've complained about how the MPAA, RIAA and big tech lobby politicians to wreck copyright and force all culture through an ownership sieve. But I might forgive them if it happens to prevent the commercialisation of all remaining human culture via AI.


The end result will be that OpenAI pays the Times a few hundred million dollars for a license. Chump change for OpenAI, but real money for the Times.


"A few hundred million dollars" is not chump change to any organization; especially one that is in its growth phase. They raised $300M [1] in their last round so that would wipe out the all of the investment money they brought in.

https://techcrunch.com/2023/04/28/openai-funding-valuation-c...


It sure as shit was for Snowflake, who purchased streamlit for 800 million[1]. Streamlit is a python webUI front-end.

https://techcrunch.com/2022/03/02/snowflake-acquires-streaml...


They can’t. Thousands of other orgs would consider that precedent and get in line.


It's reasonable for NYT to not want LLMs regurgitating their front page, but beyond that I don't care. I don't like that IP lasts so long, especially for content like news.


NY times is asking that all LLMs trained on Times data be destroyed ... because the Times knows that most of the stuff they publish is absolute bullshit and they know that AI trained on that rubbish will be screaming at all of the inconsistencies.


Why call Facebook and Co.?

Call Microsoft.

I bet the NYT relies on MS's OS, Software and Cloud services.

That's a lot more leverage.


Microsoft and OpenAI are the ones being sued. The call is for other AI companies to help them defend the case. Microsoft doesn't have a choice here.


I doubt that Copilot is trained on NYT texts and MS isn't the owner of any GPT.


https://www.documentcloud.org/documents/24241000-2023-12-27-...

Microsoft is the first defendant listed. You can read the Times' claims against Microsoft.


They don't really.


Who is this guy, basically doing the equivalent of cheering on Stalin to collectivise farmland.


He’s really snarky in the replies too, telling Simon Willison “your takes are usually better. Do better next time please. :)” [0]

So that’s a window into this guy’s personality.

[0] https://x.com/dan_jeffries1/status/1741126530590294264


"I call on everyone to defend my corporation, clearly the good guys, against the new York times, clearly the bad guys."

Just your usual smiling trendy unkempt beard sociopath corporate marketer in post-corporate business casual trying to hide as a folksy warrior of the people.

Or maybe he's an AI generated fuckface profile that cranks out ai generated propaganda for the AI corporation. Well, if he's real he should be prepared to be replaced by one soon, because word salad bullshittery is incontrovertibly the BEST thing llms are capable of.


There have been issues where a computer science professor claimed an LLM was verbatim generating the code he wrote.

Many individual artists have similar claims [0].

Now NYT has.

Individuals cannot foot the bill for lawyers so they have no choice but to see it all while being unable to do anything. NYT apparently has the muscle to pull it off.

Morality or ethics don't stand a chance here, it is all about who can march the largest army of lawyers which clearly, is the Tech bro clan.

I'm sad for all the creators. And of course, those marketing it as "synthetic intelligence" are on a whole another level.

[0]. https://youtu.be/jsggOAMQX3Q



> Morality or ethics don't stand a chance here, it is all about who can march the largest army of lawyers which clearly, is the Tech bro clan.

There’s diminishing returns in this regard. Me suing Google has this dynamic, but the NYT is large enough to hire perfectly effective legal counsel. 1000 top lawyers won’t necessarily beat 100 top lawyers.


Why is it unreasonable to request they retrain without their content exactly?


Would it be reasonable for NYT to demand that all humans forget anything learned from NYT articles?

The NYT is asserting a drastic expansion of copyright, to cover not just the specific expression of ideas but the ideas themselves.

We should be wary of AI companies, and copyright is woefully unprepared for our new world, but the answer is not to change the system so creators retain infinite control over who can even remember their work.


Computers are not humans. Why do people continue to make these inane comparisons?

Show me a human can remember even half of what GPT4 was trained on and I’ll concede the point.


How many terabytes of data have you been exposed to in your life? Songs, movies, books, day to day vision and memories, etc.

Those all influenced you. You may not have perfect memory, but neither does GPT4. You may rememeber a tiny subset of that data perfectly, just like GPT4.

The inanity here is in seeking to extend copyright beyond reproduction and into learning and cultural accretion. It’s crazy to me that anyone would look at the past 100 years of copyright and think that what we need publishers with legal control over reading.


Full retraining is prohibitively expensive, especially if it has to be done for cases like this where NYT initially permitted automated processing through robots.txt at the time of training but then changed their mind upon realizing the potential market in licensing content to developers. To my knowledge there also hasn't been any successful training of capable LLMs on just public domain data.

Machine learning, including foundation models and web-scale pretraining, has widespread uncontroversially beneficial applications (defect detection, language translation, spam/DDoS filtering, agriculture/weather/logistics modelling, etc.). The federal government has invested billions in AI, and is desperately trying to prevent China from taking the lead. NYT aren't really in a position here to dig their heels in with "not our problem if our deletion demand makes training AI in the US infeasible" - they likely know that's not going to fly, and are instead using it to prompt negotiation for licensing their content.


I don't think it's practical. It would almost be like training from scratch.


Why is that their problem? If they didn’t get permission to use copyrighted content why is it unreasonable for retraining to occur without their content?

And please spare me the humans learn too nonsense. Computers are not humans.


Humans mostly don't have the ability to reproduce articles verbatim years later and those that can certainly don't have the capacity to serve copies and derivatives to the entire free world ... and even if they did they would quote and paraphrase selectively and non-competitively with attribution to the source. LLMs do all the functions humans can't with none of the guardrails or attribution.


A last grasp at relevance from the NYT


I’m not a lawyer but it seems to me that the infringement would be in the publication of the output of the models, not in the mere existence of the models.

Let’s imagine a really obvious case of copyright infringement - where a shady publisher buys up copywritten books, assembles them into a private library, then prints and sells copies. The infringement is the “prints and sells”, right? Not the “assembles a private library”?

If we accept that ChatGPT is committing copyright infringement (which at least seems true if it returns big chunks of the copywritten material verbatim), I don’t think you can get from there to compelling deletion of the model.


The LLM model is fixed somewhere in a tangible medium and it (allegedly) includes an unlicensed copy of a copyrighted work. The LLM spitting that out to users is (allegedly) both an infringement and evidence the model itself is infringing.


Isn’t that a massive misunderstanding of LLMs? If I have infinite monkeys and typewriters, that does not mean the typewriters (or monkeys) contain copyrighted text.

LLMs are basically an optimization of infinite monkeys using statistics.

I can see how outputs can be infringing copyright, but I’m having a very hard time seeing how the weights can be. It feels like saying a musician is infringing copyright if they understand a song well enough to be able to create an infringing reproduction, even if they don’t.


A lot of analogies break down here. This is just an early salvo in many arguments to come, that will need reasoned debate by competent people, judged by wise leaders. I really hope we can have that debate.


Simply duplicating copyrighted data is itself a crime in certain jurisdictions.


I'd argue that training LLM's on news sites makes them less useful anyway. It's not like these sites do any meaningful research. They just disseminate information readily available elsewhere.


And LLMs can do meaningful research? Or even consistently produce meaningful output without every hallucinating?


Follow up Q: can you untrain an LLM, or is OpenAI going to have to retrain without the articles?


"I call on @facebook, @cohere, @AnthropicAI, @MistralAI, @huggingface, @google and everyone else who cares about the future of AI to join the defense and smash the NYT in court for this overreaching attempt to twist and expand copyright to the determinant of pretty much everyone else on Earth but the NY Times."

Preach it louder! I think most of them (certainly Huggingface and Mistral) will be ready to help on this. Let's hope that big tech companies are savvy at getting better lawyers who can more clearly articulate the pro-LLM position to lawmakers.


If an LLM scrapes a source code repository that requires attribution, will you also preach to tech companies to go after said license?


What license requires attribution to read and understand the code? Or to use techniques from code for totally different purposes, embodied in different code?

You may have a point for verbatim regurgitation of code, but I’m having a hard time seeing a violation in ingestion code and learning how it works.


> What license requires attribution to read and understand the code?

BSD, MIT, several others. Without a license you cannot read the code.


reading has never been one of the rights reserved to copyright holders under us copyright law, and in fact there are a number of carve-outs in us copyright law to ensure that copyright holders cannot prevent people from reading their work, such as the first-sale doctrine and the exemption for copying an executable into memory

there have at times been regimes that legally restricted the mere reading of published works, but such policies have generally been considered repugnant to liberal democracy and indeed liberalism in general


And who will preach morality, ethics, ownership, the right to retain your creative endeavours and benefit from them so that overall society keeps progressing and there's enough reward for putting labour into something creative and genuine?

Before someone comes up with the argument that LLMs are creative and genuine, yes as a technique they are. The art form, the theory and science is marvellous result of human ingenuity. Hats off.

But the end result, is not.

Just like tape recorder is a great invention but the pirated music recorded on it is not.


That using copywritten text in a commercial goal is not a breach of copyright?

It's gonna take Robert Kardashian to get that through


AFAIK compared to OpenAI and Microsoft, Huggingface and Mistral have no money.

I suppose they could file an amicus brief restating what openai's lawyers will say, but beyond that what are they supposed to do?


This isn’t an appeal. There’s no general ability to submit amicus briefs. Well, there is a way to, but trial courts ignore them (as they should, because this is a case between specific parties, and unless someone has some sort of privity making them worth hearing from, outside opinions are just noise).


What is "the pro-LLM position" in this case?


Weird he leaves openai and Microsoft out of the list. I assume it some game of thrones style ommission indicating where his loyalties lie.


The lawsuit is against MS and OpenAI. They don't need to be called to action, they're already on the front lines. This is for everyone not already being sued.


They are the target of the lawsuit. He is saying all these other LLMs have a target on their back and should be coming to help Microsoft and OpenAI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: