Hacker News new | past | comments | ask | show | jobs | submit login

Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)

I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.




For all the leaks on: Secret projects, novelty training algorithms not being published anymore so as to preserve market share, custom hardware, Q* learning, internal politics at companies at the forefront of state of the art LLMs...A thunderous silence is the lack of leaks, on the exact datasets used to train the main commercial LLMs.

It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?

Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?


I'm not for or against anything at this point until someone gets their balls out and clearly defines what copyright infringement means in this context.

If you give a bunch of books to a kid all by the same author and then pay that kid to write a book in a similar style and then I go on to sell that book...have I somehow infringed copyright?

The kids book at best is likely to be a very convincing facsimile of the original authors work...but not the authors work.

It seems to me that the only solution for artists is to charge for access to their work in a secure environment then lobotomise people on the way out.

The endgame seems to be "you can view and enjoy our work, but if you want to learn or be inspired by it, thats not on"


There are two problems with the “kid” analogy:

a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.


> This copy is not a transitory copy in service of a fair use

Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.


Why is training “almost certainly” fair use?

Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.

If it took an act of Congress to make “unlicensed” debugging a fair use copy…


If you overtrain the model may include verbatim copies of your training material, and may be able to produce verbatim copies of the original in its output.

If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.


> If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

I don't think that you can copyright a plot or story in any country can you?

If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.


>This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself,

Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.


How is it a copy at all? Surely the model weights would therefore be much larger than the corpus of training data, which is not the case at all.

If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?

I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.

I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.


Regarding (b) ... while a specific method of training that involved persistent copying may indeed be a violation, it is far from clear that the general notion of "send server request for URL, digest response in software that is not a browser" is automatically a violation. If there is deemed to be a difference (i.e. all you are allowed to do without a license is have a human read it in a browser), then one can see training mechanisms changing to accomodate that.


It’s all about the purpose the transitory copy serves. The mechanism doesn’t really matter, so you can’t make categorical claims about (say) non-browser requests.


I don't have a comment on your hypothetical, but this case seems to go far beyond that. If you read the actual filing at the bottom of the linked page, NYT provides examples where ChatGPT recited exact multi-paragraph sections of their articles and tried to pass it off as its own words. Plainly reproducing a work is pretty much the only situation where "is this copyright violation?" isn't really in flux. It's not dissimilar to selling PDFs of copywritten books.

If NYT were fully rellying on the argument that training a model in wordcraft using their materials is always copyright violation, or only had short quotes to point to, the philosophical debate you're trying to have would be more relevant.


Importantly, the kid- an individual human- got some wealth somewhat proportional to their effort. There’s non-trivial effort in recruiting the kid. We can’t clone the kid’s brain a million times and run it for pennies.

There are differences that are ethically, politically and in other ways between an AI doing something and a human doing the exact same thing. Those differences may need reflecting in new laws.

IANAL ans don’t have any positive suggestions for good laws, just pointing out that the analogy doesn’t quite hold. I think we’re in new territory where analogies to previous human activities aren’t always productive.


I think you’re skipping over the problem.

In your example you owned the work you gave to the person to create derivatives of.

In a more accurate example you would be stealing those books and then giving them to someone else to create derivatives.


How about if I borrowed them from the library and gave them to the kid to read?

How about if I got the kid to read the books on a public website where the author made the books available for free?


Ironically these artists cant claim to be wholly original as they were certainly inspired. Artists that play live already "lobotomize" people on their way out since it's not easy to recreate an experience and a video isn't the same if it's a good show.

Artists that make easily reproducible art will circulate as these always have along with AI in a sea of other jpgs.


you might be well served by reading the actual complaint.


I think your kid analogy is flawed because it ignores the fact that you couldn't reasonably use said "kid" to rapidly produce thousands of works in the same style and then go on to use them to flood the market and drown out the original authors presence.

Try this with a real "kid" and you'll run into all kids of real-world constraints whereas flooding the world with derivative drivel using LLMs is something that's actually possible.

So yeah, stop using weak analogies, it's not helpful or intelligent.


Would be fascinated to hear from someone inside on a throwaway, but my nearest experience is that corporate lawyers aren't stupid.

If there's legally-murky secret data sauce, it's firewalled from being easily seen in its entirety by anyone not golden-handcuffed to the company.

They may be able to train against it. They may be able to peek at portions of it. But no one is downloading-all.


Big corporations and corporate lawyers lose major lawsuits all the time.


That doesn't mean they don't spend lots of time thinking of ways not to lose them.

See: Google turning off retention on internal conversations to avoid creating anti-trust evidence


for what it's worth, i asked altman directly and he denied using libgen or books2, but also deferred to murati and her team on specifics. but the Q&A wasn't recorded and they haven't answered my follow-ups.


Really? Because the GPT-3 paper talks about "...two internet-based books corpora (Books1 and Books2)..." (see pages 8 and 9) - https://arxiv.org/pdf/2005.14165.pdf

Unclear what that corpora might be, or if its the same books2 you are referring to.


My guess is that this poster meant books3, not books2.

books1 and books2 are OpenAI corpuses that have never (to my knowledge) had their content revealed.

books3 is public, developed outside of OpenAI and we know exactly what's in it.


sorry, books3 is indeed what I meant.


Why would he know the answer in the first place?


The legal liabilities of the training data they use in their flagship product seems to be a thing the CEO should know.


We all remember when Aaron Swartz got hit with a wire tapping and intent to distribute federal crime for downloading JSTR stuff right?

It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.

This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.


What happened to Aaron Swartz was terrible. I find that what he was doing was outright good. IMO the right reading isn't to make sure anyone doing something similar faces the same way, but to make the information far more free, whether it's a corporation using it or not. I don't want them to steamroll everyone equally here, but to not steamroll anyone.


There are two points at issue here. One, that information should be more free, and two, that large corporations and private individuals should be equal before the law.


I don't want them to steamroll everyone equally here, but to not steamroll anyone.

I think you're nissing the point, and putting cart before horse. If you ensure that corporations are treated as stringently as people are sometimes, the reverse is true. And that means your goal will presumably be obtained, as the corporate might, becomes the little guy's win.

All with no unjust treatment.


Huh. I see downvotes. I am mystified, for if people and corporations are both treated stringently under the law, corporations will fight to have overly restrictive laws knocked down.

I envision pitting corporate body against corporate body, when one corporatism lobbies, works to (for example) extend copyrights, others will work to weaken copyright.

That doesn't happen as vigilantly currently, because there is no corporate incentive. They play the old, ask for forgiveness, rather than permission angle.

Anyhow. I just prefer to set my enemies against my enemies. More fun.


Corporations follow these laws much more stringently than individuals. Individuals often use pirated software to make things, I've seen many examples of that. I've never seen a corporation use pirated software to make things, they pay for licenses. Maybe there is some rare cases, but pirating is mostly a thing individuals do not corporations.

So in general it is already as you say, corporations are much more targeted by these laws than individuals are. These laws mostly hinders corporations, us individuals are too small to be noticed by the system in most cases.

I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague. I can't really think of many examples where corporations are breaking these laws more than small individuals do.


So then you refute the comment I replied to, and its parent.


> I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague.

They use copyrighted material or they commit copyright infringement? The former doesn't necessarily constitute the latter. Likewise, given it's an option legally, there are other factors that go into the decision to use it that likely make it less attractive to AAA games.


Circumventing computer security to copy items en masse to distribute wholesale without transformation is a far cry from reading data on public facing web pages.


He didn't circumvent computer security. He had had a right to use the MIT network and pull the JSTR information. He certainly did it in a shady way (computer in a closet) but it's every bit as arguable that he did it that way because he didn't want someone stealing or unplugging his laptop while it was downloading the data.

He also did not distribute the information wholesale. What he planned on doing with the information was never proven.

OpenAI IS distributing information they got wholesale from the internet without license to that information. Heck, they are selling the information they distribute.


OpenAI IS distributing information they got wholesale from the internet

Facts are not subject to copyright. It's very obvious ChatGPT is more than a search engine regurgitating copies of pages it indexed.


Facts are not subject to copyright

That's false; but even assuming it's true, misinformation is creative content and therefore 99% of the Internet is subject to copyright.


No it is not. You can make a better argument than just BSing.

https://libraries.emory.edu/research/copyright/copyright-dat...


> right to use the MIT

That right ended when he used it to break the law. It was also for use on MIT computers, not for remote access (which is why he decided to install the laptop, also knowing this was against his "right to use").

The "right to use" also included a warning that misuse could result in state and federal prosecutions. It was not some free for all.

> and pull the JSTR information

No, he did not have the right to pull en masse. The JSTOR access explicitly disallowed that. So he most certainly did not have the "right" to do that, even if he were sitting at MIT in an office not breaking into systems.

> did it in a shady way

The word you're looking for is "illegal." Breaking and entering is not simply shady - it's illegal and against the law. B&E with intent to commit a felony (which is what he was doing) is an even more serious crime, and one of the charges.

> he did it that way because he didn't want someone stealing or unplugging his laptop

Ah, the old "ends justifies break the law" argument.

Now, to be precise, MIT and JSTOR went to great lengths to stop the outflow of copying, which both saw. Schwartz returned multiple times to devise workarounds, continuing to break laws and circumvent yet more security measures. This was not some simply plug and forget laptop. He continually and persistently engaged in hacking to get around the protections both MIT and JSTOR were putting in place to stop him. He added a second computer, he used MAC spoofing, among other things. His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.

Go read the indictment and evidence.

> OpenAI IS distributing information they got wholesale

No, that ludicrous. How many complete JSTOR papers can I pull from ChatGPT? Zero? How many complete novels? None? Short stories? Also none? Can I ask for any of a category of items and get any of them? Nope. I cannot.

It's extremely hard to even get a complete decent sized paragraph from any work, and almost certainly not one you pre-select at will (most of those anyone produces are found by running massive search runs, then post selecting any matches).

Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.

How many could I get from what Schwartz downloaded? Millions? Not just even as text - I could have gotten the complete author formatted layout, diagrams, everything, in perfect photo ready copy.

You're being dishonest in claiming these are the same. One can feel sad for Schwartz outcome, realize he was breaking the law, and realizing the current OpenAI copyright situation is likely unlike any previous copyright situation all at the same time. No need to equate such different things.


Ok, so a lot you've written but it comes down to this. What law did he break?

Neither MIT nor JSTOR raised issue with what Schwartz did. JSTOR even went out of their way to tell the FBI they did not want him prosecuted.

Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.

> His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.

And this is where you are confusing a "crime" with "misuse of a system". MIT and JSTOR were in their rights to cut access. That does not mean that what Schwartz did was illegal. Similar to how if a business owner tells you "you need to leave now" you aren't committing a crime because they asked you to leave. That doesn't happen until you are trespassed.

> Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.

You violate copyright by transforming. And fortunately, it's really simple to show that chat GPT will violate and simply emit byte for byte chunks of copyrighted material.

You can, for example, ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you.

> How many could I get from what Schwartz downloaded?

0, because he didn't distribute.


> What law did he break?

You can read the indictment, which I already suggested you do.

> Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.

He wasn't charged with wiretapping (not even sure that's a generic crime). He was charged with (two counts of) wire fraud (18 USC 1343), a huge difference. He also had 5 different charges of computer fraud (18 USC 1030(a)(4), (b) & 2), 5 counts of unlawfully obtaining information from a protected computer (18 USC 1030 (a)(2), (b), (c)(2)(B)(iii) & 2), and 1 count of recklessly damaging a protected computer (18 USC...).

He was not charged with "intent to distribute", and there's not such thing as a "wiretapping" charge. Did you ever once read the actual indictment, or did you just make all this up from internet forum posts?

If you're going to start with the phrase "Remember, again.." you should try to make up nonsense. Actually read what you're asking others to "remember" which you apparently never knew in the first place.

> you are confusing a "crime" with "misuse of a system"

Apparently you are (willfully?) ignorant of law.

> You violate copyright by transforming.

That's false too. Transformative use is one defense used to not infringe copyright. Carefully read up on the topic.

> ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you

Provide the prompt. Courts have ruled that code that is the naïve way to create a simple solution is not copyrighted on it's own, so if you have only a few disconnected snippets, that violates nothing. Can you make it reproduce an entire source file, comments, legalese at the top? I doubt it. To violate copyright one needs a certain amount (determined by trials) of the content.

You might also want to make sure you're not simply reading OpenJDK.

> 0, because he didn't distribute.

Please read. "How many could I get from what Schwartz downloaded?" does not mean he published it all before he was stopped. It means what he took.

That you seem unable to tell the difference between someone copying millions of PDF to distribute as-is, and the effort one must go to to possibly get a desired copyrighted snippet, shows either dishonestly or ignorance of relevant laws.


Why isn't robots.txt enough to enforce copyright etc? If NYT didn't set robots.txt properly, is their content free-for-all? Yes I know the first answer you would jump to is "of course not, copyright is the default", but it's almost 2024 and we have had robots.txt as industry de jure to stop crawling.


robots.txt is not meant to be a mechanism of communicating the licensing of content on the page being crawled nor is it meant to communicate how the crawled content is allowed to be used by the crawler.

Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.


Furthering the S3 health data thought exercise:

If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).

I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)


HIPAA is about more than just names. Just information such as a patient's ZIP code and full medical history is often enough to de-anonymise someone. HIPAA breaches are considered much more severe than intellectual property infringements. I think the main reason that patients are considered to have ownership of even anonymised versions of their data (in terms of controlling how it is used) is that attempted anonymisation can fail, and there is always a risk of being deanonymised.

If somehow it could be proven without doubt that deanonymising that data wasn't possible (which cannot be done), then the harm probably wouldn't be very big aside from just general data ownership concerns which are already being discussed.


> should [they] be allowed to use this data in training…?

Unequivocally, yes.

LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.

I’m all for the greater good.


Eliminating the right to patient privacy does not serve the greater good. People have enough distrust of the medical system already. I’m ambivalent to training on properly anonymized health data but, i reject out of hand the idea that OpenAI et al should have unfettered access to identifiable private conversations between me and my doctor for the nebulous goal of some future improvement on llm models.


> unfettered access to identifiable private conversations

You misread the post I was responding to. They were suggesting health data with PII removed.

Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.

If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?

We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.


> Furthering the S3 health data thought exercise: If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

To me this says that openai would have access to ill-gotten raw patient data and would do the PII stripping themselves.


> could outweigh an individual's right to privacy.

If that’s the case, let’s put it on the ballet and vote for it.

I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.

If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!


> I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything

> If that’s the case, let’s put it on the ballet and vote for it.

This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.


Robot.txt isn't about copyrights, its about preventing bots. Its effectively a EULA. Copyright law only goes into effect when you distribute the content you scrape. If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.


> If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.

Why?

As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.


Er... This is what all these lawsuits against LLMs are hoping to disprove


Which lawsuits are concerning LLMs used only privately by the organization that developed it?


>Why isn't robots.txt enough to enforce copyright

You actually need a lot more than that. Most significantly, you need to have registered the work with the Copyright Office.

“No civil action for infringement of the copyright in any United States work shall be instituted until ... registration of the copyright claim has been made in accordance with this title.” 17 USC §411(a).


But the thing is, you can only bring the civil action forward after registering your claim but you need not register the claim before the infringement occurs.

Copyright is granted to the creator upon creation.


That is incorrect.

If the work is unpublished for the purposes of the Copyright Act, you do have to register (or preregister) the work prior to the infringement. 17 USC § 412(1).

If the work is published, you still have to register it within the earlier of (a) three months after the first publication of the work or (b) one month after the copyright owner learns of the infringement.

See below for the actual text of the law.

Publication, for the purposes of the Copyright Act, generally means transferring or offering a copy of the work for sale or rental. But there are many cases where it’s not clear whether a work has or has not been published — most notably when a work is posted online and can be downloaded, but has not been explicitly offered for sale.

Also, the Supreme Court recently ruled that the mere filing of an application for registration is insufficient to file suit. The Register of Copyrights has to actually grant your application. The registration process typically takes many months, though you can pay $800 for expedited processing, if you need it.

~~~

Here is the relevant portion of the Copyright Act:

In any action under this title, other than an action brought for a violation of the rights of the author under section 106A(a), an action for infringement of the copyright of a work that has been preregistered under section 408(f) before the commencement of the infringement and that has an effective date of registration not later than the earlier of 3 months after the first publication of the work or 1 month after the copyright owner has learned of the infringement, or an action instituted under section 411(c), no award of statutory damages or of attorney’s fees, as provided by sections 504 and 505, shall be made for—

(1) any infringement of copyright in an unpublished work commenced before the effective date of its registration; or

(2) any infringement of copyright commenced after first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work.


NYT seemed to claim paid subscriptions as well, which I'm not sure that bots can actually crawl.


ChatGPTs birth as a research preview may have been an attempt to avoid these issues. It would have been unlikely to trigger legal anger for a free product which few use. When usage exploded, the natural inclination would be to hope for the best.

Google may simply have been obliged to follow suit.

Personally, I’m looking forward to pirate LLMs trained on academic content.


Is there already a dataset? Before llama Facebook had one too I forgot what it was called.


> a more established competitor

Apple is already doing this: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...

Apple caught a lot of shit over the past 18 months for their lack of AI strategy; but I think two years from now they're going to look like geniuses.


didnt they just get caught for pantent infrigment? I'm sure they've done their fair share of shady stuff with the AI datasets too, they are just going to do a stellar job of conciling it.


Try searching for man or woman in your photos app. It won't even show it to me. It's lobotomized and has been for many years.


> the first being at the birth of modern search engines.

Why do you say that? Search engines would at least direct the viewer to the source. NYT gets 35%+ of its traffic from Google: https://www.similarweb.com/website/nytimes.com/#traffic-sour...


Just because they asked for forgiveness instead of asking first for permission, it's original sins will not be erased :-)

"Google Agrees to Pay Canadian Media for Using Their Content" - https://www.nytimes.com/2023/11/29/world/americas/google-can...


That's why I think the newspapers will manage to win against the LLM companies. They won against Google despite having no real argument why they should get paid to get more traffic. The search engine tax is even a shakier concept than the LLM tax would be.

Newspapers are very powerful and they own the platform to push their opinion. I'm not about to forget the EU debates where they all (or close to all) lied about how meta tags really work to push it their way, they've done it and they will do it again.


That doesn’t mean that it wasn’t theft of their content. The internet would be a very different place if creator compensation and low friction micropayments were some of the first principles. Instead we’re left with ads as the only viable monetization model and clickbait/misinformation as a side effect.


I don't quite get it. If listing your link is considered as theft, HN is then a thief of content too. If you don't want your content stolen, just tell Google to not index your website?

I guess it's more constructive to propose alternatives than just bashing the status quo. What's your creator compensation model for a search engine? I believe whatever being proposed is trading off something significant for being more ethic.


The world you’re hoping for will put all AI tech only within the hands of the established top 10 media entities, who traditionally have never compensated fairly anyway.

Sorry but if that’s the alternative to some writers feeling slighted, I’ll choose for the writers to be sad and the tech to be free.


“Feeling slighted” is a gross understatement of how a lack of compensation flowing to creators has shaped the internet and the wider world over the past 25 years. If we have a problem with the way top media companies compensate their creators, that is a separate issue - not a justification for layering another issue on top.


YouTube had made way more content creators wealthy than the NYT. Writers are not going to be paid more after this ruling either way.


Has the NYT made even a single content creator wealthy? Journalists there make less money than an average software engineer.


It's in the realm of possibility, lots of people found work post vox and buzzfeed too but i wouldn't classify it as the work of the NYT. "Real" creatives and content creators seem to embrace AI or at least grudgingly alter their own works, the OP I'm replying to would be cheering YouTube for suing openAI on the behalf of YouTubers everywhere, despite it having no bearing on reality.

The main objectors are the old guard monopolies that are threatened.


Have you been a creator on YT? Do you know how much an average creator gets paid? Did you know that it and other modern platforms like Spotify artificially skew payouts towards the richest brands? If not, then please let’s not make any claims about wealthiness and “old guard” monopolies here.


Gadzooks! You're right! If only NYT had realised the secret to success was spewing out articles reacting to other articles reacting to other articles, they would all have been millionaires!


Did you forget the /s or do you not think that a lot of journalism is indeed reacting to other journalists?


I’m a creator myself and see the two futures ahead of me and free benefits me in the long term more than closed.

The tech can either run freely in a box under my desk or I’ll have to pay upwards of 15-20k a year to run it on Adobes/Google/etcs servers. Once the tech is locked up it will skyrocket to AutoCAD type pricing because the acceleration it provides is too much.

Journos can weep, small price to pay for the tech being free for us all.


I think the introduction of an expectation for compensation has generally brought down the quality of content online. Different people and incentives appear to get involved once content == money, vs content == creative expression.


So you're advocating giving open AI and incumbents a massive advantage by now delegitimizing the process? It's kinda like why Netflix was all for "fast lanes"


> I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming.

Eventually these LLMs are going to be put in mechanical bodies with the ability to interact with the world and learn (update their weights) in realtime. Consider how absurd your perspective would be then, when it'd be illegal for this embodied LLM to read any copyrighted text, be it a book or a web page, without special permission from the copyright holder, while humans face no such restriction.


A human faces the same restriction, if it provides commercial services on the internet creating code that is a copy of copyrighted code.


This isn't true; if you hire a contractor and tell them "write from memory the copyrighted code X which you saw before", and they have such a good memory that they manage to write it verbatim, then you take that code and use it in a way that breaches copyright, you're liable, not the person you paid to copy the code for you. They're only liable if they were under NDA for that code.


> they have such a good memory that they manage to write it verbatim

No, there is no clause in copyright law that says "unless someone remembered it all and copied it from their memory instead of directly from the original source." That would just be a different mechanism of copying.

Clean-room techniques are used so that if there is incidental replication of parts of code in the course of a reimplementation of existing software, that it can be proven it was not copied from the source work.


And what professional developer would not be under NDA for the code he produces for a corporation?


The topic of this thread is LLMs reproducing _publicly available_ copyright content. Almost no developer would be under NDA for random copyrighted code online.


> while humans face no such restriction.

I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.

https://copyrightalliance.org/copyright-cases-2022/

Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.

Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.


>I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.

They're sued for _producing content_, not consuming content. If a human takes copyrighted output from an LLM and publishes it, they're absolutely liable if they violated copyright.

>Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.

That is absolutely what people in this thread are suggesting should happen: that it should be illegal for OpenAI et. al. to train models on publicly available content without first receiving permission from the authors.

>Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.

That's irrelevant here because people training LLMs aren't feeding them copyrighted content for the sole purpose of reproducing it verbatim.


> That's irrelevant here because people training LLMs aren't feeding them copyrighted content for the sole purpose of reproducing it verbatim.

Disagree, it is completely relevant when discussing computers Vs people, the bar that has already been set is alternative uses.

LLMs don't have a purpose outside of regurgitating what it has ingested. CD burners at least could be claimed they were backing up your data.


> Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)

Hacker News consistently have upvoted posts to let users circumvent paywalls. And even when it doesn't, conversations here (and on Twitter, Reddit, etc.) that summarize the articles and quote the relevant bits as soon as the articles are published are much more of a threat to The New York Times than ChatGPT training on articles from months/years ago.


I don't think it's about scraping being a threat. It's that they violated the TOS and stand to make a ton of money from someone else's work.

I find irony in the newspaper suing AI when other news sources (admittedly not NYT) use AI to write the articles. How many other AI scrapers are just ingesting AI generated content?


> I find irony in the newspaper suing AI when other news sources (admittedly not NYT) use AI to write the articles.

That isn't ironic at all, newspapers have newspaper competitors and if those competitors can steal content by washing it through an AI that is a serious problem. If these AI models weren't used to produce news articles and similar then it would be a much smaller issue.


Same, to all those arguing in favour of Open AI, I have a question, do you steal books, movies, games ?

Do you illegally share them via torrents or even sell copies of these works ?

Because that is what’s going on here?


> they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart

Maybe, but I find the "It's ok to break the law because otherwise I can't do what I want" narrative a little offputting.


Doesn't this harm open source ML by adding yet another costly barrier to training models?


It doesn't matter what's good for open source ML.

It matters what is legal and what makes sense.


It doesn't matter what is legal. It matters what is right. Society is about balancing the needs of the individual vs the collective. I have a hard time equating individual rights with the NYT and I know my general views on scraping public data and who I was rooting for in the LinkedIn case.


I have an even harder time equating individual rights with the spending of $xx billion in Azure compute time and payment of a collective $0 to millions of individuals who involuntarily contribute training material to create a closed source, commercial service allowing a single company to compete with all the individuals currently employed to create similar work.

NYT just happens to be an entity that can afford to fight Microsoft in court.


I don't see a problem as long as there's taxation.

Look at SpaceX. They paid a collective $0 to the individuals who discovered all the physics and engineering knowledge. Without that knowledge they're nothing. But still, aren't we all glad that SpaceX exists?

In exchange for all the knowledge that SpaceX is privatizing, we get to tax them. "You took from us, so we get to take it back with tax."

I think the more important consideration isn't fairness it's prosperity. I don't want to ruin the gravy train with IP and copyright law. Let them take everything, then tax the end output in order to correct the balance and make things right.


When we're discussing litigation, it certainly matters what is legal.


And also - if what is legal isn't right, we live in a democracy and should change that.

Saying what's legal is irrelevant is an odd take.

I like living in a place with a rule of law.


Should Harriet Tubman have petitioned her local city council and waited for a referendum before freeing slaves?


Time will tell if comparing slavery to copyright is ridiculous or not.

In the case of slavery - we changed the law.

In the case of copyright - it's older than the Atlantic Slave Trade and still alive and kicking.

It's almost as if one of them is not like the other.


> It's almost as if one of them is not like the other.

Use this newfound insight to take my comment in good faith, as per HN guidelines, and recognize that I am making a generalized analogy about the gap between law and ethics, and not making a direct comparison between copyright and slavery.

Can we get back on topic?


It matters what ends up being best for humanity, and I think there are cases to be made both ways on this


People often get buried in the weeds about the purpose of copyright. Let us not forget that the only reason copyright laws exist is

> To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries

If copyright is starting to impede rather than promote progress, then it needs to change to remain constitutional.


The reason copyright promotes progress is that it incentives individuals and organizations to release works publicly, knowing their works are protected against unlawful copying.

The end game when large content producers like The New York Times are squeezed due to copyright not being enforced is that they will become more draconian in their DRM measures. If you don't like paywalls now, watch out for what happens if a free-for-all is allowed for model training on copyrighted works without monetary compensation.

I had a similar conversation with my brother-in-law who's an economist by training, but now works in data science. Initially he was in the side of OpenAI, said that model training data is fair game. After probing him, he came to the same conclusion I describe: not enforcing copyright for model training data will just result in a tightening of free access to data.

We're already seeing it from the likes of Twitter/X and Reddit. That trend is likely to spread to more content-rich companies and get even more draconian as time goes on.


I doubt there’s much that technical controls can do to limit the spread of NYT content, their only real recourse is to try suing unauthorized distributors. You only need to copy something once for it to be free.


Do other countries all use the same reasoning?


I don't think this was your point, but no they don't. Specifically China. What will happen if China has unbridled training for a decade while the United States quibbles about copyright?

I think publications should be protected enough to keep them in business, so I don't really know what to make of this situation.


Copyright isn't what got in the way here. AI could have negotiated a license agreement with the rights holder. But they chose not to.


From their perspective they're training a giant mechanical brain. A human brain doesn't need any special license agreement to read and learn from a publicly available book or web page, why should a silicon one? They probably didn't even consider the possibility that people'd claim that merely having an LLM read copyrighted data was a copyright violation.


I was thinking about this argument too: is it a "license violation" to gift a young adult a NYT subscription to help them learn to read? Or someone learning English as second language? That seems to be a strong argument.

But it falls apart because kids aren't business units trained to maximize shareholder returns (maybe in the farming age they were). OpenAI isn't open, it's making revolutionary tools that are absolutely going to be monetized by the highest bidder. A quick way to test this is NYT offers to drop their case if "open" AI "open"-ly releases all its code and training data, they're just learning right? what's the harm?


The law on this does not currently exist. It is in the process of being created by the courts and legistatures.

I personally think that giving copyright holders control over who is legally allowed to view a work that has been made publicly available is a huge step in the wrong direction. One of those reasons is open source, but really that argument applies just as well to making sure that smaller companies have a chance of competing.

I think it makes much more sense to go after the infringing uses of models rather than putting in another barrier that will further advantage the big players in this space.


Copyright holders already have control over who is legally allowed to view a work that has been made publicly available. It's the right to distribution. You don't waive that right when you make your content free to view on a trial basis to visitors to your site, with the intent of getting subscriptions - however easy your terms are to skirt. NYT has the right to remove any of their content at any time, and to bar others from hosting and profiting on the content.


It does exist, and you'd be glad to know that it's going in the pro-AI/training direction: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...


> It does exist, and you'd be glad to know that it's going in the pro-AI/training direction

Certainly not in the US. From the article you linked "In the United States, in the absence of a TDM exception, AI companies contend that inclusion of copyrighted materials in training sets constitute fair use eg not copyright infringement, which position remains to be evaluated by the courts."

Fair use is a defense against copyright infringement, but the whole question in the first place is whether generative AI training falls under fair use, and this case looks to be the biggest test of that (among others filed relatively recently).


It’s disingenuous to frame using data to train a model as a “view,” of that data. The simple cases are the easy ones, if ChatGPT completely rips a NYT article then that’s obviously infringement; however, there’s an argument to be made that every part of the LLM training dataset is, in part, used in every output of that LLM.

I don’t know the solution, but I don’t like the idea that anything I post online that is openly viewable is automatically opted into being part of ML/AI training data, and I imagine that opinion would be amplified if my writing was a product which was being directly threatened by the very same models.


All I can ever think about with how ML models work is that they sound an awful lot like Data Laundering schemes.

You can get basically-but-not-quite-exactly the copyrighted material that it was trained on.

Saw this a lot with some earlier image models where you could type in an artists name and get their work back.

The fact that AI models are having to put up guardrails to prevent that sort of use is a good sign that they weren't trained ethically and they should be paying a ton of licensing fees to the people whose content they used without permission.


>You can get basically-but-not-quite-exactly the copyrighted material that it was trained on.

You can do exactly the same with a human author or artist if you prompt them to. And if you decide to publish this material, you're the one liable for breach of copyright, not the person you instructed to create the material.


Not if that person is a trillion dollar corporation. If they're a business that's regularly stealing content and re-writing it for their customers that business is gonna go down. Sure, a customer or two may go down with them but the business that sells counterfeit works to spec is not gonna last long.


Clearly if a law is bad then we should change that law. The law is supposed to serve humanity and when it fails to do so it needs to change.


setting legality as a cornerstone of ethics is a very slippery slope :)


Slavery was legal...


Still is in many countries with excellent diplomatic relations with the Western World:

https://www.cfr.org/backgrounder/what-kafala-system


open source won't care. they'll just use data anyway.

closed/proprietary services that also monetize - there's a question whether it's "fair" to take and use data for free, and then basically resell access to it. the monetization aspect is the bigger rub than just data use.

(maybe it's worth noting again that "openai" is not really "open" and not the same as open source ai/ml.)

taking data, maybe it's data that's free to take, and then as freely distributing resulting work, that's really just fine. taking something for free (without distinction, maybe it's free, maybe it's supposed to stay free, maybe it's not supposed to be used like that, maybe it's copyrighted), and then just ignoring licenses/relicensing and monetizing without care, that's just a minefield.


You can train your own model no problem, but you arguably can’t publish it. So yes, the model can’t be open-sourced, but the training procedure can.


I think not, because stealing large amounts of unlicensed content and hoping momentum/bluster/secrecy protects you is a privilege afforded only to corporations.

OSS seems to be developing its own, transparent, datasets.


It’s likely fair use.


Playing back large passages of verbatim content sold as your “product” without citation is almost certainly not fair use. Fair use would be saying “The New York Times said X” and then quoting a sentence with attribution. Thats not what OpenAI is being sued for. They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

This is also related to earlier studies about OpenAI where their models have a bad habit of just regurgitating training data verbatim. If your trained data is protected IP you didn’t secure the rights for then that’s a real big problem. Hence this lawsuit. If successful, the floodgates will open.


> They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

https://openai.com/policies/terms-of-use

> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.


They can’t transfer rights to the output of it isn’t theirs to begin with.

Saying they don’t claim the rights over their output while outputting large chunks verbatim is the old YouTube scheme of upload movie and say “no copyright intended”.


Exactly. And while one can easily just take down such a movie if an infringement claim is filed it’s unclear how one “removes” content from a trained model given how these models work. Thats messy.


If it’s found that the use of the material is infringing on the rights of the copyright holder than the AI company has to retrain their model without any material they don’t have a right to. Pretty clear to me


By that logic Microsoft Word should have to refuse to save or print any text that contained copyrighted content. GPT is just a tool; the user who's asking it to produce copyrighted content (and then publishing that content) is the one violating the copyright, and they're the ones who should be liable.


I don’t even know where to begin on this example.

The situations aren’t remotely similar and that much should be obvious. In one instance ChatGPT is reproducing copyrighted work and in the other Word is taking keyboard input from the user; Word itself isn’t producing anything itself.

> GPT is just a tool.

I don’t know what point this is supposed to make. It is not “just a tool” in the sense that it has no impact on what gets written.

Which brings us back to the beginning.

> the user who’s asking it to produce copyrighted content.

ChatGPT was trained on copyrighted content. The fact that it CAN reproduce the copyrighted content and the fact that it was trained on it is what the argument is about.


The bits you cite are legally bogus.

That would be like me just photocopying a book you wrote and then handing out copies saying we’re assigning different rights to the content. The whole point of the lawsuit is that OpenAI doesn’t own the content and thus they can’t just change the ownership rights per their terms of service. It doesn’t work like that.


Their legalese is careful to include the 'if any' qualifier ("We hereby assign to you all our right, title, and interest, if any, in and to Output.")

In any case, the point is that they made no claim to Output (as opposed to their code, etc) being their IP.


That's irrelevant. The main point is that they are re-distributing the content without permission from the copyright owners, so they are sort of implicitly claiming they have copy/distribution rights over it. Since they don't, then it's obvious they can't give you this content at all.


>The main point is that they are re-distributing the content without permission from the copyright owners,

By your logic, Firefox is re-distributing content without permission from the copyright owners whenever you use it to read a pirated book. ChatGPT isn't just randomly generating copyrighted content, it just does so when explicitly prompted by a user.


That is not the same thing at all. If I search on Google for copyrighted content and Google shows me the content, it is the server which serves the content who is most directly responsible, not Google nor I. Firefox is only a neutral agent, whereas ChatGPT is the source of the copyrighted content.

Of course, if the input I give to ChatGPT is "here is a piece from an NYT aricle, please tell it to me again verbatim", followed by a copy I got from the NYT archive, and ChatGPT is returning the same text I gave it as input, that is not copyright infringement. But if I say "please show me the text of the NYT article on crime from 10th January 1993", and ChatGPT returns the exact text of that article, then they are obviously infringing on NYT's distribution rights for this content, since they are retrieving it from their own storage.

If they returned a link you could click, t and retrieved the content from the NYT, along with any other changes such as advertising, even if it were inside an iframe, it would be an entirely different matter.


> In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.

IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.


They are distributing the output, so they (implicitly) claim to have the right to distribute it. I can send you a movie I downloaded along with a license that says "I hereby assign to you all our right, title, and interest, if any, in and to Output. ", I'm still obviously infringing on the copyright of that movie (unless I have a deal that allows re-distribution, of course, as Netflix does).


That part doesn't seem relevant to me in any case. IP pirates aren't prosecuted or sued because of a claim of ownership; they're prosecuted or sued over possession, distribution, or use.


At the root, it seems like there's also a gap in copyright with respect to AI around transformative.

Is using something, in its entirety, as a tiny bit of a massive data set, in order to produce something novel... infringing?

That's a pretty weird question that never existed when copyright was defined.


Replace the AI model by a human, and it should become pretty clear what is allowed and what isn’t, in terms of published output. The issue is that an AI model is like a human that you can force to produce copyright-infringing output, or at least where you have little control over whether the output is copyright-infringing or not.


Its less clear than you think, and comes down more on how OpenAI is commercially benefiting and competiting with NYT than what they actually did. (See four factors of fair use)


I think it did come up back in the day sort of, for example with libraries.

More importantly, ever case is unique so what really came up was a set of principles for what defines fair use, which will definitely guide this.


I would note that in the examples the NYT cites, the prompts explicitly ask for the reproduction of content.

I think it makes sense to hold model makers responsible when their tools make infringement too easy to do or possible to do accidentally. However that is a far cry from requiring a little longer license to do the trainint in the first place.


> It's likely fair use.

I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.

They asked her about the issue of copyrighted training data. Her response was:

""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.

So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """

Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.


Genuinely asking, is the “verbatim” thing set in stone? I mean, an entity spewing out NYTimes-like articles after having been trained on lots of NYTimes content sounds like a very grey zone, in the “spirit” of copyright law some may judge it as indeed not-lawful.

Of course, I’m not a lawyer and I know that in the US sticking to precedents (which mention the “verbatim” thing) takes a lot of precedence over judging something based on the spirit of the law, but stranger things have happened.


There's already precedence for this in news: News outlets constantly report on each other's stories. That's why they care so much about being first on a story, because once they break it, it is fair game for everyone else to report on it too.

Here's a hypothetical: suppose there is a random fact about some news event that has only been reported in a single article. Do they suddenly have a monopoly on that fact, and deserve compensation whenever that fact gets picked up and repeated by other news articles or books or TV shows or movies (or AI models)?


It's likely not. Search for "the four factors of fair use". While I think OpenAI will have decent arguments for 3 of the factors, they'll get killed on the fourth factor, "the effect of the use on the potential market", which is what this lawsuit is really about.

If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.

Of course, I think this is a great test case precisely because the power of "Internet scale" and generative AI is fundamentally different than our previous notions about why we wanted a "fair use exception" in the first place.


Fair use is based on a flexible proportionality test so they don't need perfect arguments on all factors.

> If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.

I think it's fairly clear that it doesn't. No one is going to use ChatGPT to circumvent NYTimes paywalls when archive.ph and the NoPaywall browser extension exist and any copyright violations would be on the publisher of ChatGPT's content.

But let's not pretend like any of us have any clue what's going to happen in this case. Even if Judge Alsup gets it, we're so far in uncharted territory any speculation is useless.


> we're so far in uncharted territory any speculation is useless

I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.

But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.

> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)

From https://fairuse.stanford.edu/overview/fair-use/four-factors/


Nobody is gonna cancel their NYT subscription for chatGPT 4.0. OpenAI will win.


Per my other comment here, https://news.ycombinator.com/item?id=38784723, courts have previously ruled that whether people would cancel their NYT subscription is irrelevant to that test.


What exactly is the effect on the potential market? That's exactly why I don't think OpenAI will lose, why would a court side with the NYT?


What if a court interprets fair use as a human-only right, just like it did for copyright?


I think we need a lot of clarity here. I think it's perfectly sensible to look at gigantic corpuses of high quality literature as being something society would want to be fair use for training an LLM to better understand and produce more correct writing... but the actual information contained in NYT articles should probably be controlled primarily by NYT. If the value a business delivers (in this case the information of the articles) can be freely poached without limitation by competitors then that business can't afford to actually invest in delivering a quality product.

As a counter argument it might be reasonable to instead say that the NYT delivers "current information" so perhaps it'd be fair to train your model on articles so long as they aren't too recent... but I think a lot of the information that the NYT now relies on for actual traffic is their non-temporal stuff - including things like life advice and recipes.


The case for copyright is exactly the opposite: the form of content (the precise way the NYT writers presented it) is protected. The ideas therein, the actual news story, is very much not protected at all. You can freely and legally read an NYT article hot off the press and go on air on Fox News and recount it, as long as you're not copying their exact words. Even if the news turns out to be entirely fake and invented by the NYT to catch you leaking their stuff, you still have every right to present the information therein.

This isn't even "fair use". The ideas in a work are simply not protected by copyright, only the form is.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: