Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenAI and journalism (openai.com)
104 points by zerojames on Jan 8, 2024 | hide | past | favorite | 122 comments


I'm sympathetic to the idea that training is fair use. I'm much less sympathetic to the "it's a bug" defense on regurgitation. Ultimately, if your system is reproducing copyrighted material, it seems like you're in violation. I'm not sure "we're trying our best to make it not do that" gets you out of that.


I think it is relevant to the claim that GPT serves as a replacement for reading the NYT, which is an important prong to the infringement case. If it can only regurgitate <1% of articles in full text, i think that's a much less strong of a case than if it can regurgitate a significant fraction


I am not sure that reading even a verbatim NYT article from say 2017 that is in the training data is a replacement for reading the NYT which is mostly current news. So it is not similar to how news orgs sued Google for showing their content in real time. It is still copyright infringement of course, I am not siding with OpenAI here.

ChatGPT can replace the NYT archives (a paid service?) though.


AI is merely exposing how insane copyright law is.

Using public facing data for training is exactly as legal as looking at that data yourself. An AI regurgitating information is exactly as legal as a human doing it. If you are allowed to - in a public square - recite from memory an article you read on the internet... if that is legal: then an AI doing the equivalent must also legal.

This is a truth: because (to reiterate a basic fact) the training data (the text) simply does not exist in the LLM. What does exist is a set of weights and biases that are a representation of that data (a model). To say a model even is reciting data is inaccurate...

With that said - big companies are not your friend - open source development of AI is the only way to go. Transparency is key to healthy development.


> If you are allowed to - in a public square - recite from memory an article you read on the internet

You are in fact not allowed to do that


>> If you are allowed to - in a public square - recite from memory an article you read on the internet

> You are in fact not allowed to do that

You're arguing poetry “recitals” aren't legal?

Copyright law is designed to protect original work from being used in a way that could harm the creator's interests, typically economically. In the case of a public recital, we have:

1. Fair Use presuming it's a one-time event, non-commercial, with no impact on market value

2. Public Performance generally means plays, music, movies, things that are a "performance". Reciting the article isn't the performance of something designed to be a performance, in fact it's arguably transformative.

3. Copyright infringement usually concerns reproduction, distribution, or commercial exploitation. A random, one-time, live recitation doesn't easily fall into these categories, and enforcing copyright in such a scenario would be impractical and unusual.

Certainly there are ways this could go differently. If the recital were to be recorded and distributed, or if it were done 9 to 5 daily as a "news barker" reading every top article from a soapbox as an alternative to the newsstand nearby ... could be a problem.


> You're arguing poetry “recitals” aren't legal?

No.

I think the strongest point you make is the one-time thing. But I still think, in general, simply reproducing an article via speech verbatim would not be found fair use (not saying whether it should or shouldn't be, nor whether this at all applies to GPT, nor whether it's right or wrong, nor whether practically anyone would face consequence for it)


Is it illegal to hallucinate and say words that happen to be verbatim a copy-write article you read once a blue moon ago?

The key point is - this isn't being copied from the website, it's being generated from the model. this is not copied copywrite work being copied. What that is is speech. Is it protected speech?


> The key point is - this isn't being copied from the website

Yes, it is.

It was copied from the website into the models training set, compressed into the model, and then reproduced from the compressed form.

> What that is is speech.

Even if it was, the Constitutional protection on speech vis-a-vis the copyright power is coextensive with statutory fair use (a codification of Constitutional case law), so that assertion doesn't, even if accepted as meaningfully true, change how copyright law applies.


data (the website) is being 'seen' by a model. training data is NOT the model. there is NO text from any website in any LLM model.

The website is not being copied to an internal model anywhere, it is being looked (processed through an LLM) at and trained on...

If you can find it on the (open and public facing) internet, then it's fair game to everyone (and everything) to 'see'.


> data (the website) is being 'seen' by a model.

The quotes around “seen” are appropriate since its at best a metaphor.

A literal copy is made of the data (along with other data) to create the training set, and then a process is run which creates a lossy compressed representation of all the data in that training set, and that lossy compressed representation is called a model.

> there is NO text from any website in any LLM model.

There is, just as if you make a collage out of a series of pictures and then apply a lossy image compression algorithm visual data from those pictures is included in the compressed reperesentation.

> The website is not being copied to an internal model

This is simply false. That's exactly what is happening.

> it is being looked (processed through an LLM) at and trained on...

That is (lossily) copying into the trained model.

> If you can find it on the (open and public facing) internet, then it's fair game to everyone (and everything) to 'see'.

It is not free to those who see it to reproduce it; public display of a copyright-protected work doesn't waive protection, and even where seeing is a literal rather than metaphorical description, as for a human reader of a protected work, reproduction of the work or creationnof derivative after having seen it is prohibited.


> training data is NOT the model. there is NO text from any website in any LLM model.

The model is a (lossily) compressed form of the training data.


In considering these things, affordances must be made to the new abilities made possible by the computational tools now at our disposal.

Following your line of reasoning if it is perfectly legal to walk into a coffee shop and sit down and listen to what the people next to me are talking about, commit it to memory, even make notes about it, does it then follow that it should be perfectly legal, reasonable, and acceptable for a govt agency or some other organization to put microphones everywhere to record what everyone is talking about, then feed all this data into various databases and modeling systems?

Reciting something in a park is different than selling a copyrighted print of something in a park when you don’t hold the copyright. Which is much closer to what the NYT is accusing OpenAI of.

The training data not “existing” in the model is interesting, but at some point, a distinction without a difference.

If I hire an autistic savant to go to a library and read all the books, then I set up a book selling service where whenever people want to buy a book I have my savant employee type out the book for them, is it then going to pass muster in a copyright case if I tell the judge “It’s okay actually, because the books don’t actually exist in my employee’s brain, merely neuronal encodings of them.” ?

If I have a copyrighted image on which I don’t hold the copyright. But I want to start selling it to people, is it cool if I just run it through a lossless compression algorithm, thereby generating a new encoding of the information and then sell this new encoding along with the software and command to reverse the compression?

Regarding the open source stuff, there I think you might find more favor to your arguments.

But the stuff we are seeing within commercial enterprises like OpenAI and Midjourney is clearly copyright infringement.

And I don’t see copyright law being insane in these cases.


It would be perfectly legal that a million government agents went it coffee shops and recorded what they heard. It is the leaving of government property on private property that is the real issue, as well as transparency... not access of information (please don't do this my government).

As far as the savant reading all the books analogy goes... it's a bit off base - mostly because the AI isn't attempting to do that... it would have to be prompted specially (which as far as I understand - what's happening: people giving verbose special prompts to 'extract' copyright... which again... extract - the verbiage regenerate is better, considering there's no guarantee the generation will be a perfect reproduction...) to generated that information. What is happening, (fixing the analogy) the savant reads all the books in the library: then someone asks him to generate a brand new book... which contains some passages that happen to be like those in copy-written works... this is 100% interoperable to what human writers do all the time. Why would we ever want to punish an AI for reading and remembering better than us?

On top of that imperfect reproduction is the sale as if it's the original... that's a lot of additional assumptions to make...

Sadly the lossless compression is also a bad analogy. Math maps and 100% translatable and thus not change/encoding to the bits... if you compresses it lossy, to the point of doing it artistically, then... if none of the bits are the same - it's not the same picture, and doesn't hold any 'bit' of the old image.

Good reply!


> If I hire an autistic savant to go to a library and read all the books, then I set up a book selling service where whenever people want to buy a book I have my savant employee type out the book for them, is it then going to pass muster in a copyright case if I tell the judge “It’s okay actually, because the books don’t actually exist in my employee’s brain, merely neuronal encodings of them.” ?

No, and I do think OpenAI returning copyrighted works verbatim is probably copyright infringement even if it’s “laundered” through a LLM.

However if the autistic savant only provided summaries, analyses, etc that is fair use (IANAL), and should be for LLMs too.

That probably means LLMs will need some sort of scrubbing process to ensure exact training data can’t be reproduced, or if that’s not feasible then some type of output filter that looks for training data (although that would be a problem for open source models)


> What does exist is a set of weights and biases that are a representation of that data (a model)

This is a distinction without a difference if the text can be reliably reproduced. For example, a trivial neural network that always outputs the text is just a fancy encoding of that text.


I wish that was how it worked in music. De la soul using 1% somebody elses song in no way diminishes the original yet it is a massive problem and can block their right to republish the song in any format 30 years later.


I'm not sure either way. While my naïve thought is that factual knowledge ought to be distinct from knowledge of language comprehension, I can't think how to actually do that — how do you riff on a quotation without knowing the quotation? How can you read between the lines if you are given half a phrase where the point is in the unspoken other half?

I don't know enough about amnesia to know if humans even separate these things or not.

If they are separable, LLMs can be made much smaller without loss of comprehension, as "mere facts" can easily be punted off the network and into a pool of searchable documents.

Without that capability, we'd have to teach the models that verbatim quotes are not allowed, just as we have to tell each other about copyright law because humans can recall things sometimes — and in mere conversation it's considered unremarkable (or even good!) if our mouths or hands reproduce perfect reproductions. Poems, pledges, anthems, quotes…

…but for anyone who has memorised, say, Star Wars, you're forbidden from rather than incapable of writing down the entire script.


> "We're trying our best to make it not do that" gets you out of that.

Well, if that wouldn't get you out, we wouldn't have Youtube, Google, Dropbox and any other site that probably could output copyrighted material. This would be the end of the internet idea.


The DMCA has specific rules which govern sites that host user-generated content, which in turn gives the hosting site legal protection from copyright claims. At the moment, inference doesn't count as "user-generated" content and would thus (probably) not fall under DMCA rules.


Without Google, YouTube, and Dropbox, copyrighted material would just be where the copyright holder deemed and authorized it to be. (save some amount of piracy of course)

It would actually be more in line with the “Internet idea” of a decentralized network, not a massive hub and spoke arrangement.

Saying, “Hey, we already have some big corporations where copyright infringement plays some role in their business model, why not add a few more in the form of “Open”AI, and whoever else.” Is not a good argument.

The centralized server farms and behemoth corporations are in many ways representative of what the commercial internet has become, but are not a fulfillment of the original “internet idea.”


Piracy was at the core of the internet idea from its very inception through hacker culture.


You seem to have misunderstood what I was meaning to convey. I wasn’t even meaning to say much in the way of the virtues or not of piracy, or the legitimacy of copyright in it’s present form.

If Google, YouTube, Dropbox didn’t exist you could still share files with your friends.

Sharing a file with a friend and him benefiting in some way from that is a different thing entirely than massive centralizing corporations flouting copyright rules to benefit commercially and increase their power.

More of the former, less of the latter.


YouTube got big SPECIFICALLY cause they hosted pirated movies. Long before they were bought by Google.


Hold up, Google bought YouTube just a year and a half after YouTube went live. My memory is that the user-produced videos and memes were a way bigger draw than copyrighted material in the early days. That's not to say copyright infringement wasn't a problem by the time Google acquired it, but I remember it being more of a nuisance than the site's raison d'etre. (Source: was at Google then, and also: https://en.wikipedia.org/wiki/History_of_YouTube)


I see a world of difference between; here are the first three paragraphs from a NYT article, what do you think the fourth is? And what is today's news and getting a paragraph from the NYT.

OpenAI is saying they are going to try stop people tricking them into a copyright violation.


The problem is not that people can "trick" the system into reproducing this data. The problem is that the data was used for training without permission.


That is left to be decided, but I don't think that is what fair use says.

Should it be illegal to gather certain bits of meta data on NYT articles like:

  * word counts
  * word frequency
  * sentiment analysis
  * grammar and spelling
  * facts about the world


While I imagine building knowledge of sources into the model itself is no trivial task, it would be so easy to search the training data against the output of the model to generate references.


Would searching the entire corpus really be easy?


Relatively so. This is essentially a much smaller version of what a search engine does (since their sources are curated).


"If we do something illegal on accident, then it should be legal."


It matters if the product has other non-infringing uses.


Sure. If you serve something legal 99 times out of a hundred, hopefully you're only liable for 1% of the damage you would have caused if you did it 100% of the time. If you do illegal things at scale, I don't really care if you're doing legal things at an even larger scale.


Sure it is. If you see a violation, you flag it like YouTube and add it as a exception that won’t show up again in the RHLF.


YouTube can just set a flag to show video X.

LLMs don't work that way.


Sure you can. You filter out the specific text post output.


Operate on the kleptomaniac to get them to stop shoplifting...


Comes off as a weak response IMO.

* They allow an opt out for training – but it was introduced in August 2023 and they don't retroactively clean up their model once you do opt out.

* Regurgitation is a rare bug – doesn't mean it isn't a problem, or one they cannot be sued over.

* Prompts that get ChatGPT to regurgitate are "not typical or allowed user activity" – this is the weakest excuse of all. It is their responsibility to control user behavior on a system they built. If I can type something in and get the response I want then I am using it in the intended way. Otherwise nothing on the internet can ever be illegal, because hey the user was the one who typed the URL.


> They allow an opt out for training

And given how buggy OpenAI is in general, can’t trust them to actually honor this opt-out, nor protect it from abuse (e.g. they scrape copied content). They’ll need a model like YouTube’s scrubber that either deletes content or monetizes it on behalf of the copyright owner. And since OpenAI is making lots of money here, yes there is a technically feasible path towards royalties. But sama thinks he’s cool af and e/sigma blah blah blah.

https://community.openai.com/t/chatgpt-occasionally-reuses-t...


> Otherwise nothing on the internet can ever be illegal, because hey the user was the one who typed the URL.

It can be illegal to access material that was legal to host, because a user and a host can be in different jurisdictions.

It can also be possible to use a tool in an unlawful manner even if user and supplier are in the same jurisdiction, for example you can type up a document that reproduces a work protected by copyright without permission and print it out, and this is on you not on whoever you got the keyboard, computer, and printer from.

These may well not be sufficient defence in law, I am not a lawyer and can't tell what the law considers sensible.


Under your standard, how are any drawing programs legal? I can draw copyrighted Disney characters and it doesn't stop me.


Case #1 – I host a database and fill it with copyrighted data. You come by and make queries and access that data.

Cast #2 – I host a blank database. You fill it up with copyrighted data, and later access that data.

You are alluding to case 2, and yes over there it is debatable who committed the violation. You will mostly likely be at fault if it fits under section 230.

This situation however is case 1. OpenAI is the one who accessed the copyrighted data, copied it, hosts it, and is giving everyone access to it. There is zero argument that the user is at fault here (which is what they are trying to say in this blog post).


ChatGPT isn't a lookup database, it is a generative system. I doubt the court is going to demand perfection from them, only that they don't make producing copyrighted materials trivially easy, and when new ways are found that they respond in a reasonable amount of time to stop it.


> it is a generative system.

It's supposed to be, but it regurgitated the NYT article and that's the problem.


Plenty of the examples given in NYT's lawsuit would qualify as "trivially easy".


That would be fine by me if the court found that, but ruled that training is fair use. Ideally the DMCA would be extended to outline what generative platforms need to do to discourage the output of copyrighted materials.


You are drawing the character, not the program. You are not allowed to sell your Disney character drawing because Disney holds the rights to the character.

ChatGPT is the one producing the article and profiting off it.


The obvious answer here is that the drawing program did not require copyrighted material from Disney to be created.


What does that have to do with outputs?

If ingesting data is a fair use violation it doesn't matter what the model outputs. I was responding that is the responsibility of the program creator stop all fair use violations instigated by its users.


1. You can't unlearn something 2. If it's right, it's right. Why dilute the answer? 3. What happened to personal responsibility? Guns don't kill people, people kill people.


You can definitely unlearn things with respect to LLM. You retrain. It’s expensive for them, boo hoo.


LLMs can't "unlearn" even with additional finetuning. Content drifts towards the newly added material, yes, but it can't forget, and artifacts will still be present.


I think they meant retrain from ground up. Clean room implementation. Technically feasible but not business feasible. Otherwise learn NDA and get sued for breaking it in the future.


you can start from scratch, but you cannot unlearn.


1. That's the problem.

2. Copyright. If it's a right, it's a right.

3. This is not a universally agreed upon position.


1. Yes you can. You can delete the training models and progress made with copyrighted material.

Of course, they don't want to do that.


Point three is insanely flawed for a multitude of reasons which range from the fact that a gun allows a human to kill people in the heat of the moment, to the fact that a gun enables a human to kill people from hidden/safe space hundreds of metres away.


When building automated systems, the personal responsibility thing is weird and diluted. Is it openai's fault, or the company using openai in their produce, or the client of that company, or the end user? Sometimes it seems like people are down to give b2b companies a free pass, like a B2C company should be liable for what they send their users, but openai shouldn't be liable for what they send the B2C company. Personal responsibility can apply to LLM providers too.


> It is their responsibility to control user behavior on a system they built.

You do realize that GPT-4 was a moment similar to the first flight of the Wright brothers? You don't have Boeing standards of quality yet.


I'm increasingly of the opinion that publicly available models should be required to disclose the data they are trained on. Perhaps not necessarily the raw data itself, but at least a description of what the datasets are.

I get the sense that there would be more backlash against these models which would drive us more quickly towards a lasting resolution if people better understood how their data is being used.


Microsoft's can search the open internet. That'd be a long disclosure list!


That's not what training data means


Isn't it considered one-shot learning training it within the context window? There were lots of glowing reports of how well it can do one/few shot learning that way.

I don't think that training on copywritten data is necessarily wrong, just pointing out that doing so within the context window rather than at weight training time might not be so different.


I’m not sure what is meant by this. There’s no “training” happening within the context window, at least not by the commonly-used definition of training, it’s all just part of the input. If you’re asking whether you can reverse-index search copywritten text and feed it into an AI model without permission, that’s been happening for years.


For example, within the context window giving it 5 examples of a problem in a class it has never seen, with answers, and then asking it to solve a sixth was given as one of the amazing few-shot learning examples.

It could potentially do similar searching by the internet for similar things to your question and then figuring out how to derive the answer, without finding an exact answer match directly.


This blog post reads less like a defense and more like a PR piece trying to convince normal people that OpenAI is in the right in the lawsuit. Except that a) pro-AI people already know the legal issues of training models and b) anti-AI people don't care.

I'm not sure who this post is for.


Maybe there is something huge between and the world isn't as black and white as you might believe?


Normally yes, but AI discourse lately is indeed that polarized and without nuance.

The normals who just use the ChatGPT webapps and mobile apps don't care about the lawsuit anyways. In that case, this blog post is just a Streisand Effect.


A high profile lawsuit raises questions about the stability of the service.

If changing your processes to include OpenAI is expensive then that’s a big concern.

Wouldn’t surprise me if their sales people are getting questioned on if the lawsuit is an existential threat and this is an attempt at giving them something.


Obviously it’s a PR piece. And it’s for the same people that the New York Time’s article on their lawsuit was for.

Their legal defense will come in the form of a response to the complaint and a motion to dismiss.


because its a PR piece :) PR for the public, defense for the courts


legal right is less important to us than being good citizens. We have led the AI industry in providing a simple opt-out process for publishers (which The New York Times adopted in August 2023)

So what does this mean for all content trained before August 2023?

Are you only allowed to opt out for future training?


I don’t think they will delete their models and retrain them for every person who fills their opt out form. So yes, it’s only for future training, and IMO it’s a “please don’t use my stuff” at best.

Their data collection is very much automated, so avoiding opted out content that was reposted somewhere else is likely impossible for them.


yes


The whole opt-out thing is such a dark pattern. If we (ML researchers) knew how to cheaply remove/untrain content from an already trained model, we'd say that opt-out without untraining is wrong. But since we don't know how to do it, we throw our hands in the air and say "well we should get a free pass on other people's content, because we can't do our jobs without it, because our jobs MUST be done before we figure this out."


Wow we can opt out, how generous of the billion dollar entity controlled by M$

Training on anything should be an explicit opt-in process and they (and every other model out there) should be retrained from the ground up with only explicitly allowed materials.


I don't see there being any basis to the claim that training is fair use, which is the premise their argument rests on.


(not making a argument for either side) I believe the Google Books decision is probably the strongest precedent for this in the US: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....


The conclusion of the appeal reads:

> Google's unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google's commercial nature and profit motivation do not justify denial of fair use.

IMO this is very different from what OpenAI is doing.

- The purpose of Google books was to search for content and then use the results of the search to buy the books from the original publishers. OpenAI doesn’t attribute results nor provides a way to pay for the content it returns.

- Google books doesn’t provide the entire book as a response from the query, only the snippet with the result. OpenAI does return entire articles verbatim or paraphrased as result of queries.

- Users of Google Books do not use it in place of actual books. ChatGPT replaces the need to read the actual articles/sources.


There's a lot of basis for claiming training is fair use. I would prefer it if it wasn't and I think using people's work to create competing product is obviously not fair use, but there's strong arguments on either side imo. It's a complex question that will be settled in court and possibly with new laws.


"We had explained to The New York Times that, like any single source, their content didn't meaningfully contribute to the training of our existing models and also wouldn't be sufficiently impactful for future training."

I wonder how they operationalize "meaningful". After all, if the work didn't meaningfully contribute to the training of existing or future models, then why do they seem to need it in the first place?


Because they need the ocean, even if they don't need any individual drop within it.

Relative to the rest of the data, NYT's is not meaningfully impactful. But they need the aggregate amount of data, which is why they need training to be fair use and to be opt out rather than opt in.


Understood, and that's at the core of the hustle here. At some point one of NYT's lawyers is going to point out the apparent contradiction in OpenAI saying that NYT is important enough that they need to take from them, but not important enough that their taking causes damages to them.


Actually, if you read Sy Damile's (a former General Counsel of the US Copyright office) testimony to Congress (which is really worth a read), he makes a very good point that the smaller the training data set the less likely that the resulting model would qualify for fair use of the data as it would be more likely that the model would not be transformative and would be prone to reproduce the training data.

So arguably the importance of including the NYT and everyone else collectively in opt out selection is precisely because it makes the NYT individually less important in the final model.


This is the art of bullshitting.


I have to pay to access/read the NYT, either in hard copy or digital forms. This content is educational for me, but I must pay for it. Why is this not the case for OpenAI? It seems like this started with Google successfully arguing that it should be able to copy books because it isn't _using_ the books directly to learn, it's only digitizing and storing them for other people to use! What a joke.


That was very similar to my reaction. All the fair use rhetoric is carefully caveated to public information. NYT is not publicly accessible, it's paid access. If this argument is true then zlib does nothing illegal the Internet Archive is allowed to scan books. We can't have this it's true and false dichotomy going.


>intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use

What if I paste in some of my own code for ChatGPT to review and it's so terrible it barfs? Does that mean I violated the terms of use?

https://donhopkins.medium.com/the-future-of-gpt4-programming...


@dang title should probably be something like, “OpenAI responds to NYTimes lawsuit”


> "[...] Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize."

- https://news.ycombinator.com/newsguidelines.html


The subtitle could be added? "We support journalism, partner with news organizations, and believe The New York Times lawsuit is without merit"


If data for machine usage is fair use why don’t they open up their model?


Because it's about copy rights, not copy obligations.

Fair use only applies to works you can use; it doesn't force their original authors/owners to publish them in the first place.


Ok, so put a policing script that checks for copywrite violation, and kicks it back to the LLM to rephrase if there is a violation.

I don't see how that couldn't appease copywrite holders, shy of them actually just wanting to cripple threats to their old ways industry.


they have consumed countless copyrighted works, and have admitted that without such data, their model would be pathetic in comparison to the current version.


> Our goal is to develop AI tools that empower people

Proprietary software does not empower people.


OAI and other AI companies are merely hiding behind fair use, which includes

- The purpose and character of the use

- Whether such use is of a commercial nature

- The amount of reference material used

- The effect on the work's value

I'm very much pro-AI but I'm not willing to sacrifice the fourth estate or erase the value of intellectual labor so some guy can by another half-million dollar watch.

I hope the NYTimes succeeds in their lawsuit, not because I want to hurt OAI or because I like the Times, but because I think the more we commoditize information, the dumber we get.


This might be a controversial take, but I just can’t imagine being emotionally invested in this whole copyright drama.

What’s the use case here NYT is trying to prevent? A user spending hours to force the model to regurgitate articles that may or may not be hallucination? Sorry to say it but there are far easier ways to get around the NYT paywall.

Pacifying the NYT isn’t worth it if it means we lose models like GPT4. I hope OpenAI wins this one.


Hallucinations are a problem waiting to be solved. What happens when it is?

NYT isn’t the only source affected by this. All content created and posted on the internet is subject to being used by ChatGPT. Their goal is to provide answers to any question you might have, as an alternative to browsing actual sources.

If OpenAI wins and ChatGPT continues to improve, there will be no financial or social motivation to create honest content for the internet. The death of the organic internet will accelerate and all content that remains will be created by bots.

Of course this is a dystopian view that may not fully realize, but if all it takes to avoid it is preventing a single corporation from succeeding, then I’m all for it.


> Their goal is to provide answers to any question you might have, as an alternative to browsing actual sources.

You mean like the internet browsing became an alternative to actually buying newspapers? And now LLM answers an alternative to actually browsing?

Let's close both the internet and AI and save journalism. Because it's gone to shit.


You really think we’re gonna be able to solve hallucination before this regurgitation problem? Please.


Any ruling on copyright in training models sets a precedent for everyone in the ML/AI space, which is a lot of people on Hacker News.


I hate the hypocrisy of “Open”AI. They feel entitled to everyone else’s data, but they keep their own data secret and proprietary.


I dont think the nyt needs a use. It's their stuff. A lot of these stupid billionaire lawsuits people can't be interested in hinge on legal foundations you otherwise freak put over. Imagine having to give random nobodies a valid reason to not use your car.


Why is it not possible to fine-tune the models to not verbatim reproduce copyrighted material even if they are aware of the knowledge contained in the copyrighted material?


I’m not even going to read this. ClosedAI should immediately open source their dataset and release all training data.


They can't release the training data because doing so would be copyright infringement lol.


You don't have to share the content of the sources to specify what the sources were.


Of course it's fair use.

It's a transformative application of copyrighted materials it's the textbook definition of fair use, and anyone pretending it isn't because of some anti generative AI luddite sentiment is risking to destroy a precedent on which much of the modern internet is built.


Whether something is transformative is only one of four factors that is considered when determining if something is fair use.


Please read the article. It's not about transformative use.


The "is it transformative?" question is the heart of the lawsuit, given that the NYT demonstrated that ChatGPT outputted nontransformative NYT articles.


Even if it does output training articles verbatim it's still transformative, mainly because that's not its primary application.


What is the primary application of a LLM other than outputting text?


Well it depends on intent, in chatgpts case it's responding to users prompts. And it's not really the best way to read those articles that's just a byproduct of the training process.


> Training AI models using publicly available internet materials is fair use

Question: Does this apply to paywalled content too ?


Answer's in the statement. Paywalled content isn't publicly available.


Sure. What I had in mind is partial paywalls, like the NYT.


For once Jason on the All In podcast had an interesting point that you could see bolt-on data sources to your ChatGPT experience where you opt-in to having NYT available to the ChatGPT retrieval step. There's no reason that a data provider shouldn't be paid for the information they create, and I think most of HN would agree with that. The problem is how do you know a model didn't bypass your walls to include your data in the model itself, or as part of retrieval.


Why does writing about the acts of others apply a layer of ownership when the impetus was the act of another? Without them the journalist has nothing.

All GPT4 is doing is writing about the act of your writing, same way a journalist only creates from the act of another, would make sense if the individuals the journo writes about got paid a percentage for the act that caused the writing.

Seems hypocritical to claim the chain of writing about ends with you when you have nothing without others first acting when the page was blank.


Fair use also implies you legally acquired the content to begin with. They trained the data on pirated collections of text. If they had bought a copy of each magazine or book – just one copy – then maybe it's fair use. They couldn't even be bothered to do that.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: