If it's fair use reproducing the article text verbatim is fine.

andy99 · on Jan 7, 2024

What is "it"? Training can be fair use, i.e. updating weights incrementally based on predicted next token probabilities. And I (not a lawyer) think that if a broadly trained foundation model can recall some verbatim text, that doesn't mean the model is infringing.

It seems like the lawsuit here is talking about specific NYT related functionality, like retrieving the text in new articles. That essentially has nothing to do with training and running a foundation LLM, it's about some specific shenanigans with NYT content, and it's legal status would appear to have nothing to do with whether training is fair use.

gremlinunderway · on Jan 7, 2024

Good luck trying to explain "updating weights incrementally based on predicted next token probabilities" to completely non-technical lawyers and judges.

lesuorac · on Jan 7, 2024

Good thing they don't have to. As I've said before, this slight of hand to talk about the case as-if its about training is a great move by OpenAi; however the case is more than just about training. NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

This is akin to having to explain to non-technical lawyers and judges how crypto works. In the FTX case it became irrelevant because you can just nail them on fraud for using deposited funds for non-allowed reasons.

mrkramer · on Jan 7, 2024

>NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.

So if ChatGPT didn't refer to it verbatim and if ChatGPT trained on it and mixed it with other content, NYT would be OK with that? Tbh I don't get it.

Edit: I found this in my bookmarks archive - https://news.slashdot.org/story/23/08/15/2242238/nyt-prohibi...

Also this: https://www.cnbc.com/2023/10/18/universal-music-sues-anthrop...

lesuorac · on Jan 8, 2024

It depends on the purpose of ChatGPT. If people use it as a substitute for the NYT then yes I suspect NYT will be not ok with it.

I think also the courts will also side with NYT. Very recently there was a copyright case involving Andy Warhol [1] which he lost. Despite the artwork being visually transformative; it's use was not transformative. So, to me that means if you create a program using NYT's materials that is used as a replacement for NYT it will not count as fair use. Obviously you could just do what say Google does and fork money over to NYT for some license.

However, my initial point is that this is a tangent. NYT has claimed that OpenAI is using NYT's works at least as-is and so OpenAI can just be nailed for that. Which is my point about FTX; it's irrelevant if their exchange was legal since you can just nail them for mis-use of customer funds. Another example would be Al Capone; it doesn't matter if he's a mobster because you can nail him for tax evasion.

[1]: https://www.cbsnews.com/news/andy-warhol-supreme-court-princ...

mrkramer · on Jan 8, 2024

I think this is more of a question of licensing content, sooner or later AI chat bots will have to license at least some of the content they are trained on.

But broadly speaking this is also the question of the "Open Web" and will it survive or not. Walled gardens like Facebook, Instagram etc. are strong and pervasive but still majority of people use and acknowledge publicly open websites from the Open Web. If AI chat bots do not drive traffic to websites then they are walled gardens and Microsoft, Google or whoever will lock users in and try to squeeze them for money.

simonw · on Jan 7, 2024

I didn't see NYT allege that - their lawsuit explains pre-training pretty accurately I thought.

lesuorac · on Jan 8, 2024

Its buried on page 37 - #108. There probably are other examples in the lawsuit but this is sufficent.

> Synthetic search applications built on the GPT LLMs, including Bing Chat and Browse with Bing for ChatGPT, display extensive excerpts or paraphrases of the contents of search results, including Times content, that may not have been included in the model’s training set. The “grounding” technique employed by these products includes receiving a prompt from a user, copying Times content relating to the prompt from the internet, providing the prompt together with the copied Times content as additional context for the LLM, and having the LLM stitch together paraphrases or quotes from the copied Times content to create natural-language substitutes that serve the same informative purpose as the original. In some cases, Defendants’ models simply spit out several paragraphs of The Times’s articles.

https://www.courtlistener.com/docket/68117049/1/the-new-york...

simonw · on Jan 8, 2024

Oh I see - yeah, that's the part of the lawsuit that's about Bing and ChatGPT Browse mode retrieval augmented generation.

It's a separate issue from the fact that the model can regurgitate it's NYT training data.

There's a section on page 63 which helps clarify that:

    Defendants materially contributed to and directly assisted
    with the direct infringement perpetrated by end-users of the
    GPT-based products by way of: (i) jointly-developing LLM
    models capable of distributing unlicensed copies of Times
    Works to end-users; (ii) building and training the GPT LLMs
    using Times Works; and (iii) deciding what content is
    actually outputted by the GenAI products, such as grounding
    output in Times Works through retrieval augmented generation,
    fine-tuning the models for desired outcomes, and/or
    selecting and weighting the parameters of the GPT LLMs.

So they are complaining about models that are capable of distributing unlicensed copies (the regurgitated training data issue), the fact that the models were trained on NYT work at all, and the fact that the RAG implementation in Bing and ChatGPT Browse further creates "natural-language substitutes that serve the same informative purpose as the original".

solomatov · on Jan 7, 2024

Yep, you seem to be right. Google stores the quotes from pages, and it's fair use. Again, I am not a lawyer, and didn't think about this.

simonw · on Jan 7, 2024

This is argued extensively in the lawsuit document.

A key argument the NYT is making is that part of the definition of fair use is not producing a product that competes with the original.

They argue that ChatGPT et al DO compete with the original, in a way that harms the NYT business model.

One example they give: ChatGPT can reproduce recommendations made by the Wirecutter, without including the affiliate links that form the Wirecutter's main source of revenue - page 48 of https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

shkkmo · on Jan 7, 2024

There are plenty of other services that strip affiliate links from content for users, such as ad blockers.

Notably, in both those cases, the user is specifically asking for what wirecutter thinks.

To me, that makes the infringing behavior clearly the fault of the user of the tool, not the tool itself.

When I saw this lawsuit announced, I assumed that large portions content were being reproduced in response to generic queries. That isn't the case, every example I've seen from this lawsuit has prompts where the user specifically asks for the content. To me, any fault and liability rests on the user here.