What is "it"? Training can be fair use, i.e. updating weights incrementally based on predicted next token probabilities. And I (not a lawyer) think that if a broadly trained foundation model can recall some verbatim text, that doesn't mean the model is infringing.
It seems like the lawsuit here is talking about specific NYT related functionality, like retrieving the text in new articles. That essentially has nothing to do with training and running a foundation LLM, it's about some specific shenanigans with NYT content, and it's legal status would appear to have nothing to do with whether training is fair use.
Good luck trying to explain "updating weights incrementally based on predicted next token probabilities" to completely non-technical lawyers and judges.
Good thing they don't have to. As I've said before, this slight of hand to talk about the case as-if its about training is a great move by OpenAi; however the case is more than just about training. NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.
This is akin to having to explain to non-technical lawyers and judges how crypto works. In the FTX case it became irrelevant because you can just nail them on fraud for using deposited funds for non-allowed reasons.
>NYT is alleging that OpenAI maintains a database of NYT works and refers to it verbatim (per-prompt) as part of the ChatGPT software and not via some pre-trained wieghts.
So if ChatGPT didn't refer to it verbatim and if ChatGPT trained on it and mixed it with other content, NYT would be OK with that? Tbh I don't get it.
It depends on the purpose of ChatGPT. If people use it as a substitute for the NYT then yes I suspect NYT will be not ok with it.
I think also the courts will also side with NYT. Very recently there was a copyright case involving Andy Warhol [1] which he lost. Despite the artwork being visually transformative; it's use was not transformative. So, to me that means if you create a program using NYT's materials that is used as a replacement for NYT it will not count as fair use. Obviously you could just do what say Google does and fork money over to NYT for some license.
However, my initial point is that this is a tangent. NYT has claimed that OpenAI is using NYT's works at least as-is and so OpenAI can just be nailed for that. Which is my point about FTX; it's irrelevant if their exchange was legal since you can just nail them for mis-use of customer funds. Another example would be Al Capone; it doesn't matter if he's a mobster because you can nail him for tax evasion.
I think this is more of a question of licensing content, sooner or later AI chat bots will have to license at least some of the content they are trained on.
But broadly speaking this is also the question of the "Open Web" and will it survive or not. Walled gardens like Facebook, Instagram etc. are strong and pervasive but still majority of people use and acknowledge publicly open websites from the Open Web. If AI chat bots do not drive traffic to websites then they are walled gardens and Microsoft, Google or whoever will lock users in and try to squeeze them for money.
Its buried on page 37 - #108. There probably are other examples in the lawsuit but this is sufficent.
> Synthetic search applications built on the GPT LLMs, including Bing Chat and
Browse with Bing for ChatGPT, display extensive excerpts or paraphrases of the contents of search
results, including Times content, that may not have been included in the model’s training set. The
“grounding” technique employed by these products includes receiving a prompt from a user,
copying Times content relating to the prompt from the internet, providing the prompt together with
the copied Times content as additional context for the LLM, and having the LLM stitch together
paraphrases or quotes from the copied Times content to create natural-language substitutes that
serve the same informative purpose as the original. In some cases, Defendants’ models simply spit
out several paragraphs of The Times’s articles.
Oh I see - yeah, that's the part of the lawsuit that's about Bing and ChatGPT Browse mode retrieval augmented generation.
It's a separate issue from the fact that the model can regurgitate it's NYT training data.
There's a section on page 63 which helps clarify that:
Defendants materially contributed to and directly assisted
with the direct infringement perpetrated by end-users of the
GPT-based products by way of: (i) jointly-developing LLM
models capable of distributing unlicensed copies of Times
Works to end-users; (ii) building and training the GPT LLMs
using Times Works; and (iii) deciding what content is
actually outputted by the GenAI products, such as grounding
output in Times Works through retrieval augmented generation,
fine-tuning the models for desired outcomes, and/or
selecting and weighting the parameters of the GPT LLMs.
So they are complaining about models that are capable of distributing unlicensed copies (the regurgitated training data issue), the fact that the models were trained on NYT work at all, and the fact that the RAG implementation in Bing and ChatGPT Browse further creates "natural-language substitutes that serve the same informative purpose as the original".
There are plenty of other services that strip affiliate links from content for users, such as ad blockers.
Notably, in both those cases, the user is specifically asking for what wirecutter thinks.
To me, that makes the infringing behavior clearly the fault of the user of the tool, not the tool itself.
When I saw this lawsuit announced, I assumed that large portions content were being reproduced in response to generic queries. That isn't the case, every example I've seen from this lawsuit has prompts where the user specifically asks for the content. To me, any fault and liability rests on the user here.