Despite the cynical responses here there is actually a practical reason why OpenAI is paying for this: The AP News Archive is not available online to be crawled. See https://www.ap.org/content/archive
There's a reasonable strong argument that crawling public pages for "indexing" (aka learning) is fair use based on the Google precedent and case law from ther early 2000s.
The argument is much less strong if those records aren't available.
Because it’s gonna occasionally spit out direct uncredited quotes from AP articles, and they know it. There’s gonna be a whole new field of law cropping up around this stuff; getting the big players appeased will be important.
We know AI learns words or symbols from the training set, the training set is the only place it could possibly learning them, that’s not surprising. It’s only meaningful to see a shutterstock logo if the developers of the model are claiming shutterstock images weren’t used in the training.
You don't see how it generating a registered trademark in its output is a problem? You think it's okay to monkey patch this one obvious case and call it good?
Shutterstock would say that the model is a derivative work of their images, much like a photograph of a painting is a derivative work of the painting despite being in a different medium.
They would say they are owed licensing fees regardless of whether the shutterstock logo appears in the output. The appearance of their logo in the output merely proves that the system's outputs are derived from their copyrighted images.
ruDALL-E is not DALL-E, it's a replication attempt made on a Russian image-text dataset and is not affiliated with openai. The original DALL-E was trained on a private dataset and wouldn't have watermarks.
Because they see the lawsuits and shifts in public perception. Now that they've "cheated" their way to the top, they can play by the rules to entrench themselves so others can't catch up.
Interesting, isn't it? I'm sure it has some legal or PR reason or something but IMHO the more important part is about acknowledging the problem: The current copyright system doesn't work and something is needed to compensate for the work.
The internet has shaken the system but the content producers were able to adapt, albeit resulting in lower quality content.
However with the raise of AI, the thing completely shattered. Previously, someone reading the content and telling it to others wasn't a problem that breaks the compensation scheme for content producers but with ChatGPT and similar we have a situation where this "person" can tall about it to literally everyone. Some new compensation scheme is needed and OpenAI is probably trying o act as the "nice guy" to prevent the urgent need of a scheme that might limit their ability to consume other people's content.
Copyright law doesn’t quite have a handle on machine learning training rules but it will. Belligerence at this point will only encourage stronger rules in the future.
So they can make a local copy of all AP stories before training the AI. The AI is a derivative work and it’s unclear if it’s distinct enough to be a problem or not. Finally, some of what ChatGPT is going to spit out is going to very closely resemble AP stories which may itself be a problem.
This stuff is a legal minefield, as a for profit company building their core product it’s very difficult to argue each of these is fair use etc. Though I am sure that argument will be made, it’s risky when dealing with companies whose businesses model centers around IP.
I'm about to go to the Online News Association (ONA) conference, which starts this week. I am really interested to see what is said at the various sessions about generative AI. Last year there was a mention of generative AI, but this was before chatGPT officially launched. I'm sure this year will be very different, as different organizations embrace or fend off this undeniable new development. Some orgs/journalists will surely use it to churn out more articles in less time. Others will probably adopt a purist stance and eschew it entirely.
There are good reasons to keep confidential info out of LLMs that you don't control, but I'd think it would make sense for anyone to run text through a locally-hosted LLM for editing suggestions and the like.
I'm interested in the style guide point about using non-gendered pronouns for LLMs - if I were to write a style guide, I would say, "use the gender appropriate for the persona designed by the company," for example Siri-the-llm would be "she" but ChatGPT or Sydney would be "it," a male persona llm would be "he," an explicitly nonbinary one would be "they," etc. Respect the company's style guide etc.
But I can see the potential for harm in over-humanizing by a news outlet. I'd be interested to hear what their decision-making process was for this point, if it was obvious or if they went back and forth, what arguments they had for which direction, etc.
The core problem surrounding LLMs is personification. Narratives that surround LLMs have failed to draw a clear distinction between what they hope to be, and what they are.
An LLM hopes to be an "intelligence" that can understand and manipulate text along the logical boundaries of language; and do so intentionally.
What an LLM really is, is an inference model that can reorganize text across boundaries that closely "align to" real language patterns. This is accomplished by creating a completely new pattern (the model) from inferring whatever patterns already exist in the training corpus' text.
A Large Language Model (LLM) serves as an alternative to true language comprehension. It is not an equivalent replacement, nor does it intelligently navigate itself with any explicit intent.
The act of "intelligently navigating the content of language" is at the core of journalism. It's incredibly important for journalist to both recognize and articulate the difference between an Artificial Intelligence realized, and any technology in the category of AI pursuit.
Really? I would have said the opposite - journalists have no obligation to parrot what companies' marketing departments feed them, and in fact usually ought to do the opposite.
Russia might name its invasion of Ukraine "Anti-Nazi Operation Freedom Eagle" but you wouldn't expect a war correspondent to repeat such obvious propaganda. In general, journalists have no obligation to follow companies' and governments' naming preferences.
Well, that's true, but for example if the branding guide says to call Reddit with a lowercase r (reddit) then you do that. If they then change it and now want "Reddit" then you do this, too.
Or LEGO gets written in all caps.
To me, the style guide decision to conform to the desired persona when describing it is along these lines.
A bunch of very reasonable conclusions that would have been more obvious in the first place if we would just stop calling these tools "Artificial Intelligence" instead of what they are.
Why are they paying or even asking for permission for training on data?