Hacker News new | past | comments | ask | show | jobs | submit login
Associated Press clarifies standards around generative AI (niemanlab.org)
83 points by jyunwai 9 months ago | hide | past | favorite | 36 comments



> OpenAI, the ChatGPT maker committed to paying to train its models on AP news stories going back to 1985

Why are they paying or even asking for permission for training on data?


Despite the cynical responses here there is actually a practical reason why OpenAI is paying for this: The AP News Archive is not available online to be crawled. See https://www.ap.org/content/archive

There's a reasonable strong argument that crawling public pages for "indexing" (aka learning) is fair use based on the Google precedent and case law from ther early 2000s.

The argument is much less strong if those records aren't available.


Because it’s gonna occasionally spit out direct uncredited quotes from AP articles, and they know it. There’s gonna be a whole new field of law cropping up around this stuff; getting the big players appeased will be important.

DALL-E helpfully slaps a ShutterStock logo on some creations, for example. https://twitter.com/amoebadesign/status/1534542037814591490


>DALL-E helpfully slaps a ShutterStock logo on some creations, for example

Seems like you could train the AI to recognize the logos and edit them out?


that you have to do it at all is evidence that its recreating images from its training set and there could be potential copyright issues.


We know AI learns words or symbols from the training set, the training set is the only place it could possibly learning them, that’s not surprising. It’s only meaningful to see a shutterstock logo if the developers of the model are claiming shutterstock images weren’t used in the training.


>is evidence that its recreating images from its training set

No it isn't lol. a shutterstock logo is just more common ground for the model because of how often it will appear in the dataset.


You don't see how it generating a registered trademark in its output is a problem? You think it's okay to monkey patch this one obvious case and call it good?


If you think it's the trademark that's the issue then removing it is sufficient.


They do not think it's only that one trademark that's the issue.


I think you misinterpreted parent's statement.


Shutterstock would say that the model is a derivative work of their images, much like a photograph of a painting is a derivative work of the painting despite being in a different medium.

They would say they are owed licensing fees regardless of whether the shutterstock logo appears in the output. The appearance of their logo in the output merely proves that the system's outputs are derived from their copyrighted images.


Wouldn’t the appearance of shutterstock’s mark suggest to the consumer that the provenance of the image was shutterstock, when it is not?

But upon the absence of the mark, shutterstock would argue that the provenance of the image WAS shutterstock?

I don’t think this is the salient feature of this phenomenon, but the proximity of these two arguments is interesting.


what if a user copied a line from AP on a reddit comment and some LLM trained on it.


ruDALL-E is not DALL-E, it's a replication attempt made on a Russian image-text dataset and is not affiliated with openai. The original DALL-E was trained on a private dataset and wouldn't have watermarks.


Because they see the lawsuits and shifts in public perception. Now that they've "cheated" their way to the top, they can play by the rules to entrench themselves so others can't catch up.


Exactly, regulatory capture.


Interesting, isn't it? I'm sure it has some legal or PR reason or something but IMHO the more important part is about acknowledging the problem: The current copyright system doesn't work and something is needed to compensate for the work.

The internet has shaken the system but the content producers were able to adapt, albeit resulting in lower quality content.

However with the raise of AI, the thing completely shattered. Previously, someone reading the content and telling it to others wasn't a problem that breaks the compensation scheme for content producers but with ChatGPT and similar we have a situation where this "person" can tall about it to literally everyone. Some new compensation scheme is needed and OpenAI is probably trying o act as the "nice guy" to prevent the urgent need of a scheme that might limit their ability to consume other people's content.


That's nice, so OpenAI will pay you if you're big enough but if you're a small fry they won't, good to know.


I'm sure I'll get 53 cents from a class action lawsuit somewhere down the line.


I think it's safe to assume that a cost minimization is taking place, as would be expected.


Copyright law doesn’t quite have a handle on machine learning training rules but it will. Belligerence at this point will only encourage stronger rules in the future.


Doesn't matter as long as we give AGI to masses first.


So they can make a local copy of all AP stories before training the AI. The AI is a derivative work and it’s unclear if it’s distinct enough to be a problem or not. Finally, some of what ChatGPT is going to spit out is going to very closely resemble AP stories which may itself be a problem.

This stuff is a legal minefield, as a for profit company building their core product it’s very difficult to argue each of these is fair use etc. Though I am sure that argument will be made, it’s risky when dealing with companies whose businesses model centers around IP.


Because they don't own it? AFAIK, ChatGPT's model isn't open source, so they seem to know data has some value.


Seems like a very reasonable policy. I wish that other companies would adopt similar policies instead of trying to stuff AI in at every chance.


I'm about to go to the Online News Association (ONA) conference, which starts this week. I am really interested to see what is said at the various sessions about generative AI. Last year there was a mention of generative AI, but this was before chatGPT officially launched. I'm sure this year will be very different, as different organizations embrace or fend off this undeniable new development. Some orgs/journalists will surely use it to churn out more articles in less time. Others will probably adopt a purist stance and eschew it entirely.

There are good reasons to keep confidential info out of LLMs that you don't control, but I'd think it would make sense for anyone to run text through a locally-hosted LLM for editing suggestions and the like.


I'm interested in the style guide point about using non-gendered pronouns for LLMs - if I were to write a style guide, I would say, "use the gender appropriate for the persona designed by the company," for example Siri-the-llm would be "she" but ChatGPT or Sydney would be "it," a male persona llm would be "he," an explicitly nonbinary one would be "they," etc. Respect the company's style guide etc.

But I can see the potential for harm in over-humanizing by a news outlet. I'd be interested to hear what their decision-making process was for this point, if it was obvious or if they went back and forth, what arguments they had for which direction, etc.


The core problem surrounding LLMs is personification. Narratives that surround LLMs have failed to draw a clear distinction between what they hope to be, and what they are.

An LLM hopes to be an "intelligence" that can understand and manipulate text along the logical boundaries of language; and do so intentionally.

What an LLM really is, is an inference model that can reorganize text across boundaries that closely "align to" real language patterns. This is accomplished by creating a completely new pattern (the model) from inferring whatever patterns already exist in the training corpus' text.

A Large Language Model (LLM) serves as an alternative to true language comprehension. It is not an equivalent replacement, nor does it intelligently navigate itself with any explicit intent.

The act of "intelligently navigating the content of language" is at the core of journalism. It's incredibly important for journalist to both recognize and articulate the difference between an Artificial Intelligence realized, and any technology in the category of AI pursuit.


> Respect the company's style guide etc.

Really? I would have said the opposite - journalists have no obligation to parrot what companies' marketing departments feed them, and in fact usually ought to do the opposite.

Russia might name its invasion of Ukraine "Anti-Nazi Operation Freedom Eagle" but you wouldn't expect a war correspondent to repeat such obvious propaganda. In general, journalists have no obligation to follow companies' and governments' naming preferences.


Well, that's true, but for example if the branding guide says to call Reddit with a lowercase r (reddit) then you do that. If they then change it and now want "Reddit" then you do this, too.

Or LEGO gets written in all caps.

To me, the style guide decision to conform to the desired persona when describing it is along these lines.


An LLM is always an it.


A bunch of very reasonable conclusions that would have been more obvious in the first place if we would just stop calling these tools "Artificial Intelligence" instead of what they are.


A replacement for artists, on the other hand... (the graphic for the article was quite obviously AI generated.)


Niemanlab != Associated press


Hey user! Could you expalin it more for the easy understandibilty of the members.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: