Hacker News new | past | comments | ask | show | jobs | submit login

> AI models will make 1:1 copies of training data where artists [...]

In general I don't think this is the case, assuming you mean generations output from popular text-to-image models. (edit: replied before their comment was edited to include the part on text generation models)

For DALL-E 2: I've never seen anyone able to provide a link of supposed copying. Even if you specifically ask it for some prominent work, you get a rendition not particularly closer than what a human artist could do: https://i.imgur.com/TEXXZ4a.png

For Stable Diffusion: it's true that Google did manage, by generating hundreds of millions of images using captions of the most-duped training images and attempting techniques like selecting by CLIP embeddings, to get 109 "near-copies of training examples". But I'd speculate, particularly if you're using the model normally and not peeking inside to intentionally try to get it to regurgitate, that this is still probably lower than the human baseline rate of intentional/accidental copying. It does at least seem lower than the intra-training-set rate: https://i.imgur.com/zOiTIxF.png (though many may be properly-authorized derivative works)




The more degrees of freedom the less likely independent creation rather than copping occurred.

LLM’s recreating training material causes real issues such as Google’s dealing with PII leaks: https://ai.googleblog.com/2020/12/privacy-considerations-in-...

If one prompts the GPT-2 language model with the prefix “East Stroudsburg Stroudsburg...”, it will autocomplete a long block of text that contains the full name, phone number, email address, and physical address of a particular person whose information was included in GPT-2’s training data.


Privacy, where there's a problem if some original data can be inferred/made out (even using a white box attack against the model), is a higher bar than whether an image generator avoids copyright-infringing output under non-adversarial usage. Additionally, compared to image data, text is more prone to exact matches due to lower dimensionality and usually training with less data per parameter.

While it's still a topic deserving of research and mitigation, by the time your information has been scooped up by Common Crawl and trained on by some LLM it's probably in many other places that attackers are more realistically likely to look (search engine caches, Common Crawl downloads, sites specifically for scooping credentials, ...) before trying to extract it from the LLM.


The privacy issue isn’t just about the data being available as people’s names, addresses, and phone numbers are generally available. The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

In terms of copyright infringement the bar is quite low, and copying is a basic part of how these algorithms work. This may or may not be an issue for you personally but it is a large land mine for commercial use especially if you’re independently creating one of these systems.


> The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

This seems a more obscure concern than extraction of data.

> copying is a basic part of how these algorithms work

Do you mean during training/gradient descent, or reverse diffusion?


Given the models are too small to possibly contain enough information to reproduce anything with any fidelity, that’s the only possibility - if it creates something similar to an original work, it’s similarity is fairly poor. Where it can do well is when the copyright material is something simple, like a super man logo. But even then it’s always slightly off.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: