> AI models will make 1:1 copies of training data where artists [...] In general...

Retric · on May 31, 2023

The more degrees of freedom the less likely independent creation rather than copping occurred.

LLM’s recreating training material causes real issues such as Google’s dealing with PII leaks: https://ai.googleblog.com/2020/12/privacy-considerations-in-...

If one prompts the GPT-2 language model with the prefix “East Stroudsburg Stroudsburg...”, it will autocomplete a long block of text that contains the full name, phone number, email address, and physical address of a particular person whose information was included in GPT-2’s training data.

Ukv · on June 1, 2023

Privacy, where there's a problem if some original data can be inferred/made out (even using a white box attack against the model), is a higher bar than whether an image generator avoids copyright-infringing output under non-adversarial usage. Additionally, compared to image data, text is more prone to exact matches due to lower dimensionality and usually training with less data per parameter.

While it's still a topic deserving of research and mitigation, by the time your information has been scooped up by Common Crawl and trained on by some LLM it's probably in many other places that attackers are more realistically likely to look (search engine caches, Common Crawl downloads, sites specifically for scooping credentials, ...) before trying to extract it from the LLM.

Retric · on June 1, 2023

The privacy issue isn’t just about the data being available as people’s names, addresses, and phone numbers are generally available. The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

In terms of copyright infringement the bar is quite low, and copying is a basic part of how these algorithms work. This may or may not be an issue for you personally but it is a large land mine for commercial use especially if you’re independently creating one of these systems.

Ukv · on June 2, 2023

> The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

This seems a more obscure concern than extraction of data.

> copying is a basic part of how these algorithms work

Do you mean during training/gradient descent, or reverse diffusion?

fnordpiglet · on June 1, 2023

Given the models are too small to possibly contain enough information to reproduce anything with any fidelity, that’s the only possibility - if it creates something similar to an original work, it’s similarity is fairly poor. Where it can do well is when the copyright material is something simple, like a super man logo. But even then it’s always slightly off.