The same way you do with any other content you generate in other ways.

madeofpalk · on June 1, 2023

Well, when I pick up a pencil and make a drawing, I have a lot of agency over what is created.

The whole point of these generative models is that I have less agency over exactly what gets created - it takes my prompt and does the rest.

jprete · on June 1, 2023

Normally when I generate content it’s from my brain and I can tell the difference between copying memorized content, re-expressing memorized content, and generating something original. How do I know what the LLM is doing?

bobbruno · on June 3, 2023

Are you sure? If you look at plagiarism in music, you'll find a number of cases where the defendant makes a compelling point about not remembering or consciously knowing they heard the original song before. For legal purposes, it is not the point, but they feel morally wronged to be charged as guilty. The case here is that they internalized the music knowledge, but forgot about the source - so they can't make the distinction you claim anymore. Natural selection shaped our brains to store formation that seems useful, not is attribution.

LLMs are also not usually trained to remenber where the examples they were trained on came from, the sourcing information is often not even there (maybe they could, maybe they should, but they aren't). Given that and the way training works, one could argue that they're never copying, only re-expressing or combining (which I think of as a form of "generating something original"). Just memorizing and copying is overfitting, and strongly undesirable, as it's not usable outside of the exact source context. I agree it can happen, but it's a flaw in the training process. I'd also agree that any instance of exact reproductions (or of material with similarity to the original content over some high threshold) is indeed copyright infringement, punishable as such.

So, my point is, training a model on copyrighted material is legal, but letting that model output copies of copyrighted material beyond fair use (quotations, references, etc - that make sense in the context the model was queried on) is an infringement. And since the actual training data is not necessarily known, providers of model-as-a-service, such as OpenAI with GPT, should be responsible for that.

In cases where a model was made available to others, it falls on the user of the model. If the training data is available, they should check answers against it (there's a whole discussion on how training data should be published to support this) to avoid the risk;if the training data is unknown, they're taking the risk of being sued full-on, without any mitigation.