I *strongly* agree with this. There's a distinction between "learning from" and ...

lamontcg · on June 1, 2023

> There's a distinction between "learning from" and "copying".

Neural nets can memorize their training data. Generally that isn't what you want, and you strive to eliminate it. However, it could instead be encouraged to happen if someone wanted to exploit this law in order to abuse copyrights.

GaggiX · on June 1, 2023

The law applies to the training of a neural network; you're not depriving the copyright holder of his intellectual property; if you use a copy of his work, he still owns the copyright independently if you copy it by right-clicking > copying or by overfitting a generative model.

adam_arthur · on June 1, 2023

Humans can memorize their training data too... aka see something and then produce a copy (code, drawing, music etc). The principles underlying how LLMs and humans learn isn't really that different... just different levels of loss/fuzziness.

leni536 · on June 1, 2023

And when humans do that they may also infringe on copyright.

adam_arthur · on June 1, 2023

Yet it’s not illegal to look at the McDonald’s logo, is it?

DennisP · on June 1, 2023

Yes, and as GP suggested, going on to distribute copies would be copyright infringement. That doesn't imply that it's an infringement to train the neural net.

sgk284 · on June 1, 2023

Humans learn on copyrighted works as a matter of standard training. And certainly humans can memorize those works and replicate them – and we rely on the legal system to ensure that they don't monetize them.

The same will apply to neural nets. They can learn from others, but must make sufficiently distinct new works of art from what they've learned.

kelnos · on June 1, 2023

> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

I don't think that's correct. That might be trademark infringement, if the logo is a registered trademark, but "seeing something and then drawing it" is in general not copyright infringement.

lmm · on June 1, 2023

Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement. (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).

yazaddaruvala · on June 1, 2023

> Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement.

In US law, there is a nuance between Copyright and Trademark.

> Drawing a copy of a copyrighted picture from memory, and then distributing that copy

Would not necessarily be copyright infringement (it depends on a judge). For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.

> (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).

A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.

In your example, if someone draws from memory a logo, they actually own the copyright, but it is still Trademark infringement and the trademark owner will be protected.

lmm · on June 1, 2023

> In US law, there is a nuance between Copyright and Trademark.

It's not a nuance, it's a completely separate legal regime, and not what this conversation is about.

> Would not necessarily be copyright infringement (it depends on a judge).

Every law can be challenged in court, but a copy of a picture as-is is a pretty clear-cut case.

> For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.

Nope. She's able to because there is a compulsory license for covers of songs that have already been published - something very different from them not being protected by copyright - and/or because she owns some of the rights. She may well be paying royalties on them. That compulsory license regime is specific to recorded music and does not apply to pictures.

> A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.

"More" is a simplification; trademark laws are quite different from copyright laws, stronger in some ways and weaker in others (e.g. you can lose a trademark by not enforcing it, whereas you cannot lose a copyright that way). In any case, that's a distraction from the current topic.

wvenable · on June 1, 2023

> "seeing something and then drawing it" is in general not copyright infringement.

It's seeing something, drawing it, and then distributing that drawing which is infringement. Bonus points for the distribution being a sale.

madeofpalk · on May 31, 2023

> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

This then becomes about where the liability of that violation lies, and how attractive that is to companies.

A human "learning" the marvel logo and reproducing it is violation. How does OpenAPI fit into this analogy?

williamcotton · on June 1, 2023

The liability would lay with the company using the LLM product. This could mean that many companies won’t want to take on the risk unless there is decent tooling around warnings of infringement and listing sources.

fnordpiglet · on June 1, 2023

I think liability lies with the person who uses the product to violate copyright. The hosting / producing company didn’t violate copyright if I use their model to make Mickey Mouse pictures. I did.

madeofpalk · on June 1, 2023

How can you be certain that the content being generated is non-infringing?

aoeusnth1 · on June 1, 2023

You can’t, I think Fair Use is a fundamentally subjective judgement of a combination of how transformative the work is and the intent and impact of it being distributed.

Paradigma11 · on June 1, 2023

The same way you do with any other content you generate in other ways.

madeofpalk · on June 1, 2023

Well, when I pick up a pencil and make a drawing, I have a lot of agency over what is created.

The whole point of these generative models is that I have less agency over exactly what gets created - it takes my prompt and does the rest.

jprete · on June 1, 2023

Normally when I generate content it’s from my brain and I can tell the difference between copying memorized content, re-expressing memorized content, and generating something original. How do I know what the LLM is doing?

bobbruno · on June 3, 2023

Are you sure? If you look at plagiarism in music, you'll find a number of cases where the defendant makes a compelling point about not remembering or consciously knowing they heard the original song before. For legal purposes, it is not the point, but they feel morally wronged to be charged as guilty. The case here is that they internalized the music knowledge, but forgot about the source - so they can't make the distinction you claim anymore. Natural selection shaped our brains to store formation that seems useful, not is attribution.

LLMs are also not usually trained to remenber where the examples they were trained on came from, the sourcing information is often not even there (maybe they could, maybe they should, but they aren't). Given that and the way training works, one could argue that they're never copying, only re-expressing or combining (which I think of as a form of "generating something original"). Just memorizing and copying is overfitting, and strongly undesirable, as it's not usable outside of the exact source context. I agree it can happen, but it's a flaw in the training process. I'd also agree that any instance of exact reproductions (or of material with similarity to the original content over some high threshold) is indeed copyright infringement, punishable as such.

So, my point is, training a model on copyrighted material is legal, but letting that model output copies of copyrighted material beyond fair use (quotations, references, etc - that make sense in the context the model was queried on) is an infringement. And since the actual training data is not necessarily known, providers of model-as-a-service, such as OpenAI with GPT, should be responsible for that.

In cases where a model was made available to others, it falls on the user of the model. If the training data is available, they should check answers against it (there's a whole discussion on how training data should be published to support this) to avoid the risk;if the training data is unknown, they're taking the risk of being sued full-on, without any mitigation.

williamcotton · on June 1, 2023

That’s what I said, the user would be liable. The user could be a company or an individual.

michaelbrave · on June 1, 2023

> A human "learning" the marvel logo and reproducing it is violation

Not quite, it's really in the resale or redistribution that violation occurs, painting an image of the hulk to hang in your living room wouldn't really be a violation, selling that painting could be, turning it into merch and selling that would wholeheartedly be, trying to pass it off as official merch is without question a violation.

jprete · on June 1, 2023

Hanging it in your living room is in fact a copyright violation, just not one that Marvel is likely to legally pursue.

quicklime · on June 1, 2023

I strongly disagree with this. We shouldn't create new laws for new technology by making analogies to what's allowed under old laws designed for old technology. If we did, we would never have come up with copyright in the first place.

600 years ago, people were allowed to hand-copy entire books, so they should be able to do it with a printing press right? It's "just a tool"!

The correct way to think about this is to recognize that society needs people to create training data as well as people to train models. If we don't reward the people who create training data, we disincentivize them from doing so, and we'll end up in a world where we don't have enough of it.

a_bonobo · on June 1, 2023

I don't think the comparison with human learning holds.

NNs and humans don't learn the same way - humans can fairly quickly generalise what they have learned and, most importantly, go beyond what they've learned. I haven't see that happen with neural networks or GPTs; at best, you're getting the average of what it has 'learned'. There's human learning and there's neural network 'learning' and they're a different thing.

nl · on June 1, 2023

NN's absolute can go beyond what they have learned and aren't just producing the "average".

Some good examples outside the typical LLM/images work:

* Deep Mind's work on AlphaFold, which generates predictions on proteins that haven't been seen before

* AlphaGo which plays games better than any human (so clearly can't be "the average")

If we look at LLMs, something like writing code in the style of Shakespear isn't really something that's been seen before.

a_bonobo · on June 1, 2023

> Deep Mind's work on AlphaFold, which generates predictions on proteins that haven't been seen before

I have used AlphaFold a bit in my own work, and if I showed it 'unusual' proteins like rare mutants it usually generated garbage. Some evidence for this exists in the literature; see for example https://www.biorxiv.org/content/10.1101/2021.09.19.460937v1 or https://academic.oup.com/bioinformatics/article/38/7/1881/65... or

>AlphaFold recognizes a 3D structure of the examined amino acid sequence by a similarity of this sequence (or its parts) to related sequences with already known 3D structures

https://www.biorxiv.org/content/10.1101/2022.11.21.517308v1

fennecfoxy · on June 2, 2023

Yup, exactly.

I'm sure Google represents strings of text from pages in some internal format, but relatively verbatim. Even represented verbatim, because their output is a search result and not an article that uses the copyrighted text verbatim there's no copyright violation.

And models don't even use data verbatim, if they do they're bad models/overfitted. People are making all sorts of arguments but they seem to boil down to "it's fine if humans do it but if a machine does then it's copyright violation".

People often disregard the fact that copyright law is woefully outdated (an absolute joke in itself, which can't be used to defend anything since Disney shoved it's whole fist up copyright law's...) and should really be extended for the modern world. Why can't we handle copyright for ML models? Why can't animals have copyright? It's extremely trivial to handle these cases, the point of copyright is usage and agency comes into play.

If people want to be biased against machines, then fine. Be racist to machines, maybe in 2100 or so those people will get their comeuppance. But if an ML model isn't allowed to learn from something and use that knowledge without reproducing verbatim, then why is predictive text in phone keyboards allowed?

Everyone out here acting like they're from the Corporation Rim.

WalterBright · on June 1, 2023

Copying a logo is trademark infringement, not copyright infringement.