There's a distinction between "learning from" and "copying". "Learning from" is a transformative process that distills from the observation. This distillation can be as simple as indexing for a search engine, or as complex as a deep neural network.
Simply because a neural network can create something that is a copyright violation doesn't mean the training process itself it.
A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.
The neural network is a tool.
It's reasonable to be concerned about the loss in employment by people who are affected by generative AI. But I think this is a separate issue to the copyright argument.
> There's a distinction between "learning from" and "copying".
Neural nets can memorize their training data. Generally that isn't what you want, and you strive to eliminate it. However, it could instead be encouraged to happen if someone wanted to exploit this law in order to abuse copyrights.
The law applies to the training of a neural network; you're not depriving the copyright holder of his intellectual property; if you use a copy of his work, he still owns the copyright independently if you copy it by right-clicking > copying or by overfitting a generative model.
Humans can memorize their training data too... aka see something and then produce a copy (code, drawing, music etc). The principles underlying how LLMs and humans learn isn't really that different... just different levels of loss/fuzziness.
Yes, and as GP suggested, going on to distribute copies would be copyright infringement. That doesn't imply that it's an infringement to train the neural net.
Humans learn on copyrighted works as a matter of standard training. And certainly humans can memorize those works and replicate them – and we rely on the legal system to ensure that they don't monetize them.
The same will apply to neural nets. They can learn from others, but must make sufficiently distinct new works of art from what they've learned.
> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.
I don't think that's correct. That might be trademark infringement, if the logo is a registered trademark, but "seeing something and then drawing it" is in general not copyright infringement.
Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement. (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).
> Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement.
In US law, there is a nuance between Copyright and Trademark.
> Drawing a copy of a copyrighted picture from memory, and then distributing that copy
Would not necessarily be copyright infringement (it depends on a judge). For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.
> (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).
A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.
In your example, if someone draws from memory a logo, they actually own the copyright, but it is still Trademark infringement and the trademark owner will be protected.
> In US law, there is a nuance between Copyright and Trademark.
It's not a nuance, it's a completely separate legal regime, and not what this conversation is about.
> Would not necessarily be copyright infringement (it depends on a judge).
Every law can be challenged in court, but a copy of a picture as-is is a pretty clear-cut case.
> For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.
Nope. She's able to because there is a compulsory license for covers of songs that have already been published - something very different from them not being protected by copyright - and/or because she owns some of the rights. She may well be paying royalties on them. That compulsory license regime is specific to recorded music and does not apply to pictures.
> A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.
"More" is a simplification; trademark laws are quite different from copyright laws, stronger in some ways and weaker in others (e.g. you can lose a trademark by not enforcing it, whereas you cannot lose a copyright that way). In any case, that's a distraction from the current topic.
> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.
This then becomes about where the liability of that violation lies, and how attractive that is to companies.
A human "learning" the marvel logo and reproducing it is violation. How does OpenAPI fit into this analogy?
The liability would lay with the company using the LLM product. This could mean that many companies won’t want to take on the risk unless there is decent tooling around warnings of infringement and listing sources.
I think liability lies with the person who uses the product to violate copyright. The hosting / producing company didn’t violate copyright if I use their model to make Mickey Mouse pictures. I did.
You can’t, I think Fair Use is a fundamentally subjective judgement of a combination of how transformative the work is and the intent and impact of it being distributed.
Normally when I generate content it’s from my brain and I can tell the difference between copying memorized content, re-expressing memorized content, and generating something original. How do I know what the LLM is doing?
Are you sure? If you look at plagiarism in music, you'll find a number of cases where the defendant makes a compelling point about not remembering or consciously knowing they heard the original song before. For legal purposes, it is not the point, but they feel morally wronged to be charged as guilty. The case here is that they internalized the music knowledge, but forgot about the source - so they can't make the distinction you claim anymore. Natural selection shaped our brains to store formation that seems useful, not is attribution.
LLMs are also not usually trained to remenber where the examples they were trained on came from, the sourcing information is often not even there (maybe they could, maybe they should, but they aren't). Given that and the way training works, one could argue that they're never copying, only re-expressing or combining (which I think of as a form of "generating something original"). Just memorizing and copying is overfitting, and strongly undesirable, as it's not usable outside of the exact source context. I agree it can happen, but it's a flaw in the training process. I'd also agree that any instance of exact reproductions (or of material with similarity to the original content over some high threshold) is indeed copyright infringement, punishable as such.
So, my point is, training a model on copyrighted material is legal, but letting that model output copies of copyrighted material beyond fair use (quotations, references, etc - that make sense in the context the model was queried on) is an infringement. And since the actual training data is not necessarily known, providers of model-as-a-service, such as OpenAI with GPT, should be responsible for that.
In cases where a model was made available to others, it falls on the user of the model. If the training data is available, they should check answers against it (there's a whole discussion on how training data should be published to support this) to avoid the risk;if the training data is unknown, they're taking the risk of being sued full-on, without any mitigation.
> A human "learning" the marvel logo and reproducing it is violation
Not quite, it's really in the resale or redistribution that violation occurs, painting an image of the hulk to hang in your living room wouldn't really be a violation, selling that painting could be, turning it into merch and selling that would wholeheartedly be, trying to pass it off as official merch is without question a violation.
I strongly disagree with this. We shouldn't create new laws for new technology by making analogies to what's allowed under old laws designed for old technology. If we did, we would never have come up with copyright in the first place.
600 years ago, people were allowed to hand-copy entire books, so they should be able to do it with a printing press right? It's "just a tool"!
The correct way to think about this is to recognize that society needs people to create training data as well as people to train models. If we don't reward the people who create training data, we disincentivize them from doing so, and we'll end up in a world where we don't have enough of it.
I don't think the comparison with human learning holds.
NNs and humans don't learn the same way - humans can fairly quickly generalise what they have learned and, most importantly, go beyond what they've learned. I haven't see that happen with neural networks or GPTs; at best, you're getting the average of what it has 'learned'. There's human learning and there's neural network 'learning' and they're a different thing.
>AlphaFold recognizes a 3D structure of the examined amino acid sequence by a similarity of this sequence (or its parts) to related sequences with already
known 3D structures
I'm sure Google represents strings of text from pages in some internal format, but relatively verbatim. Even represented verbatim, because their output is a search result and not an article that uses the copyrighted text verbatim there's no copyright violation.
And models don't even use data verbatim, if they do they're bad models/overfitted. People are making all sorts of arguments but they seem to boil down to "it's fine if humans do it but if a machine does then it's copyright violation".
People often disregard the fact that copyright law is woefully outdated (an absolute joke in itself, which can't be used to defend anything since Disney shoved it's whole fist up copyright law's...) and should really be extended for the modern world. Why can't we handle copyright for ML models? Why can't animals have copyright? It's extremely trivial to handle these cases, the point of copyright is usage and agency comes into play.
If people want to be biased against machines, then fine. Be racist to machines, maybe in 2100 or so those people will get their comeuppance. But if an ML model isn't allowed to learn from something and use that knowledge without reproducing verbatim, then why is predictive text in phone keyboards allowed?
Everyone out here acting like they're from the Corporation Rim.
There's a distinction between "learning from" and "copying". "Learning from" is a transformative process that distills from the observation. This distillation can be as simple as indexing for a search engine, or as complex as a deep neural network.
Simply because a neural network can create something that is a copyright violation doesn't mean the training process itself it.
A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.
The neural network is a tool.
It's reasonable to be concerned about the loss in employment by people who are affected by generative AI. But I think this is a separate issue to the copyright argument.