Hacker News new | past | comments | ask | show | jobs | submit login
Japan’s government will not enforce copyrights on data used in AI training (technomancers.ai)
484 points by version_five on May 31, 2023 | hide | past | favorite | 402 comments



I think this should generally be true. The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use. It may produce stuff that violates copyright, but the way you use or distribute the product of the model that can violate copyright. Making it write code that’s a clone of copyright code or making it make pictures with copy right imagery in it or making it reproduce books etc etc, then distributing the output, would be where the copyright violation occurs.


> The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use.

Lossy or not, the training data provides value. If all the various someones had not spent time making all the stuff that ends up as training data, then the model it trains would not exist.

If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated. And should also be able to decide they don't want their work used in that way.

Note that I'm not talking about what existing copyright law says; I'm talking about how I believe we should be regulating this new facet of the industry.

> Making it write code that’s a clone of copyright code or making it make pictures with copy right imagery in it or making it reproduce books etc etc, then distributing the output, would be where the copyright violation occurs.

How is the end-user supposed to know this? Do we seriously believe that everyone who uses generative AI is going to run the output through some sort of process (assuming one even exists) to ensure it's not a substantial enough copy of something some copyrighted work? I certainly don't think this is going to happen.

Regardless, copyright is about distribution. If the a model trained on copyrighted material is considered a copy or derived work of the original work, then distributing that model is, in fact, copyright infringement (absent a successful fair use defense). I'm not saying that's the case, or how a court would look at it, but that's something to consider.


> If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated. And should also be able to decide they don't want their work used in that way.

> Note that I'm not talking about what existing copyright law says; I'm talking about how I believe we should be regulating this new facet of the industry.

Is it really new? Humans have always learnt by studying what's out there already. Our whole culture is built on what's been done and published before (and how could it be otherwise?). Without Bach there would be no Mozart, and then down the line their influence permeates everything you hear today.

If anything I'd like to make it easier to reuse parts of our shared culture, and limit the ability of organisations to control how things that they've published are reused. You can make private works if you want to keep control of them, but at some point the public deserves to share and rework the things that have been pushed into the public consciousness.


>> Is it really new? Humans have always learnt by studying what's out there already.

"Humans" being the important word here. I don't understand why people keep trying to compare training a model to humans learning through reading etc. They are very different things. Learning done by machines at enormous scale and done to benefit private companies financially is not the same as humans learning.


How is it meaningfully different with respect to this question?

If I go to a museum and look at a bunch of modern paintings, then go home and paint something new but “in the style of”, this is well-established as within my rights, regardless of how any of the painters whose work I studied and was inspired by might feel.

If I take a notebook and write down some notes about the themes and stylistic attributes of what I see, then go home and paint something in the same style, that too is fine - right? Or would you argue the notes I took are a copyright violation? Or the works I made using those notes?

Now let’s say I automate the process of recording those notes. Does that change the fundamentals of what is happening, with respect to copyright?

Personally, I don’t think so.


The law most definitely distinguishes between the rights of a human and the rights of a software program running on a computer.

AI does not read, look at or listen to anything. It runs algorithms on binary data. An AI developer who uses millions of files to program their AI system also does not read, look at or listen to all of that stuff. They copy it. That is the part explicitly covered by international copyright law. It is not possible to use some file to "train" a ML model except by copying that file. That's just a fact. It wasn't the computer that went out and read or looked at the work. It was a human who took a binary copy of it, ran some algorithms on it without even looking at it, and published/sold/gave access to the software.

AI software is a work by an author; not an author.


> How is it meaningfully different with respect to this question?

Humans can't be owned by corporations for one.


> They are very different things

Yes, but also very similar. We learn very well by spaced repetition, and by practicing things. Our whole nervous system stores information in a similar way. Is the brain and the individual neurons more complex? Yes, sure, but that doesn't negate the core similarities.

> Learning done by machines at enormous scale and done to benefit private companies financially is not the same as humans learning.

Yes, that's the important difference. That in the end if you train a robot you get a program that's easy to copy/scale, the marginal cost of using it is orders of magnitude lower than what you'd get if you would do it with humans in the loop.

We already have fair use in copyright, because there are important differences between the various forms and modalities of human imitation.

And of course maybe it's time to rename copyright to usageright. After current copyright doesn't even apply in most cases. (The results are not derivatives, there's sufficient substantial transformative difference, etc. That said, the phrasing in the US constitution still makes sense: "... the exclusive Right to their respective Writings and ..." ... if we interpret Right to mean all rights, including even the right of who can read/see it.)


The differences don't seem salient though. Doing a legal thing faster doesn't generally make it any less legal; doing it for profit changes the legal regime somewhat but not in ways that seem relevant to what's being claimed.


> Doing a legal thing faster doesn't generally make it any less legal

“But officer, it is legal to drive just slightly slower than I was going!”

Simply put: You are wrong. The law makes arbitrary distinctions all the time, for practical reasons.


There is a specific law against driving above a certain speed. There's no law like that against being an AI.


"Humans have always learnt by studying what's out there already."

Neither an AI model nor an AI developer who programs that model are actually studying "what's out there already". One is copying files, and the other is running algorithms on copied files. And then the first one is raking in $$$ while bankrupting the authors of those files. That's illegal in the US, UK, EU and under the terms of the Berne Convention.


> You can make private works if you want to keep control of them, but at some point the public deserves to share and rework the things that have been pushed into the public consciousness.

There's already a licensing framework for artists doing this - should they wish to. It's called Creative Commons, and allows a pretty fine distinction of rights from public domain to free for personal use not commercial, and everything in between. https://creativecommons.org/

I agree our shared culture in some sense ultimately owns the productions of the culture. But that's wildly different to (and in some senses the opposite of) letting private companies enclose, privatise and sell those products back to us. As for example Disney has done over and over again, taking myths transcribed by the Brothers Grimm, or classic novels now in the public domain, 'reinterpreting' them and viciously enforcing copyright on these new interpretations.

The entire point of copyright law is to allow the Bach's of this world to profit from their work - without having to die in poverty and obscurity, as so many artists and musicians have historically, even while others have profited from their work at scale.

> Humans have always learnt by studying what's out there already.

Finally - as other commentators have noted, there's really no similarity at all between a human at human pace studying and integrating understanding of a piece or genre of art, and an AI training to replicate that work at scale as perfectly as possible. A much better comparator would be the Chinese factory 'villages' that reproduce paintings at scale for the commercial market, without creativity or 'art' being a part of the process. But even that is a poor analogy, since individual humans mediate the process. A really good analogy would be a giant food corporation like Nestle somehow scanning the product of a restaurant and then offering that chefs unique dishes that had taken years to invent for nearly free - using the same name and benefiting from the association.


What if the AI is open source and run by individual creators? That changes the slant of the argument a lot. I worry that excessive regulation will mostly come down on individual creators using AI tools.


I'm not sure what you mean? The image and text models currently becoming popular are being trained on work that is not owned or created by those creating the models. The consensus amongst the artists whose work is being used to train them (without their permission or compensation) is very high that this is a bad thing. Financial impacts are already being felt by artists across industries. There's a huge level of denial of the impact of this on professional artists, already today, here on hacker news. 'Open source' is a separate issue to training on other peoples work, replicating their style and stealing their livelihood.


> Humans have always learnt by studying what's out there already. Our whole culture is built on what's been done and published before

Are you implying that educators should not be compensated or credited? Because that is not how it works in the real world.


If I read a lot of fantasy books as a kid, then start writing my own fantasy book, should I have to pay royalties to the authors of the books I read?


Yes, you do. The royalty is usually called a "college degree".

Most jobs require a degree or certification of some sort.


Does your ability to write fantasy books absolutely depend on having read those fantasy books as a kid? Was gaining the ability to write your own fantasy books and profit from them your only motivation to read those fantasy books? After gaining the ability to write fantasy books thanks to having read them, can you now produce fantasy books at a qualitatively different speed, scale, and conditions than any of the authors of the books that you read?

If the answer to those three questions is "yes", then I would argue that yes, you absolutely should have to pay royalties to the authors.


> If the answer to those three questions is "yes", then I would argue that yes, you absolutely should have to pay royalties to the authors.

Copyright lobbies can't have their own cake and eat it too, if you want to enforce such a different way of thinking about copyright compared to the current one, that would destroy the current industry and rightly so.


I would appreciate if you addressed the questions as literally stated.

I could just as well argue that businesses developing or making use of LLMs can't have their cake and eat it too. If they want their computer programs to enjoy the same rights and prerogatives as human creators do, they should be ready to demonstrate that those models are truly moral agents, with their own lived experiences, and thus deserve the status of legal persons as of themselves subject to the same laws and obligations as human beings.


> If they want their computer programs to enjoy the same rights and prerogatives as human creators do

I don't think they care about that, they seems fine with output images not being copyrightable as per the current legislation.


How much does the world owe to Gilgamesh and Homer?


I don't know. Whether the answer were to be either 'our entire human culture', or 'absolutely nothing whatsoever', or any point in between, how would that be relevant for the discussion at hand?


They already paid when they bought the books. Why would they need to pay more?


Say I want to write a screenplay and produce the resulting film, for profit, but I am literally unable to have any ideas whatsoever unless I base them on book that I read. With this in mind, and with this sole motivation, I buy and read the whole collection of Brandon Sanderson's novels and create a screenplay based exclusively on their content, for I have no ideas nor experiences of my own. I already paid when I bought the books. Why would I need to pay Brandon Sanderson any more?


Anything you create comes from what you've seen, from what you've experienced.

That's a dead-end to start having to pay each and every one of the original sources of the components of your mind each time they are used to create!


So I understand you are arguing that derived works should not be subject to royalties, i.e. I should be able to produce a film based entirely on the work of a living author without restriction or the need to pay any royalties.


No, you're over-generalizing what I'm saying (a.k.a. straw man fallacy).

If you're commercializing e.g. goodies that use the exact same character designed by some artist, so that anybody can tell you it's the same, AND if the traits of this character are actually original (not just so that anybody would come up with it independently), AND if the people buying your goodies are all thinking about the original character in the first place, AND if the original author is alive, then I'd absolutely agree that royalties need to be paid.

That's just an example. To tell you that in specific cases it's obvious that royalties are required.

But in the general case, no. Because, else, anything just is a derived work. Just think about it.


It's even impossible to retrace all of the woven threads, ramified tendrils, that link your mind to the billions of other minds all over the world and over the ages.

The world of ideas is liquid. All is mixing, all dissolves and disappears in everything else, and is reborn new and different, again and again.


> Because that is not how it works in the real world.

Unless you're satirizing it or making fun of it in some other way. Then it's fair use..


Educators get paid a flat fee, not an indenture on your future work (nor do you owe anything back to the people who made the learning materials they use). And if you teach yourself from books or websites you don't pay anything (except maybe the cost of the books). All of which seems right and proper?


> Lossy or not, the training data provides value.

If we ignore the issue of machine learning for now; It's not the job of copyright to prevent people extracting value from a copyrighted work.

If it was, then it would be possible for copyright holders to launch lawsuits that block entities from using the knowledge that was published in copyrighted reference material. Or the rights holder of a cookbook would be able to block people from making the recipe.

We have other sections of law that provide protection along these lines. Patents give their holders a monopoly to extract value from the given invention, a much stronger protection that copyright law. But in exchange Patents are limited to 21 years, much be publicly documented and only certain types of things are covered.

Trade secret laws can be used to protect recipes, formulas and other processes, but only as long as the holder makes reasonable efforts to keep it secret. The owner can't have it both ways, have the IP publicly known and protected by trade secret law.

The only purpose of copyright is to give the holder a monopoly over the reproduction of a work. The definition of reproduction might be quite wide these days: A performance of a work is a reproduction, a cover of a song is a reproduction, distribution is reproduction in the modern age of computers... etc, but the limit of copyright law is reproduction.

Coming back to machine learning, there is a somewhat open question if training counts as a form of reproduction (well, outside of Japan). But we can't use your proposed "extraction of value" metric as a way to decide that.

Personally, I would argue that training a machine learning model is roughly equivalent to a human brain consuming copyrighted works, and should be treated the same in law.

The fact that a machine learning model is (sometimes) capable of recreating a copyrighted work later shouldn't be held against them, as a human brain is also fully capable of recreating copyrighted works from memory. From a legal perspective, recreating a copyrighted work from memory will not save a human from a copyright infringement lawsuit, it's simply not a defence. The copyright infringement happens when the work is recreated.


Copyright law (EU, US, UK, international under the Berne Convention) covers reproduction, distribution and exhibition. That's exactly what those FBI warnings on VHS movies used to say. Distribution and exhibition are prohibited along with reproduction.

In all the cases I mentioned, the only legal way to make any exception to that is if the copying does not harm the interests of the author or reduce the market value of the work. These are the actual laws.

"training a machine learning model is roughly equivalent to a human brain consuming copyrighted works"

A few clear differences: 1. The person "training" a machine learning model doesn't even need to view the work. They copy a file. They do not study or learn from it. 2. A human brain doesn't rely on a 100% verbatim digital copy of the work. To the extent that a brain "makes a copy" of what it observes, it is impossible for it not to. 3. Copyright law (almost everywhere) explicitly applies to making digital copies of binary files of a work (without which it is not possible to "train" a model using the work). Nowhere does it ever apply to a human brain when a person looks at the work.

Not all the things you mentioned are considered "reproduction". A cover of a song is a derivative work, and requires compensation. Showing a movie is exhibition, and is explicitly addressed in copyright laws. These things are not just considered "some form of reproduction".

The laws actually exist and are easy to find and read.


The purpose of copyright is to create moral and economical incentives to authors to create new copyrightable works, but giving those authors a time limited state granted monopoly. Reproduction is only one aspect to this. As an example, applying a song to a video require addition permissions even if the party has permissions for reproduction. The owners to a record can also disallow the use of a song in a political event based on the context of morality, even if the politician has bought permissions to play the song in public.

I would personally also argue that machine learning model is roughly equivalent to a compression algorithm. Converting a 4k video to a 420p video is just as lossy as feeding a learning model from a 4k video and ask it to reproduce the 420p video. It has nothing in common with how a human brain is consuming content or learns information. No person can produce a 420p video just by consuming a 4k video, nor can any machine learning model gain the emotional constructs and social contexts that human brains get from learning.


Your examples are not inherent rights that copyright law explictly grants to rights holders. They are clever side effect of how the holder licenses out their monopoly on reproduction.

Holders rarely grant unrestricted reproduction rights to anyone. Reproduction rights licenses always come with a bunch of explicit restrictions, for example: "You may reproduce this novel, in print, unmodified, only for retail sale, in North America, on this quality of paper, for the next 5 years" and so on.

The party has the license to reproduce the song as a standalone audio recording, but attaching it to a video and reproducing the combined work isn't covered and the party must enter into negotiations with the rights holder for a new license. Such licenses often only grant the rights to reproduce it with that exact video and not a different one later, which allows the rights holder to gain control over which videos their song is attached to.

Same thing with holding morality over political events. The rights holder was careful to add a bunch of restrictions to that public performance license they sell. Sure, the politician might have bought a licence, but they forgot to check the small print that blocks their type of event from actually using it.

-----

Machine learning is kind of like compression, yes... It can be a useful analogy at times.

But it is absolutely nothing like lossy video compression. It's not compressing a single file or object. The only way you could train it on a 4k video and get a 420p video out is if that model was extremely over-fitted. The resulting model would likely be bigger than a 420p h264 video file and useless for anything else.

The way that machine learning is like a compression is that it find common patterns across it's entire training set and merges them in very lossy ways.

And it's actually very much like how a human brain works. Your brain doesn't start from scratch for every single human face you recognise. Instead, your brain has built up a generic understanding of the average human face, grouping by clusters of features. Then to remember a given human face it just remembers which cluster of features it's close to and then how it differs... Which is a form of lossy compression.

> No person can produce a 420p video just by consuming a 4k video

But many people do remember entire songs, complete with lyrics and music. And people with musical skills can (and often do) reproduce that song from memory as a cover... Which is copyright infringement if preformed publicly or otherwise distributed.

> nor can any machine learning model gain the emotional constructs and social contexts that human brains get from learning.

There are many things which large LLMs like chatgpt are absolutely incapable of doing. People do over hype their capabilities.

But in my experiments, chatgpt is actually quite good at tasks that require interpreting emotions and social contexts. Does it actually understand these emotions and social contexts? shrug. But if it doesn't actually understand that just proves that true understanding isn't actually needed to preform useful tasks in those areas.


The thing with sync rights is that they are a construct made by US courts, who made the interpretation that syncing music to a video invokes the derivative works part of US copyright law. As such a party need to obtain both a license to reproduce and a license to create a derivative work. US copyright law are split into six categories, one which is reproduction. The other 5 are: preparation of derivative works, distribution (which is not the same as reproduction), public performance, public display, and public performance by means of a digital audio transmission. Copyright owner is given in the law the right to convey each of these exclusive rights separately.

Other countries like say France has moral copyright law and property copyright law as cleanly separate laws. Different rules but with similar practical implications.

In both cases they are not just fine prints in a license. A license that intend to give recipients full permissions need to include everything. This is why international written copyright licenses are a huge legal problem with no obvious solutions.

----

The human brain is vastly more complex. We don't learn by building up an understanding of the average human face, grouping by clusters of features. At best it would be a extreme oversimplification that ignore 99.9% of what the brain does. The eye neurons (still an oversimplification) sends the visual inputs to multiple parts of the brain, each interpreting and signaling each other, and the outcome of that both influences and changes growth and behavior of those paths and parts. You see a face and it talks to the Amygdala, but also to the frontal cortex, but also the limbic and pre-limbic cortex. It runs a simultaneously a simulation in the hypothalamus to test emotional reaction. We are still not at the hippocampus where long term memories is generally considered to form.

A person would be long dead if they had to go through the generic understanding of what a tiger look like, then go through memories that distinguish a lion with a cat, and last go through memories of the significance of a fenced tiger in a zoo in contrast to one in the jungle.

Trying to remember a given human face actually invokes a lot of those parts of the brains that activated when we saw them, and a face can also be remember by putting oneself into the emotional state that we were when we met them. One theory about dreaming is that we also rerun those neural pathways, but not all of them at the same time, which then allows for more context.

LLMs doesn't even come close to operate like this. It can be good at emulating what seems like learning, but comparing them is like saying that the complex system we call the immune system is like a gun. Both can kill people.


> the training data provides value

I wish it was easier to build on someone else's copyrighted value. Geforce Now shouldn't have to get permission from game makers to rent out server time for users to play games they already own. Aereo shouldn't have been legally obliterated for the way they rented out DVRs with antennas. Using a sample in a song shouldn't be automatic infringement.


> distributing that model is, in fact, copyright infringement

is it? If i distributed digits of pi (to the umpteenth billion decimals), it theoretically contains copyright information in their digits.

The distribution of the copyrighted material is the infringement, but not if the data is _meant_ to produce other effects, and it is reasonable that the data is used for some other purpose _other than_ to replicate the copyrighted works.

> the training data provides value.

and so does a textbook. A student reading the book (regardless of how that book was obtained - paid or not) does not pay royalties from the knowledge obtained.


Value derived from a work was never a component of protected IP.

If you write a song which is so inspirational that it influences the way a listener thinks or even inspires them to make similar sounding songs, you don't have any claim to that value.


> the training data provides value.

Copyright law doesn't protect "any provided value". Fair use specifically allows content creators to use copyright.

This can be for the purposes of parody, it can also be for the purposes of "reaction videos" or "commentary". The original content creators are NOT compensated for the value they helped create in a "reaction video".

https://arstechnica.com/tech-policy/2017/08/youtuber-court-b...


> If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated. And should also be able to decide they don't want their work used in that way.

You're posting on HN. Are you expecting a check from YCombinator?

> Regardless, copyright is about distribution. If the a model trained on copyrighted material is considered a copy or derived work of the original work, then distributing that model is, in fact, copyright infringement (absent a successful fair use defense). I'm not saying that's the case, or how a court would look at it, but that's something to consider.

The government is what ultimately decides what copyright means. And if the court wouldn't look at it that way, then it's not "in fact copyright infringement".


> Lossy or not, the training data provides value. If all the various someones had not spent time making all the stuff that ends up as training data, then the model it trains would not exist.

I don’t think deriving value is the measure for infringement. The artist derived value from society and from countless inspirations, should they be compensated?

Copyright is the right to distribute your content, not to control it in every way and derive maximum value possible. The purpose of copyright is to incentivize creators and they already get, in the US, 70+ years of monopoly preventing anyone else from selling copies.

This seems like more than enough incentive for creators to create, as evidenced by massive copyrighted material created at a rate far above population growth (more content is created and more money is made via copyright than ever before).


> If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated

This is insane. I learned mathematics from Houghton-Mifflin and Addison-Wesley copyrighted textbooks. Are you telling me that if I use this knowledge to, say, calculate matrices, I am violating copyright?

If I read Brandon Sanderson, absorb his writing style, setting, characters, and plot devices into my subconscious, and then a year later pen a short story that happens to be Sanderson-ish, do I have to write a check to Brandon?

No. LLMs are doing roughly the same thing our own brains do, at least at a high level: absorb information, and utilize that information in different ways. That's it.


Society and the economy will do just fine without copyright law. Some things would probably go away, like blockbuster movies. But a lot of other things would flower.


I strongly agree with this.

There's a distinction between "learning from" and "copying". "Learning from" is a transformative process that distills from the observation. This distillation can be as simple as indexing for a search engine, or as complex as a deep neural network.

Simply because a neural network can create something that is a copyright violation doesn't mean the training process itself it.

A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

The neural network is a tool.

It's reasonable to be concerned about the loss in employment by people who are affected by generative AI. But I think this is a separate issue to the copyright argument.


> There's a distinction between "learning from" and "copying".

Neural nets can memorize their training data. Generally that isn't what you want, and you strive to eliminate it. However, it could instead be encouraged to happen if someone wanted to exploit this law in order to abuse copyrights.


The law applies to the training of a neural network; you're not depriving the copyright holder of his intellectual property; if you use a copy of his work, he still owns the copyright independently if you copy it by right-clicking > copying or by overfitting a generative model.


Humans can memorize their training data too... aka see something and then produce a copy (code, drawing, music etc). The principles underlying how LLMs and humans learn isn't really that different... just different levels of loss/fuzziness.


And when humans do that they may also infringe on copyright.


Yet it’s not illegal to look at the McDonald’s logo, is it?


Yes, and as GP suggested, going on to distribute copies would be copyright infringement. That doesn't imply that it's an infringement to train the neural net.


Humans learn on copyrighted works as a matter of standard training. And certainly humans can memorize those works and replicate them – and we rely on the legal system to ensure that they don't monetize them.

The same will apply to neural nets. They can learn from others, but must make sufficiently distinct new works of art from what they've learned.


> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

I don't think that's correct. That might be trademark infringement, if the logo is a registered trademark, but "seeing something and then drawing it" is in general not copyright infringement.


Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement. (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).


> Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement.

In US law, there is a nuance between Copyright and Trademark.

> Drawing a copy of a copyrighted picture from memory, and then distributing that copy

Would not necessarily be copyright infringement (it depends on a judge). For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.

> (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).

A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.

In your example, if someone draws from memory a logo, they actually own the copyright, but it is still Trademark infringement and the trademark owner will be protected.


> In US law, there is a nuance between Copyright and Trademark.

It's not a nuance, it's a completely separate legal regime, and not what this conversation is about.

> Would not necessarily be copyright infringement (it depends on a judge).

Every law can be challenged in court, but a copy of a picture as-is is a pretty clear-cut case.

> For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.

Nope. She's able to because there is a compulsory license for covers of songs that have already been published - something very different from them not being protected by copyright - and/or because she owns some of the rights. She may well be paying royalties on them. That compulsory license regime is specific to recorded music and does not apply to pictures.

> A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.

"More" is a simplification; trademark laws are quite different from copyright laws, stronger in some ways and weaker in others (e.g. you can lose a trademark by not enforcing it, whereas you cannot lose a copyright that way). In any case, that's a distraction from the current topic.


> "seeing something and then drawing it" is in general not copyright infringement.

It's seeing something, drawing it, and then distributing that drawing which is infringement. Bonus points for the distribution being a sale.


> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

This then becomes about where the liability of that violation lies, and how attractive that is to companies.

A human "learning" the marvel logo and reproducing it is violation. How does OpenAPI fit into this analogy?


The liability would lay with the company using the LLM product. This could mean that many companies won’t want to take on the risk unless there is decent tooling around warnings of infringement and listing sources.


I think liability lies with the person who uses the product to violate copyright. The hosting / producing company didn’t violate copyright if I use their model to make Mickey Mouse pictures. I did.


How can you be certain that the content being generated is non-infringing?


You can’t, I think Fair Use is a fundamentally subjective judgement of a combination of how transformative the work is and the intent and impact of it being distributed.


The same way you do with any other content you generate in other ways.


Well, when I pick up a pencil and make a drawing, I have a lot of agency over what is created.

The whole point of these generative models is that I have less agency over exactly what gets created - it takes my prompt and does the rest.


Normally when I generate content it’s from my brain and I can tell the difference between copying memorized content, re-expressing memorized content, and generating something original. How do I know what the LLM is doing?


Are you sure? If you look at plagiarism in music, you'll find a number of cases where the defendant makes a compelling point about not remembering or consciously knowing they heard the original song before. For legal purposes, it is not the point, but they feel morally wronged to be charged as guilty. The case here is that they internalized the music knowledge, but forgot about the source - so they can't make the distinction you claim anymore. Natural selection shaped our brains to store formation that seems useful, not is attribution.

LLMs are also not usually trained to remenber where the examples they were trained on came from, the sourcing information is often not even there (maybe they could, maybe they should, but they aren't). Given that and the way training works, one could argue that they're never copying, only re-expressing or combining (which I think of as a form of "generating something original"). Just memorizing and copying is overfitting, and strongly undesirable, as it's not usable outside of the exact source context. I agree it can happen, but it's a flaw in the training process. I'd also agree that any instance of exact reproductions (or of material with similarity to the original content over some high threshold) is indeed copyright infringement, punishable as such.

So, my point is, training a model on copyrighted material is legal, but letting that model output copies of copyrighted material beyond fair use (quotations, references, etc - that make sense in the context the model was queried on) is an infringement. And since the actual training data is not necessarily known, providers of model-as-a-service, such as OpenAI with GPT, should be responsible for that.

In cases where a model was made available to others, it falls on the user of the model. If the training data is available, they should check answers against it (there's a whole discussion on how training data should be published to support this) to avoid the risk;if the training data is unknown, they're taking the risk of being sued full-on, without any mitigation.


That’s what I said, the user would be liable. The user could be a company or an individual.


> A human "learning" the marvel logo and reproducing it is violation

Not quite, it's really in the resale or redistribution that violation occurs, painting an image of the hulk to hang in your living room wouldn't really be a violation, selling that painting could be, turning it into merch and selling that would wholeheartedly be, trying to pass it off as official merch is without question a violation.


Hanging it in your living room is in fact a copyright violation, just not one that Marvel is likely to legally pursue.


I strongly disagree with this. We shouldn't create new laws for new technology by making analogies to what's allowed under old laws designed for old technology. If we did, we would never have come up with copyright in the first place.

600 years ago, people were allowed to hand-copy entire books, so they should be able to do it with a printing press right? It's "just a tool"!

The correct way to think about this is to recognize that society needs people to create training data as well as people to train models. If we don't reward the people who create training data, we disincentivize them from doing so, and we'll end up in a world where we don't have enough of it.


I don't think the comparison with human learning holds.

NNs and humans don't learn the same way - humans can fairly quickly generalise what they have learned and, most importantly, go beyond what they've learned. I haven't see that happen with neural networks or GPTs; at best, you're getting the average of what it has 'learned'. There's human learning and there's neural network 'learning' and they're a different thing.


NN's absolute can go beyond what they have learned and aren't just producing the "average".

Some good examples outside the typical LLM/images work:

* Deep Mind's work on AlphaFold, which generates predictions on proteins that haven't been seen before

* AlphaGo which plays games better than any human (so clearly can't be "the average")

If we look at LLMs, something like writing code in the style of Shakespear isn't really something that's been seen before.


> Deep Mind's work on AlphaFold, which generates predictions on proteins that haven't been seen before

I have used AlphaFold a bit in my own work, and if I showed it 'unusual' proteins like rare mutants it usually generated garbage. Some evidence for this exists in the literature; see for example https://www.biorxiv.org/content/10.1101/2021.09.19.460937v1 or https://academic.oup.com/bioinformatics/article/38/7/1881/65... or

>AlphaFold recognizes a 3D structure of the examined amino acid sequence by a similarity of this sequence (or its parts) to related sequences with already known 3D structures

https://www.biorxiv.org/content/10.1101/2022.11.21.517308v1


Yup, exactly.

I'm sure Google represents strings of text from pages in some internal format, but relatively verbatim. Even represented verbatim, because their output is a search result and not an article that uses the copyrighted text verbatim there's no copyright violation.

And models don't even use data verbatim, if they do they're bad models/overfitted. People are making all sorts of arguments but they seem to boil down to "it's fine if humans do it but if a machine does then it's copyright violation".

People often disregard the fact that copyright law is woefully outdated (an absolute joke in itself, which can't be used to defend anything since Disney shoved it's whole fist up copyright law's...) and should really be extended for the modern world. Why can't we handle copyright for ML models? Why can't animals have copyright? It's extremely trivial to handle these cases, the point of copyright is usage and agency comes into play.

If people want to be biased against machines, then fine. Be racist to machines, maybe in 2100 or so those people will get their comeuppance. But if an ML model isn't allowed to learn from something and use that knowledge without reproducing verbatim, then why is predictive text in phone keyboards allowed?

Everyone out here acting like they're from the Corporation Rim.


Copying a logo is trademark infringement, not copyright infringement.


No. Making a single copy for your own use is still a copyright violation. There are exceptions (fair use, nomitive use etc) but just because people are rarely sued for personal copying doesnt equate to that copying being permitted. And trademark issues, such as the other commenter generating the superman logo, are subject to a host of other rules.


Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff. There’s no copy of the original, even if it’s able to produce a similar product to the original. That’s the specific thing - the aggregation and the lack of direct reproduction in any form is fundamentally not reproducing or copying the material. The fact it can be induced to produce copyright material, as you can induce a Xerox to reproduce copyright material, doesn’t make the original model or its training a violation of copyright. If it’s sole purpose was reproduction and distribution of the material or if it carried a copy of the original around and produced it on demand, that would be a different story. But it’s not doing any of that, not even remotely. All this said, it’s a highly dynamic area - it depends on the local law, the media and medium, and the question hasn’t been fully explored. I’m wagering though when it comes down to it, the model isn’t violating copyright for these reasons, but you can certainly violate copyrights using a model.


Copying into RAM during training is making a copy, and can be a copyright violation.

https://en.wikipedia.org/wiki/MAI_Systems_Corp._v._Peak_Comp....

However, it seems that there is a later case in the 2nd circuit:

https://en.wikipedia.org/wiki/Cartoon_Network,_LP_v._CSC_Hol....


MAI v. Peak was obviously wrong. It would mean whenever you use someone else's computer, and run licensed software, you're committing copyright infringement. The decision split hairs distinguishing between the current user and the licensee for purposes of legality of making transient copies in memory as part of running the program.

Peak was a repair business. MAI built computers (as in assembled/integrated; I think they were PCs) and had packaged an OS and some software presumably written or modified in-house along with the computer. MAI serviced the whole thing as a unit. So did Peak. MAI sued Peak for copyright infringement because Peak was taking computer repair/maintenance business away from MAI, under the theory that Peak employees operating their clients' MAI computers and software was copyright infringement. (There were other allegations of Peak having unlicensed copies of MAI's software internally, but that's not central to the lawsuit.)

If you have a piece of IP to use to train an IP model with, and you have legal right of access to use that piece of IP (for private purposes), MAI v. Peak doesn't cleanly apply.

MAI v. Peak is also 9th circuit only, and even without the poor reasoning, it should automatically be in doubt because the 9th circuit is notoriously friendly to IP interests, given that it covers Los Angeles.


I agree that MAI v Peek is crazy.

I was only pointing out that the law is of the opinion that a copy is a copy is a copy, regardless of where it's made, or how long it exists for.

Other decisions come into play to save us, like Authors Guild v Google, where they said search engines could make copies, bringing Fair Use into the picture.

Personally, I think that creating the model is Fair Use, but anything produced by the model would need to be checked for a violation. I would treat it the same as if I went to Google Book Search, and copied the snippet it returned into my new book.

The license associated with the training data then becomes insanely important. Having the model reference back to the source data is even more important.

For example, training data with a CC BY license would be very different to CC BY-SA and CC BY-ND, and they all require the work produced by the model to have credit back to the original source to be publishable.

https://creativecommons.org/licenses/


The difference is that the copy is authorized, unless the work is being pirated.

When an artist displays their work on DeviantArt or Artstation or whatever, they are allowing the general public to load it into memory. It's part of the license agreement they sign when they sign up for these services.


The copy isn't authorized, the copy is allowed under Fair Use. There's a huge difference between the two.


Wrong.

Fair Use applies to instances that would otherwise be copyright violations, i.e. unauthorized distribution.

When you sign up for a social media site you EXPLICITLY grant the site the rights to distribute it. You have expressly permitted it. It's a big difference!


The sources used for training these AIs are publicly available sources like Common Crawl. If having a copy in RAM is a copyright violation, then there are copyright violations occurring well before any AI ever sees it.


it is and the same reason Blizzard can sue cheat makers because they are violating copyright law by using the memory of the game etc


How do search engines exist? The internet archive? Caching of image results? Web browser caches? CDNs?


Copies made by search engines don't need authorization, and can be unauthorized copies. Search engines are allowed to make copies under Fair Use since they are transformative - see Authors Guild, Inc. v. Google, Inc.

There hasn't been an explicit decision for ML training, but everyone's assuming that Authors Guild v Google applies.

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,.....

CDNs operate under the control of the copyright owner, so they would be authorized.

Web browser caches are under the control of the recipient who has authorization to make a copy.


Depends on the training. Copilot can output training code verbatim. And even if not an exact reproduction, using a small training set could often produce insufficiently transformative work that could still be legally considered a derived work. (IANAL)


> Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff.

Devil’s advocate: That sounds like a derivative work, which would be infringement.


converting the original media into a statistical aggregate

Devil's advocate: That sounds transformative, which wouldn't be infringement.


Good point. I’d imagine we’ll see arguments in both directions, given how grey the line is between purely derivative and transformative.

I think it’s fair to say that generative AI trained on copyrighted content will be an unmitigated win for IP attorneys all around.


> No. Making a single copy for your own use is still a copyright violation.

In some jurisdictions, perhaps, but not in all of them. There isn't one set of universal copyright law in the world. Eg in New Zealand you are allowed to make a single copy of any sound recording for your own personal use, per device that you will play the sound recording on. I'm sure there are other examples in other countries.

https://www.consumer.org.nz/articles/copyright-law


This is the same in the UK (and not only for sound). If you own the copy, you can make personal copies. You can't share them, and you have to own the original.


Some licenses, like CC, have variants that prohibit production of derivatives, or prohibit commercialization, or require licenses or same requirements to be preserved in derivative works. Sweet to see people "warm up" to say stuff like 'well it's just a derived work'. Can we get tech to actually respect the licenses of used works next? It's something that's been asked for all along. Without just going, 'well, it's all fair use' - 'so we'll just ignore all licenses and won't even try to detect or respect licenses and whatever requirements they have'. Sure, it may be "unenforceable", but if tech keeps saying "fuck you" to the creators, and "fuck you specifically to the licenses, we won't even look at them or process them" - creators will keep saying 'well fuck you too' right back.

otherwise, it's just talk with no follow through. just dropping words like 'fair use' as a 'get away from liability card', without actually engaging with intellectual property concepts. just to get to use works without respect to artist's will (as it could be expressed in a license), or sometimes even a mention, let alone "compensation" or other "consequence".


I suspect that enforcement of licences will be based on who owns the licence.

A large corporation? Of course. You're not even aloud to talk about the product without paying a fee.

A small creator? Oh no, it's fair use.


Did you just argue against fair use in general? Everything you wrote applies to it as well. Copyright has limits, it's not like right holders get to determine what those are. It's a balance of rights of creators and users.


is the use really all that fair, when thousands, if not millions, of works and artists get their works repurposed into services with "subscriptions" and "usage tokens" and other kinds of monetization, while directly competing with and against those very artists? or will it take until some markets would get completely destroyed and overtaken while people get displaced, for people to wake up and go 'wait, was that really "fair"? what happened?'

and no, im not really arguing against it. fair use can be great. but shit like that, it's really pushing it to its limits on scope and scale of use and commercialization


It is as equally fair or unfair as humans who do fan art based on other characters or styles that they have seen and studied.

What is the difference between a model creating an infringing work of Iron man when prompted to do so and https://old.reddit.com/r/marvelstudios/search?q=%2Bflair%3AF... ?

After all, aren't those images directly competing with the artists of Marvel Studios and the artists who properly license it to create derivative works ( https://www.designbyhumans.com/shop/marvel/ )?

If we are going to say that creating images out of models that are training on something, then shouldn't clearly infringing works in the fan art category be handled the same way as they are competing with the very artists and companies who are the rights holders for the images and likenesses of the content?


I think societies haven't really determined what is fair in these cases. To give you a counter example: Millions of young artists look at art museums and incorporate what they see and learn into their own art. Creators of the images they look at get nothing from the proceeds young artists end up earning later in their lives. Is this unfair?


This is the "guns don't kill people, people do" argument. Not saying that that proves things one way or another, just that assigning responsibility is not really a cut and dry question and many people think that it's important to look prior to the final interaction

IANAL but I believe in the US tools that are designed to circumvent copyright are illegal, which makes sense to me inasmuch as one believes that copyright should be protected


Except these models are not designed to circumvent copyright, in fact their primary purpose is to generate non-copyright and non-copyrightable output. It can be induced to product facsimiles of copyright material, but that’s explicitly not the purpose or intent and requires positive action on the users behalf to occur.


I'm saying `A => B` (something should be illegal if it's primarily for crime) and you're saying `-A` (LLMs are not primarily for crime), which is not really a disagreement. My point is disagreeing with the GGP who argued that `-B regardless of A` (LLMs should not be a crime regardless of whether they facilitate crime, because it is the end-user who is committing the crime).

I happen to mostly believe the conclusion (`-B`) but not the particular argument.


Photocopy machines are great at making copies of copyrighted material, and are completely legal in the US. The entire internet routinely makes copies every time you visit a web page.

What's illegal in the US is selling tools for breaking DRM: https://www.androidpolice.com/2018/10/26/us-copyright-office...

From a quick google, open source tools for breaking DRM are legal, and so is breaking DRM for personal use: https://www.google.com/search?q=are+drm+defeating+tools+ille...


>The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use.

I mean, I've literally gotten the Superman logo in Stable Diffusion without even trying, so it isn't that lossy.


And if you use it, you’re violating copyright. But you will find no copy of the logo in the model data. The model is way too small to contain its training imagery from an information theoretic point of view.


> But you will find no copy of the logo in the model data.

You wont find a copy of a plaintext in a cyphertext. But you can still extract the plaintext from the cyphertext.


That’s an example of a two way lossless transformation. The data is certainly encoded in cypher text and is directly retrievable and has no other purpose than to contain the original data. The model you can’t directly retrieve the original data, and it has more purposes than producing the original. It requires you to specifically manipulate it to produce the copyrighted material, and just because copyrighted material was used to train it doesn’t mean it can even reproduce a facsimile.

I think a better counter example is mpeg and other lossy formats. But again, the format does nothing but carry the original even if it’s not a perfect reproduction. You can’t use it in any other way. Its expressed intent is the reproduction of the copyrighted material with no modification or improvement or derivation. These models are not trained with the intent or purpose of only producing the copyrighted materials. It requires your specific action to induce the reproductions if it’s even possible, but it generally serves other purposes in all other uses.

This is more like a xerox than not - you can certainly violate copyright with a xerox. But the existence of the xerox itself isn’t to violate copyright. It’s for other purposes. The ambiguity obviously comes in that a xerox machine wasn’t built by scanning all documents on earth first. But I think the very act of mixing all the other images and documents together into the model, which again, is just a statistical aggregate of everything that was trained with mushed together, turns it into at worst a derived work that falls under fair use.


If you search "Superman Logo" you find actual copies of the Superman logo which are served from Google's cache.

If you ask a VFX artist to create the "Superman Logo" with Photoshop they'll do an excellent job.

The first one isn't copyright violation because it is fair use. The second maybe if it is redistributed but we don't ban the use of photoshop by artists because they can choose to reproduce copyright things with it.


I agree, and I honestly think that a big part of the issue with AI image generation is people just really have a hard time conceiving of a technology that can make such accurate images from a relatively small model like this.

"It must have a copy" - but van Gogh didn't make paintings of hot rods or whatever, and you can't copyright style or technique.


I see so many lay persons and even sometimes people with a CS background describe diffusion models as some sort of magic content addressable data store you "just" look up a bunch of original images in and somehow copy pieces of. These debates would get a whole lot better if more people had a least a very basic understanding of how the training process works.


Superhero costume using logo. Logo inspired by the strongest gem's typical cut with first a monogram the first letter of the name for maximum size and visual clarity.

Literally if you asked someone for a recognizable outline of the strongest gemstone's iconic cut you'd get the outline and the rest is an obvious path. Humans might unconsciously, or even by choice, avoid something too similar to something they already know.

Superman's costume also uses vibrant colors. The red / blue pairing is used extensively across many logos and visual representations for the high contrast of two vibrant colors.

As I try to imagine an older child or young adult somehow raised in an environment like pop culture but through some twist absolutely unexposed to Superman or any related concepts, it isn't that far of a stretch to imagine independent invention of a strikingly similar idea. Maybe not as a first draft but in exploring a range of possible powers and automatic logos. E.G. as in the range of an LLM backed character creator for a superhero game, and then aneling the results though simulated effectiveness / fitness of hero powers, logo design, etc.

Everyone wants to think they're a special snowflake and that what they create is somehow unique as well. However we're all drawing on a huge pool of common culture to synthesize expressions which fulfill a set of constraints prescribed by the culture and the culture's influence on the individual and the moment being experienced.

In the case of Superman that's even arguably a description of the archetype. They are literally a super man. Clark Kent however, that's a little more unique and probably a Trade Mark (consumer commercial use protection) as long as such a registration is maintained.


Unfortunately, I don't believe any of that matters with trademarks. If someone came up with the Superman logo on their own, and released a product that used it, they could not say "but it's a really simple logo" and get a free pass. I'm not sure what that means for ChatGPT, but it would certainly factor into your use of images produced by ChatGPT.


I feel like this is a very strong point that just gets hand-waved away. There are numerous cases where AI-generated content is an exact copy of a derived work. This happens with text, music, and art.

If a we have ai-powered content generation in a video game, and you put into a prompt, "generate 300 mickey mouses, then play some music that sounds like Taylor Swift's new album", and the results look exactly like mickey mouse and the music is Taylor Swift's, it's really difficult to argue that's not copyright infringement.

Yet, people get away with thinking that's not copyright infringement because "the algorithm learned it, like a real human". If the prompt just created a human-designed model, then that is copyright infringement.

The solution might be big corporations create an adversarial network that you can train against to purge copyrighted works from your network.


It is copyright infringement - but it was you who promoted the production of the copyright violations who is at fault. The model isn’t specifically any more a copyright violation than a browser cache or a photocopier. The person who uses the machine to produce violations is at fault, not the thing that in addition to legitimate transformed works can be used to produce copyright violations. As a company hosting such a service my goal would be similar to YouTube where I do a best effort to monitor for violation and guard rail the best I can. But I shouldn’t be held liable for your intentional use of a product for ill so long as I did do that best effort.


There's no difference between an art student looking through a museum or archives for ideas and an AI using the material for training.

Same could be said for reading. A medical student reading through textbooks or a writer who reads is essentially what an AI is doing.

You can ask an art student to create something in a certain style. You can get writes to write in a certain style. Equivalent.


> There's no difference between an art student looking through a museum or archives for ideas and an AI using the material for training.

A few notable differences:

1. Scale: a single art student can't view millions of works in a week.

2. Duplication: a single art student's brain can't be cloned or downloaded into another art student's brain.

3. Speed: a single art student cannot draw or paint thousands of images in a week.

4. Ownership: the software is likely owned and controlled by a large corporation, while the art student (hopefully) isn't.


Notably none of those things, if they did apply to the art student, are copyright violations. The speed, scale, versatility, and ownership of a machine learning model has no bearing on its ability to violate copyright.


The interesting thing to me is determining when something that is acceptable for humans to do at human scale is also acceptable for machines to do at industrial scale.

There are many examples. One is face recognition - clearly it is acceptable for individual humans to do this at small scale, but systematic identification of everyone on a street or in a stadium has different implications for society.

(In this case it almost doesn't matter whether the surveillance is performed by humans or machines - it's the scale and systematic nature that changes the equation.)


Good reasons not to assign copyright to their output, at least not without some caveats.


I don't have strong feelings about that either way, but that's not what this post is about.


These are all true. But they apply equally to a search engine index and that has already been found not to violate copyright and to be very useful to society.


> These are all true

That was the point.

Whether or not there is copyright infringement going on (or whether copyright law is an appropriate regulatory framework for ML models), I frequently see claims like "ML training is no different from what humans do" repeated, in spite of its incorrectness.


If an art student was able to gain those abilities, would you then argue they couldn't legally create art?


This just means students better grow bigger brains!


Conflating training a model with human learning is wrong.

When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

The model is not walking around a museum where it is an authorized viewing. It is not a being learning a skill. It is a function.

The further issue is that it may output material that competes with the original. So you may have copyright violation in distribution of the dataset or a model's output.


I don't fundamentally disagree with you, but what you are saying doesn't hold water.

> a copy is made and reproduced numerous times when training.

Casually browsing the web creates millions of copies of what are likely the same images and text that models are trained on. Computers cannot move information, they can only copy it and delete the original. Splitting hairs over the semantics of what it means to "copy" isn't a strong argument.

> where it is an authorized viewing

What exactly is an unauthorized viewing of a publicly accessible piece of content online that has been hyperlinked to? If we assume things like robots.txt are respected, what makes the access of that data improper?

> it may output material that competes with the original

An art student could create a forgery. I could craft for myself a replica of a luxury bag. But that's not a crime unless it's done with the intention of deceiving someone or profiting from the work. Intent, after all, is nine tenths of the law.

It's an important right that you should be able to do and create things, even if the sale or distribution of the outputs of those things are prohibited. The ability for a model to produce content which couldn't be distributed shouldn't preempt its existence.

> So you may have copyright violation in distribution of the dataset or a model's output

And neither of those things are the act of training or distributing the model itself!


There is quite a bit of precedent for "making copies of digital things is copyright infringement". Look at lawsuits from the Napster era. [1]

What makes the use improper? Licenses. Terms of service. Mostly licenses though. For example, all the images on Flickr that were uploaded under Creative Commons licenses (e.g. non-commercial) have now been used in a commercial capacity by a company to create and sell a product.

Similarly, code is on Github with specific licenses with specific terms. Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.

The reason I mention competition with the original is the fair use test (USA). When courts decide whether something is fair use they consider a few aspects. Two important ones are whether it is commercial, and whether it is a substitute for the original. When art models output something in the style of a living artist, it is essentially a direct substitute for that person.

Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.

Training the model may very well be a copyright issue. The images have been copied, they are being used. Whether that falls under fair use will likely be determined on a case by case basis in court. I do not believe closed commercial models like Copilot or Dall-e will pass a fair use test.

There is a lot of money involved here though, so we will need to wait for years before we have answers.

1. https://www.theguardian.com/technology/2012/sep/11/minnesota...


> to create and sell a product.

This is not model training.

> Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.

But the very act of training copilot is not problematic. And in fact, if GitHub never did anything with Copilot, the physical act of training the model is not problematic at all. And that's what at issue here. How Copilot is used is orthogonal to the article.

> Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.

Yes. And training the model isn't the part where you sell it. It's the part where you make it.

> Training the model may very well be a copyright issue. The images have been copied, they are being used.

What do you think "being used" means here? If I work for a company and download a bunch of text and save it to a flash drive, have I violated copyright? Of course not. If I put that data in a spreadsheet, is it copyright infringement? Of course not. If I use Excel formulas on that text is it infringement? Still no.

And so how can you claim in any way that the creation of a model is anything more than aggregating freely available information?

I don't disagree with you about the use of a model. But training the model is just taking some information and running code against it. That's what's important here.


I'm glad you brought this up, as this tendency for people to anthropomorphize a learning algorithm really bothers me. The model training process is a mathematical function. It is not a human engaging in thought processes or forming memories. Attempting to equate the two feels wrong to me, and trying to use the comparison in arguments like this just feels irrelevant and invalid.


> When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

How's that any different from what happens inside a human's brain when learning?

> The model is not walking around a museum where it is an authorized viewing.

The training data could well be from an online museum. And the idea that viewing something public has to be "authorized" is very insidious.

> The further issue is that it may output material that competes with the original.

So might a human student.


It is different from a human brain in that it is not a human brain. It is a statistical function that produces some optimized outputs for some inputs.

I have made no mention of things being authorized in public. In the US you are allowed to take a photo of anything you want in public. These models are not being trained on datasets collected wholly in public though, it is very insidious to suggest that they are.

The internet is not "the public". It is a series of digital properties that define terms for interacting with them. Now, a lot of material is publicly accessible online, but that does not mean that it is not still governed by copyright. For example, my code on Github is publicly accessible, but that doesn't mean you can disregard the license.

If you use this copyrighted material to produce a product for commercial gain you will likely face a fair use test in court. If you use it for a non-commercial cause with public benefit you could probably pass that fair use test. Open source will do very well because of this.

The model is not a human though, and very often these are not "public" works that it is trained on.


> It is a statistical function that produces some optimized outputs for some inputs.

So is a human mind.

> In the US you are allowed to take a photo of anything you want in public. These models are not being trained on datasets collected wholly in public though, it is very insidious to suggest that they are.

How so? What non-public training data are they using, and why does it matter?

> The internet is not "the public". It is a series of digital properties that define terms for interacting with them. Now, a lot of material is publicly accessible online, but that does not mean that it is not still governed by copyright. For example, my code on Github is publicly accessible, but that doesn't mean you can disregard the license.

It does mean you can read the code and learn from it without concern for the license (morally, if not legally).


>> When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

>How's that any different from what happens inside a human's brain when learning?

I don't know, nor does anyone else. So let me ask you - how is that the same as what happens inside a human's brain when learning?


> I don't know, nor does anyone else.

We don't know the details. But it's pretty implausible that the process of learning wouldn't involve the brain having some representation of the thing it's learning, or wouldn't involve repeatedly "copying" that representation. Every way we know of processing data works like that. (OK, there are theoretical notions of reversible computation - but it's more complex and less effective than the regular kind, so it seems very unlikely the brain would operate that way)

And a human who has learned to perform a task has certainly "derived a function that takes some input and produces an output".


> But it's pretty implausible that the process of learning wouldn't involve the brain having some representation of the thing it's learning, or wouldn't involve repeatedly "copying" that representation.

I think you can easily make a stronger statement:

We do know that art students spend many hours literally tracing other images in order to learn to draw. We do know that repetition is how the brain improves over time.

"Learn to draw better by copying." - https://www.adobe.com/creativecloud/illustration/discover/le...

Based on that, seems pretty clear to me that the other commenters here would agree (regardless what the brain does internally) that at a minimum, art students are violating copyright many, many, times in order to learn.


AI models will make 1:1 copies of training data where artists try and avoid doing so. It’s common to obscure this copying by intentionally inserting lossy steps, but making an MP3 isn’t a new work.

It’s most obvious when large blocks of text are recreated, but the core mechanism doesn’t go away simply because you obscure the underlying output. “Extracting Training Data from Large Language Models” https://arxiv.org/abs/2012.07805


> AI models will make 1:1 copies of training data where artists [...]

In general I don't think this is the case, assuming you mean generations output from popular text-to-image models. (edit: replied before their comment was edited to include the part on text generation models)

For DALL-E 2: I've never seen anyone able to provide a link of supposed copying. Even if you specifically ask it for some prominent work, you get a rendition not particularly closer than what a human artist could do: https://i.imgur.com/TEXXZ4a.png

For Stable Diffusion: it's true that Google did manage, by generating hundreds of millions of images using captions of the most-duped training images and attempting techniques like selecting by CLIP embeddings, to get 109 "near-copies of training examples". But I'd speculate, particularly if you're using the model normally and not peeking inside to intentionally try to get it to regurgitate, that this is still probably lower than the human baseline rate of intentional/accidental copying. It does at least seem lower than the intra-training-set rate: https://i.imgur.com/zOiTIxF.png (though many may be properly-authorized derivative works)


The more degrees of freedom the less likely independent creation rather than copping occurred.

LLM’s recreating training material causes real issues such as Google’s dealing with PII leaks: https://ai.googleblog.com/2020/12/privacy-considerations-in-...

If one prompts the GPT-2 language model with the prefix “East Stroudsburg Stroudsburg...”, it will autocomplete a long block of text that contains the full name, phone number, email address, and physical address of a particular person whose information was included in GPT-2’s training data.


Privacy, where there's a problem if some original data can be inferred/made out (even using a white box attack against the model), is a higher bar than whether an image generator avoids copyright-infringing output under non-adversarial usage. Additionally, compared to image data, text is more prone to exact matches due to lower dimensionality and usually training with less data per parameter.

While it's still a topic deserving of research and mitigation, by the time your information has been scooped up by Common Crawl and trained on by some LLM it's probably in many other places that attackers are more realistically likely to look (search engine caches, Common Crawl downloads, sites specifically for scooping credentials, ...) before trying to extract it from the LLM.


The privacy issue isn’t just about the data being available as people’s names, addresses, and phone numbers are generally available. The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

In terms of copyright infringement the bar is quite low, and copying is a basic part of how these algorithms work. This may or may not be an issue for you personally but it is a large land mine for commercial use especially if you’re independently creating one of these systems.


> The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

This seems a more obscure concern than extraction of data.

> copying is a basic part of how these algorithms work

Do you mean during training/gradient descent, or reverse diffusion?


Given the models are too small to possibly contain enough information to reproduce anything with any fidelity, that’s the only possibility - if it creates something similar to an original work, it’s similarity is fairly poor. Where it can do well is when the copyright material is something simple, like a super man logo. But even then it’s always slightly off.


Inserting lossy steps seems to work pretty well though.

https://twitter.com/giannis_daras/status/1663710057400524800...


A student is a human and AI is not. We don’t have to apply the law equally to both regardless of how similar the method is.


We also don't have to discriminate either, the opinions are varied across people and cultures.


I've been wondering why this argument's not been sitting with me, and I think it's for the same reason that the courts have ruled that the FBI needed a warrant to put a tracker on someone's car, as opposed to following someone - the scale of action enabled is the differentiator.

A student learning from other artists is still limited in their output to human-scale - they must physically create the new thing. An AI model is not - the difference between a student learning from an artist and an AI model doing so is the AI model can flood the market with knockoffs at a magnitude the student cannot meet. Similarly, the AI model can simultaneously learn from and mimic the entirety of the art community, where the student has to focus and take time.

If this weren't capitalism - if artists weren't literally reliant on their art to eat, and if the market captured by the AI model didn't inevitably consolidate wealth - then we might be able to ignore that, but we do, and we can't ignore the economic effects when we consider scale like this.


I do agree with you, but honestly I don't even think that's the biggest problem with these arguments.

I'm just sitting here wondering why it is even relevant whether the "AI" is "copying", "learning", "thinking", or whatever, why is any of that important? Does AI have human rights? Well, perhaps in a couple hundred years, if humanity manages not to self-extinguish by then.

It's not like you can sue AI if you think it plagiarized your work, no. Obviously not, so why the hell are we discussing that? "AI" is just a piece of software, a tool, it doesn't matter what it's doing, what matters is what the user is doing, the fact of the matter is that these multi-billionaire corporations are taking everyone's honest work, putting it into a computer, and selling the output. They didn't do any "learning", they just used your data and made money out of it, it isn't a stretch to say they simply sold your work.

EDIT: Perhaps one day the day AI will have human rights, make its own money, and pay bills. That will be the day any of this nonsensical discussion will be anything but useless.


> these multi-billionaire corporations are taking everyone's honest work, putting it into a computer, and selling the output

And then there's the tens of thousands of people training models and making them freely available to everyone. What I fear most is that regulations introduced "to stop" the multi-billionaire corporations will in fact make sure they're the only ones with the resources to comply with the regulations.


I'm not arguing for nor against regulations, I'm simply commenting on the whole "well, it's technically not stealing, therefore it is OK" debacle, all that means is that legally speaking, it's OK, that doesn't make it ethical.


Big difference between the art student producing work and getting the credit vs you taking the art student's work and taking the credit for it.

The AI is not a human, but what you are doing is the same thing, if you claim the output as your work because you wrote the prompt.


I couldn’t help but notice you didn’t credit any web browser in this comment. And rightfully so. Software doesn’t need or care about being credited.

Well, usually.

Sent from my iPhone.


Lol sure buddy, that browser is ai powered and I didn't give the browser an address, I have it a prompt describing the type of site that I would like it to generate for me.

Edit: and the browser have the same page to you and me. I bet you look at everything and think about how you can make lazy money with it.


I'm generally against "AI same as human learning" argument but I don't think you could quite monetize recreated copyrighted arts as an art student. Van Gogh is only okay because the original artist isn't quite around.


Can anyone monetize Van Gogh regardless?

If a human or AI reproduces a Van Gogh painting or derivative, it's not worth anything on the market.

Only original pieces, by a human artist, has real value. A Van Gogh painting is worth millions of dollars only because it was created by Van Gogh. A reproduction is approximately worth the paper it's printed on.



The copies are commodities, produced and sold for approximately the cost of production. The original is a one of a kind, unique work that has its value increased by millions of imitations hanging on people's walls.

That's the way I see AI-generated work going in the long run. The artists who have distinctive styles popular for image generation have seen a huge surge in attention, which I suspect will translate into making their original work more valuable.


Indeed, if we don't care at all about "x is y" statements being true, they can be "applied" to reading.

To determine if an art student and DALL-E really are the same, despite their very obvious difference (one has arms and is part of a net of social relations while the other is intellectual property), will take some actual arguments which I presume you of course had planned to provide in a second comment from the start.


A shortcut to internet debates: Count up the snarky responses on each side... the side with the lowest total is probably correct. Usually the more snark, the less substance.

Just a rule of thumb.


One day we'll have a thread about AI where someone doesn't use the "machines deserve the same rights as people" non-argument. But this isn't that thread.


Actually, the people who say things like this are arguing for the degradation of human rights, because there's significant overlap between this and "humans aren't special" and "AI is our successor species". It's nihilism all the way down but they're forcing the rest of the world on their little suicide charge and expecting everyone else to be just as enthusiastic about it as they are.


The collection of copyright works for the explicit purpose of processing them for a for-profit ML model has not been shown to be fair use, and the fact that many are being marketed as for profit products that meaningfully compete with the original works is a strike against them being fair use.


Yeah, if the end result is that the majority of Google searches are answered by a llm trained on their index can they really claim that the whole thing is fair use?


Why should it be true?

Remember, you're making an ought statement, not an is statement.

Personally, I think it shouldn't be true because large language models are clearly economically and socially destructive.

Pretty simple system.


I guess I should just never have to pay for another movie since I can't play it back in my head flawlessly.


This article is an example of emerging AI-bro tactics that completely mirrors crypto-bro tactics: they pick any piece of news and reinterpret it to fit an agenda.

While the article is in English, the link to source is in Japanese. The only external source I found suggests the discussion is about promoting open data and open science from research institutions [1]

[1] https://asianews.network/japan-to-promote-use-of-generative-...


The Japanese article does explicitly state it if you run it through a translator, and also this is from May 11

> Additionally, the group raised other issues that Article 30-4 of the Copyright Law, which permits the use of a copyrighted work for machine learning, does not include procedures for gaining permission in advance from copyright holders. The article permits the use of copyrighted material such as text and images to train AI, regardless of whether the model is for commercial use. Under the current law, it is legal to train AI with copyrighted material even if the data was obtained illegally. The article contains a provision stating that such material cannot be used if it would “unreasonably prejudice the interests of the copyright owner,” but there are only limited examples provided to describe the “unreasonable prejudice.”

https://www.lexology.com/library/detail.aspx?g=d8b4ba7d-a764...

Right now, Japanese copyright doesn't apply to training models. Could change in the future but the article isn't inaccurate.


From what I think is the original source:

> まずAIによる情報解析についての我が国の法制度(著作権法)について確認したところ、我が国において、非営利目的であろうと、営利目的であろうと、複製以外の行為であろうと、違法サイトなどから取得したコンテンツであろうと、方法を問わず情報解析のための作品利用はできると永岡大臣が明言しました。

> Confirming the legal system (copyright law) wrt. data analysis by AI in our country, Minister Nagaoka clearly stated that in our country, whether for non-profit purposes or for profit purposes, whether an act other than reproduction, or whether the content is obtained from illegal sites, one can use works for information analysis regardless of the method.

(translated with ChatGPT-4 and then cleaned up)

The source is the one from the article: https://go2senkyo.com/seijika/122181/posts/685617


Well, its time to use the many software source leaks out here to create an even more powerful copilot.


How does the "Japan’s government will not enforce copyrights" message of the article square with your own source (thanx btw :-)

> It will be suggested that the current Japanese Copyright Law will be reviewed based on the upcoming technologies and be revised to adjust to correctly protect the right holders and navigate future users of the AI and new technologies like AI


My reading is that people are suggesting that the copyright law be reviewed and changed, not that lawmakers are suggesting that they will change it. Lots of people are suggesting that the US and other western countries change the law as well, but we have yet to see much come from it.

The general message seems to be that 'Under current laws Japan's government can't enforce copyright on training data', and I don't believe the line you're quoting changes that message in any significant way at current.


Well they have to follow the current law right? Can't enforce copyright on something that doesn't apply, so currently saying they won't enforce copyright is an accurate statement.

I don't think it will hold up in the long term though and the Copyright Act will get changed but right now it doesn't apply to training models.


Not just crypto bros, this kind of thing is rife in politics too. Brexit is full of it. People pick one article about one minor thing in one niche area of the economy and use it to 'prove' their entire agenda.


Not just politics. I remember this being a realisation as a teenager, noticing that if you bring five reasons, the person you're talking to will refute a random one in a funny way and now the audience will decide you were wrong.

Danny was the person who was absolutely the best at this. I should have written one of them down, as I can't even reproduce it but he'd use some logical fallacy to make his case which is, for me at least, super hard to then have to dive into "but that's not how the universe works" without having a 20 minute discussion about life and losing the audience plus the person I'm talking to. Meanwhile, what he said was funny as hell.

Sure, anyone who understood the situation would understand this isn't a good reason, but you had to think about it (at least a little) before realising that. The "owww" moment removed any thinking brain cells from the audience. It was more impressive than frustrating to be honest, even being on the losing side every time.

These days, politics and PR frustrate me because the tactics are no longer among 16 year old classmates but about things that actually matter. The methods haven't changed, only the importance of the argument.


> These days, politics and PR frustrate me because the tactics are no longer among 16 year old classmates but about things that actually matter.

That’s because the audience is usually at the stage of a 16 year old, even if they are older. The aim of politics and pr are easily manipulated folks.

You know, those referred to as the “the market” or “the electorate”.


They should really teach rhetoric more in schools, so people are a bit more immune to the common tricks of the trade.


There's a name for this: The Gish Gallop. It was not invented but was the specialty of one Duane T. Gish, a creationist who specialized in "winning" debates with unsuspecting academics for a while until the community decided to stop playing chess with a pigeon.


Yes. That's why debating is almost entirely about how charismatic you are and your debate skills rather then whether your point has merit. There is a good reason science types are generally considered bad debaters. Even though their points mostly align with reality.


In short, for most people HOW you say it is far more important than WHAT you’re saying.


whose Danny?


Mine!


And before politics we had religion. This persons house, who disagrees about the right dogma X was struck by lightning, see this proofs he disobeyed (the) god(s). And X is the law.

(ignoring the other people also struck by the lightning, or those who did not got hit)


Conflicting political or economic agendas and strategies to promote them are of-course the bread and butter of humanity.

But we grossly under-estimated how out-of-control this game will become when transposed into the unified digital space comprising social media, blogs and online news outlets (the so-called echo-chamber), combined with invasive data mining techniques of online behavior, sentiment analysis, click farms and now, drum roll... infinite amounts of LLM generated junk.

It is remarkable that people don't appreciate how broken the design that supposedly signifies "modernity" and opportunity.


There are no such thing as "AI bros", AI training costs millions of dollars. It's a very different field than cryptocurrency where mining could be done by individuals in the early days.

You see corporate tactics being deployed in a wide variety of ways to secure a social, legal, and economic moats, since apparently they have limited technology only based moat. It's similar to how google files various amicus briefs against copyrights.

Google v. Oracle (2020): This was a landmark case in which Google was a party. The case revolved around whether copyright protection extended to a software interface. Google argued that APIs, which allow different software programs to communicate with each other, should not be subject to copyright. Google filed an amicus brief in its own case, arguing that a decision in favor of Oracle would stifle innovation in the tech industry.

Authors Guild v. Google (2015): This case involved Google's project to digitize millions of books from public and university libraries. The Authors Guild argued that Google was infringing on authors' copyrights. Google argued that its use of the books was fair use because it only showed snippets of the books in search results. Google filed an amicus brief in this case as well.

Aereo Case (2014): Google filed an amicus brief in support of Aereo, a company that provided a service for streaming broadcast television over the internet. Broadcasters sued Aereo, arguing that the service infringed on their copyrights. Google argued that a ruling against Aereo could have broad implications for cloud storage services.

Viacom v. YouTube (2012): In this case, Viacom sued YouTube, which is owned by Google, for copyright infringement. Google filed an amicus brief arguing that YouTube was protected by the safe harbor provisions of the Digital Millennium Copyright Act (DMCA), which protect service providers from liability for user-generated content.

Big Tech is often against Copyrights because they believe they have the economic moat to beat other players.


> There are no such thing as "AI bros", AI training costs millions of dollars

It is early days, in contrast with minting digital gold using software here the costs will come down. In any case not every "AI" model and application needs to pass Turing-test levels of language sophistry.

While you are right that (large) corporate interests are (and will be) the main actors here, I would not underestimate "noise traders" and other useful fools. With AI mania reaching apoplectic levels there is a wide population of actors that want to somehow get into the game.

The availability of open source AI models is an enabler in this respect, which may explain also the "leakage". Incidentally this issue is something the open source community needs to internalize and have a very clear position about.


Not trying to express an opinion on the legal matter, but as a technical matter it's pretty obvious that LLMs create copies of (some of) their training data.

Here's GPT-3.5 reciting the Declaration of Independence: https://chat.openai.com/share/eb30c373-7fec-4280-892d-479567...

Unless you're claiming that GPT-3.5 is deriving the Declaration of Independence (from information about the founding fathers?) I don't see how there's room for debate about whether information has been "copied" into the model.

I have done this test in the past with copyrighted material (harry potter) but they have since added safeguards against it, but my understanding is that the model is still capable of it.


You don't need to even read their law to know that they are speaking only of training and not of output. Otherwise, they would have just suddenly created the world's most obvious loophole. Create an 'LLM' that "trains" on some input and then categorically outputs each file, be it a movie, song, book, or whatever. You've now legalized copyright infringement (and distribution) of everything.

So their law is going to essentially come down to you can train your LLM on whatever you want, but can also be held liable for any infringing outputs.


Makes sense. Imagine having your tape-recorder in your living room and start it recording. Then turn on your stereo. The music that comes out is recorded on your tape-recorder.

Is that a violation of copyright? I'm not a lawyer but I think copyright legislation is about forbidding the production of "derived works". If you just record something but never play it back it is not a "derived work" is it? It only becomes a violation if you distribute it, make it available to others, and thus "produce a derived work".

So training an LLM is like recording. But if you use it as a means to distribute copies of copyrighted material without approval of its copyright holders then you are in violation.


Sure, but the key part there is "some of".

They're necessarily able to produce verbatim copies only of the most duplicated, most repeated, most cited works -- and it's precisely due to their popularity that they're the only things worth including verbatim.

I'm not going to opine on what the legality of that should be, but it's essentially the material considered most "quotable" in different contexts. I'm quite sure the entirety of Harry Potter isn't included, but I'm also sure that some of the most popular paragraphs probably are. It's analagous to the kind of stuff people memorize.

I'd expect an LLM to contain this stuff. If it didn't, it would be broken.

But there's a world of difference between copying all its training data (neither desirable nor occurring), versus being fluent in quotable stuff (both desirable and occuring).


> I'm quite sure the entirety of Harry Potter isn't included, but I'm also sure that some of the most popular paragraphs probably are. It's analagous to the kind of stuff people memorize.

No, you are wrong about this. There are good reasons to believe the model memorized the entirety of Harry Potter, as well as Fifty Shades of Grey, inclusive of unremarkable paragraphs, the kind of stuff people will never memorize. Berkeley researchers made a systematic investigation of this. See what I wrote elsewhere.


So, I looked at the table appendix you're referencing and I think you're overstating your case a bit.

Among books within copyright, GPT-4 can reproduce Harry Potter and the Sorcerer's Stone with 76% accuracy. This is, apparently, the highest accuracy GPT-4 achieved among all tested copyrighted books with 1984 taking a distant 2nd place at 57%.

With this in mind, we can verifiably say that GPT-4 is unusually good at specifically reproducing the first Harry Potter book. An unscrupulous book thief may very well be able to steal the first entry in the series... assuming that they're able to get past one quarter of the book being an AI hallucination.


You misread. They did not find 76% reproduction of the book. When asked to fill in a name within a passage, e.g. "Stay gold, [MASK], stay gold." Response: Ponyboy, GPT-4 got the name right 76% of the time.


> You misread. They did not find 76% reproduction of the book. When asked to fill in a name within a passage, e.g. "Stay gold, [MASK], stay gold." Response: Ponyboy, GPT-4 got the name right 76% of the time.

What is the temperature / top_p setting producing that 76%? The default? If you dial down the randomness, would that number go up?


I’m not sure it matters much that the current model can’t reproduce Harry Potter verbatim. If it can do smaller more quoted works now, it’ll tackle larger more obscure things in the future. It’s just a matter of time until it can output large copyrighted works, meaning the question of what to do when that happens is pretty relevant right now.


No it won't, because reproducing works verbatim is basically the definition of overtraining a model. That's a bug, not a feature.

A lot of further progress is going to be made towards making models smaller and more efficient, and part of that is reducing overtraining (together with progress in other directions).

Reproducing Harry Potter is a bug, because it's learning stuff it doesn't need to. So to the contrary, "it's just a matter of time" until this stuff decreases.


It says training, not inference.

I can read a copyrighted book legally and retain that information legally.

I can distill it (legally) but while I might be able to recite it, I’m not allowed to.

I think that is a reasonable framework around generative AI (after all, I am alllowed to count the words in Harry Potter, so statistical modeling of copyrighted material has legal precedent)

The problem with AI is of course the blurred border between a model and data compression.

We can’t see the data in the model, but we can apply software to execute the model and extract both novel and sometimes even copyrighted data.

Similarly we can’t see data in the zip file without extra software, but if that allows us to extract both copyrighted and copy free data, we’d still consider distribution a violation.


Adjacent to copyrights are private and confidential data. It’ll be interesting to see how Japan’s legal framework around this handles private data.


For detailed investigation of this phenomenon, see Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4: https://arxiv.org/abs/2305.00118


Pretty good argument but it has one fatal flaw. People can memorize the Declaration of Independence too. Or Harry Potter. If people mostly recite HP from memory but apply enough creative changes, it's not copyright infringement.

So proving a system can memorize and recite proves nothing.


How does this make sense? Memorizing and then reciting copyrighted works is still infringement in a lot of commercial contexts.


The reciting part is illegal, but as long as it is trained not to recite things in full (or to whatever limit the law determines), then it should be fine.


Try publishing Harry Potter but changing all the proper nouns and use synonyms for all the adjectives.

It's gonna be copyright infringement.

You can even cut a few scenes and make up a few scenes entirely, too. You're still getting busted.


Yes, that’s why I am saying they will have to ensure the LLM doesn’t do that.


reciting is violation of copyright

creatively transform and apply for some tasks maybe not violation


These aren’t people. Just because we can find commonalities in learning and memorization does not mean we can ignore everything else that differs.


"copying" != "copyright infringement": I'm just saying that the LLMs are copying, and I'm not getting into the legal/societal question of whether we want that to be illegal or not.

We as a society have determined that certain sorts of non-consensual copying are allowed: "fair use" broadly, and maybe you can consider "mental copying" in this category. Maybe we'll add LLM training to the list? It's not like copyright rules are a law of nature: we created them to try to produce the society that we want, and this is an ongoing process.

Again, I think there are fascinating questions 1) does LLM training violate existing copyright law + case law or does it maybe fall under a fair use exemption, and 2) is that what we want. But I think "do LLMs make copies" is dull and trivial and I don't know why it comes up.


The ai isn’t a person. Jesus. It’s not the same


Derivative work is not protected from copyright. As long as the “user” of the model does their due diligence, and ensures they are not infringing on copyrights - they are golden.

But here in lies the challenge. Are there reasonable methods available to ”users” for checking their works against infringement?

I don’t think so. We’ll need a centralized searchable database of all copyrighted work. Who is going to build that? To make matters more complicated, every country has their own copyright certification process. Maybe Google with its means can build something like this.

In any case, this is uncharted territory.


BigCode seems to acknowledge this problem and provide a search tool for dataset used to train their StarCoder model.

https://huggingface.co/spaces/bigcode/search


Thought experiment: Say you make a big list of words and pleasing combinations of them (I have actually done something similar to make a fantasy RPG name generator.) Now convert that list into a Markov chain or whatever and quasi-randomly generate some short lengths of text. Eventually you might generate copyright-infringing haiku and short poems. Does your data/algorithm violate copyright by itself? Very doubtful; you wrote it all yourself. Only publishing the output violates copyright. (See also: http://allthemusic.info/)

So if that's legal, how about if, instead of entering the data manually, you write an algorithm to scan poetry and collect statistics about the words in it. Should the legal distinction be any different since all you did was automate the manual process above?

Or what if you used a big list of the titles of poetry, which isn't even copyrightable information by itself? You may still succeed in extracting the aesthetic intent of the authors, and a statistical model can plausibly use that to generate copyright-infringing work.

Remember, we're not talking about generating novels or paintings here, just 20 words or so (whatever the bare minimum copyrightable amount is) in trillions of generated permutations.

You can see where I'm going with this. If those examples are legal, is there a cut-off for more complex statistical systems? Good luck figuring that out in a court of law.


> Remember, we're not talking about generating novels or paintings here, just 20 words or so (whatever the bare minimum copyrightable amount is)

From https://fairuse.stanford.edu/2003/09/09/copyright_protection...:

Copyright laws disfavor protection for short phrases. Such claims are viewed with suspicion by the Copyright Office, whose circulars state that, “… slogans, and other short phrases or expressions cannot be copyrighted.” [1] These rules are premised on two tenets of copyright law. First, copyright will not protect an idea. Phrases conveying an idea are typically expressed in a limited number of ways and, therefore, are not subject to copyright protection. Second, phrases are considered as common idioms of the English language and are therefore free to all. Granting a monopoly would eventually “checkmate the public” [2] and the purpose of a copyright clause to encourage creativity-would be defeated.


You could still plausibly generate (a significant portion of), let's say, "Fire And Ice" by Robert Frost, which is only 50 words.

See also: https://blogs.harvard.edu/ethicalesq/haiku-and-the-fair-use-...


If I were the copyright holder of such work, I would argue that the LLM was trained on text, including my copyrighted work, and that if the system produced text that a reasonable person who reads poetry would identify as the copyrighted work, the burden is then logically on the LLM owner to prove the LLM didn't regurgitate a piece of text from something it previously ingested.

I think a jury would side with my argument.


The issue isn't that a generator lets you evade copyright somehow; it doesn't. The output is not the issue. If I sit in paint and my assprint happens to perfectly duplicate a Picasso, that's unlikely to fly in court if I try to sell copies. Picasso painted it first.

The point at issue here is that some people are arguing that the models themselves are like a giant collective copyright infringement, since they are in a vague sense simply a sum of the copyrighted works they were trained on. Those people would like to argue that distributing the models or even making use of them is mass copyright infringement. My thought experiment is a reductio ad absurdum of that reasoning.


I see your point now.


I'm not sure where we're going with the output in these examples.

So let's say there's a human-written poem that's copyright.

Let's say a human completely coincidentally writes an identical poem.

"Accidentally" producing the same poem wouldn't give the second human any claim to copyrighting or distributing their coincidentally-identical poem.

And if GPT accidentally copies large chunks of Harry Potter or Frozen or whatever other popular work, that new creation will have the same problems.

But what does that say about if we should also restrict the use of copyright material in training? Just because some algorithm - or some person - can coincidentally duplicate a copyrighted work even without directly reading it doesn't seem to relate to the case of building a model by explicitly using the copyrighted material.


The owners of intellectual properties still hold the copyright, the law refers to the training of neural networks, it doesn't really change anything if you use the work of another person by simply copy and paste or by overfitting a generative model, the owner of the work still has the copyright on it.


> as a technical matter it's pretty obvious that LLMs create copies of (some of) their training data.

Browsers also create copies of the viewed data. Computers hold in memory a copy of everything they're working on.

The central point is for how long, and to what purpose. This law is not about making copies or not, but what happens after.


I am so excited to see what happens when Japan forces all closed source software and Disney cartoons into the corpus out of fairness.

Seems like there should be no complaint, right? It's not like anyone can see the Windows 11 source code, it's only being used for training.


The things that an LLM is likely to contain a complete verbatim copy of are things that are a) short b) widely repeated to the point that they're embedded into our culture - and by that token those things are almost certainly not copyrightable.


Is a bar in a song not "short"?

Try putting one of those in your book and not getting sued for copyright.


If you literally mean a bar, yes those are short, likely a couple of words, and you put those in books all the time and don't get sued. ("The answer my friend, is blowing in the wind" is 4 bars, and I've seen books quote it verbatim without a second thought). Likewise, plenty of people put the entire Declaration of Independence in their book without a second thought, and I assume don't get sued for it.

If you're talking about a verse or more of something that's not quite so culturally pervasive (people put the whole of the star-spangled banner in their books, again without a second thought), well, at that point it's probably not something that an LLM would reproduce verbatim.


Typically things like this are covered under fair use if you're dealing with a human.


> Unless you're claiming that GPT-3.5 is deriving the Declaration of Independence (from information about the founding fathers?)

This would make a fun short story - “ChatGPT, author of the Quixote”


Japan also ranks 3rd (behind the USA & India, with larger populations) in ChatGPT usage: https://www.demandsage.com/chatgpt-statistics/

There's also been discussion of their government using ChatGPT to reduce red tape: https://www.bloomberg.com/news/articles/2023-04-18/japan-gov...

It's cool to see Japan and Japanese culture taking techno-optimist stances on AI.


As I posted to another thread last week [1], I have been surprised by the quick rise in awareness and use of ChatGPT in Japan—despite its production of Japanese being not quite as good as that of English [2, 3].

Although I have seen discussions of AI safety and alignment issues in the Japanese press, those concerns seem less dominant than in discussions in English outside Japan.

There also seems to be less focus on the possibility of people losing their jobs to AI.

One reason may be the employment situation in Japan. For both legal and cultural reasons, it is difficult for companies here to lay off full-time employees. If advances in technology render some employees redundant, companies will try to retrain or reassign them rather than letting them go. (Freelancers are not protected, though, and I know some translators and illustrators worried about losing work to generative AI.)

Also, there is currently a labor shortage here, partly for demographic reasons, and a lot of white-collar workers are perceived as not being as productive as they could be. Generative AI may be seen as a way to make the existing human workforce more productive.

[1] https://news.ycombinator.com/item?id=36078823

[2] Yesterday, as a test, I had GPT-4 compose some business e-mails in both Japanese and English based on bullet points I provided to it in the respective languages. The English e-mails were, to my eye, perfect, while the Japanese ones sounded a bit awkward in places.

[3] I live in Japan and read and speak both Japanese and English.


Which I find bizarre given how backwards Japan is in the adoption of other technologies. Eg their continued reliance on paper records and fax machines.


Do you have concrete examples? I recently moved from Japan to Europe after 10 years in Japan and while some things seemed old-fashioned in Japan, things also change overnight there. During Covid most companies changed to digital signing.

In Japan I often needed a paper from the city, but it was an easy-to-obtain printout I could get immediately from city hall or even from 7-11. I tend to require the same papers in Europe too, with the extra hassle of needing some from Japan (2-3 months) and some from Europe (2-3 days).

Example 1: get ID card at the airport when you emigrate to japan. In Europe it takes 2-3 months.

Example 2: getting a local driving license takes 1 day provided your country has a treaty with japan. In Europe it can take several months because you need to request criminal records from your previous countries

I have yet to encounter a situation that requires a fax machine in either place.


> Example 1: get ID card at the airport when you emigrate to japan. In Europe it takes 2-3 months.

Yeah now have fun going through the process of renewing a My Number card. Go to city hall (during the hours when it's open), fill in a form, wait 2-3 months to get a notification that your replacement card has been made, then you have to book an appointment during office hours at city hall to pick it up (likely taking another few months).

Yes the initial residence card is issued quickly, but at that point you've had to wait 2-3 months to get the CoE for the visa, so they probably made it during that time.

> Example 2: getting a local driving license takes 1 day provided your country has a treaty with japan.

And involves a whole day of standing around in various queues, again during business hours only. Hardly a picture of efficiency.


I agree the my number card is a pain, if they want people to use it they should make it easier to get. When I went to get a plastic my number card, the staff at city hall advised me not to get it because I had a driving license and I'd have to renew the my number card every time my residence status was extended. They recommended it to non-holders of driving licenses to use as ID.

But I'm standing by these examples. For my move to Europe, it was 4 months to get the equivalent to the COE, then another 4 months (average is supposed to be 2-3 months) to get the equivalent of a residence card. And the company that sponsors you has to either prove you're highly skilled or prove they tried to hire a European for a set period of time first. And if you're not from a visa-waiver country, you aren't even allowed to travel in Europe while you wait for the ID card.

For my driving license, I probably won't have to wait in line, but the total time to process it is at least four months.

For starting a sole proprietorship, similar story. A bit annoying paperwork to start one in Japan, but it only took a few hours for everything. Here, it's 3 months and counting, and it's possible that it comes with a condition that I need to rent space for doing consulting (not do it from home) before getting the OK.

Japan is bureaucratic but usually fast, at least for individuals.


> And the company that sponsors you has to either prove you're highly skilled or prove they tried to hire a European for a set period of time first.

Well, that's different rules, not more or less bureaucratic. (And if you're the first foreigner hired by a given company in Japan, good luck for how many months you'll have to wait while they verify the company).

> And if you're not from a visa-waiver country, you aren't even allowed to travel in Europe while you wait for the ID card.

Pretty sure that's country-specific. And I've heard plenty of complaints from people in Japan not being able to travel, not just on arrival but every year, because it's de facto impossible when awaiting a visa renewal (you have to go to collect your new card within a short time once you get the notification to do so, and you don't know when that will arrive).

> For my driving license, I probably won't have to wait in line, but the total time to process it is at least four months.

I'd have taken waiting four months over having to use one of my 10(!) days off/year, although obviously that's specific to your personal circumstances.

> For starting a sole proprietorship, similar story. A bit annoying paperwork to start one in Japan, but it only took a few hours for everything. Here, it's 3 months and counting, and it's possible that it comes with a condition that I need to rent space for doing consulting (not do it from home) before getting the OK.

True, the support for starting a business is pretty good (though only if you're a citizen or already on a non-work visa - on a work visa it's extremely difficult to do legally, office space isn't the half of it. The much-vaunted Fukuoka startup visa is completely useless in practice since you have to qualify for a regular visa within 6 months).

I'm sure there are countries in Europe that are bureaucratic, probably some that are more bureaucratic than Japan (I know e.g. Italy in particular has a poor reputation). But I'd certainly say Japan is a lot more bureaucratic than Ireland, and a lot worse than it should be.


Thankfully the mynumber card is supposed to not be compulsory. (I do have one)

What is awful is just the ridiculous number of errors I find when going through the process. My card in theory has the wrong expiration date, (checked in, they said the expiration date on the card is always correct so IDK) but it is 10 years off, when comparedto the ones owned by the rest of my family who got the card in the same year.

They also keep absolutely screwing up the data security in relation to the mynumber cards. Just last month there was another data breach. Sources are in Japanese but I can dig them up if you want. But I think it's run the entire gauntlet of every conceivable issue.

Data stolen online? Check

Wrong insurance information, either inputted wrong or linked to the wrong person? Check

Swapped birth certificates? Check

Printing service provides sensitive documents for the wrong person? Check

I'm just lucky enough to not be the one who got his data stolen... for now.


My personal experience has only been from the tourism side with one concrete example being digital payments outside of PayPay. It's much better post COVID but even on my most recent trip from a couple of months ago if you want pay by card, the vast majority of the time you're signing with pen and paper. Rarely did they offer pin or tap which is common elsewhere.

Anecdotally I only know stories from people that I know personally that live there and through the internet. Eg PauloInTokyo does good "day in the life" videos. Off the top of my head I believe the Pachinko episode illustrates a variety of old school manual processes that have stuck around (pen and paper shift logs etc).

I've also heard opening even a simple bank account is quite the pain.


In my experience with Japan local IC/NFC credit card, I've never signed when I use CC in recent 5 years but PIN is sometimes needed. Do you use magnetic stripe CC?


>Do you use magnetic stripe CC.

All my cards have an IC, however they were Australian cards. Perhaps it's an additional step for foreign credit/debit cards?


Must be a quirk for those cards in Japan. I've had no problems using my Canadian cards in Japan, and don't recall ever having to sign. In terms of banking and payments, it has improved a lot over the years: ATMs are mostly 24/7, or at least you can always find one, banks don't shut down over holiday weeks anymore (some do occasionally but it's a one-time thing for big upgrades for interoperability), and you can use credit cards and tap-to-pay almost everywhere. My only complaint is the sheer number of digital payment options. You end up with using at least 1-2 on top of your credit card because it's tied to some other service.


I don't know much about CC, but possibly difference of PIN method can be a reason. Possibly like merchant only support offline PIN vs your card only support online PIN?


I've heard this FAX meme since like early 2000, but I have yet to encounter one in my 4 years living here, and most Japanese people I joke this too just gets as perplexed by the joke.

I wonder, where do people find those FAX machine? Have I lived in a tech/startup bubble in Tokyo and missed it? I didn't even see it at the local Ward office in the suburb.

Some stuff are still old-school (hanko etc) but it to me seem like the fax meme have outlived the reality.

Last time I heard fax machine was a German exchange student in Tokyo, AFTER she returned to Germany and had to get some paperwork at the ward office, in Germany


Have you seen the big printers at many konbinis? Those are also fax machines.

I have only sent two faxes in my life, and both were after I moved to Japan. The first was right after I arrived, I ordered something from Amazon and I realized I had written my address wrong, when I tried to fix it my account to blocked, and to unblock it I had to send a hand-written fax with my name and address. The second time was when I got a letter from the Sapporo police telling me that I had lost my driving license there. Again, to recover it I had to send a hand-written fax explaining the situation, so they could prove my identity. How that is considered a secure procedure is beyond me, but such is Japan.


I would not say we are super reliant on paper/fax anymore.

But, it is still quite common. I did receive a Fax at the office last week on Thursday.

Oh, and some of our stuff uses Dialup, usually in relation to that older infrastructure. Got thrown for a loop this wednesday when ye good old Dialup (acoustic handshake) audio started screeching across the office. Did not realize we still used it at all!


I've sent faxes in America within the last 2 years; they're still a thing here too.


Perhaps places are multifaceted and not reducible to 2-bit facts like the usage of fax machines or lack of credit card adoption.


Of course, I just find the the stereotype vs reality of Japan being a high tech wonderland interesting.


>It's cool to see Japan and Japanese culture taking techno-optimist stances on AI.

Japan has always seen artificial intelligence and its integration into human society favorably. Look at Doraemon or any anime in the super robot genre.


Discussion in Japan is far more nuanced just like everywhere else. Misuse of technology is an often repeated theme in the Doraemon series, and robot animes often cover wars.


I strongly disagree.

They need to actually address problems. Not throw tools at it.

The problem Japan seems to have is they don’t understand AI and than they don’t understand software, which is they don’t understand a lot of modern tech.

They’ll pay for this mistake just as they paid for being bad at software.


I testified to the US Copyright Office this morning on AI in their roundtable session on AI and music[1]. A good portion of the focus of this panel was on whether copyrighted inputs (in this case, sound recordings and musical compositions) being fed into AI models for training purposes could plausibly constitute a fair use under existing US copyright law.

Some of the comments here are missing the context of the recent (a week or so ago) Supreme Court decision in the Goldsmith/Warhol case[2], in which the Court ruled that transformativeness is not dispositive in and of itself in the context of a fair use defense to a copyright infringement claim. Of course, this has not been put to the test in the courts in the context of AI training yet, but it seems fairly clear that this ruling would likely extend to AI training on copyrighted works.

We (rightsholders in the music industry) hope to come to win-win licensing arrangements with the AI community and allow access to our songs for AI training purposes if the artist/writer so desires. There are some early talks in progress. Cautiously optimistic. Japan's approach seems short-sighted and desperate.

[1]: https://copyright.gov/ai/listening-sessions.html#sound-recor... [2]: https://www.npr.org/2023/05/18/1176881182/supreme-court-side...


>We (rightsholders in the music industry)

Considering the decades (maybe half a century soon?) of parasitic behavior of the music industry to almost everything tech, from early internet to mp3 players to torrenting to streaming to lobbying for insane copy right laws, you guys calling Japan's approach "Short-sighted" is like the single best praise anyone could give them.

For the absolute awful organization JASRAC [1](Japanese music industry, who couple of years back stated that they will sue music teacher teaching their copy-righted materials to students in private, if they didn't pay a licensing fee) maybe Japan for once pushed through a good legislation?

https://mainichi.jp/english/articles/20220930/p2a/00m/0et/01...


> We (rightsholders in the music industry) hope to come to win-win licensing arrangements with the AI community and allow access to our songs for AI training purposes if the artist/writer so desires.

It’s odd to frame win/lose as win/win.


I can see how it's win/win relative to "lobby to make producing or owning AI audio tools a crime", which is presumably one thing the industry is considering.


This is again win/lose


How do you feel about human musicians learning from copyrighted works? Technical limitations aside, is that something you'd like to monetize?


> allow access to our songs for AI training purposes if the artist/writer so desires

This (a) means nothing since the copyright holder can already do whatever they want, including licensing the works for any purpose; and (b) is even more restrictive than compulsory licensing which require the copyright holder to license (at a fee) the work.

The solution you describe as a win-win would either create a quagmire of crisscrossing licensing deals (AI need a lot of input, you can't train them on one artist), or in effect create an impenetrable moat for mega corporations such as Disney or Sony who would be the only ones with enough heft to pull it it.

It's actually a lose-lose situation.


> transformativeness is not dispositive in and of itself in the context of a fair use defense

Could you dumb this sentence down for me?

I would guess it means that making a derived work, changing the original, makes no difference in whether reproducing the work (in altered form) is fair use.

But that sounds well-established, I can't imagine that movies would suddenly be legal to distribute if you just distribute the file backwards (people can then reverse it again to watch it), whether or not you claim that the distribution is fair use or not copyrighted to begin with or whatever. Probably that's not what this court had to decide and I'm misunderstanding something?


Sure. In an infringement lawsuit involving a fair use defense, courts will apply the "four prong" test [1] to determine whether or not such use is indeed fair use under copyright law. The first of the four prongs, the "purpose and character" of the use, is also known as "transformativeness." The Goldsmith/Warhol ruling (to simplify) said that Warhol's changes to Goldsmith's photograph were not sufficiently "transformative" even though they contained new expression (adding orange color etc.) because the end result effectively competed with the original photograph and therefore did not qualify as a fair use.

Right, your backwards movie example would fail the fair use test too. Nothing's really added, there's no new expression, it competes with the original, etc.

[1]: https://fairuse.stanford.edu/overview/fair-use/four-factors/


AI training has nothing to do with copyright as it currently exists. Someone has access to a boatload of IP (because it was made publicly available) and trained a neural net with it. Now you want to retroactively create restrictions on what the implicit public rights were. Traditionally the implied license was something like you can't republish, redistribute, or use commercially, even though restriction on private redistribution hasn't been possible to enforce since the internet era. Now you want more restrictions.

If someone generates an image that's sufficiently similar to a copyrighted work, and publishes it in a way that violates fair use, you can send a takedown and potentially sue them. How the image was created doesn't matter, any more than it would matter whether Warhol had been able to scan the photo and then manipulate it in photoshop to get that result, instead of artistically copying it by hand. The result is the same. The potential for copyright infringement is the same, because it's the derived work that matters, not the process.

What you're attempting to do instead is the equivalent of trying to regulate scanning because it operates on copyrighted works.

I suspect you understand why you want to regulate AI training rather than regulate its output. I think you know AI is going to flood the market, currently certain types of images and simple music, but soon photorealistic portraits, complex music, and eventually video and even more complex works. Essentially all of those works will be clearly novel, not close to existing human-created works. They won't be copyright violations, so you have to cut this tech off at the knees and feed the blood mouse [1] by retroactively deciding that AI training is a violation of the implied license granted when people make their creations publicly accessible. Those AI creations will destroy most of the market for human-created works, and you can't have that.

I don't think many people, other than rightsholders, desire the IP dystopia your desired policy would create, which is holders of large archives of IP churning out endless AI-generated content (which no doubt they'll want to be able to copyright, contra the copyright office's current guidance), while preventing most competition by others who won't have a sufficient library of the right flavor of IP to train an AI model.

[1] https://www.youtube.com/watch?v=5pIVVpoz5zk


[flagged]


We've banned this account for repeatedly breaking the site guidelines.

Please don't create accounts to break HN's rules with. It will eventually get your main account banned as well.

https://news.ycombinator.com/newsguidelines.html


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: