Artificial Intelligence and Copyright: Request for comments

rickmode · 2023-09-01T01:07:08

I believe we first need to answer the question of whether the copyright of the AI model’s source text or images affects the output.

My opinion — and note I’m a software engineer, not a lawyer — is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material. This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation. And further, a user of the AI would themselves require a license to use the output.

The alternative seems to be “anything goes”.

Nevermark · 2023-09-01T01:32:26

I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

A model trained on several copyrighted data sources cannot somehow be used in a way depending on a subset of those sources.

So all parameters of usage and compensation should be settled by contract between the model builder and copyrighted data supplier, before the copyrighted material is used.

Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

That’s it. That’s the standard. No complicated new laws required.

Model builders obtain permission to use copyrighted material from copyright holders based on any terms both agree to.

Terms might involve model usage limits, term limits, one time compensation, per use compensation, data source credits, or anything else either party wants.

The likely result will be some standard sets of terms becoming popular and well known. But nobody has to agree to anything they don’t want to.

kuchenbecker · 2023-09-01T01:57:36

I slightly disagree, in that I think the person using the tool should bear the burden of copyright. I.e. if the model outputs something under copywrite it merely can't be republished. In this same way, i can use Photoshop on proprietary data but I can't necessarily sell the results.

dumpsterdiver · 2023-09-01T04:02:21

I'm so torn. On one hand, what you suggest seems to be a nearly ideal balance between advancing scientific progress and legal liability. By placing the legal burden to publish generated works on the person actually trying to publish, it allows for a more nuanced legal approach (i.e. the difference between "there are similarities to this work, but it's murky" or "you %100 stole that work").

On the other hand, is the company running the model themselves not already publishing all of that work and profiting from it? It seems unfair that their bottom line gets to be bolstered because they can produce work based on any artist, whereas the consumers of that work may need to end up walking on egg shells in order to publish them.

Like I said, I'm torn as far as how it "should be". I know how I want it to be though. I would love if AI continued training unabated. The results have been amazing, and I believe it would be a shame if the effort was slowed down by legislation.

chii · 2023-09-02T02:29:53

> is the company running the model themselves not already publishing all of that work and profiting from it?

no, because the model is transformative enough that it cannot be said to be a derivative works of the training set.

The model is in essence a form of distilled information, extracted from the training set. Information cannot be copyrighted - only expressions can.

Therefore, a model producer should have the right to use any pre-existing work, in the same way a person can, to study and internally memorize and extract information.

The reproduction of any of the training set data constitutes a copyright violation, but this is not done by the owner of the model, but by an end user of the model.

dumpsterdiver · 2023-09-14T03:50:15

My point is that if a court finds that a generated image is indeed similar enough to constitute an infringement when a subscriber of for instance MidJourney attempts to publish it, has that work not already been "published" to the subscriber? And has MidJourney not profited by gaining a subscriber based on the work of others?

haldujai · 2023-09-01T05:23:01

I wonder if that analogy represents the same thing. Speaking purely from a non-legal perspective on the ethics in my mind:

When you use Photoshop on propriety data you're providing the original data and choosing what manipulation to make (i.e. what tool) and directly creating the output. It makes sense that if you redistribute this it may be copyright violation.

When you use Copilot or ChatGPT for programming you're typically asking a non-proprietary question or accepting suggestions it's making based on non-proprietary (or proprietary to you) code in the file. You also don't dictate the manipulation process a black box deep learning model does (i.e. I haven't asked it to do something that could be reasonably thought to be a copyright violation).

Am I then responsible for the fact that Copilot is fooling me with effectively copy-pasted copyrighted code when it's being presented to me as generated by the software and I haven't instructed the software to commit a copyright violation? I'm not sure if intent matters for copyright, I assume it doesn't but perhaps that's a missing piece to this.

Diffusion models are gray to me, if you're asking/prompting with "Mickey Mouse riding a horse" I can see the argument that the prompt itself can be interpreted as asking the model to commit copyright violation and the user is just hiding behind a layer of abstraction. If I ask the model to spit out "a picture of a smiling cartoon woman" and it generates a Betty Boop lookalike is that still the users fault?

It seems to me like passing the burden to the user could be reasonable but would need some safe harbor type of exception. It'll be really interesting to see what the courts decide.

8n4vidtmkvmk · 2023-09-01T06:15:33

I see 2 problems with that.

(1) how do you know if the image that just generated is substantially similar to an existing copyright work? Maybe if some registration tool existed, but other wise the burden is too great

(2) what is stopping someone from generating millions of images and copy righting all the "unique" ones? Such that no one can create anything without accidental collisions.

gwd · 2023-09-01T12:21:19

> how do you know if the image that just generated is substantially similar to an existing copyright work?

This is already a problem with biological neural nets (i.e. humans). I remember as a teenager writing a simple song on the piano, and playing it for my mom; she said, "You didn't write that -- that's Gilligan's Island!" And indeed it was. If I had made a record and sold it, whoever owned the rights to the Gilligan's Island theme song could have sued me for it, and they would (rightly) have won.

There's already loads of case law about this; the same thing would apply to AI.

> what is stopping someone from generating millions of images and copy righting all the "unique" ones? Such that no one can create anything without accidental collisions.

Right now what's stopping it is that only humans can make copyrightable material; whatever is spat out from a computer is effectively public domain, not copyrighted.

jasrys · 2023-09-01T12:16:27

1. lots of established law and case law (at least in the US), this is already a well-settled problem and folks have the tools and proper venue to bring infringement claims. Yes, federal copyright infringement litigation is prohibitively expensive for many issues. There is a now a "small claims court" for smaller issues. [1]

2. Those works cannot be copyrighted (at least in the US). [2]. And hey, someone already tried copyrighting every song melody [3]

[1]: https://copyright.gov/about/small-claims/

[2]: https://www.federalregister.gov/documents/2023/03/16/2023-05...

[3]: https://www.youtube.com/watch?v=sJtm0MoOgiU

Nevermark · 2023-09-01T03:53:25

But that problem is already solved.

Copyright holders are already protected from (I.e. can legally prohibit) distribution of obvious copies, or clearly derivative works.

Regardless of how they were produced by hand, copy machine, Photoshop or with a model.

The new problem is that artists styles are being “stolen” by incorporating their copyrighted work into models without their permission.

And that problem can easily be solved if using copyrighted material to create models is declared NOT fair use.

Artists could still allow models to be built from their work, but on their terms. If they wish to do that.

A famous artist, that doesn’t mind being commercial, could sell their own unique model to let fans create art in that artist’s style, while not having their style “ripped” by others.

Or just keep their style to themselves, for their own work, as artists have done for centuries.

(Of course, with greater effort, their style could still be recreated - styles are not protected unless they are trademarked - but the recreation would have to be done without using the artist’s copyrighted works.)

jasrys · 2023-09-01T12:04:31

This is probably a somewhat unpopular opinion on HN, but it is where many of the artists I work with are generally trying to get to. Consent, compensation, and credit.

Nevermark · 2023-09-02T09:45:40

> Consent, compensation, and credit.

I just want to quote you. Nothing I need to say. That’s it.

readyplayeremma · 2023-09-01T03:01:29

This is the best path forward I think. And it will become increasingly sensible as things continue to evolve. AI wasn't necessary to violate copyright before, and it isn't necessary today.

The determination of copyright violation should be made against the output of the model in the event that someone uses it for commercial purposes.

If the models have a risk of generating copyrighted content, it will be up to the consumers of the system to mitigate that risk through manual review or automated checks of the output.

xorbax · 2023-09-01T03:28:24

A divergence, but I see a lot of posters asserting that "humans learn by copying other people, but we don't call that a violation of copyright when they draw"

People casually asserting that software is equivalent to humanity will be a non-negligible thing to consider, as irritating and poorly-founded as it seems.

If the reproduction isn't pixel-perfect, but merely obvious and overwhelming, how do you refute that philosophically to people who refuse a distinction between 50GB and a human life?

readyplayeremma · 2023-09-01T06:26:57

> People casually asserting that software is equivalent to humanity will be a non-negligible thing to consider, as irritating and poorly-founded as it seems.

> If the reproduction isn't pixel-perfect, but merely obvious and overwhelming, how do you refute that philosophically to people who refuse a distinction between 50GB and a human life?

Software equivalence to humanity is a very philosophical question that many sci-fi writers have approached. But our primary issue related to this technology does not depend on anyone making a determination there.

The challenge is that losses to livelihood from this technology are going to come from far broader impacts than copyright alone. Copyright disputes are just the first things to get everyone's attention.

Let's say we err on the side of protection of copyright, and all training data must be fully licensed, in addition to users being responsible for ensuring outputs did not accidentally reproduce something similar to a copyrighted work, even if it was part of the licensed training dataset. Great! This fixes the problem of lost value for the owners of copyrights. Companies will face a slight delay and slightly increased costs as they license content; however, in the end, model capabilities will be the same and continue to increase.

The number of jobs that actually cannot be performed without humans will continue to dwindle — livelihoods will be lost at essentially the same scale despite upholding copyrights.

The only way we can handle a technology capable of reducing most need for human labor is by focusing on planning and executing a smooth transition toward an economy with more people than jobs — aiming for minimal human suffering during this process.

A mass loss of human jobs does not need to mean a mass loss of livelihood if our society is prepared to transition to a universal basic income. After all, human life is far more than just a job. We have the opportunity for much more fulfilling lives if we plan this transition well. We must understand that this is a far larger issue than copyright - copyright disputes are just one of the first symptoms of this disruptive process.

JamesBarney · 2023-09-01T06:47:07

A human is still entering the prompt to generate the possibly copyrighted image/text. I don't think copyright law should care about the implementation. It's ok to copy a style if you use paint brushes or photo shop. But not ok if you use a statistic model?

LegitShady · 2023-09-01T12:57:11

Apply for a copyright on your human authored prompt then. That's the extent of human authorship.

gwd · 2023-09-01T13:38:56

> Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

The more I think about it, the more something along these lines seems like it might be the right way to think about it.

When you play a DVD, for example, you copy the bits off the DVD, into the memory of your DVD player, and onto your screen; this is all explicitly considered "fair use" copying. But if you then copied those fair-use bits off the screen onto a thousand other screens, that violates copyright.

When you, as the human watch the DVD, bits of it get copied into your brain; but you don't then copy the bits of your brain to millions of other people -- they each have to make their own copy.

We could make the law for LLMs follow a similar logic: That having an LLM watch a video or read a text is similar to having a DVD player read a DVD or a web browser copy information from a website. It's good for that limited use case, but the resulting copy cannot be copied again without a license.

This would allow (say) researchers, or even individuals, to do their own training and so on without a license; but when anyone wanted to create something that they wanted to scale up, they'd have to get licenses for everything.

That would fundamentally keep things balanced as they are now with creators and other creators. The big problem isn't that a handful of other creators may be copying their style; that growth in competition is limiting because of the expense of duplication. It's that millions of electronic engines can copy their style.

judge2020 · 2023-09-02T00:48:19

> When you, as the human watch the DVD, bits of it get copied into your brain; but you don't then copy the bits of your brain to millions of other people -- they each have to make their own copy.

If you ripped The Little Mermaid, redrew every frame to combine it with The Fresh Prince of Bell-Air and moved things around in scenes to make it look like Ariel is Will Smith responding to sit-com dialogue, then it'd be fair use, regardless of how many people you show this new version to.

Fair use isn't about how or why you're doing with something. The definitions for fair use are very clearly laid out at https://www.law.cornell.edu/uscode/text/17/107

kelnos · 2023-09-01T05:29:39

> I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

I'm torn on who should pay, and where and when. In the world of patents, there's often an option/split. Say a chip manufacturer wants to build H265 decoding into their hardware. The chip manufacturer could buy the license. Or the purchaser (who probably is building some sort of board or device around the chip) could pay for the license. Or they could disable that functionality in the end product, and the consumer could pay for a license (or not, if they don't care about that feature).

The most common is usually the middle option: the end-device manufacturer (or brand that eventually sells the product) will pay for the license.

But I'm not sure if this works all that well for an AI model. With hardware, the license is usually paid per unit. It's easy to see that one chip = one license. If the model builder buys a license, that model could be used one time or 100 million times. Tracking use like that probably isn't all that practical, but I think it's safe to say that a 100-million-use model should probably pay more for a license than a single-use model.

So maybe the model builder should be responsible for attaching a comprehensive "copyright history" to the model, and users should have to pay for a license based on their use? Again, not sure how to track that. But I guess general software licensing has similar problems when you can "hide" usage.

Retric · 2023-09-01T01:50:08

Yes, someone using a model can’t know if the generated text/image/sound is a nearly identical copy of the original material they don’t recognize. If use of the output of these systems comes at significant legal risk then then such systems become nearly useless.

chii · 2023-09-01T03:40:36

> if the generated text/image/sound is a nearly identical copy of the original material they don’t recognize

how does the industry today deal with artists that "copy" off some other works? This isn't a problem with AI at all - just that AI provides a tool to generate such works faster.

skydhash · 2023-09-01T03:49:30

Someones comes to me to ask for a drawing of Batman or to write an erotic story around Supergirl. I can do it, but I cannot claim ownership over the characters. And I think I will quickly get a letter from DC or Marvel if I try to do this at scale.

chii · 2023-09-01T03:53:11

> I can do it, but I cannot claim ownership over the characters.

of course not. But you can claim ownership if you don't call those characters their original names, and make sufficient changes to the design (how sufficient is determined by a court of law - thus expenses).

> DC or Marvel if I try to do this at scale.

The show 'invincible'[1] has a character that is a basic copy of superman. And yet, you will find that they don't get a letter from DC.

[1] https://en.wikipedia.org/wiki/Invincible_(TV_series)

skydhash · 2023-09-01T04:30:24

> make sufficient changes to the design

I think that’s one of the issue. The transformation done by these tools are mechanical. Even if it may be extensive. The human input is too small. Omniman may have similarities with Superman, but he is not him in the larger context of the story. LLMs can not yet be that consistent for marketable output that deserves to be copyrightable.

I’m perfectly fine for LLMs to aid with spell checking and alternative phrasing (image is a grayer area). Bu the ideas of prompts and prompt output being copyrightable is something I oppose.

orbital-decay · 2023-09-01T11:19:30

> The human input is too small.

That's a huge assumption, especially for image generation models.

JamesBarney · 2023-09-01T06:59:01

Why shouldn't a prompt output be copyrightable?

Retric · 2023-09-01T12:14:32

Because prompts lack sufficient creative control.

Typing a search sting into Google doesn’t provide copyright over its output.

chii · 2023-09-01T13:27:13

> lack sufficient creative control.

the prompts have become somewhat creative these days. If you have a look at the prompts on https://civitai.com for example, you can argue they are a form of creative expression. Just like hand rolling assembly code might be.

Edit: an example one - https://civitai.com/images/2268828?collectionId=107&period=A...

and the associated prompt:

  High detail, dynamic action pose, masterwork, professional, fantasy, neo classical fine art, of a beautiful, primordial and fierce, ((angel-winged-woman,:1.9)), archangel, (MiddleEastern:1.6), with very long, flowing, wavy white hair, peach colored streaks, with a sexy, slender, fit body, wearing an ethereal, light violet, light aqua, faded gold, tie-dye, linen and Chantily lace, (knee length:1.5), strapless dress with a tattered hem, a Platinum and gold Cuirass, platinum vambraces, platinum and lace Gladiator Boots,  long broadsword in a Baldric, at night, in a metropolis warzone, during a thunderstorm, dimly lit, thin, vibrant streaks of crimson light, outlining her body, fantasy illustration,  in the style of Osamu Tezuka, George Edward Hurrell, Albert Witzel, Hiromitsu Takeda, Clarence Bull, Gil Elvgren, Ruth Harriet Louise, Takaki, Milton Greene, Huang Guangjian, and Cecil Beaton,, High detail, dynamic action pose, masterwork, professional, fantasy, neo classical fine art, of a beautiful, primordial and fierce, ((angel-winged-woman,:1.9)), archangel, (Columbian:1.6), with very long, flowing, wavy white hair, peach colored streaks, with a sexy, slender, fit body, wearing an ethereal, light violet, light aqua, faded gold, tie-dye, linen and Chantily lace, (knee length:1.5), strapless dress with a tattered hem, a Platinum and gold Cuirass, platinum vambraces, platinum and lace Gladiator Boots,  long broadsword in a Baldric, at night, in a metropolis warzone, during a thunderstorm, dimly lit, thin, vibrant streaks of crimson light, outlining her body, fantasy illustration,  in the style of Osamu Tezuka, George Edward Hurrell, Albert Witzel, Hiromitsu Takeda, Clarence Bull, Gil Elvgren, Ruth Harriet Louise, Takaki, Milton Greene, Huang Guangjian, and Cecil Beaton,

Retric · 2023-09-01T13:40:17

That’s a perfect example, they said “during a thunderstorm” does that image look like it’s in a thunderstorm? Sure, the output of the prompt relates to what was said, but they influenced the output rather than controlled it.

Further, it’s well known that simply telling an artist what you want even including quite detailed descriptions isn’t enough to get copyright over the resulting image.

Retric · 2023-09-01T04:23:53

The difference is the artists assertion that it’s either original or a copy from something else. DALLE 2 can’t tell you if it’s original or not. These AI’s have no idea and the company or group that created them doesn’t review individual output so they can’t say either.

chii · 2023-09-01T06:11:07

> DALLE 2 can’t tell you if it’s original or not

whoever pressed the button to run DALLE will make the assertion, just like whoever was running photoshop to make the image today would make the same assertion.

Retric · 2023-09-01T12:08:23

Based on what?

A photoshop user controls what data photoshop uses, a DALLE user doesn’t. Even a prompt as generic as “Cat” could be producing an obviously derivative work if you compare it to the original. This is true for all prompts.

chii · 2023-09-01T13:25:16

> A photoshop user controls what data photoshop uses

the point was that the user of the program is making their declaration, whether it's photoshop or DALLE. How does the business verify that their staff artists aren't producing copyright infringing material, just from memory?

The liability falls to them to verify the copyright status of the output they're asked to make. A business paying a photoshop user to produce a picture has just as much (or as little) trust in them as the button presser for DALLE.

Retric · 2023-09-01T13:59:08

This gets complicated, having no reason to know that something is copyrighted is a defense.

So if your employee installed pirated 3rd party software you’re facing strict liability. However, if a third party is reproducing their collage roommates drawing from memory then it’s effectively impossible for you to verify if something is a derivative work.

Dalle is effectively Getty images, if you’re buying works from them you can only assume it’s free of copyright issues.

mjan22640 · 2023-09-01T06:42:25

The generated content is a derivative work of each piece of the material the model was trained on. That material can be listed.

Retric · 2023-09-01T12:09:28

So your suggestion is to list 100’s of millions of works and have users manually review them? I don’t think that’s going to work.

renonce · 2023-09-01T05:04:46

Problem is, how can you determine if the model contains copyrighted material? The laws governs copyright through ownership, so in order to claim copyright infringement you have to be able pinpoint a specific person and prove that their work is somehow embedded in the gradients, which is not practically possible at the point. It's just like how you can't practically enforce copyright on encrypted data unless you ban encryption altogether.

haldujai · 2023-09-01T05:35:13

1. If you know your copyrighted material was in the training dataset is that not sufficient?

2. From a legal perspective do you actually have to prove it's embedded in the gradients? If I draw an exact copy of Mickey Mouse from memory and sell it I didn't think Disney had to prove I've ever actually seen Mickey Mouse before or point to where the image of him is embedded in my brain.

cwkoss · 2023-09-01T06:51:54

Disney has a trademark on Mickey mouse, but that does not mean that they automatically get copyright on all pictures of Mickey Mouse drawn by others (they don't)

haldujai · 2023-09-01T15:06:27

Bad example on my part in that case. I thought some art is copyrighted or am I mistaken? If so replace Mickey Mouse with something copyrighted

meowkit · 2023-09-01T01:33:25

My opinion as a SWE who is dating a lawyer (joke, not a serious qualification but it does provide some insight):

Generative models traverse and interpolate high dimensional state spaces. These state spaces are created from input data.

I would argue people do the exact same thing - the first main difference is we can use novel inputs (e.g. we can use images or words to develop our music/temporal state spaces and vice versa). People also are recursive and self referential in a way that doesn't collapse.

Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution. Either traditional copyright wins and we get even more draconian policies (think Disney and their desire to never put anything in the public domain), or we have a free for all (which I don't think is bad for creative works, but certainly for more practical things like stock photos or nonfiction).

cj · 2023-09-01T02:03:58

I can appreciate how this line of thinking might be attractive.

But IMO the human<>machine comparison doesn't lend itself much credence. We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too. I think some care should be taken when considering if we allow machines to have the same privileges as humans.

hexage1814 · 2023-09-01T06:20:39

> We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too

There are no sentient machines (at least yet). Your position is one where you are actually limiting what other humans can do, limiting which tools can other humans have access to. Also, the parameter – according to the law – was always "the same". For instance, there is nothing preventing you from making your own chess league where computers are allowed to compete. FIDE is free to ban you from compete own their leagues or to ban anyone associate with your league or whatever, but there is nothing in the law preventing you.

I have been saying this from the day one: this whole debate it's mainly white-collar workers negatively impacted by automation making up any excuse they can to say why their job should be protected, somehow, for some reason, but not the one of coal miners or what have you.

A human downloads a photo to learn how to draw. Another human downloads a photo to teach their computer how to draw. No difference, no need to obtain any license in any of the cases.

raincole · 2023-09-01T02:37:29

> We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too.

Generally speaking, even one machine can do something, it doesn't automatically mean another machine is allowed to do that.

For example you can drive car with a normal driving license, but not a truck. In some states you can own a pistol but no automatic rifle.

hexage1814 · 2023-09-01T06:42:01

It also depends on where this happening. For instance, you don't need a license to drive a car inside your own private propriety. You need a license to drive it on public streets because society needs some assurance that you know what you are doing. So in many cases the laws and restrictions also happen in relation to a given scenario.

qaq · 2023-09-01T02:39:52

copyright exists among other things to "promote the progress of science and useful arts".

freejazz · 2023-09-01T02:46:44

That section is written in parallel verse, with copyright <> science, and patent <> useful arts. This sounds weird, now, but it's consistent with the use of the words at the time, which is the reverse of how they are used today, where paintings etc are considered art, and inventions are considered science. So, it's not that copyright exists to promote science and art (as we call them today) but only just the arts. Patents are for science. Authorship reflects copyright and invention reflects patent:

> Congress shall have the power... To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”

Paradigma11 · 2023-09-01T04:29:29

A machine is just a tool. It is the creator and the user of the machine that has the privileges he uses the tool with. I think we should be careful not to anthropomorphize, attribute agency, responsibility and autonomy to something that is essential a better photoshop plugin.

klabb3 · 2023-09-01T05:02:30

I don’t think parent anthropomorphizing anything. The ones who anthropomorphize are saying that machines should be covered by fair use, because they have similarities with humans.

This is not about the rights of a machine but about how one human product is consumed by another human product. This is just a commercial supply chain: if you make a model, you need human data. You generally need to compensate your suppliers of “raw material”.

Paradigma11 · 2023-09-01T06:54:31

Its not the tool that is covered by fair use. It is the creation of the tool that is covered by fair use.

Is the tool itself supposed to be a copyright violation or is it a tool facilitating copyright violation by producing violating output?

The later is something that can be tested because we have processes to compare works of art for it. If it is shown that LLMs produce mostly infringing art then we can and should ban or heavily regulate them. If not then not.

klabb3 · 2023-09-01T19:20:42

> It is the creation of the tool that is covered by fair use.

Copyright doesn’t restrict creation of something, it restricts (mainly) commercial distribution. Research, education and journalism etc are largely unaffected, and would still be.

That said, I believe that selling access to the tool to the public already violates the copyright of the rights holders, even if it doesn’t produce similar works of art. The copyrighted works increased the value of the product (otherwise why would they use it?).

> The later is something that can be tested because we have processes to compare works of art for it.

This is the most expensive, least practical and most arbitrary part of existing copyright. It would be a huge mistake, imo, to expand this dramatically. This problem mostly goes away if the supply chain is sanely regulated.

All you’d need is give access to the training set upon audit, and bureaucrats could check for copyrighted works. There are already automated tools for this.

Paradigma11 · 2023-09-02T06:57:52

"That said, I believe that selling access to the tool to the public already violates the copyright of the rights holders, even if it doesn’t produce similar works of art. The copyrighted works increased the value of the product (otherwise why would they use it?)."

So it is similar to how ISPs argue that they should get a cut of streaming services because they enable another product.

I think it is also relevant that more than half of the globe will just completely ignore any regulation and any artist in a country with regulation will just have to compete with ever more empowered artists using all ai has to offer.

bionhoward · 2023-09-01T11:47:27

“It’s just a machine!”

So are you!

freejazz · 2023-09-01T18:55:11

Don't be obtusely misanthropic

mjan22640 · 2023-09-01T07:04:29

The value of copyright is going to vanish. There is enough public domain material to train models on and to avoid the problem altogether.

There used to be professions like tinkerers, bards, clowns. The tinkerers disappeared when the society became modern. The clowns on the other hand managed to lobby for laws that put people into jail for heinous crimes like copying pictures, and survived longer. They are going to bite the dust now.

freejazz · 2023-09-01T18:57:26

What you describe would result in the opposite - copyright will be incredibly valuable in a system where the vast majority of "creative works" are just regurgitations of past works in the public domain, churned out by machines. In such a world, none of that has a copyright anyway. Actual creative works, which do garner copyright, will then be that much more valuable, because they will continue to be a property right with a breadth of coverage to make them useful.

rcme · 2023-09-01T02:13:02

Whether or not “humans do it” isn’t relevant. You can walk around with a copyrighted song in your head. That is not copyright infringement. But if you take that song, create a digital copy, and distribute it for money, then you are violating someone’s copyright. Additionally, our legal system requires a balance of probabilities. It’s hard to prove that someone was influenced by another work unless the similarities are plainly obvious. The same does not apply to ML models where the training data and algorithm are knowable facts.

gremlinsinc · 2023-09-01T03:30:23

I challenge you to listen to 4 chords of awesome and tell me again about how every song is completely original. How does eragon exist when it's definitely ripped parts from star wars, etc...ai usually doesn't spit out a full plagiarism, but a loosely inspired work which is what most media we consume is.

Edit: 4 chords of awesome link is https://youtube.com/watch?v=oOlDewpCfZQ&si=8vL6PbDnHiaffJh3

freejazz · 2023-09-01T18:58:26

A copyright in just Eragon would be incredibly thin, for the exact reasons you state. This criticism of copyright by people that have no understanding of actual copyright law, how it works, how its used, etc, is so exhausting and ignorant.

rcme · 2023-09-01T03:45:17

“Every song is completely original” is the opposite of what I said.

distract8901 · 2023-09-01T03:56:00

The analogy doesn't hold when you consider the sheer scale of the problem.

I can outright buy a machine for a few thousand dollars that can crank out a faithful rewrite of every Stephen King novel without the shitty endings and nonsense plot points. It can do it in a few days, maybe a couple of weeks at most.

To do that with human labor would take years and cost hundreds of thousands, if not millions of dollars.

Instead of paying an artist a couple hundred for a commissioned drawing, I can just scrape up their entire portfolio and generate any image I want with their style. I can generate hundreds or thousands of images. I can take their distinct style and use it exclusively as the branding for my company.

What a ML model does is very fundamental not what happens when a human draws inspiration from prior art. A human would require an extremely significant amount of time and resources to perfectly imitate every artist they have ever seen. It takes a human significant time and resources to produce faithful variations on prior art.

A ML model is measured in words or images per second.

CrimsonRain · 2023-09-01T11:47:10

Hello.

Maintaining a system like Netflix or AWS or even Amazon will require insane amount of people and time, if possible at all within a finite time, without all the computers doing work for us in seconds that would take humans ages to do.

omnicognate · 2023-09-01T11:15:19

> ... a SWE who is dating a lawyer

> I would argue people do the exact same thing

Perhaps a ménage à trois with a neuroscientist would change your view on this.

ethbr1 · 2023-09-01T02:18:30

> Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution.

This is the rub. Without reverse attribution... open source anonymous models become a free-for-all loophole.

Since that doesn't currently exist, I think the best we can do is to say that any commercial entity using a model bears the responsibility of proving the model they use is untainted by copyrighted material (to which they haven't secured rights).

Open source model X is... whatever it is.

But I'll be damned if OpenAI / Meta / Microsoft / IBM should be able to build a commercial product on top of laundered copyrighted material while ignoring provenance.

I mean, we have models for this: software code and art. Both aren't clearly attributable. In the case of software code, we've developed case law around clean room design and similarity. In the case of art, we value verifiable chain of custody.

Hopefully, something similar would tilt commercial funding of AI in the direction of responsible use.

Natsu · 2023-09-01T01:31:02

My problem with this is that artists learn by studying other artists, cutting that off because it's AI rather than focusing on whether the resulting work is derivative, seems more of a problem to me. It seems to me that an AI can be used for either original work or derivatives, proving that you can get derivatives out of it has always struck me as no different than commissioning a copy of someone's work from a human artist and being shocked that you got what you asked for.

freejazz · 2023-09-01T03:41:05

Can an AI express to you how van gogh affected it as an artist? I'm not sure that AI is "learning" the way we say humans are "learning," when humans learn and study art. Obviously there is no debate that you can input van gogh into a model and produce something van gogh-like as a result. But I've not seen anything that indicates that the AI is learning anything about van gogh at all. Perhaps it comes down to whether you think learning van gogh is just creating a mapping of all of his brush strokes ever, and only exactly what they look like. It's obvious the AI knows nothing more than that. If you think that's what humans do when they learn art, I'd be sad for you!

As to your hypothetical, we don't give copyrights to people who make rote copies of things, human or otherwise. Is the implication of the shock, that there is sufficient difference with the work as to render it a derivative and not a copy? Okay, how so? And of what consequence? Making derivatives of a copyright without license is infringement.

Natsu · 2023-09-01T07:49:31

I think it's learning styles in a way that's at least partially analogous, because it comes out with things that are reasonably original and not in the training data.

I'm sure an LLM can write you an essay like that for any artist you want, but I'm not all that convinced those are meaningful even with humans.

> As to your hypothetical

That's the thing, it's not a hypothetical, it's a past story from here on HN. Someone did that, asking for copies of a famous painting (Girl with a Pearl Earring) and got highly derivative items out of the model and we had a debate over whether that even means anything, because that's both a simple description of the painting and the name of a famous work, so it makes it so it can be ambiguous whether you asked for "Girl with a Pearl Earring" or a girl with a pearl earring in the prompting.

I agree that it looks like copyright infringement whether it's done by a human or AI, though. I guess a lot of people missed the prior discussion on HN.

freejazz · 2023-09-01T14:38:40

>I think it's learning styles in a way that's at least partially analogous, because it comes out with things that are reasonably original and not in the training data.

I don't think that is evidence that what it is doing is "learning".

>I'm sure an LLM can write you an essay like that for any artist you want, but I'm not all that convinced those are meaningful even with humans.

Well, it wouldn't be reflective of what the LLM thinks, so what is your point? If you are of the belief that humans don't have thoughts, I guess it's not a surprise you view things this way.

>That's the thing, it's not a hypothetical, it's a past story from here on HN. Someone did that, asking for copies of a famous painting (Girl with a Pearl Earring) and got highly derivative items out of the model and we had a debate over whether that even means anything, because that's both a simple description of the painting and the name of a famous work, so it makes it so it can be ambiguous whether you asked for "Girl with a Pearl Earring" or a girl with a pearl earring in the prompting.

You say derivative but without any reference to what it actually means... what about is derivative - that's the analysis that's happening in court. The analysis isn't "what you asked the LLM" because that's not dispositive to whether or not something is a copy.

>I agree that it looks like copyright infringement whether it's done by a human or AI, though. I guess a lot of people missed the prior discussion on HN.

Sorry I don't read every single thread about copyright on HN? This is the second posting I've seen on the RFC today. Give me a break!

Natsu · 2023-09-02T00:02:36

> I don't think that is evidence that what it is doing is "learning".

When I say learning I mean something like "gaining new ability by studying how others did the same task, resulting in being able to produce novel output." I'm not quite sure what you are using the word to mean here, though I might agree that there are differences between what AIs do and what humans do, the question being what they are and whether they're important here.

I don't claim to know anything about the internal experience (if any) of an LLM writing such an essay and I can't really reason about that because I've never been an LLM, whereas I can at least relate to human experience. I think your assertion that it "wouldn't be reflective of what the LLM thinks" is a bit like saying that you don't think submarines are actually "swimming," as the saying goes, though. It may not "think" in human terms as we do, but it's certainly doing some kind of calculation that produces an equivalent output, so I have a lot of questions about whether we can say that on principle. We're well past passing the Turing test for a lot of things, either the original or censored form, these questions are getting less academic by the day.

> You say derivative but without any reference to what it actually means

We're talking about copyright law, so the meaning of derivative was borrowed from that, i.e. that AI model was producing works that could be reasonably thought to have infringed on the copyright of that painting when prompted for "a girl with a pearl earring" and this was held up to mean that AIs are just regurgitating training data and are therefore implicitly missing something essential to being an artist or what have you and all their work should be considered derivative works of the training data as far as copyright law is concerned.

Meanwhile, I'm saying that I think the AI should be judged about like a human artist would be to argue against the people who seem to want to say that the AI can't take input from copyrighted things without all of its output being tainted forever. We have no such requirement for humans and I don't see why it makes sense to add this new restriction on AIs specifically.

> Sorry I don't read every single thread about copyright on HN?

I'm not faulting you for not knowing, I'm faulting myself for assuming too much context and just trying to explain what I had in my head when writing that so you could understand how I came to think that. Hopefully this lets you see where I'm coming from.

freejazz · 2023-09-02T01:30:46

>When I say learning I mean something like "gaining new ability by studying how others did the same task, resulting in being able to produce novel output." I'm not quite sure what you are using the word to mean here, though I might agree that there are differences between what AIs do and what humans do, the question being what they are and whether they're important here.

I think the dictionary definition is more than sufficient: "the acquisition of knowledge or skills through experience, study, or by being taught." This is what I mean by running with your own made up definition.

>I don't claim to know anything about the internal experience (if any) of an LLM writing such an essay and I can't really reason about that because I've never been an LLM, whereas I can at least relate to human experience. I think your assertion that it "wouldn't be reflective of what the LLM thinks" is a bit like saying that you don't think submarines are actually "swimming," as the saying goes, though. It may not "think" in human terms as we do, but it's certainly doing some kind of calculation that produces an equivalent output, so I have a lot of questions about whether we can say that on principle. We're well past passing the Turing test for a lot of things, either the original or censored form, these questions are getting less academic by the day.

You are the one redefining words like "think" and "experience" not me. I'm not playing that game at all. After all, you are the one that is equivocating these processes between humans and AI by coming up with your own, much more broad concoctions.

>We're talking about copyright law, so the meaning of derivative was borrowed from that, i.e. that AI model was producing works that could be reasonably thought to have infringed on the copyright of that painting when prompted for "a girl with a pearl earring" and this was held up to mean that AIs are just regurgitating training data and are therefore implicitly missing something essential to being an artist or what have you and all their work should be considered derivative works of the training data as far as copyright law is concerned.

I'm familiar with copyright law, I'm not sure you are. A work can be derivative in a number of ways, some are legal, some aren't. It's not a new thing that some uses by a machine can be infringing, and others, non-infringing. Why now must it be that machines should be analyzed the same as humans all of the sudden?

>Meanwhile, I'm saying that I think the AI should be judged about like a human artist would be to argue against the people who seem to want to say that the AI can't take input from copyrighted things without all of its output being tainted forever. We have no such requirement for humans and I don't see why it makes sense to add this new restriction on AIs specifically.

Yes, I understand that. But I asked why it should be judged as a human, and you are saying because it "learns". But that's only based upon your re-defining the concept of learning in order to make it inhuman. The only reasonable arguments I've seen that AI outputs should be copyrightable are based on them being a tool that an artist can use. What you are saying is just dressed up anthropomorphization.

Natsu · 2023-09-02T01:58:29

> I think the dictionary definition is more than sufficient: "the acquisition of knowledge or skills through experience, study, or by being taught." This is what I mean by running with your own made up definition.

I mean, if a human looked at a bunch of art, essays, etc. and then was able to produce similar works, we'd normally consider that "learning." What word would you use for being able to reproduce Picasso (or whomever) by looking at a bunch of examples?

Also I don't think I have defined "think" or "experience" at all. But I'd point out that I don't see anything like a principled boundary around them or that we can point to something that humans do that AIs don't or can't do. It seems to fall back on something that looks like qualia or subjective internal experience and philosophy hasn't resolved that with respect to other humans... except by analogy. "I think the other humans are like me and I have subjective internal experience, so they probably have it to, rather than being p-zombies."

If you have a better answer to that, feel free to tell me, it'd be interesting.

> It's not a new thing that some uses by a machine can be infringing, and others, non-infringing. Why now must it be that machines should be analyzed the same as humans all of the sudden?

Sure, I'll agree that it's not even necessary to consider the works transformative or whatever.

FWIW, I don't think that AIs should be getting their own copyrights or anything like that, I'm just saying that the training data shouldn't forever taint the output no matter what's produced.

freejazz · 2023-09-02T14:00:01

>I mean, if a human looked at a bunch of art, essays, etc. and then was able to produce similar works, we'd normally consider that "learning." What word would you use for being able to reproduce Picasso (or whomever) by looking at a bunch of examples?

Would we? What you described sounds a lot more like copying than learning. That's why I asked the question I originally did. Your whole perspective seems to be based on an ignorant and misanthropic view of the arts. That art students just go to school to look at things so they can then reproduce things that look like those things. It's a bit asinine and insulting.

>Also I don't think I have defined "think" or "experience" at all. But I'd point out that I don't see anything like a principled boundary around them or that we can point to something that humans do that AIs don't or can't do. It seems to fall back on something that looks like qualia or subjective internal experience and philosophy hasn't resolved that with respect to other humans... except by analogy. "I think the other humans are like me and I have subjective internal experience, so they probably have it to, rather than being p-zombies."

That's your burden to demonstrate as the person equivocating AI to humanity. You couldn't do it with "learning" without redefining learning, and you can't do it with "experience" or "think", without redefining those words either. Who is seriously advocating that LLMs are thinking and experiencing? I haven't seen anyone make those arguments.

>Sure, I'll agree that it's not even necessary to consider the works transformative or whatever.

That wasn't my point. A transformative analysis is one of the most fundamental elements of determining if something is a copy or not in copyright law. So I don't really have any idea what you are talking about with this one.

>FWIW, I don't think that AIs should be getting their own copyrights or anything like that, I'm just saying that the training data shouldn't forever taint the output no matter what's produced.

Yeah but your only argument for that is to redefine learning to pretend it's the same thing that humans are doing when that's clearly not the case.

Natsu · 2023-09-02T20:52:31

> Yeah but your only argument for that is to redefine learning to pretend it's the same thing that humans are doing when that's clearly not the case.

What test can I do to differentiate them, then?

At first, you said they couldn't write an essay... but AIs can absolutely do that. The internal experience of even other people is unknowable and something we guess by analogy, so if you want me to agree you need some other actual test on measurable outputs to differentiate.

Otherwise this is all about qualia and there's no way to come to rational agreement.

freejazz · 2023-09-03T16:04:00

You are being obtusely literal, as I did not ask you if they could write an essay. I asked you if they could express their feelings. There's no point in us conversing if you are going to respond this way, as it's disingenuous. I'd think you are capable of understanding the difference between the two. And I don't care if you agree with me or not, it's your burden to elevate AI to humanity, not mine, and you haven't done it here. Your perspective here seems to come from a life devoid of art and experience in things. For that, I'm sorry for you.

Natsu · 2023-09-03T19:37:11

> I asked you if they could express their feelings.

And I asked how we can test whether someone has actual feelings or any other kind of conscious internal experience. If it's "obvious" then why is there no consensus on the whole https://en.wikipedia.org/wiki/Philosophical_zombie thing?

> There's no point in us conversing

I gave this conversation to an LLM to respond to.

freejazz · 2023-09-03T20:20:45

I only said it was obvious that LLM's don't know anything about art past what you described, which you didn't dispute and was an obvious logicaly conclusion from your own explanation of what AI "learned".

>I gave this conversation to an LLM to respond to.

I'm not surprised, I repeatedly characterized your responses as obtuse, disingenuous, or ignorant. I'm not sure what you think you proved.

skydhash · 2023-09-01T03:03:34

You can ask someone to produce a pin-up version of Minnie Mouse, but good luck using it in any commercial activities.

Most LLMs are just profiteering from people’s labor without their consent. And there’s nothing new being produced. It’s always a statistical output of previous works.

gwd · 2023-09-01T12:36:08

> You can ask someone to produce a pin-up version of Minnie Mouse, but good luck using it in any commercial activities.

The same would automatically apply to LLM output -- there's no need to change the current laws to cover that case.

The question is this. Suppose I ask a human artist and an LLM to create me a new female mouse cartoon character. And suppose both the artist and the LLM have been exposed to Minnie Mouse. It's not unlikely that the new character created in both cases will have aspects specifically similar to, or specifically opposite to Minnie Mouse.

In the case of the human artist, the new character will not be covered by Disney's copyright, unless there was a lot of copying. Why should the result be different for LLMs?

The logical conclusion of "any output of an LLM that's seen Minnie Mouse must be subject to Disney's copyright" is "any output of any human that's seen Minnie Mouse must be owned by Disney". Which I'm sure Disney would love, but would certainly make the world a worse place for everyone.

chii · 2023-09-01T03:43:45

> a pin-up version of Minnie Mouse

that's not because of copyright, but because of trademark. If you make the minnie mouse sufficiently different that it cannot be mistaken for not being Minnie to the average person, and don't call it minnie mouse (to get rid of trademark), disney will have a much harder time suing you. Of course, they will still try, and steam roll you with just money instead.

JamesBarney · 2023-09-01T07:05:39

> And there’s nothing new being produced. It’s always a statistical output of previous works.

I don't think you can define those terms such that what you say is true of AI but not true of people.

Natsu · 2023-09-01T07:42:36

I think you're misunderstanding that, I don't expect it in either case, I'm saying you have to judge the output not the input. So even if it trained on a ton of copyrighted artwork, if the output isn't a ripoff of something in the training data, I don't think there should be any copyright issues.

idle_zealot · 2023-09-01T01:16:26

Is intelligence really a factor here?

Say I use the same training set as one of these LLMs, copyright protected text and all, and use it to derive a compression algorithm that uses very little space to store tokens and token sequences that are common in that huge collection of text. The resulting compression scheme includes some sort of statistical artifact derived from that copyrighted text. Is that allowed? And if so why is an LLM different?

cj · 2023-09-01T02:12:23

Very good question indeed.

A lot of these questions are somewhat ethical/moral in nature. E.g. is it okay to take someone else's creative work, process it through some algorithm, to create a service like ChatGPT? Or a compression algorithm? I don't know.

It's awesome to see the Copyright office request input from both sides of the argument.

livrem · 2023-09-01T11:16:37

It worries me that so much focus is on two sides that may not have the end-users' best interest much in mind. The companies building the models may have an incentive to regulate models to keep smaller players or open source projects away. Artists mostly seem totally anti any solutions as even laws that allow models trained on purely public domain art would be bad for them. If laws around this are shaped primarily by the wishes of those two groups I am not sure things will end up well at all for those of us that want the tools to keep improving and remain reasonably free (including applications you can install locally and run on your own GPU).

chii · 2023-09-01T03:45:15

> is it okay to take someone else's creative work, process it through some algorithm, to create a service like ChatGPT? Or a compression algorithm?

and the test i use is: if they currently allow a human to perform this same task, then it is allowed to be done using an AI model.

quickthrower2 · 2023-09-01T02:06:59

LLMs are generative though not just compressive

orbital-decay · 2023-09-01T11:28:31

Generation, prediction, and compression are all the same - the only different thing is the intent.

stale2002 · 2023-09-01T01:20:05

> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

None of what you are saying has anything to do with copyright.

The tool Photoshop isn't generally intelligent either. And yet, yes it can be used to create art using other people's stuff.

And it could be done legally if the results are transformative.

jtr1 · 2023-09-01T01:32:24

Photoshop doesn’t install with a massive directory of other people’s copyrighted works to draw snippets from.

tick_tock_tick · 2023-09-01T02:27:45

Yes it does...

devsda · 2023-09-01T02:57:07

If it does, then Adobe would have commissioned or acquired the license. In either case they would have _paid_ someone to get those images.

It is very unlikely Adobe would be shipping their software with copyrighted material without paying for them first.

fluidcruft · 2023-09-01T02:22:06

I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression". Copyright and "lossy compression" are pretty easy to reason about. Model "building" is "compression". Model "use" is "decompression". Everything about these AI models seems to be about the "lossy" part, but "lossy" is just an adjective to the main show.

It's very difficult to not conclude that copyright of a trained model should be treated identically to the copyright of a zip file.

chii · 2023-09-01T03:46:51

Information is not copyrighted, just the expression of said information.

So if you took a recipe book, extracted the recipe information, and listed out the recipe in a different format (such as a table), it's a new work. It does not violate the copyright of the recipe book you extracted the info from.

gwd · 2023-09-01T12:41:09

> I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression".

If you feed a photo of your dog into a JPEG compressor and the result looked like a cat in the same style, I think you'd be pretty annoyed.

CamperBob2 · 2023-09-01T02:51:22

When you perform lossy compression, you feed it one file at a time, not every file in existence.

fluidcruft · 2023-09-01T03:38:16

If you concatenate images into a stream container (say as tar) and then compress the stream, the compression coding will (generally) cross over the individual images. True, that's generally not lossy compression.

But concatenating images is also how you create video. Lossy video compression does typically cross over frames. So I don't actually see a difference. If you want to think about mkv or mp4 instead of zip it's still the same concept.

There's nothing stopping you from putting every available image into a video and figuring out how to compress it lossily.

Maybe there's some bounds for how much information was lost? Obviously piping everything into /dev/null destroys the input. And piping /dev/random from a true random source creates information. So somewhere between that and lossless compression there's the nebulous "plagarism" threshold. And then there's another threshold that is copyright infringement that's considered "fair use".

But the general structure of the "AI" this is about are fundamentally storage and retrieval.

freejazz · 2023-09-01T04:01:36

What does any of this have to do with creating a new expression?

fluidcruft · 2023-09-01T04:06:04

What makes anything new? Is anything created by "AI" actually new? How much entropy is in a prompt vs in the output?

freejazz · 2023-09-01T14:30:48

>What makes anything new?

In copyright law? It's not being a copy

tomrod · 2023-09-01T03:35:38

Some compression, yes, but the analogy oversimplifies. AI rerepresents input information in a transformative way (embedding, say) then creates new, derived and combined output from a new input (e.g prompt).

It's not just lossy compression. It's potentially novel.

fluidcruft · 2023-09-01T04:00:14

Phrases like "transformative way" are meaningless woospeak to me. Everything is a transformation. Sulpose I run a linear convolution on ten images and average them. Is the result "new"? Does it not contain the original images? Subspaces and mappings don't create anything "new" any more than SVD does. This is just playing digital Ship of Thesius.

tomrod · 2023-09-01T11:34:01

> Phrases like "transformative way" are meaningless woospeak to me

Fortunately we live in a society that supports specialization where something that is woospeak to a smart person can still be a very well understood topic. AI transformations are methodologically well documented, even if transparency of neural network node activations is yet to be fully formalized.

fluidcruft · 2023-09-01T16:18:33

In that case, you'll surely be able to provide a citation that clearly distinguishes the differences between the ways of transformations performed by "AI" and the ways of transformations performed by compression.

tomrod · 2023-09-04T15:29:49

Sure. AI (more specifically, ML) is curve fitting, and more generally, objective function optimization. https://en.m.wikipedia.org/wiki/Curve_fitting

A projection is not compression, necessarily. And you'll find AI is a very poor compressor when used for such a purpose in all but the most trivial setups (e.g SVD matching input data rank, only reversible functions in neural network activation, etc.).

KHRZ · 2023-09-01T05:57:12

Congratulations, you just discovered that copyright is a weak and ill-defined concept.

fluidcruft · 2023-09-01T16:21:54

I think that unless you can clearly show that an "AI" is not a form of compression, the question of copyright is orthogonal. The copyrights that apply to a zip file may be ill-defined concepts to you, but it's not really important to the core question which is: how are model weights different from a zip file? If you put unambiguously copyrighted content into a zip file, most people would agree that the copyright applies to the zip file. So by analogy if you put copyrighted content into model weights, the copyright applies to the model weights. Issues such as what constitutes fair use comes up, but fair use is permissible copyright infringement, not absence of copyright. And that's where the question of how lossy a compression algorithm has to be to be considered "fair use". In all likelihood it's the specifics of the use itself (rather than technology or method details used) that matters.

skydhash · 2023-09-01T04:05:53

It’s compression + filtering. Nothing generative. Its output is like 99.99 % deterministic.

tomrod · 2023-09-05T23:30:38

Linear regression is 100% deterministic after training and isn't lossless compression, but rather a linear projection of along a manifold in a (potentially transformed) input space.

So, maybe not just compression+filtering, if level of deterministic behavior is to be the gauge.

Philpax · 2023-09-01T12:51:27

Source?

8note · 2023-09-01T02:07:16

Why is being a statistical model relevant?

The simplest statistical model is an average. Why would the average pixel rgba of a bunch of images invoke the copyright of those images?

chii · 2023-09-01T03:50:53

The crux of the AI copyright argument sits in economics. Those currently producing content want future content generated from AI to benefit them financially, as long as a thin sliver of their own content was used in the training.

This is like asking all the student to pay their teachers a (small) percentage of their future economic output.

JamesBarney · 2023-09-01T06:55:42

My opinion is we should treat AI like photoshop/word/windows. If you use windows to copy a file and distribute it, Microsoft isn't liable you are. If you use word to type up a book and sell it, you're responsible.

Same with a statistical model, if you general a copyrighted work and distribute it you are responsible. But the tool (GPT-4) maker isn't responsible just like Adobe isn't responsible for copyright infringement.

The copyrighted text/image isn't generated until you ask it to. Your prompt is what reproduces the material.

NoMoreNicksLeft · 2023-09-01T03:21:54

Why would any non-lunatic want to live in a world where someone can't import an image into software?

If only some software is disallowed, then why permit Excel but prohibit Stable Diffusion?

Can someone even look at a SD-generated image, and claim with certainty that their own art was used to train it? Any more than claiming that another artist was inspired by it, looking at their output?

I'm fine with anything goes. The alternative seems to be copyright maximalist clownworld.

paxys · 2023-09-01T02:06:12

> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

But then you are just shifting the problem forward by an inch. What happens when tomorrow someone declares that their model is generally intelligent and is therefore allowed to disregard copyright when training just like a person can?

jasonzemos · 2023-09-01T02:58:10

This point is of the utmost importance from a public policymaking perspective. Laws such as these are easy to craft now and difficult to change later. I feel like we are previewing an unfolding disaster here.

The future will clearly yield a class of "beings" striving for some degree of indistinguishability from or coexistence with humans. Proposals that discriminate --literally discriminate -- without respect for the principles of universality and equal treatment under law are creating and condemning a marginalized group before it even reaches maturity. This is an old and tired theme repeated through history. Let's foresee this and not get it wrong.

freejazz · 2023-09-01T04:05:28

Is it your experience that people's facial declarations cary the day in legal disputes? It's not mine. Rather, it seems like the whole thing is designed to provide scrutiny against bare facial declarations that something is true or false.

I see this on HN all the time "someone just has to claim" "someone just has to say". Yeah... that's not how it works. People can say whatever they want, that doesn't mean it satisfied their burden of proof. Self serving testimony is the lowest form of evidence imaginable.

orbital-decay · 2023-09-01T11:42:17

Intelligence lacks any legal definition, for starters. And if a law like that will provide an arbitrary line in the sand, it will just disincentivize AI research in general.

freejazz · 2023-09-01T14:32:30

Often, when laws are passed, they provide definitions for the terms in the law that require definitions. Regardless, I'm not aware of any proposals for copyright law where "intelligence" is used.

paulusthe · 2023-09-01T01:58:48

I agree completely. AI model trainers should have to pay the people who provide their training materials, and there should be a default assumption of opting out until someone or their company explicitly opts in.

Unfortunately the Peter thiels and all those bizarrely out of touch silicon valley assholes have already effectively scraped the Internet because ethics don't matter if you're special like them, so to a degree regulations are way behind the ball.

That said it's still worth doing, and I'd love to see it done retroactively as well. It's not as if "I forgot that I had a public Myspace 25 years ago" is an implicit user opt-in for some startup to save your data - however anonymized they claim it is (lol!) - and train its AI on it.

zmmmmm · 2023-09-01T02:14:32

> The alternative seems to be “anything goes”.

Seems like a huge false dichotomy. You really can't imagine anything in between total shutdown of AI training on public data sources and no rules at all?

I think we should try a bit harder for a middle ground.

lewhoo · 2023-09-01T07:10:26

I think you are right. People argue if LLM's store or maybe generalize. I propose an experiment for anyone interested. Try and do this prompt multiple times and change the appropriate verse numbers:

> Provide quote from King James' Bible Genesis :25-31

or

> Provide quote from King James' Bible Genesis :1-25

or whatever you fancy.

I didn't go through the whole Bible, but I got pretty much a verbatim chapter. I argue that you can't do this with copyrighted books only because of guardrails and not chatgpt's lack of capability so the information is there, and it's verbatim. Plus other books don't have such nifty indexing.

mensetmanusman · 2023-09-01T01:49:07

Because the cat is out of the bag so to speak, any attempt to force ai companies to generate their own content to train on means we are signing up for a future where only multi billion dollar companies are in control.

PaulDavisThe1st · 2023-09-01T01:57:40

If they were truly forced to do this, even they would find it difficult.

CamperBob2 · 2023-09-01T02:52:16

And everyone else would find it impossible.

Hence the headlong rush to implement regulatory capture.

gnopgnip · 2023-09-01T01:21:41

Is there any precedent where copyright was focused on the input rather than the final published work?

jj999 · 2023-09-01T01:25:34

Compilers

lstodd · 2023-09-01T01:55:19

Object code is a derivative work I think.

So no. Compilers do not count.

kevinmchugh · 2023-09-01T02:07:45

The US had to update copyright law to explicitly protect binaries

freejazz · 2023-09-01T04:07:13

That just means some judges got it wrong and congress really wanted to make sure others didn't. I'm not sure what proposition that stands for here, except that sometimes new things are hard to get right at first.

kevinmchugh · 2023-09-01T02:05:30

Remixes, generally?

harshreality · 2023-09-01T06:18:25

This is more of a problem for images, where similar output to inputs is likely, than for LLMs, where no matter what you prompt it with I doubt you can get it to regurgitate any significant parts of Harry Potter well enough to be a classical copyright violation of any of the novels. Maybe you could generate a copyright violation of character traits.

The output space of images (MB for larger images) tends to be larger than books (a few hundred KB of text for a long novel), but the perceptual output space of books is much larger.

Any determination that licensing is required for AI generation, or use of AI-generated works, is unacceptable until Congress or courts put some reasonable objective tests in place to determine what is and isn't a copyright violation for various types of works of various lengths. Not the ambiguous 4-factor test that is basically whatever the judge feels like. It will be a complete mess otherwise. They can't just define a new AI policy for copyright with a few types of works in mind; it has to work for all works.

You could look at this mathematically from a complexity perspective and try to define a similarity function that's true when a second work is close enough to a first work to be a derived work (assuming the first one had been seen by the creator of the second). Unfortunately that won't work because nobody can define such a function to everyone's satisfaction, and the courts wouldn't accept any informal suggestion of a definition when it didn't come from Congress. Specifically, you'd get into trouble with consistency in the function determining derived works depending on length of the work: short works, like a haiku, are much more sensitive to copyright violation in some ways... a mere 17 syllables is a complete reproduction and therefore a copyright violation, yet a single word isn't; for a novel, reproducing 1/17 of the content is almost certainly a copyright violation, but reproducing 17 syllables probably isn't.

Different stakeholders and creative re-mixers would want different things from the function. It's untenable.

judge2020 · 2023-09-01T01:31:31

> This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation

That is a fairly illogical leap. From your text alone, “should not be allowed to disregard the copyright of its source material” would be: “the AI’s maintainer should have a fairly reliable (but not infallible) system to output how likely it generated something that is a direct derivative work of something in its dataset”. As a human you don’t need to attribute/license every piece of art you’ve seen of clouds if you draw a cloud. So if an AI draws a cloud that is actually derivative of the millions of clouds it has seen, then it doesn’t need any permission from the millions of creators to draw one either.

rmbyrro · 2023-09-01T07:24:20

AI is taking work away from lawyers, and instantly creating more work for lawyers.

Ain't that interesting to reflect upon?

I speculate there is a hidden force in the universe, something physicists are yet to identify, which mandates: "they shall always have something to do".

mjan22640 · 2023-09-01T06:34:03

The human brain is no different. It generates content from the things it learned.

CatWChainsaw · 2023-09-03T15:18:38

Repost #4 I believe

https://news.ycombinator.com/item?id=37305580

"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."

gaganyaan · 2023-09-01T11:08:26

I hope your opinion isn't shared by lawmakers. Copyright is a relic of the past, and it needs to be put out of its misery. Trying to (mis)apply copyright here would just lobotomize the US. Existing companies would just technically operate out of a saner jurisdiction, and we'd be handing other countries a golden opportunity to leapfrog the US.

scotty79 · 2023-09-01T11:54:19

"anything goes" is the best and most natural solution. Just don't let people copyright the output if they don't have full copyright on all of the inputs. This should finally get rid of the cancer that is copyright in a generation or two.

rickmode · 2023-09-01T01:47:28

Generic reply to siblings here… I get the intelligence argument.

My _main_ point is that there’s a non-trivial question to answer here.

I’m not qualified to answer (though I’ve offered up my non-expert opinion). It certainly seems to quickly veer in to philosophy!

jillesvangurp · 2023-09-01T04:17:57

It shows you are not a lawyer. You misunderstand how copyright works. Creating copies or derivative works and distributing those is all that matters under copyright. This is not "disregarding" copyright (which is not an actual thing) but something that is either fair use or may require some kind of permission from the creators of the original by those distributing some kind of derived work or copy. That's why it's called copyright.

Copyright merely restricts the distribution of original works or their derivatives. In case of an infringement, copyright holders can insist you stop distribution and/or compensate them for that.

If I sell you a paint brush, I'm not liable for you putting a red nose on the mona lisa and trying to sell it off as an original work. Doing that on the original would be an act of vandalism (because you don't own it) and doing that on a replica that you got from somewhere infringes on the rights of those that created the replica. Which is a derived work or copy in itself of course and the distribution of that is regulated by copyright. Distribution of such a replica is of course fine because Da Vinci has been dead for a very long time and his work would no longer be protected under copyright. Distributing your red nosed mona lisa would therefore be fine too. Either way, the paint brush seller is no party in this case this is between you, Da Vinci, his descendants, and the replica creators.

Now your assertions as to what AIs are of aren't, are simply not relevant. You assert it's a statistics algorithms thingy. That sounds like a tool to me. Yet another paint brush. Using a paint brush is not infringing on anyone's rights. For that you have to distribute the results of your work. The nature of the tool does not matter. How you use the tool does not matter either. You merely create (potentially) derivative works with the tool and what you do with those matters. Especially when you distribute them to others. One of those derivative works is of course the AI model itself. Creating one is fine. Copyright gets potentially infringed when you distribute one.

Now we get to the core of the matter. Can you with a straight face say the AI model resembles the original and is a derivative work. It doesn't actually look like or resemble the original in any shape or form. Even proving the AI model is derived from the original is tricky. Copyright is not about protecting vague ideas or notions but the concrete shape or form of things. And it's only an infringement if you distribute a derived work or a copy of a thing to others. So, merely creating an AI model is not distributing anything to anyone. You are merely using tools to create something for yourself. An AI model in this case.

Distributing a verbatim copy of a book is an infringement. Citing the book in your own work is fair use (up to a point). Paraphrasing elements from the book, acknowledging it exists, taking inspiration of it, or reading it aren't copyright infringements.

The legal problem with AI models is that their concrete shape or form doesn't resemble the original inputs in any shape or form. Besides, companies like OpenAI don't actually distribute their AI models. They are huge; it's not very practical. They merely exploit those models to generate outputs to inputs from their users and customers. Are those outputs derivative works? Maybe, but that's where it gets tricky. They clearly aren't in the classical sense. Not even close. But if you somehow could conclude that they are, who is distributing that derivative work? Secondly, it the AI model is a tool, who actually creates those outputs and are those outputs protected under copyright? Who actually holds those rights? And how would you tell apart such an output from a human created one?

It's questions like this that make all this extremely murky from a legal point of view. IMHO without dramatic changes to copyright law or the way it has been commonly interpreted legally, it's just very poorly suited to do anything about stopping AI companies from doing what they are doing. You'd have to bend the conventional interpretation quite a bit for that. No doubt, there will be court cases where people will try to do that. But it will take many years before the dust settles on that. And I wouldn't get my hopes up on some unexpected/dramatic outcome.

freejazz · 2023-09-01T14:34:50

This is generally, but I'm surprised you aren't aware that distribution isn't the only right protected by copyright - creating derivative works is protected, display rights are protected.

carom · 2023-09-01T01:21:59

There are three copyright issues here; datasets, model weights, and model outputs.

Dataset copyright is pretty well defined and things can often be used under fair use. Fair use decisions are done with a four prong test and really decided by the courts on a case-by-case basis.

Model weights cannot currently be copyrighted. They are the output of a mechanical process over the dataset. However, software faced a similar situation where the source code could be copyrighted but the compiled binary was not. US copyright law was updated to address this. We may see something similar for model weights.

Model outputs are less clear, but these are likely copyrightable by the user of the model. It is not possible for a non-human to hold copyright, so the model cannot. It is very unlikely that the company producing the model could assert copyright over the outputs. A good analogy here is someone using photo manipulation software.

Super interesting area. I think we will eventually see an update to the copyright code to make weights copyrightable. Also it will be interesting to see how court challenges (code generation, image generation) affect datasets in the future.

jj999 · 2023-09-01T01:39:57

Who do you think should hold the copyright on the model weights, the copyright holders of the individual works comprising the dataset or the ones who assembled the dataset?

shagie · 2023-09-01T02:06:13

I don't think that model weights are copyrightable.

They're a mathematical transformation of the source material.

As a trivial example, applying gray = .299 r + .587 g + .114 b to a pixel is also a mathematical transformation but it doesn't create a new copyright.

And thus, the model is a mathematical transformation and a derivative work of its source material.

However I am also of the opinion that the resulting model is sufficiently transformative and fills a different purpose than the source material and so the model is not infringing.

From the Perfect 10 case against Google and Amazon:

> …We conclude that the significantly transformative nature of Google's search engine, particularly in light of its public benefit, outweighs Google's superseding and commercial uses of the thumbnails in this case. … We are also mindful of the Supreme Court's direction that "the more transformative the new work, the less will be the significance of other factors, like commercialism, that may weigh against a finding of fair use."

That was dealing with thumbnails being used in search and it was found that thumbnails being used for search was sufficiently transformative and applied to a different purpose that Google's use didn't infringe.

I believe that a model is even more transformative given that bar.

That doesn't mean that the output of the model isn't infringing, but that's a human with agency creating and publishing that output which is a different artifact to be considered than the model weights.

satvikpendem · 2023-09-01T02:40:19

Indeed. Inputs are likely fair use but if you output Mickey Mouse, Disney will definitely he on you. I think that's the most sensible approach, as anyone can use a tool like a pencil to draw anything they want, doesn't mean that they'll get away with creating a drawing of Mickey and saying they own the copyright.

XorNot · 2023-09-01T01:55:43

Depends who you think holds the copyright on the SHA256 hash of an image.

Model weights are even less specific then that number, since they don't represent any specific source input at all.

usr8new · 2023-09-01T01:38:48

Biggest issue is the first one. Dataset copyrights are where most of the fight b/w regulators, AI companies & artists is happening.

Should you be able to train your model on an image or text someone uploaded on the internet without buying the copyright? If no, then most of current LLMs & stable diffusion models will have to go & I don't see big tech allowing it.

Tazerenix · 2023-09-01T03:48:57

Realistically an AI model is basically just a very complicated piece of software. The model weights are akin to the software code, the model outputs are akin to the outputs a user of the software creates, and the datasets are akin to the intellectual property put into the software by the developer to create the code.

In the same way that a developer could not simply steal someone elses intellectual property in order to develop a feature of a piece of software, one cannot simply steal the intellectual property to adjust the model weights. The main difference is its generally quite easy to see in practice if a model has utilized some intellectual property (because for example you can ask ChatGPT to recite the first 100 words of Harry Potter) compared to another piece of software where you'd need access to the source code or developers thoughts (which could only be achieved through litigation, in most circumstances).

I think a great many people come up with convoluted answers to this question because they are uncomfortable with the reality that these very large organizations have essentially stolen hoards of intellectual property, and now that the horse has bolted people want to justify not closing the barn door. It seems to me very simple: to train an AI model on data, you must respect its copyright. The model weights should be copywriteable by the developers of the model (even if the law currently does not allow this), and the outputs of the model should be copywriteable by the person who interacted with the model (software) to produce the outputs.

The analogy with Photoshop is extremely simple: If some other software invented Gaussian blurring and copywrited it, then Adobe would have to license that technology from them to include it as a feature in Photoshop. The actual photoshop software/code would be copywrited by Adobe, and if someone created an blurry image with Photoshop they can copywrite it.

I think people only disagree with this due to some sense that the process of translating data to model weights is "automatic" or "computational" in nature. You could in principle get a person to, by hand, go through millions of data sets and compute the changes to the model weights. This is no different to someone writing a piece of code, checking someone elses approach, and adjusting their own code after the fact. It just happens that we have developed very effective tooling to automate the adjusting of the code.

Zuiii · 2023-09-01T04:13:07

Three points I want to make:

- Models are nothing more than a statistical distillation of facts that can be traversed. They are not like software at all. Calling them software is like calling pachinko machines software. Nonsense.

- Models are mechanically derived with no element of human authorship or creativity. You could argue that there is creativity in selecting the dataset or the process that derives the model, but neither is relevant to the final generated model. Even if we assumed for the sake of argument that a model is more that a just statistical distillation, it should still not be considered copyrightable due to this reason alone.

- Don't use the word Steal when you refer to the well-defined act of infringement. Stealing implies deprivation of property which does not and cannot occur in this case. Using the word Infringe is more honest and less manipulative.

gaganyaan · 2023-09-01T11:17:33

It's not stealing, and the term "intellectual property" should be put to rest:

https://www.gnu.org/philosophy/not-ipr.en.html

Your opinions on what should be copyrightable are wrong, and fortunately the courts agree.

dredmorbius · 2023-09-01T06:02:28

You've rather conspicuously failed to mention a fourth, with two parts, and the first listed in the article abstract: "the use of copyrighted works to train AI models, the appropriate levels of transparency and disclosure with respect to the use of copyrighted works".

Outputs are the third mentioned: "the legal status of AI-generated outputs."

Ajedi32 · 2023-09-01T03:24:17

Much debate has been had about how existing copyright law applies to AI models. But once you get past that and start asking about how copyright should apply to AI models (as the copyright office is here) the answer in my mind becomes clear.

Copyright, as defined in the U.S. Constitution, exists "to promote the Progress of Science and useful Arts"[1]. I can think of no better modern example of "the Progress of Science and useful Arts" than AI models themselves. Therefore, it follows that:

1. Existing copyright laws should _not_ be applied in such a way as to make training these models any more difficult than it already is (as that would be in direct opposition to the stated goal)

2. AI models should be copyrightable by the person training the model (for the same reason any other software program is copyrightable)

3. Output of AI models should be copyrightable by the person running the model (for the same reason any other creative work is copyrightable) provided the output does not conflict with any preexisting copyright

For those who think training on copyrighted materials should be illegal, explain to me how that helps "promote the Progress of Science and useful Arts" and I'll re-consider my position.

[1]: https://en.wikipedia.org/wiki/Copyright_Clause

showerst · 2023-09-01T03:28:09

So i'm not sure how I feel, but to play Devil's advocate --

If I know anything I create is just going to be hoovered up and input into somebody's AI model so I do 99% of the work and they get 99% of the profit, perhaps I'm much less likely to progress Science and useful Arts by creating content in the first place.

I fear an internet of signup walls and TOC agreements for everything, just to prevent crawlers that feed AI from soaking it all up.

Zuiii · 2023-09-01T04:30:29

> If I know anything I create is just going to be hoovered up and input into somebody's AI model so I do 99% of the work and they get 99% of the profit, perhaps I'm much less likely to progress Science and useful Arts by creating content in the first place.

Perhaps you wouldn't but I, and apparently most of the scientific community who publish research, would (and do). The entitlement of people here feel towards their way of doing things is astounding.

It's a story as old as time: Those who try to resist or limit progress (by placing arbitrary restrictions) will be beaten by those who adapt.

trgdr · 2023-09-01T13:51:12

I think you should spend some time thinking about the purpose of progress. Progress in and of itself is not useful, nor is it necessarily desirable.

It's very easy to envision a society that is both more advanced than ours and profoundly worse in all meaningful aspects.

CatWChainsaw · 2023-09-03T15:13:49

More people around here need to hear this.

Ajedi32 · 2023-09-01T04:12:33

> If I know anything I create is just going to be hoovered up and input into somebody's AI model so I do 99% of the work and they get 99% of the profit

Could you give a more concrete example of how this could happen?

As-is, I don't see how the existence of an AI model trained on J. R. R. Tolkien's Lord of the Rings is going to result in Tolkien's works receiving 99% less profit.

Maybe you could argue AI models as a whole will devalue certain types of creative works (e.g. art commissions for designing logos), but they don't need to train on any one particular creative work to accomplish that, so unless you're saying we should just ban AI models entirely I'm not sure how copyright helps with that.

> I fear an internet of signup walls and TOC agreements for everything, just to prevent crawlers that feed AI from soaking it all up.

This is a fair point, particularly since it appears to be already happening to some extent. Though it seems to be largely social media companies and content aggregators trying to control access to information they don't hold the copyright to in the first place, not individual users trying to restrict access to works that they created. I'm not sure copyright would really help "promote the Progress of Science and useful Arts" there so much as "promote the wallets of large social media conglomerates".

chii · 2023-09-01T03:36:46

> If I know anything I create is just going to be hoovered up and input into somebody's AI model

but today, without an AI model, anything you create is already going to be learnt and studied (if it is worth studying of course). What's the difference, but speed?

> they get 99% of the profit

Why is that a priori the assumption? What stops you from getting a profit?

> I do 99% of the work

you did 0.000001% of the work, since the model contains billions of other works from which they train.

skydhash · 2023-09-01T03:43:42

> What stops you from getting a profit?

OpenAI and Stable Diffusion not paying for their dataset. I don’t believe GitHub asked for my contribution to Copilot.

chii · 2023-09-01T03:55:12

But you weren't receiving profit from your works originally? So therefore, why does it matter what someone else was doing?

kelseyfrog · 2023-09-01T04:36:39

If someone jacks my car while I'm asleep, races it, and wins a prize, and returns it before I wake up, they haven't deprived me of anything, profited off my property, and it's still wrong.

Profit and deprivation are not and never will be a good tests in determining things like this.

chii · 2023-09-01T05:09:24

> jacks my car

no, they downloaded your car.

klabb3 · 2023-09-01T04:05:16

> Copyright, as defined in the U.S. Constitution, exists "to promote the Progress of Science and useful Arts"[1]. I can think of no better modern example of "the Progress of Science and useful Arts" than AI models themselves.

This makes no sense. Before AI, it’s clear that copyright itself restricts what can be done, in order to promote overall health of innovation. You can’t just say this is cool so therefore allowed, then there would never have been any copyright in the first place. You have to argue that the overall result will be better given the rules you propose.

Now, I’m no fan of copyright, but it is abundantly clear that the tech companies are able to capitalize on new tech disproportionately. Thus, it’s a transfer of privilege from the very many designers, artists, musicians, authors etc to whomever will dominate AI. That’s not good, simply because of the centralization.

Moreover, you can’t just look at the short term gains of AI models that can be produced with existing content. You need to include the change in incentives, when creators financial prospects are even more minuscule than today. Even if AI is all that matters, they still need training data, and that needs to come from somewhere.

gaganyaan · 2023-09-01T11:27:37

AI is the innovation. Trying to misapply copyright here would retard the progress of science and useful arts, not promote them, because it would wrongly restrict that innovation.

The answer to big tech corporations centralizing this is to fund it publically and make it available for free to everyone as a shared summation of our culture, not lobotomize ourselves just to profit a few old dinosaurs that are relying on an outdated idea of copyright.

klabb3 · 2023-09-01T19:47:56

> AI is the innovation.

Yes, but it does not live in a vacuum. It’s power is derived from human creations and their labor, for now.

When Spotify transformed the music industry, they could not simply claim innovation and be exempt. They had to negotiate with the dinosaurs and eventually it came through.

The inertia is a feature, not a bug. We’re still dealing with the fallout of the social media and ad-tech transformations. Unintended side effects takes a lot of time to understand.

> Trying to misapply copyright here would retard the progress of science and useful arts[…]

How so? All of academia would be entirely exempt, and so would the hackers and tinkerers, etc. You’d only violate copyright if you sell the models or the works they produced.

> The answer to big tech corporations centralizing this is to fund it publically and make it available for free

Yeah but that won’t happen. Even if it does, we need something that works in the meantime.

gaganyaan · 2023-09-02T00:00:57

> You’d only violate copyright if you sell the models or the works they produced

That's not how copyright works. Academia would be able to claim fair use in expensive lawsuits. Hackers and tinkerers would be sued into submission exactly because it's a hobby that they won't risk jail for. People would be scared to work on anything related because of lawsuits threatening them with obscene amounts of money, and hence the retardation of science and the useful arts.

Other countries would leapfrog the US, and it would be left behind, all so that a few people can continue extracting rent with their government-granted monopolies.

rsaxvc · 2023-09-01T03:39:32

Do you see #1 and #3 conflicting at all? Ex: you produce a model, run it, publish and copyright some output. I can then use that as training data for another model in the style of your existing model?

Ajedi32 · 2023-09-01T04:26:30

I see that more as a conflict between #1 and #2, but fair point. In extreme cases, you could probably make a crude copy of a model by training a new model solely on the outputs of the first one. Normally that would be a derivative work, but that's inconsistent with the idea that training on copyrighted works is always permissible.

Maybe one way to resolve this would be to say there ought to be some practical limits on what percentage of the training data can be from any one individual source. If I train an model solely on the text of one book, for example, such that it's so overfitted that it can do nothing but regurgitate passages from that book, it's probably fair to call that a derivative work. The same would apply to a model trained solely on output from another model. (Though if it merely incorporates a few examples from a bunch of different models that would be okay.)

AlexandrB · 2023-09-01T04:28:18

> provided the output does not conflict with any preexisting copyright

I think this clause is doing all the work here and unfortunately in many cases there's no quick, automatic way to verify this.

Because of the legal costs involved in litigating who copied whom, I fear this would allow someone to sue the original creator of a work for infringing on the copyrighted output of an AI model trained on that work. If this seems far fetched, consider that this already happens with the DMCA and Creative Commons works: https://www.techdirt.com/2016/04/26/ifpi-files-dmca-takedown...

globular-toast · 2023-09-01T07:46:59

If you're going to bring up the origin of copyright you also need to consider the world copyright was made for. Back then, there was only one type of copying machine: a printing press. These were huge, expensive machines that could only practically be operated by corporations. Copyright was invented to protect authors from those corporations.

Your point 1 is what concerns me. AI models seem too much like the printing presses of old. They are available mainly to corporations and authors need to be protected from them. Otherwise there will be no incentive for anyone to publish anything novel as they know the corporation will slurp it up and make it "better" with their better model.

chii · 2023-09-01T03:38:05

> 3. Output of AI models should be copyrightable by the person running the model

i would go further, and declare that this output is uncopyrightable.

gaganyaan · 2023-09-01T11:25:45

Number 3 doesn't really make sense. I wouldn't get copyright if I told a human artist "draw a dog". Why would that change just because I'm telling an AI to do it?

Ajedi32 · 2023-09-01T15:34:08

I oversimplified that point a bit for brevity's sake, but I'd say I agree that some level of creative input from the human operator should be required in order for the work to be copyrightable. If I draw a black 100px*100px square in MS paint that's not copyrightable, nor should typing "dog" with a seed of "1" into Stable Diffusion be. But as soon as even the smallest level of creative input is involved, yeah it should be copyrightable.

Also, kinda beside the point, but:

> I wouldn't get copyright if I told a human artist "draw a dog".

If they were working for hire, yeah you absolutely would.

gaganyaan · 2023-09-01T22:55:46

You can get very creative in your instructions to a human, but that creativity still won't get you any copyright.

And afaik, in a work for hire scenario, the artist just agrees to transfer their copyright automatically. If there's no copyright in the first place, there's nothing to transfer

asu_thomas · 2023-09-01T03:36:05

Although I disagree and consider copyright an anti-social institution only necessary due to the anti-social capitalist mode of relations, I commend you for making an actually coherent argument on this question. It is the first coherent argument I've come across outside of the small Marxist circles I run in.

Der_Einzige · 2023-09-01T00:52:07

I'm going to try to plead my case for images generated using sophisticated prompt engineering to be copyrightable. For example, at the point that I've written a prompt with 20 tags, 10 negative prompt tags, some loras, custom weights, embeddings merges, and prompt editing, I'm now writing what is effectively a "program", which should be copyrightable and so should its outputs.

It's total BS to me that a book of midjourney generated images is itself copyrightable because a human arranged the book together, but that a highly sophisticated prompt involving custom tooling wouldn't be.

If nothing else, my comments should show the US Patent Office how deep the rabbit-hole goes with just how interpolatable everything is with everything else.

lern_too_spel · 2023-09-01T01:05:30

Copyright is, like all regulation, freedom restriction for the benefit of society. Producing movies and books takes a lot of investment, and people would not do it to the extent they do without copyright. If we think there is more than enough creative content, we should reduce copyright protections, and if we think there is not enough, we should increase it. Generative AI will move the needle very far in the direction of overabundance and should result in correspondingly reduced copyright protections.

LegitShady · 2023-09-01T01:06:58

The thing you're authoring is just your prompt. Apply for a copyright for it. The image or text generated in reply is generated by a computer with no more input from you than someone commissioning work using very precise words, and lacks human authorship.

idle_zealot · 2023-09-01T01:31:45

This falls apart when you start getting into things like control net and posing, or iterative erasing, reprompting, and in/outfilling. It starts feeling more like some weird combination of 3d modeling, photoshop, and a really advanced autofill.

LegitShady · 2023-09-01T12:14:36

if you keep prompting an artist for edits over and over again for more changes, you don't suddenly own the copyright.

the lesson you should learn is that advanced ai autofill in photoshop may lack human authorship, but it wasn't good enough to become a serious copyright issue until those tools came into existence.

singpolyma3 · 2023-09-01T02:19:06

That's the same as saying if I open paint and draw with the mouse my mouse movements should be copyrightable but not the resulting image...

blehn · 2023-09-01T02:45:04

It's not the same. If you make exactly the same mouse movements in paint 100 times, you will get 100 identical images. If you enter the exact same midjourney prompt 100 times, you'll probably get 100 different resulting images. The relationship between your authorship and the final image in the two cases is quite different.

LegitShady · 2023-09-01T12:17:14

your mouse doesnt make decisions for you. ML based art does, which is why it lacks human authorship and you shouldn't be able to copyright it.

If you hand painted something in photoshop 100% you can copyright it. It has human authorship. If its mostly AI based fill, those elements can't be copyrighted. if Its 100% an ai result, its public domain.

optymizer · 2023-09-01T01:15:08

The training model for Stable Diffusion has a lot of copyrighted images mixed together into an output which makes the plagiarism non-obvious, but let's reduce the set by 1 image. Shouldn't affect the output too much, right? Maybe some prompt will have a slightly different image.

Now let's reduce it by another image. Again, less options for what to display, fewer images to take pixels from, but still a lot of options, output may sound copyrightable.

Now lets do that N-1 times. What output will we get when the model was trained on a single image, let's say an image that is labeled 'dog'. If your prompt is "an image of a dog" you will get that image, the only image in the training set. When going from latent space to image space, taking pixels from that image in the output, despite it being done in convoluted ways, is that not an obvious copyright infringement? I think it is. There's a cloud of mumbo jumbo about latent space, but after the dust settles and it needs to generate pixels in the output image, Stable Diffusion has a step that is essentially copying pixels from the source image into the output. When there's only 1 image, it will reproduce large portions of that image, necessarily infringing on copyright.

So then adding back images one by one into the training set, each one being used as source for the pixels being copied, what makes that model OK? Just because the output is 50% image A and 50% image B, or 0.1% image A and 0.1% image B and 99.8% image C, doesn't suddenly make it OK.

Once there are millions of images, you end up with just tiny blobs of pixels being copied from many different images. That's still infringes on the copyright of all those images, because it's essentially a map-reduce process that maps pixels from copyrighted images and reduces them into a single image.

idle_zealot · 2023-09-01T01:28:28

This viewpoint is about as coherent as "every image file is copyright infringing because every pixel in it exists somewhere in some other image somewhere".

Derivative works, when substantially changed, are not infringing. If I take an image of the Mona Lisa and rearrange all its pixels so it looks like a picture of a cat, that's not infringement.

If I sample lines and curves and colors and styles from several images and make something new, that's not infringement.

The actual problem with image models is that they can sometimes be coaxed into outputting images that are quite similar to an image they were trained on. That constitutes infringement.

optymizer · 2023-09-01T12:50:34

> If I take

> If I sample

You're not a computer program and your viewpoint is about as valid as "cars don't need speed limits because most humans can't run faster than 10mph and that speed is safe".

Copyright laws where made with humans in mind.

idle_zealot · 2023-09-01T19:04:01

If 1 in 100 humans could run up to 100mph you bet your ass there'd be laws against doing so around other people; it's a safety concern. Hell, even now running in most indoor or crowded areas is, if not illegal, at least considered bad behavior and may get you reprimanded or thrown out.

Some people claim to have a photographic memory. Supposing this is true, is it illegal for these people to look at copyrighted material because they may reproduce it later from the copy in their head? Of course not, it's the actual act of producing that copy that isn't allowed.

Of course, we're not talking about a computer program that stores a copy of an image and reproduces it later (that's called an "image encoder"), we're talking about is a statistical software that identifies common patterns in images and associations between those patterns and human language descriptions of the images containing them. It doesn't store or make a copy of the images it learns from, and it should only be able to reproduce images or elements of images that are overrepresented in its training data. Like any other software tool, if someone manages to use it to make an unauthorized copy of someone else's work, whether it was present in the training data or otherwise, then the user has infringed the other person's copyright. The only real argument you could make is that distribution of a trained model constitutes distribution of a tool aimed at assisting users in unlawful copying, but IMO that would apply more easily to wget than StableDiffusion.

Copyright laws were made to encourage and promote the creation and practice of useful arts. Applying them to stop the creation and adoption of a tool that would make humans far more efficient in the creation of art is backwards.

stale2002 · 2023-09-01T01:23:11

> is that not an obvious copyright infringement?

No, it is absolutely not.

Let's do the same hypothetical that you brought up and using other people's art, but instead are model just takes 1 single pixel from 1 million images.

Taking 1 single pixel from a million images, or the first letter from every book, and putting it into a new work is transformative fair use.

Transformative fair use is legal.

> Just because the output is 50% image A and 50% image B, or 0.1% image A and 0.1% image B and 99.8% image C, doesn't suddenly make it OK.

It quite literally does! Using .1% percent of an image is legal.

The amount of work that you take from someone else is one of the 4 factors of fair use.

Yes, the specific example you gave falls under what the courts literally use right now as one of the factors!

nearbuy · 2023-09-01T03:58:59

> Once there are millions of images, you end up with just tiny blobs of pixels being copied from many different images.

This is not how these neural nets work. They don't copy pixels from anywhere. They learn features.

The features represented internally are generally not easy to interpret to humans, but for sake of illustration, there could be an artificial neuron that fires when a subject should have blue eyes. Having a lot of blue eyes in the training data would help this neuron learn better when to fire (based on the values of other neurons, which may in turn represent other features). For example, it may learn to place more importance on an input that represents pale skin or Nordic origin.

It can learn concepts like cars have wheels, and wheels are round, etc. And then when you ask it to draw a car, it composes one from the concepts it learned. Some parts of the network will deal with the fine details that more directly influence pixels, but these aren't copying pixels from any image either. They're weighing a bunch of factors (eg is this pixel part of the iris and did the network decide to make a person with blue eyes?) and choosing pixel colors based on those factors.

optymizer · 2023-09-01T13:02:38

Thank you for the explanation. Let me explain my position in similar terms.

I'm not replicating an image, I'm "using my brain to build a network of neurons that map electrical impulses from the optical nerve excited by wavelengths projected onto my retina in order to send other electrical signals to actuator tissues".

The complexity of the process is irrelevant imo. We can treat it as a black box and look at the inputs and outputs.

If the images in the database didn't exist, it wouldn't know what to draw, and those images are copyrighted.

Everyone's welcome to take a camera, run around the world and label every object for the neural net to learn, like a human does, but model authors didn't do that because using copyrighted images for free is much easier.

nearbuy · 2023-09-01T22:42:42

You're right that if you're replicating an existing copyright image, the process doesn't matter. Legally, if you lived in a cave your whole life and never saw any art and by amazing coincidence you just happened to paint and sell the exact same painting as some other artist, you'd be violating their copyright. Independent creation doesn't protect you.

On the other hand, under current copyright law, if Stable Diffusion generates an original image that doesn't look like a copy of any existing image, it's clear the new image doesn't violate any artist's copyright.

The debate is whether you can use copyright images/text to train an AI.

Stable Diffusion is of course trained on millions of photos of the real world, in addition to images made by artists. Of course, humans artists also see and digest both the real world and images by other artists and both influence their output. That's why you get trends like impressionism.

flangola7 · 2023-09-01T01:26:06

You are describing transformative use, which is permitted. Otherwise I could create a picture with every possible RGB pixel and then claim all other artists are infringing on 0.1% of my work.

gnopgnip · 2023-09-01T01:27:45

How does this square with something like Cariou v Prince? http://www.artistrights.info/cariou-v-prince

internet101010 · 2023-09-01T03:47:28

It is impossible to 1:1 replicate the input as an output because the images are not stored. It isn't a database. It's basically aggregating summaries/abstractions/generalizations of a bunch of tags.

In other words, it is transformative by default.

optymizer · 2023-09-01T13:06:19

Can it replicate the input 0.1:0.1?

lost_tourist · 2023-09-01T02:10:32

I personally feel the mere fact that it was fair use of mostly copyrighted images it's fairly self evident that anything produced from it should not be copyrightable, UNLESS the origin of the art used to train the model is 100% owned by the "artist". This could either via licensing for that purpose or they own the copyright to, that right not extendable to corporations, as a corporation can't be an artist. It doesn't matter how complex the prompt or series of prompts are, the key here is that the "artist" either owns the training material or licensed it through the proper chain of licensors.

spookie · 2023-09-01T01:39:21

Thank you, it basically boils down to this.

I'm baffled at how someone in their right mind still argues as if using any other tool is free of copyright infringement of some kind.

It also boggles my mind how, in our line of work (which is often artistic in its own right), a lot of people make preconceptions on how art is made. Often reducing it to nothing but transformative generation. Such takes are deeply narcissistic, and downright wrong. At this point I'm led to believe they're AI generated.

andrewprock · 2023-09-01T01:08:10

Your prompt (the input) should be copyrightable. The output should not.

asynchronous · 2023-09-01T01:52:09

Okay but does that include the seed? The sum of the input model datasets? I think a better scenario is no copyright for any part of this process. Give humanity what they’re going to take anyway- open access to this tech.