Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AI is in danger of being swallowed up by copyright law (heathermeeker.com)
234 points by ceohockey60 on Jan 22, 2023 | hide | past | favorite | 682 comments



There's no part of AI that is being swallowed up by copyright. AI companies can ask for permission if they want to train their models on other people's works. It's not that hard, various image hosting sites have already added an opt-in/opt-out toggle to their services. Sites might even get away with using this stuff as compensation for free hosting.

The fact of the matter is that the AI companies don't want to ask for permission, because people will say no. Or worse, ask for attribution or even payment. There is plenty of copyright free/public domain material out there, but what the customers of AI people want isn't available under those terms.

The code to train an AI is not enough to make a product and these people have nothing to add themselves, so they take what others made and use that to make a profit. They can make or pay for their own paintings, their own pictures, their own music, but that would require putting in too much work or paying too much money.

It's very possible that a judge will rule that AI models do not violate copyright. If that is the case, I hope new legislation will correct that oversight very quickly.


AI companies can ask for permission if they want to train their models on other people's works

Do you ask for permission when you train your mind on copyrighted books? Or observe paintings? Or listen to music? Do you ask for permission when you get new ideas from HN that aren't your own?

Humans are constantly ingesting gobs of "copyrighted" insights that they eventually remix into their own creations without necessarily reimbursing the original source(s) of their creativity.

Time to put the horse back in the barn, cars and trains are here.


> Do you ask for permission when you train your mind on copyrighted books? Or observe paintings? Or listen to music?

Yes, that’s exactly what happens when you buy a book, or pay for a music subscription. The work is in the public domain, then global permission to observe and copy the work is already granted.

> Do you ask for permission when you get new ideas from HN that aren't your own?

You don’t need to. It’s implicitly assumed, by virtue of publishing in a public forum, that the author is providing permission for people read their comments and ideas, and remix them as they wish. That permission doesn’t include exact replication, but reading and understanding is assumed, otherwise why did the author publish it?

> Humans are constantly ingesting gobs of "copyrighted" insights that they eventually remix into their own creations without necessarily reimbursing the original source(s) of their creativity.

Correct. Literally everything produced by a human is automatically copyrighted. But the manner in which work is published creates implicit licenses for the public to consume those works. You publish in public, you automatically grate licenses for the public to consume and transform it.

If a human transforms an idea, it automatically becomes a new idea with its own copyright. The same doesn’t apply to AI because they’re not human, and thus the law generally doesn’t recognise them an having ability to create or transform ideas. If you believe AI can create and transform ideas, then you need lobby for the law to recognise that ability, but right now, only natural humans have that ability according to the law


> > Do you ask for permission when you train your mind on copyrighted books? Or observe paintings? Or listen to music?

> Yes, that’s exactly what happens when you buy a book, or pay for a music subscription. The work is in the public domain, then global permission to observe and copy the work is already granted.

You can buy a book, read it, sell the book, and then write and sell another book based on the ideas contained in the first book (Baker v Seldon). This is the cornerstone of contemporary copyright law. Or read the book on a shelf of a bookstore where the clerk is asleep. Or borrow the book from the library or any other manner where direct compensation of the author is nowhere to be seen.

Copyright is consistently interpreted in alignment with the needs of public learning, both by protecting the authorial incentive as well as protecting the public need for knowledge.


>You don’t need to. It’s implicitly assumed, by virtue of publishing in a public forum, that the author is providing permission for people read their comments and ideas, and remix them as they wish. That permission doesn’t include exact replication, but reading and understanding is assumed, otherwise why did the author publish it?

Following this logic, isn't training AI on Github or Deviantart 100% fair game then? It's not like OpenAI is infiltrating computers and reading hidden away data.


> Following this logic, isn't training AI on Github or Deviantart 100% fair game then?

Unlike forum comments, GitHub code generally has an explicit license attached which you'd have to respect - you know, for instance by giving attribution to every MIT-licensed source that was used.

And even then, let's say someone releases a book with all your HN comments: you are definitely entitled to sue them for copyright. Here's some info from the BBS era, which is still relevant today: https://www.templetons.com/brad/copymyths.html


But you can still see the code and learn generally how to write code. Maybe you see a style of unit testing in a library, and decide to incorporate the techniques into your own code. This is not a copyright violation. It can't be, or all creative expression would be dead.


This does not universally hold, though.

For example, the "clean-room design" method of copying a work exists precisely to avoid potential copyright issues. One team reads the original work and writes a description in such a way that it cannot possibly be infringing, and a second team reads the description and creates the new work. This avoids any chance of someone reading the original work and incorporating potentially infringing aspects into the new work.


Don't be so sure of that: in music it's now established that getting too close from the "style" of another musician is a copyright violation: https://www.jdsupra.com/legalnews/what-s-going-on-another-ma...


and this ruling will prove to be a disaster for music creation as there will be fewer copyright free spaces for music as time goes on.

a similar ruling will also be a disaster for software as our tools of expression are very restricted. code is based on boolean algebra and predicate calculus, practice guides like design patterns and books teaching algorithms and data structures.

there are lots of ways to write bad code and only a few for good, correct code. Recognizing this led me to replicating known working code, code I had created, for multiple employers. so who's copyright did I intentionally violate?

I think we are attacking the wrong problem WRT ML and copyright. to me, ML shows the foundation on which copyright is built is a lie. we should use ML to break copyright for code.


To hell with only focusing on code. We should break copyright completely and take back the intellectual landscape.


Important distinction - these lawsuits are about copying the "style" of a single song, not a musician's entire output.


Only if you assume that ML models should be treated as though they have the same rights and privileges as humans.

Personally I reject that. ML needs to be restricted heavily.


Why does it need to? It's just the agent of a human.

Sounds like an option instead of a need.


You're saying this as though we don't already have lots of regulations on tools to ensure that people use them appropriately.

Forklifts are "agents of humans" but you still need a license to drive one.

It's pretty obvious to me at least that AI bros are using these tools recklessly and inappropriately, without regard for licensing or copyright, and therefore I am proposing that the tools need to be regulated.

Simple as that.


>Yes, that’s exactly what happens when you buy a book, or pay for a music subscription. The work is in the public domain, then global permission to observe and copy the work is already granted.

When you buy a book, you’re not paying a licensing fee. You’re exchanging for goods. You’re granted very few rights to own a copy of the work. But they’re almost all to do with distribution. None of those rights is the right to read it.

>You publish in public, you automatically grate licenses for the public to consume and transform it.

By this interpretation, all the artists upset by stable diffusion have given tacit permission for their works to be used as they are published in the public. Even though those works are posted to websites, the artist has not granted any rights to the viewer of the work.

> only natural humans have that ability according to the law

The law is not explicit about this, and we have case law that describes non-human entities as having rights associated historically with personhood. This is definitely not clear, nor is it obvious.


> When you buy a book, you’re not paying a licensing fee. You’re exchanging for goods. You’re granted very few rights to own a copy of the work. But they’re almost all to do with distribution. None of those rights is the right to read it.

You are absolutely buying a license to read the material when you purchase a book. That's why books cost more than the paper they're printed on and why pirated books are illegal. The "distribution" rights you refer to stem from the "first sale" doctrine[0], which acknowledges that the first sale (e.g., you buying a new copy of a book) of a physical object embodying a copyrighted work grants limited distribution rights.

[0]: https://en.m.wikipedia.org/wiki/First-sale_doctrine


When you buy a book or some other artwork, it is implicitly assumed you will put it in your brain, or your meat neural net. And that your brain could produce something related to this content.

It's not just assumed, it's celebrated when a work of art gathers fans who produce their own, inspired content.

Not sure why it needs to be over-complicated or different for silicone neural nets. But I think it will get very over-complicated, if not politicised, in the following years.


> When you buy a book or some other artwork, it is implicitly assumed you will put it in your brain, or your meat neural net. And that your brain could produce something related to this content.

It is implied that if you are using the work by youself, or via a tool you made youself, it's fine.

However works that you redistribute, by copying it yourself or indirectly by tools, said silicone neural nets being one example, instead require a "wide redistribution license agreement", and those are implicitly limited by default unless the work is put in a sorta public domain license.


Trademark and copyright already restrict what humans can do with others work. If too similar then it could prompt legal action.


The same could be applied to AI-generated work.


> You publish in public, you automatically grate licenses for the public to consume and transform it.

No you don’t. That would fall under the category of “derivative work” which is still the intellectual property of the original author under most jurisdiction copyright laws.

https://en.m.wikipedia.org/wiki/Derivative_work


Unless the resulting "derivative work" is sufficiently transformative. Which, i would argue, training an AI/ML is.

Therefore, using a training dataset does not constitute copyright violation.

If the AI outputted an exact copy (or a close enough copy, that the laymen would agree it's a copy), then that particular instance of the AI's output is in violation of copyright. The AI model itself violate any copyright.


> Which, i would argue, training an AI/ML is.

> Therefore, using a training dataset does not constitute copyright violation.

It's not for you to decide that. Different jurisdictions will have their own process for deciding that and none of them are based on the opinions of random commentators on internet message boards.

Also please bare in mind my comment was reply to a specific statement (repeated below) and not talking about AI in general:

> You publish in public, you automatically grate licenses for the public to consume and transform it.

^ this statement is not correct for the reasons I posted. AI discussions might add colour to the debate but it doesn't alter the incorrectness of the above statement.

> If the AI outputted an exact copy (or a close enough copy, that the laymen would agree it's a copy), then that particular instance of the AI's output is in violation of copyright. The AI model itself violate any copyright.

That assumption needs testing in courts.

As I've posted elsewhere, there have been plenty of cases where copyright holders have successfully sued other creators based on new works that have bared a resemblance to existing works. It happens all the time. I remember reading a story about how a newly successful author was being handed ideas from fans during a book signing only for one of her representatives to intercept them each time. When they later asked why the representative took them, the representative said "it's because if any of your future books follow a similar idea, that fan could sue. But if we can prove you haven't read the idea then the fan has no claim". (to paraphrase)

Experts don't all agree on where the line is with similar works created by humans, let alone the implications of copyrighted content being used as training data for computers. And this is true for every jurisdiction I've researched. So to have random people on HN talk as confidently as they do about this being all perfectly legal is rather preposterous. You don't even fully grasp the intricacies of copyright law in your own jurisdiction, let alone the wider world. In fact this is such a blurred line that I wouldn't be surprised if the some cases would have different rulings in different courts within that same jurisdiction. It's definitely not as clear cut as you allude to.


not literally everything this comment says about copyright is false, but it does pack an impressive number of errors into a comparatively small space


> If you believe AI can create and transform ideas, then you need lobby for the law to recognise that ability, but right now, only natural humans have that ability according to the law

My experience with ML tools like co-pilot is why I reject copyright claims on ML systems. there are a tool that generated original work based on my instructions not unlike a paintbrush, photoshop, or a CNC machine. My instructions were based on my exposure to copyrighted works.

I use co-pilot as an accessibility device enabling me to write code again. like with speech recognition co-pilot is a force multiplier IF you change how you work. If you keep using the habits formed by typing, you will get shit results.

The end result of the shift in how I work is now I know how to tell co-pilot how to write code in my style. My co-pilot generated code is no less my code than what I generate by hand. Co-pilot acts as an extension of my brain, not my fingers.

Is my co-pilot generated code copyrightable? I say yes because it is the result of this human's creation and instruction.


>> Do you ask for permission when you get new ideas from HN that aren't your own?

>>You don’t need to. It’s implicitly assumed, by virtue of publishing in a public forum, that the author is providing permission for people read their comments and ideas, and remix them as they wish.

Ideas are not eligible for copyright protection.


>Yes, that’s exactly what happens when you buy a book, or pay for a music subscription. The work is in the public domain, then global permission to observe and copy the work is already granted.

Libraries exist.


If an AI is not a human (I agree) it's a tool, that a human or company created. If it's a tool, the product belongs to the person who owns or uses (which is an important distinction, but not for this case) the tool. Ownership of the product can then be transferred to a new owner through whatever legal means.

If we agree on this, what we need to resolve mostly seems to be, in how far a human should not be allowed to use publicly available data to make his tool, in the same way he is allowed to use publicly available data to make anything else.


> Do you ask for permission when you train your mind on copyrighted books? Or observe paintings? Or listen to music?

Plenty of people have been successfully sued if their work is too similar to existing content.

This isn’t a new concept that AI is throwing into contention, it’s literally just companies trying to side step copyright law because of “disruption”.

Source: I work for a company in this field and we do gain permission from creators before training our models on their content. It’s very possible to operate this way but a lot of companies simply choose not to.


Exactly. Somehow it is so 'hard' for many AI companies to ask for permission to use and monetize copyrighted images in the training set these days and instead of asking permission, they attempt to bypass copyright law and give out frequent useless excuses from AI bros like: 'but muh fair use tho', 'oh well genie's out of the bottle it's too late', 'oops, cat's out of the bag, but you can opt out now'.

Little do some of them know that OpenAI was able to get permission from Shutterstock via a partnership to use their copyrighted images in the training set for DALL-E 2. [0] There is also a reason why Dance Diffusion was trained on only public domain music and copyrighted music which has the actual permission from the authors. [1] If they did otherwise and monetized on copyrighted music without the permission from musicians or record labels, they would be sued to the ground.

With the recent cases of Getty, Shutterstock, and even as admitted by the CEO of Stability themselves [2], the way forward for using copyrighted images in the training set for commercial purposes, is via licensing. Neither Getty or Shutterstock are looking for banning it, despite the AI bros claiming that these companies are trying to.

If not, just train only on public domain images to avoid these legal issues.

[0] https://www.shutterstock.com/press/20435

[1] https://techcrunch.com/2022/10/07/ai-music-generator-dance-d...

[2] https://twitter.com/EMostaque/status/1603390169192833027


Can we maybe drop the whole sneery "AI bro" lingo? HN is not the right place for that.[0]

[0] https://news.ycombinator.com/newsguidelines.html


> HN is not the right place for that.

Where in the guidelines does it mention that one cannot say 'tech bro, finance bro, pharma bro, and more recently and most actively the crypto bro'? These have been there for years despite the guidelines existing.

Me saying 'AI bros' is no different. Given it is fine to mention the tech bros, finance bros, crypto bros and the other, then it is also fine to say 'AI bros'.


> “Do you ask for permission when you train your mind on copyrighted books?”

I’m not able to read billions of books in less than an hour.

Even if we agree that machine learning is like human learning, scale commonly matters in law.


> scale commonly matters in law

I am not a lawyer, so the following is only my opinion.

Scale matters, but so does the legality of the thing that scales.

Reading two dozen books by other authors, or studying hundreds of artworks, or visiting the museum of awesome statues every week, in order to get inspired for ones own novel/painting/scuplture, isn't illegal.

So a lawsuit will have a really hard time argueing that it somehow is a problem if its two dozen billion books/paintings/sculptures. Because, such a lawsuit would suddely need to explain why the smaller scale is also problematic, only less so. And given that this is basically how art worked ever since the first human had the idea to paint pictures on a cave wall, that's a hard sell.


However, AI learning is not the same as a person learning. The same way memorizing a book is not the same way as putting it into computer memory. Nobody would sue you for copyright infringement if you memorized a book, song or movie in your head. No the issue is a completely different matter if you made a copy on a harddrive.


> The issue is a completely different matter if you made a copy on a harddrive.

But there are no copies. For example, the LAION-2b training data is a total of 240 TB. The pruned SD model based on this dataset, is less than 5GB.

The data isn't copied into the models, it is used to teach the models, letting them learn patterns in the dataset.


You are just anthropomorphizing the model by calling it teaching and then implicitly equating that it's the same thing happening in the human mind. That's your burden to establish when you say it's the same.


Actually, isn't the burden of the plaintiffs to prove their copyrights are being violated?

Whether or not it's identical to human brains isn't the matter, they'd need to prove how a small 5GB model trained from a huge dataset infringes their rights specifically.


Well if it's their defense they need to substantiate it. Obviously the plaintiff has a theory, they filed the case.


I am not anthropomorphizing anything, because this is literally what happens. The model is taught, by having its predictions tested against examples, how images work.


You are, that's not what teaching means and it's not what learning means.


Okay, then what do these 2 terms mean?


Okay, but can you prove that?

I have a 1.9 GB mp4 file on my harddrive. It contains 2 hours and 15 minutes of 1080p video data at 24 fps. Assuming it was generated from 4096x2160 16-bit color depth source material, the "training data" was 10.32 TB. I bet I could even get a similar size reduction as LAION-2b if I recompressed it to 720p.

Could I not also claim that I created an advanced AI model, which did not copy but learned patterns in the dataset? Modern video compression algorithms are getting quite complicated, after all.

I think no reasonable person would agree with this, but can you prove that the AI model is doing something substantially different?


> Could I not also claim that I created an advanced AI model, which did not copy but learned patterns in the dataset?

Such patterns would enable the video file to decode into a multitude of pictures not originally in the training data. Obviously, a video file cannot do that...it's just compressed data.

Generative models however can generate things that are not in its training set.

And of course, there is a fundamental difference in the source data between compressed video and a generative model: video codecs work with a sorted sequence of images, where most images are slight variations of the ones before them. The training for generative AI doesn't have these properties, the input is not an ordered sequence, and even similar pictures are not sequential variations of one another.


To expand a bit on this:

Relying solely "uncompressed" size does not a really good metric make (this is analogous to the raw input size of the LAION dataset): one could make a reasonable argument that there are not billions (1) of image-pairs that are effectively identical up to a minute shift. I would posit the correct basis would be the Shannon entropy of the "best fit" ordering (minimize inter-frame diff), versus the lossy-compressed video, and a similar "best fit" ordering for the LAION dataset vs. the model.

My suspicion is that one will find that the relative number of "smooth transition" pairs in LAION viz the whole will be very different from the video.

-------

(1) - Napkin math: There are about 194400 frames, so ~37 billion (37791165600) frame-pairs. Assuming you have runs of about 1 second between hard cuts throughout, so an incidence rate of 1/24 for non-smooth transitions, gives us about ~36 billion "smooth transition" frame-pairs. I think it is safe to assume "on the order of" 1 billion, then. This ignores long "action" scenes with significant variance in images throughout, but also ignores longer-than-1-second slower scenes, hence the order-of-magnitude shrink in the assumption as buffer.


> Nobody would sue you for copyright infringement if you memorized a book, song or movie in your head.

No, but what if you then produce "your own" rendition, or "remix", of that book, song or movie and offer it to the public? E.g. you memorize a collection of Taylor Swift's latest songs, and then start performing a medley of her hits in your local clubs, you may well find yourself in trouble.


It is superior, but it is absolutely still learning.

ML can do either facsimile or imitation far better than a human mind can.

You seem to be conflating both things and suggesting that ML only does facsimile, which is where the potential legal problems are.


It is superior, but it is absolutely still learning

No, it is not. Memorization =! understanding.

I can teach a parrot to spew the times table, good luck getting it to understand how to apply it.

And a parrot is billions upon billions of times more capable than any current AI algos.


You’re talking about facsimile, which is a product of memorization, which is a type of learning. And it is not the only type of learning AI is capable of.

AI, like humans, is capable of both imitation and facsimile. It is far superior at both feats.

Your fallacy is that you are noticing AI is superior at facsimile and erroneously assuming it is “not learning”. You are also ignoring the other amazing learning feats of imitation in front of you.

Parrots are lovely animals, but it’s unclear what you think you’ve accomplished by bringing them up. The fact that they are capable of more than just memorization does not differentiate them from advanced AI models, which are rapidly gaining all sorts of abilities.

Polly the parrot would have a hard time producing a picture of Elmo with a light saber in a Superman costume riding a dragon on the moon in the style of Rembrandt (in under 300ms, at least). I also know a parrot couldn’t write a 500 word story about the image.


You don't need to "understand" art to create it though. Good art, sure, maybe, but not really. Plenty of brilliant musicians who know fuck all about music theory, but they can crank out tunes.

Diffusion would appear to me to work in much the same way. It doesn't understand what's good ("works", creates acceptable output) or why, but it knows it when it sees it, and has the tools to refine it.


> And a parrot is billions upon billions of times more capable than any current AI algos.

I am pretty sure that GPT-3 is a lot more capable than a parrot in transpiling a function written in Python to Golang, or writing a summary to a tech-magazine article.

Same as StableDiffusion is ALOT more capable than me in drawing, painting and generally making up pretty pictures.


You don't need to understand something to go beyond memorization and copies.


in order to get inspired

But that never, ever, ever happens with current AI. it is not AGI. it is inspired by nothing, has no creativity, nada, ziltch.


You make it sound like we should assume that getting inspiration from a few hundred or thousand art works that are very famous and highly public is the same as training over nearly every available public piece of art. I see no reason why that should be our null hypothesis.

Humans either learn art by being natural art geniuses, or by receiving instruction and learning through an iterative process (where, again, they might create thousands of art works, but nowhere near the scale here), which is very different.


1. An AI has a training set of one image and produces an exact replica.

2. An AI has a training set of every image in the world and produces an entirely unique work.

Which do you have more of a problem with?


#2 is pretty interesting. After all, the purpose of copyright law is to encourage creative works. If we have machines that can generate creative works on demand with little effort, what purpose does copyright law serve?


I am not a lawyer, so the following is only my opinion.

Copyright law serves the purpose of peoples works being protected from unauthorized parties making copies of their works, and profiting off them.

It doesn't protect from technology making the production of new works cheaper, faster, more efficient. An artist using photoshop can be, and is allowed to be, many times faster than one using oil and canvas.


That's a false dichotomy.


I don't mean to suggest those are the only options, or that either one is even practical. The point is to determine where the objection lies: with the method or the outcome.


Hm, okay. In that case, I would say the first is a problem, and the second is also, but differently.

I think the objectionable thing about the second is that the AI knows everyone's styles, and so can use them in creating something new. Even if the AI is restricted to not be able to paint an image in a certain artist style (as the new version of stable diffusion is, for instance) and the art is unique, I think part of the problem is that the AI is still (presumably) leaning on the collective styles of everyone it has trained over.

If we can train an AI over a small dataset or maybe even a large dataset of old art, or some mix in between, and then maybe fine tune it wtih a snall sampling of modern art, then I believe it would be unobjectionable, as this is largely how humans do it.


> which is very different.

Why is it different?

The only difference that matters, is scale. And again, if I want to argue that something done 10000000000 times is legally problematic, I have to be prepared to explain why doing it 10 times is problematic as well, only less so.


Would you give me a second of your time? What about 1 billion seconds of your time?

Could I have a dollar? What about a billion dollars?

The burden isn't on me to explain why something being scaled up by a billion is not the same.


The question isn't if a scaled up thing is the same, the question is if a legal thing scaled up suddenly becomes illegal for no other reason than being scaled up.


Nothing new. People differentiate between genocide and murder, for example, or poisoning water supply vs an individual poisoning. Criminal law in quite a few places definitely has scale considerations.


You just gave two examples of where both ends of the scale are illegal, which only strengthens the argument of GP. IANAL, and I'm not stating anything about the reality of the judicial system, but only following the logic of the argument.


My examples only were meant illustrate that scale is a well known "thing" in legal systems and I happened to pick things with two illegal endpoints (IANAL). You could look at other things like the need for permits for certain things as a function of size and use, if you want simple examples for scale mattering and legal endpoint(s).


I am no lawyer, so the following is only my opinion.

I am completely aware that scale is a "thing" in legal systems. But as I said before: For scale to be important, the unscaled act in itself has to be problematic already.


IANAL, but not sure I necessarily agree I with that. Why then worry about monopolies or cartels? Price setting is just "at scale" there, for example.

Granted, things around power concentration have deep philosophical and social roots.


I recently worked on information extraction from 10K documents. GPT-3 needs about 7 days of operation in batch mode on one thread. It takes 40..70s to read one single document and report the extracted data. One MINUTE per page.

But I think you meant GPT-3 has seen many books during training, not during inference. You should know that training on millions of books is not the only way GPT-3 learns. It is just the foundation of its knowledge.

GPT-3 learns "in-context", that means it can learn a new word or a new task at first sight. It just needs a description or a few examples. This is the most powerful feature of GPT-3 - in-context learning. And when it comes to ICL, it is much like humans - only sees a few examples, not millions of books.

> “Do you ask for permission when you train your mind on copyrighted books?”

The nature of ICL is that it happens at prediction time. So GPT-3 would have to explicitly be instructed to learn a specific skill. Should it reject instructions if they are sourced from copyrighted books?


> “You should know that training on millions of books is not the only way GPT-3 learns. It is just the foundation of its knowledge.”

I’m not a lawyer, but to me it seems within the realm of possibility that a U.S. court eventually finds strongly in favor of the copyright holders, the Supreme Court agrees (because Big Tech has so few friends left), and OpenAI will be required to destroy the GPT-3 model and all copies of the training data because they can’t filter out copyrighted works.


Yeah, I think "we can't actually tell which bits of our model are derived from your work which we copied without your permission onto our system for training purposes probably makes AI companies more vulnerable rather than less. Other platforms like search and social media have successfully defended themselves by showing willingness to promptly remove copyrighted material, let copyright holders opt out of indexing or even negotiating stuff like ContentID so copyright holders get paid each time their work is used.


Your reply is mostly a red herring because as you yourself said, the OP was talking about regular training data, not the ICL you focused on.

Just because you can find one way that GPT might be slow doesn't invalidate the point that its training does use massive amounts of data.


If GPT-3 can learn in context it means both the training set and the prompt could be in copyright violation. So even a clean model, trained on licensed data, cannot guarantee there will be no copyright issue.


So what if it does?


I think the Aereo case is an interesting precedent [0] [1]. An individual DVR-ing over-the-air broadcasts with an antenna was fine. A corporation DVR-ing over-the-air-broadcasts for thousands of customers by using thousands of tiny antennas was not fine.

[0]: https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In.... [1]: https://www.vox.com/2018/11/7/18073200/aereo


Is this about jealousy then?

Or it’s just too fast so let’s stop it?


It's about a law designed by humans to give everyone a chance to make a living and contribute to the common good. You think you have found a loophole in that law that lets you use that work for free and deny author's any compensation.

Bear in mind, AI is not making artists or creative types obsolete - that would be fair game, just like computers made human calculators obsolete. No, this is about abusing other people's work.


Copyright never guaranteed anyone compensation, nor is it loophole to not pay for copyright just because you saw and learned from something in the public domain.

If using someone else work for learning is infringement, then that's going to cause a lot of difficulty for all artists. Try making a rock song without listening to rock, or paint some modern art without viewing it etc.


Copyright exists to protect human authors and promote creation, which in turn leads to learning - from other human beings. Algorithms are not learning, they are automated tools which are consuming creative works and outputing derivative works.

The loophole is to use copyrighted works for free despite no learning taking place - no human being observing and developing their skills based on that work - rather, an algorithm transforming those works into some other useful interpretation of them.


I’m not seeing a salient argument here. Something about an allegation of abuse?

Where is the abuse happening?


I’m not able to read billions of books in less than an hour.

I think you underestimate the sheer volume of data + conclusions the brain ingests and processes on a daily basis, primarily through unconscious experience.


The training set for GPT-3 is about 500e9 tokens; any given synapse in a human in their lifetime is going to fire about 2e9s * (10% * {100Hz to 1000Hz}) = 20e9 to 200e9 times.


It sounds like you agree with them, since there are a lot of synapses.


Au contraire.

While our brains are more complex than the networks, this has never been in dispute.

The quantity of experiences needed to train GPT-3, however, is many more than we are capable of experiencing in a lifetime.


GPT-3 is learning from ridiculously dense data through.

Also, about volume of data processed by the brain: https://gwern.net/Differences


Not not mention the millions of years building our nervous system.


100's of millions of years.


> Do you ask for permission when you train your mind on copyrighted books? Or observe paintings? Or listen to music?

The difference is that I buy books, pay for visiting museums and buy music in several formats, or pay it accepting to receive advertisement between songs.

It Is expected that If I buy a book I will be allowed to read it without asking for a permission.

What I don't do is copypasting paragraphs of other books to write a new book and claim that is mine. Is a different situation.


If you pirate a book, learn from it and then create something using the information you learned, would that creation constitute copyright infringement? If so how far does the tainting go? Once you put your eyes on something which you haven't purchased, all future works could potentially be inspired by that experience and should therefore be considered infringement, following your logic.


> would that creation constitute copyright infringement?

No, obviously not, as that would clearly be unworkable and ridiculous. Mainly because we have a very unclear understanding of human creativity, and there’s no way to analyse an individuals mind to understand how they created an idea. Additionally copyrights reach generally stops at the point of “transformation”, once you take any idea an transform it “enough” it’s considered a new idea.

The reason none of the above applies to AI is simply because we’ve declared that only humans can transform and produce new ideas. AI aren’t human, thus they’re not afforded the same rights. Arguing about if there’s an inherent difference between AI creations and human creations is pointless, the law doesn’t care, it has already declared that there’s a difference between AI and human.

If you disagree with declaration, then you need to lobby to change the law. But until the change occurs, your believes are meaningless in the eyes of the law.


You see how you’re introducing a viewpoint and assuming it’s true — you’re just saying “humans can create things” and “ai can’t create things”. You don’t even address the possibility that the AI itself is a tool of the human who created it to create things.


I think in order to justify "the AI is a tool that a human is using, just like a paintbrush" you would have to have to define what meaningful creative process that the human has followed while using the tool.

In my opinion things like selecting a training dataset and then writing prompts are not creative processes they are mechanical processes. Input in, output out, with barely any interaction from the human.

Consider when you commission an artist to make a painting. You give them a "prompt" by explaining what you want. Maybe you even give a "training dataset", a few examples similar to the look and feel of the result you want.

Then they go off and make something. They show you in process stuff and you make suggestions so the next version they show you is closer to what you want. This repeats until you are both happy. Then you own the drawing. Because you paid them for it.

In this case however it's absolutely clear that you did not create the work. You had input into the creation, but the artist was not a tool you are using to realize your own creative vision. They are the creator, you're a customer for them.


In my opinion things like selecting a training dataset and then writing prompts are not creative processes they are mechanical processes. Input in, output out, with barely any interaction from the human. >>> When it's bleeding edge research there is a ton of human creativity involved in developing the product and engineering the dataset.


Even if that were true, which I am kind of doubtful of, most applications of these tools are not going to be bleeding edge research and should not be treated as though they were.


The reason why you put your eyes on something is probably that someone had the hope of selling it to you. Or that someone paid for it on your behalf. The difference is that machine learning algorithms never (or rarely) leave a single penny in their training set creators’ pockets, turning “no income” into the default outcome.

Copyright law is not about logically perfect system, but creating a general environment in which artistic, academic and other creations can appear and benefit the general population.


> Copyright law is not about logically perfect system, but creating a general environment in which artistic, academic and other creations can appear and benefit the general population.

Yes ... and because it's not a logically perfect system, its lifetime has to be limited. One day we should abolish copyright and find a better, more functional way to drive progress.


Copyright at its heart is fine. The original objectives, allowing people to hold a short-term monopoly on their ideas, so they can fund further ideas, and the manner in which they’re achieved is perfectly fine.

Where goes wrong, is when individuals and cooperation believe that such monopolies should be indefinite, and pushed the monopolies beyond the lifetime of the author. A dead author can’t produce new works, so it’s now clear how allowing such long monopolies increases the amount of creative work produced.

The original primary objective of copyright was to create an environment to could produce an endless supply of public work, freely available to all. It’s only abuses of copyright over the past 50 years that have destroyed objective, and ironically it’s copyright holders like Disney that really starting to suffer the consequences.

Winding back copyright durations to better balance the public and private interests would go a long way to resolving many of our issues with copyright today.

> Yes ... and because it's not a logically perfect system, its lifetime has to be limited.

It’s also worth pointing out that no system of law is “perfectly logical”. It’s almost certainly impossible to produce a perfectly logic system because humans are inherently illogical, and binding them into a perfectly logical system of law would almost certainly produce more injustices.


It's really not. Economics is very simple at its core. You tax negative externalities and subsidize positive externalities. The discovery of new information is a positive externality. It should be subsidized.

Anything that has infinite supply and zero marginal costs, as Nobel Prize winning economist Samuelson argues when he was looking at the context through lighthouses[0], should be free to all. By using copyright to make it a monopoly and allowing the extraction of monopoly rents you are drastically reducing the value and reach of the thing that was discovered. Copyright is a hack and this hack is now fundamentally breaking. Instead of trying to save the hack, we need a full rewrite. If winding back the duration of copyright is correct, the best winding back is zero.

As we are a remix culture where idea A and idea B combine to create idea C, we drastically reduce the innovation in our economy through reduced discoveries. This failure ends up with large monopoly holders consolidating into bigger and bigger entities in order to right some of this failure, but that only makes the monopoly extraction worse.

The discoverer should be subsidized for the discovery of that information but it should immediately go to the public domain. How you work out what that works out to is just as abstract as what Spotify works out what each play costs. This is no doubt monstrously complex to figure out the dollar number what some discovery is worth, but it is the economically correct path. Copyright isn't.

[0]: https://courses.cit.cornell.edu/econ335/out/lighthouse.pdf - page 359, first paragraph


If you pirate a book, lossy compress it into a 14-byte content description vector, decompress it into a book that fundamentally contain at most 14-byte worth of information, then subsequently sold it, that will still be piracy partially depending on how good that 14-byte representation is.


We are not dealing with AGI here. Current trained models are basically just using copy and paste, to create "new" works.

There is zero creativity, zero art, zero original thought, zero newness.

When actual AGI happens, then your arguments mean something. Such as in, at least 50 or 100 years down the road.


Your statement here is simply ignorance. Generative models do not only provide “copy paste”, they can interpolate and extrapolate from training data. When a human sees a bunch of ideas and mixes them up to produce something slightly different it doesn’t bother anyone. But when a human creates an AI and uses it to do something similar suddenly it’s a problem. I think the burden of proof is on the laypeople here that keep whining about how AI is just copy paste (which is simply not true and such a base simplification it wouldn’t even pass a ELI5 truthfulness test). I’m sure this will get downvotes plenty.


> When a human sees a bunch of ideas and mixes them up to produce something slightly different it doesn’t bother anyone.

Try to make a car using the best ideas developed by each car maker. You will be surprised.


We're not dealing with AGI but modern models absolutely demonstrate creativity and newness. If it was simple copy paste, we wouldn't be having this whole conservation since it would be simple for copyright owners to sue and win in case of infringement.


Even if what you say is true, the person using it is still asking for what they want. Some of these prompts get unique enough that it's unlikely somebody else is going to make another one like it.


There is no copy pasting on diffusion models. All there is is searching for very probable regions in space [1].

[1] https://news.ycombinator.com/item?id=34378500


The AI doesn’t understand what it consumes. That is why the models still can’t add two numbers.


What's the difference between acting as if it actually understands and "true understanding". I'd argue there is none, or at least that it doesn't matter. For instance there is nothing you could do to prove to me that you aren't just a black box acting on input in a sophisticated manner (eg. chinese room argument[1]), yet I give you the benefit of the doubt. GPT's lack of understanding of math may be a localized lack of understanding where it does "understand" other topics. I don't require you to display an understanding of quantum physics in order to prove that you're able to understand anything at all.

[1]: https://en.wikipedia.org/wiki/Chinese_room


I'm sure this topic is already the subject of much discussion, but from the sessions I've had with ChatGPT, it's quite obvious it doesn't really "understand" very much in the way humans do. At best it seems to understand what question you want an answer to, but it often fails miserably even in simple cases (try asking it how many letters certain words have, or to give examples of words ending in a particular letter etc.). But sure, eventually it may overcome those cases and make an excellent mimic of an intelligence with understanding. If it's genuinely able to produce output accurately emulating all the sorts of logical reasoning humans can do then it may well be impossible to distinguish it from "the real thing" (whatever that actually is...)


Humans understand and there is just no comparison. Can an AI make a major novel discovery as humans have? How could they if they don’t understand language?


Presumably AI trainers aren't hacking into Amazon's servers to steal their copyrighted ebook files. All the data they use are publicly available to view by the AI, just as a human might view them. So I don't think your distinction is accurate. It think the question is to what degree are AI systems "inspired" by the content they are trained on, versus merely regurgitating it - to be honest, it's hard to draw a line between the two for people's creative work, never mind those of machines.


Getty and shutterstock images are watermarked and copyrighted as to not being reproduced, so , no, they are not confining themselves to copyright. Also, take 'AI' that is used currently to sell art on t-shirts for example, if you specify yellow eyes, it will show you examples of yellow eyes and you choose the one you want. The AI did not produce those samples. The AI did not learn to make pictures of yellow eyes. The AI produced samples of yellow eyes from it's repository of collected images that it did not make itself nor ask permission to use nor pay for.


> Do you ask for permission when you train your mind on copyrighted books

yes, thats why I pay a fee to buy/borrow one (or someone pays the fee in the case of a library.)

> listen to music

again money is exchanged.

> Humans are constantly ingesting gobs of "copyrighted" insights that they eventually remix into their own creations without necessarily reimbursing the original source(s) of their creativity.

yes, and so long as they are not derived works, its not a problem.

Copyright is there to allow you and me to develop things and make money from it. It is there to stop people stealing our work, which may have taken years to develop and sell it for a profit with none of the risk.

Large corperations have abused this to make monster profits.

Google have spent billions to try and persuade us that copyright is evil, because they didn't want to pay content producers to host their work (ie music and movies on youtube and local news site)

The issue is this, I might have made a website that tells users how to make a specific type of metal work. I have a free ebook, and I run courses. I have spent many years to to perfect the art, create the tutoring content, recording videos. its advertising supported, and people are asked to consider buying a course, to support the creator.

The AI company comes along and scrapes all the content, allows people to regurgitate it, with more or less accuracy.

The creator now gets less traffic, less money and now cant afford to create more content.

The AI people now skim all the money, and the consumer gets less useful information.

Culture isn't free. Someone is paying for it, and if you stop paying them, then it doesn't get created.


You don't need to pay for or "borrow" anything to learn from copyrighted works. Nobody has had that expectation for years, and that is also not what copyright pertains to. It's not that AI breaks into libraries and isn't paying the fees. You can google an image of any great work of art and look at it for as long as you like, for free and take from it what you can and use all of that to create something else and get paid for that thing. The stuff that AI uses to learn is available to you, as a human. That is not being challenged in the least.

As parent said: Everything is derived work. We are remix machines. It is how we learn and how we make money. Now with AI, apparently, we are offended, when something does it better and faster than we can? To me it seems, if we expect AI to pay additional fees, the question is: Why?

I am not saying that it's not an important question. Google has built its entire business around information other people have provided. I would argue most people are quite happy with the existence of something like Google search and see it as a net positive in their lives. Does that make the business part okay? Where do we stand on this in regard to an open web? Is it okay for Google to do what they do (and if they do it well to win the space), or should there maybe be a license where people have to pay the owner whenever they are indexing a website? I don't know. Feels complicated.

> If you stop paying them, then it doesn't get created.

That's an interesting thought. But is it true and, more so, is it a problem? What if humans from here on will only be paid to create stuff that an AI can't?


> You don't need to pay for or "borrow" anything to learn from copyrighted works.

someone pays, just maybe not you. How do you think google/meta/et al offer you a service free at the point of delivery, through charity?

> You can google an image of any great work of art and look at it for as long as you like

see my bit about google. The copyright still is with the owner. That image can be removed, should the owner wish, but for various reasons its too expensive to get google to respect that.

> I would argue most people are quite happy with the existence of something like Google search

yes, because its a symbiotic relationship. I as a creator, make something that people want to find, google points them to me, and I get people's attention. I might do that to fluff my ego, or try and convert it to cash through sales or something.

The AI step threatens to remove that relationship. Instead of being passed to me, the AI just pastes shit its gleaned from mine and other websites, leaving no chance of me getting a reward for making that website.


> You can google an image of any great work of art and look at it for as long as you like, for free and take from it what you can and use all of that to create something else and get paid for that thing.

If the copyright on a given work of art is still active, those pictures were taken and are distributed with the permission of the copyright holder (or they're just pirated). That's one of the reasons it's much easier to find images of classic art (for which the copyright has expired) than it is to find images of contemporary art.

> You don't need to pay for or "borrow" anything to learn from copyrighted works.

What exactly do you think copyright is, and would you be surprised to learn that libraries have purchased the books on their shelves?


> You can google an image of any great work of art and look at it for as long as you like, for free and take from it what you can and use all of that to create something else and get paid for that thing.

If you're referring to piracy, that is very much being kept in check. Otherwise, the vast majority of copyrighted art is only available for payment in various ways (streaming services, museum and theatre access fees, library cards, buying e-books etc).


They're talking about googling any copyrighted image and looking at it, as a human or an AI.


> What if humans from here on will only be paid to create stuff that an AI can't?

I look forward to a life of horrific poverty


I don't think there is any defensible reason to have people at large suffer over AI advancement, without having a plan for making their lifes better.

If AI takes jobs because it's simply superior at them, and that creates friction and anxiety until we have stuff figured out, that's of course sad and we should do our best to soften the process, but I think it's inevitable. The carriage must die. It seems obvious that restrictions on training data are just a distraction and will not move the needle on any interesting time frame.

If however AI does not pay forward, in an arrangement that makes our collective lifes better, I will be the first to work on burning it into to the fucking ground.

But, on a lighter note, since that has generally been the direction of human civilization (not linear when zoomed in, but always when zooming out) I remain optimistic.


In the anglo-saxon world, I have not seen significant successful program since the industrial revolution that has helped or softened the impact of a new process on an affected lump of people[1].

The weavers were left to rot when the automatic looms came in. (there were in flanders east england and northern france incredibly rich and influential class)

Furniture makers were left to rot when steam power tools came in

Farm labourer were left to starve when steam threshing/harvesting came in.

enclosure was another tragic note in england.

The green shirts were lobbying for "a share of the domestic profit" in the 20s-30s, in the 60s they were convinced that we were going to be working 2 hours a day by now, with robot servants cooking and cleaning for us, and no-one would be living in poverty. Even Orwell has written on this.

Instead we see productivity in the western[2] world dropping. Meaning for every human hour worked we make less money. because I suspect in part to the rise of servant-as-a-service jobs(food/shopping delivery/cleaning/elderly care etc etc) all of which are long hours and low paid.

[1] well, DDR everyone had a job, but lived in permapovety and were likely to be disappeared if you spoke out.

[2] specifically the US and UK, who appear to be snorting financial inequality by the metric fuckton


Fair enough. All is not well.

What I was more so thinking of are the unspecific societal functions that evolved to the benefit of everybody, but more so to those who could not have afforded them beforehand: Quality health care, various forms of social support, more accessible education and food, better road systems. The stuff that makes the charts on education, prosperity and health go from bottom left to top right and child mortality and hunger in the opposite direction.

The injustices of the day do not show in the most important, most long term graphs. As far as I can tell (and I am happy to hear your thoughts) this can only be true because people have benefitted increasingly from things improving, over time.


That's either a very pessimistic take on human creativity, or a very optimistic take on the ability of AI to mimic human emotion and experience.


>Culture isn't free. Someone is paying for it, and if you stop paying them, then it doesn't get created.

This is just deeply wrong. Culture existed before money. It is tragic to me that a person can't see culture as anything but a marketable good.


> It is tragic to me that a person can't see culture as anything but a marketable good.

with respect, thats not what I am saying, I'm saying it has a cost. If people do not have the means to spend that money on making culture, then it is not created.

Juvenal was a client of someone, and complained about it

Tallis, Allegri, Purcell, Bach, Mozart were all professional composers

The great seats of learning (Ashurbanipal's libary, Venice, Alexandria) are all paid for by a ruler wanting to show off how good they were

Wilde, byron were all rich people wafting around bored and making art on the way.

In the 60s-80s it was possible to live in NYC working at a bar or something, and still have time and money to create art. Where can you do that now?

Now you need to be rich, or have time, or get patrons. The internet is a great way to either lower the cost of entry (see music) or get support to create (see Patreon)

> This is just deeply wrong. Culture existed before money

Culture existed when we had time, food and resources to stop worrying about being cold wet and hungry.


> Culture isn't free. Someone is paying for it, and if you stop paying them, then it doesn't get created.

That doesn't seem to be universally true, but an end-game of capitalism. There are countless examples of artistry/sculpture/music that were created long before copyright existed and although they may have been "paid" for it previously, those cultural items can be appreciated without needing to pay someone for it.

There are also many contemporary cultural items that were created without monetary recompense that can also be enjoyed without needing to spend money.

> Copyright is there to allow you and me to develop things and make money from it. It is there to stop people stealing our work, which may have taken years to develop and sell it for a profit with none of the risk.

Your use of the word "stealing" is unnecessarily loaded and specifically means that the creator was deprived of physical ownership which would be incorrect.


> they may have been "paid" for it previously, those cultural items can be appreciated without needing to pay someone for it.

You are arguing against your own point here. As I said culture stops being created when there is no money for people to create it.

Should copyright never expire? no. is 25 years enough? you betcha.

> There are also many contemporary cultural items that were created without monetary recompense

Again you are missing the wider point. For culture to be created you need a mix of people, and those people to feel safe enough, and have enough time and energy to create said culture. They will also need money for materials.

As I suspect you are not on a poverty wage, you will have the time, energy and healthcare to be able to create a new thing. This is not a luxury someone who works two jobs just to make rent has.

> our use of the word "stealing" is unnecessarily loaded and specifically means that the creator was deprived of physical ownership which would be incorrect.

stealing is taking with intent to deprive. I mean specifically what I say.

taking someone else's work and selling it as your own to make money, whilst depriving that person of credit or income stream. It is morally wrong.

Now there is an argument about corporations abusing copyright (they do) but, throwing it all out only benefits people like google, amazon and facebook.


> Do you ask for permission when you train your mind on copyrighted books?

The law already makes many distinctions between humans and machines. For example, looking out the window to see when your neighbor is going to the supermarket: allowed; using a machine-vision system to store the movements of groups of people into a large database: not allowed.

Also, "training the mind" and "training a machine learning system" are two completely different things, even though the language used is the same.


This is the crux of the issue, I believe.

It seems to me that one side is arguing that people (as in, individual human beings) already do what the AI is being accused of, the other side argues that it's replicating work.

The truth of the matter is that what is taking place is a different thing altogether. We do generally deal in a different way with "machine behavior" because we recognize it being automatic and reproducible matters.


> Humans are constantly ingesting gobs of "copyrighted" insights that they eventually remix into their own creations without necessarily reimbursing the original source(s) of their creativity.

Yes, and humans are being found liable for copyright infringement for doing so. All that's needed to establish liability is access and substantial similarity; the bar for the latter can be very low indeed (see Williams et al. v. Bridgeport Music et al.).


Perhaps the bar is different in different fields. In art, for example, collages are quite legal. And what AI art is doing is way beyond collages.


Humans also (usually) remember where we saw something and can attribute the source. ML at this point doesn't do that.


I've been programming for 25 years. I really can't remember where I learned what for most of it.


I doubt you can reproduce some function you saw 20 years ago verbatim, either. If you could, you'd know the source too.


> Do you ask for permission when you train your mind on copyrighted books?

I pay the books directly (cash, credit) or indirectly (school books via taxes). I do pay the louvre to observe the painting. I also pay to listen music in ads (YouTube) or via subscription (YT Music and Spotify).


But the datasets these tools are using are available to view for free. The AI isn't stealing physical books or paintings, it's viewing the same data that you or I can by sending an HTTP request, for free.


Could an AI view you for free in a street or even through a window? Does that imply it can use that view data to create advertising using your modified likeness, for example? Just because you can view something for free doesn't mean you can use it anyway you want.


> Just because you can view something or free doesn't mean you can use it anyway you want.

This whole thread really makes me want to pull my hair out.

Difference between illegaly creating a (even temporary) copy of a copyrighted work (e.g. streaming a movie) vs. creating a derivative work of said copyrighted work: Two completely different things, with completely different legal outcomes.

If OpenAI in any shape or form creates a temporary copy (<--- by copyright definition of what a copy is!) than this needs to be adressed with the former. If OpenAI creates a work that is considered to be a derivative work (<---- by copyright definition of what a derivative work is!) than that needs to be adressed with the latter.

The crux of this whole thing is: Human minds cannot make a copy of a copyrighted work by definition of copyright laws (in Germany, I presume the same can be said for pretty much all western copyright laws), while anything that a computer does can be construed as making a copy.


> anything that a computer does can be construed as making a copy.

but that's not the point of contention. The training data set has been granted the right to be distributed (by virtue of it being available for viewing already - it's not hidden or secret). The proof is that a human can already view it manually. Let's call this 'public'.

The question is, whether using this public training dataset constitutes creating a derivative work. Is the ML model sufficiently transformative, that the ML model is itself a new work and thus does not fall under the copyright of the original dataset?


>but that's not the point of contention. The training data set has been granted the right to be distributed (by virtue of it being available for viewing already - it's not hidden or secret). The proof is that a human can already view it manually. Let's call this 'public'.

This is wrong. My paintings are publicly available (especially going by your definition [which I'm confused by the origin of?]). Taking a photograph of my paintings is still a copyright violation. I hope we can ignore all the legal kerfuffle about personal use, as it has no bearing on our discussion. Again -- all of this boils down back to what I've said before: Bare human consumption does not constitute as making a copy, nearly everything else does.

Your second point -- a copyrighted work automatically granting someone else any rights (especially distrubtionial rights) by just being available to be consumed -- is even more wrong. I'm not going to go further into that, as you can very easily prove yourself wrong by googling it.

>The question is, whether using this public training dataset constitutes creating a derivative work

I'm not well versed in the US copyright laws, but I would assume (strongly so) that this would not be the case. I -- again, for US copyright law -- assume that for something to be considered a derivative work, it needs to include (or be present in other ways) copyrightable (!) parts of the original work(s). In other words, the original work needs to "shine through" the derivative work, in one way or the other. The delta of parameter changes of a ML model would (imo) not constitute such a thing.

Problems with derivative works will come into play when considering the things ML models produce.


But the AI is (supposedly) not making a copy of your painting. It is ingesting it, and adjusting it's internal "model of what a good painting looks like" to accommodate the information it gleaned from your work. This seems more similar to what a human might do, when they draw inspiration from another's work. The question is - to what extent does the exact image of your painting remain within the AI's data matrices? That, no-one knows for sure.


> But the AI is (supposedly) not making a copy of your painting.

You are mixing up the two things that I've mentioned in my original comment. You have to differentiate between creating a copy and creating a derivative work. Both of those things matter, when talking about AI, but the former is way more cut clear.

>The question is - to what extent does the exact image of your painting remain within the AI's data matrices?

And the answer is: It's irrelevant. The model has to be ingested with a copy of something. That's all that matters. The AI could even reject learning from that something. By the time that something reaches the AI to even do something with it, it's been copied (in the literal sense) who knows how many times, each of those times being a copyright violation.


I see what you're saying, though could you not say the same thing about the browser's internet cache? That copies the file from it's original server, to the user's local machine in order to display it efficiently.


The poster is wrong about what constitutes a copy (for the purposes of distribution). The temporary copy that resides in your browser's memory, or local caches, aren't considered violations unless they are publicly accessible.

I would put the same criteria to the copy made for the purpose of AI training. As long as you have the right to view the image, you would also have the right to ingest that image using an algorithm.


Yes, of course. It could even create advertising using my unmodified likeness, and that wouldn't be a problem. A person's appearance in public is public domain, you don't need permission to use it.


Definitely not simply the case in every jurisdiction.


I think that's an entirely different scenario. For one, I'm not displaying myself in my front window with the explicit intent of people viewing me. If you replace the AI in your example with a human taking a photograph, I would be equally appalled at the misuse of my image, and I'm very confident I'd have legal recourse to stop it.


Rephrasing of the question: If you put you eyes on a single pirated work in your lifetime, all future potential creations are potentially inspired by that experience. Is every future creation of a human who has put their eyes on a pirated work copyright infringement?


> Do you ask for permission when you train your mind on copyrighted books?

AI is not a mind. It’s a program. We might call it a “mind” as a metaphor, but it’s not really one.

So any justification which presupposes that an AI should be able to do something (really: that the people who are running the AI programs should be able do something) because they are a “mind” is fallacious and doesn’t need to be interrogated.


This is the uncomfortable truth that no one on that side of the argument wants to address.

Also, the fact that Artistic Freedom is now under attack by artists. Not that long ago Artists hated the Music Industry and Corporations such as Disney for weaponizing Copyright law against Artistic Freedom. Now artists are utilizing that same tactic against other artists.

https://en.wikipedia.org/wiki/Artistic_freedom


It's just humans being humans. Some years ago microsoft was using FUD against Linux. Now linux users are using FUD against microsoft.


My fire department is spreading FUD about forgetting to blow out candles.

What is your point? If "Linux users" are right it is not FUD or the concept of "FUD" is pointless.


Computers aren't human. Software isn't people.


thanks for sending your strawman in to do battle with his strawman.

you don't need permission to train on books, but you do need to buy the books or take them from the library one at a time.

"training" these machines so far is not like human learning as becomes apparent when they spit out source code that mirrors individual repositories. And you know that humans are required to both remix their own creations and follow copyright law at thbowe same time, and also adhere to the social and institutional stigmas against extensive uncreative cut and paste paraphrasals.

when training AIs on copyright law trains them in obeying copyright law, they'll be ready for the Turing test, or even to be called AIs.


> when they spit out source code that mirrors individual repositories

That's not a problem, we already have copyright laws that prevent people from distributing mirrors of copyrighted works. They don't care about how the works were copied.


So you are saying we should prosecute MS because co-pilot is distributing the copyrighted work? Or in other words why is co-pilot spitting out the code to someone else (often with copyright notices removed), not code distribution?


> So you are saying we should prosecute MS because co-pilot is distributing the copyrighted work?

yes, isn't that what very many people are saying?


It becomes a problem when violating content can be generated faster than it can be discovered and fined.


Only if the person generating that content is able to earn greater profit than the fine, before they are stopped.


So copyright law should be rewritten? It doesn’t follow.


This “ML learning is not like human learning” fallacy is all over the place lately. It’s stupid, and it should stop.

Humans are capable of both facsimile and imitation.

The fact that ML is able to perform facsimile far better than a human can is not evidence that this is “not the same” learning. Only that ML learning is superior. ML is far superior in feats of both imitation and facsimile.


Are you high? How is ML learning superior to human learning?

If I show a 3 year old a single picture of a Tiger, and tell him this is a tiger, the child is able to recognize a Tiger fairly accurate in real life without further input. Though the child might say that a house cat is tiger,,,,

ML learning needs millions of pictures to do the same, and still might mistake an elephant for a tiger...

ML is nothing more than graph approximation, there is no logical reasoning


First off, your tone is bad.

ML is currently capable of the tiger case you mention. It’s generally called “few-shot” or “one-shot” learning. In the context of an image generation model, having never seen a tiger before, if you show it a few pictures of a tiger, it could immediately draw you thousands of tigers in any variation or scenario you can think of, which is way more than a child can do.

As for the need to train on millions of images for the base model, I believe you are trying to say something about “sample efficiency”, and how ML differs from the brain in this regard outside of the few/one-shot contexts (which ML is absolutely capable of). I would argue that sample efficiency of the brain is actually also quite low, much lower than people assume. It’s irrelevant to an argument that ML is not superior, because ML is clearly is capable of learning richer, more effective representations in a shorter wall time than we can, whether it is sample efficient or not. And in the sample efficient few/one-shot contexts (learning what a tiger looks like from one picture), it also outperforms humans in speed accuracy and creativity. It’s not even close.

As for classification errors, ML is capable of some errors we are not, actually by virtue of being superior at learning representations we are not even close to being capable of learning. But those are edge cases, and they are fixed by various means. In the main cases, ML outperforms humans in speed, accuracy and class complexity, all exponentially.

You said something about graph approximation but it doesn’t make a lot of sense. I’m talking about learning and you’re complaining that machine learning is not “logical reasoning”. Whether ML is currently capable of logical reasoning is another discussion. Certain models do demonstrate some types of it today.

“Graph approximation” is a type of learning task. ML is a billion times better than humans at it so it also doesn’t help you argue that ML isn’t superior (in that regard).


There is no reason to accuse others of being on drugs because you fail to view the world in the same light as them. You can make your point known without doing so.


There are free books (imagine the horror of people giving something away) and lots of out of copyright books.


> Do you ask for permission when you train your mind on copyrighted books? […] Humans are constantly ingesting gobs of “copyrighted” insights

This comment fundamentally and dangerously misunderstands Copyright Law. Insights are not copyrighted, nor are they copyrightable. Copyright law controls who gets to distribute a specific “fixation” or performance of work. It is not, and never was about preventing the spread of ideas. Authors and artists have always intended for you to read/observe/listen to their work when you legally acquire a copy. They just want you to not copy it verbatim, but go do your own original work if you want to distribute or sell something.

The whole problem is that today’s NNs are specifically designed to remember and remix only the fixed performative parts of the work, and they, unlike humans, don’t understand the insights at all. They are just deterministic machines that copy and remix at a large scale. As such, it’s pretty clear the people training AI today should expect to have to ask permission before “training” (copying) other people’s work.


I don't think that's so clear. When you train a deep learning model you are making it extract the gist or insight of many works and then use that pattern to produce new works. While the NN does not experience the work like a human it is definitely not memorizing.

A silly example. Making GPT write a rap battle between Keynes and Mises goes beyond a performative remix, it is transformational work, nothing is copied explicitly. If a human were to write it that would not violate copyright.

I think that to tackle this we need a new lens other than copyright in the long term.


You’re right that it’s not so clear, perhaps I overstated for brevity. I don’t actually think requesting permission is absolutely necessary, what I really think is that there aren’t good reasons AI people shouldn’t at least first try to establish training sets that are unambiguously legal, either through use of public domain work, or through an actual attempt to curate licensing models that allow re-use. We have plenty of precedent for doing this, so people claiming they should have access to everything without permission strikes me as lazy. There’s also the problem that the AI winners already are, and will continue to be, the monopoly tech and media companies who stand to make handsome profits off of the results of their trained networks. Even if you believe the results of their tech is “transformational”, there is no question that it wouldn’t work at all without access to the source material.

The argument that NNs aren’t memorizing is definitely debatable and not necessarily true. They are designed to memorize deltas and averages from examples. They are, at the most fundamental level, building high dimensional splines to approximate their training data, and intentionally trying to minimize the error between the output and the examples. It’s fair to say that “usually” they don’t remember any single training sample, but it’s very easy for NNs to accidentally remember outliers verbatim. The whole reason the lawsuits mentioned in the article are happening is because we keep finding more and more examples where the network has reproduced someone’s specific work in large part. If we’re going to claim that today’s AI is producing original work, then we have to guarantee it, not just assert that it doesn’t usually happen.

> a rap battle between Keynes and Mises goes beyond a performative remix, it is a transformational work, nothing is copied explicitly.

I don’t buy that the work can be called transformational just because the remix doesn’t have any recognizable snippets. GPT is in fact copying individual words explicitly, and it’s putting words together by studying the statistical occurrence of words in context of other words.

> I think that to tackle this we need a new lens other than copyright

I totally agree with that. This question is legitimately hard. We do need a new lens, but we might have to keep and respect the old one too at the same time. I feel like AI work should acknowledge that difficulty and step up to lead the curation of training sets that are legal wrt copyright by design, rather than ignoring the concerns of the very people who made the work they are leveraging.


So if I visually look at a piece of work and “run” the NN training algorithm in my meat-space brain or even on pen and paper, do I need to ask for permission for doing so? Or is permission required only if silicon chips “run” the algorithm? Asking for a friend.


This isn’t what the comment I replied to was suggesting, which is important because the NN training algorithm isn’t how humans observe creative work, nor how we make insights. But, yes, the same standards apply whether your deterministic machine is silicon based, or meat based. Copyright law applies to both. If you reproduce significant parts of a fixed performative work, then you are in violation of the law. If your algorithm mixes enough snippets from enough sources, then it’s hard to tell, and you probably won’t get caught, but it doesn’t really change the fact that you’re mechanically copying. FWIW, copyright precedent in music seems to allow human-made remixes that involve multiple sources, as long as the work as a whole is original and doesn’t reproduce significant parts of the sources.


You are loading up my question with your own assumptions. My point specifically is if I’m just observing a piece of work and running the NN algorithm in my brain, does this constitute as illegal thought then? Do my thoughts violate rights? Note that I am not “reproducing” anything (whatever that term means). I am just observing the work and running the algorithm in my brain while silently sitting.


What assumptions are you referring to? It doesn’t seem like you understand Copyright Law, so that’s why I keep trying to explain it. Under Copyright Law, you have to acquire material legally, and it’s illegal to distribute copies you made to other people.

If you’re executing a NN algorithm in your mind, or via pen & paper, then you are copying from the training samples, because that’s what the algorithm does. During training you compute errors against the samples, and update your weights to reduce error. During inference or generation, you use the weights (the results you remembered across all your training data) to produce an output. When your training samples are clustered in the latent space, the network will only remember an average of the samples, but samples that are sparse and don’t have close neighbors are sometimes remembered verbatim because there’s nothing nearby to average from. You can legally run the algorithm all you want on your own. Once you run it and then distribute the output, it might be in violation of Copyright Law if you accidentally reproduced one of the samples. Same is true for traditional human learning, you can free copy ideas legally, but reproducing too closely something that someone else made may be against the law, even if it was accidental.


So we are in agreement that it is not violating copyright laws to run the algorithm on copyrighted works to produce the model, because if it is, my thoughts could be illegal too. In the end only actions such as reproducing the work and distributing it can be a violation. In other words, the end user of the model is the one to be held responsible if they reproduce and distribute the copyrighted material.


You have to acquire the source material legally. You can be in violation of copyright for copying music you didn’t buy. If you acquire work legally, you’re legally allowed to make backup copies for yourself, if you don’t distribute it. You can be in violation of copyright if you distribute something you don’t have the copyrights for.

Thoughts are never illegal wrt US Copyright Law. It’s a straw man to insist on making this point.

> In other words, the end user of the model is the one to be held responsible if they reproduce and distribute the copyrighted material.

No, this is false because it is the creators of the model that 1) did not legally acquire the source material and 2) distributed the network that contains latent copies of the source material that end users can use to reproduce works from.


> You have to acquire the source material legally. You can be in violation of copyright for copying music you didn’t buy.

This is incorrect. As another poster mentioned, it is not illegal to read a stolen book. It is only illegal to steal the book.

Secondly the source material is acquired legally since it is open to consumption on the open internet.

Thirdly model does not contain “latent copies of the source material”. By using a simple test (currently legal standard) that if I showed you the node weights and counts of the network no person even trained in the art can identify it to a specific piece of work. Therefore it is at best a derivative, reasonably distinct.


> By using a simple test (currently legal standard) that if I showed you the node weights

Nope, this is strawman and continuing to demonstrate a misunderstanding of Copyright Law. There is no such legal standard, where did you get that? If the network can reproduce a work, then it does in fact contain a latent copy. Arguing that you can’t see it by inspecting node weights is straw man. You cannot argue that you’re not copying music if you use a new compression algorithm and then suggest it’s distinct and derivative because nobody can read the raw compressed data. That’s not how Copyright Law works. If you can approximately re-perform someone else’s work, you’re in violation. This is true even if you have to run a black-box program to produce the output.

> no person even trained in the art can identify it to a specific piece of work

Ironically, you’re actually admitting that even AI researchers can’t prove the network won’t reproduce someone’s work.

The rest you seem to now be looking for a snarky gotcha, which if you don’t want to have a discussion, then I’m uninterested in discussing further. I made clear above and in a sibling comment that remixes are gray area, and this question is complicated. That said, even if AI people do acquire source material legally, they are in fact copying it and distributing it, and that part alone can potentially violate US Copyright Law. This isn’t even up for debate, so I don’t know why you’re attempting to suggest otherwise. The lawsuits mentioned in the article were brought on evidence that networks violated copyrights of specific existing works, and lots of people have found specific examples of violations.


My claim has always been that

1) the creating of the model is does not violate copyright. Claiming otherwise means running same algorithm in meatspace would violate copyright laws, which implies thoughts violates laws which is absurd.

2) distribution of the model does not violate copyright laws because the models themselves do not contain latent copies of the work. The model itself is not the work nor a recognizable copy of it nor can it be reconstituted back to the work. It is a tool more analogous to photoshop where the tool can be used to reproduce copyrighted work, yes, by the end user (where I believe the responsibility lies). But the tool itself is not copyrighted work. Microsoft word can be used to generate copyrighted books if I’m correct. Or I can hire smarter tool: a human writer to produce copyrighted works. Is the writer-for-hire illegal? Or his employability is illegal? Of course not. I believe the law will eventually take the position that AI model is a tool.


Creating of the model does violate copyright if the model you create can reproduce someone else’s work. Your logic is faulty. Running the algorithm isn’t what causes the problem, so there is no implication that thought is the problem: this point is still a straw man argument. If it seems absurd, then shouldn’t you re-check your assumptions?

> nor can it be reconstituted back to the work

This is false. It has already happened multiple times that networks reproduced copyrighted material.


How is creating the model in my brain not illegal (as in my thoughts are not illegal) yet if it’s in silicon it would be illegal? Please give a well reasoned answer. This is not a strawman no matter how many times you insist on using that label.

Secondly you seem to be conflating the “tool itself” to “what the tool can do” to be strongly equivalent. I.e if the tool has the capability to violate laws, then the existence and distribution of the tool itself also violates said law. (Not so)


Distributing a model created using your brain is illegal, if the model violates copyright law. (Just like copying stuff without using neural networks in meat-space was already illegal.) If you create a program that reproduces copyrighted work, and distribute the program, then the distribution is illegal. That was the same answer I gave at the top. The brain vs silicon silliness is a strawman and has been all along because it doesn’t matter how your program was created, building and running it is not the illegal part under copyright law, distributing it is (and/or using material that you haven’t obtained legally).

> if the tool has the capability to violate laws, then the existence and distribution of the tool itself violates said law.

That’s right if you remove the word “existence”. Distribution of a NN model that violates copyright by reproducing copyrighted works is illegal. That part has been my point in this thread, it seems like you understand now and we agree.

It’s “existence” is not illegal under US Copyright Law unless you didn’t have the legal right to use the training material, and in that case it’s illegal to use the material whether you used a computer or your brain, it doesn’t matter how you created the neural network (or even whether you created a neural network), the violation there isn’t the act of creating the network, it’s the act of stealing and using material you don’t have permission to use.

This whole discussion would be a lot less frustrating for you if instead of making assumptions and logic arguments about brains and computers, you took some time to read the copyright legal code. https://www.copyright.gov/


If you understand that the model is a tool, and that as a tool it can be used to generate activity that can violate laws and be used for other perfectly legal activities, then as a broad principle the distribution of said tool is not a violation of said laws.

Cars, phones, guns, knives (practically anything) can be used to generate activities that break the law. They are perfectly legal to distribute. The onus on the legality of the activity lies with the end user.


While it’s true that knives and guns have both legal and illegal uses, it’s another straw man in this context, irrelevant to both neural networks and copyright law. In the case of neural networks, you’re distributing the copied material along with the tool, in the form of network weights, thus breaking the law by distribution whenever the network can reproduce significant portions of any of its individual training samples, or whenever you didn’t have legal permission to use the source training material.

> If you understand that the model is a tool, and that as a tool it can be used to generate activity that can violate laws and be used for other perfectly legal activities, then as a broad principle the distribution of said tool is not a violation of said laws

That statement is incorrect, the logic is flawed. Just because a tool has both legal and illegal uses does not necessarily have any bearing whatsoever on whether the tool’s distribution is legal. Tools that are illegal to distribute can have legal uses, and that does not make them legal to distribute.


> That statement is incorrect, the logic is flawed. Just because a tool has both legal and illegal uses does not necessarily have any bearing whatsoever on whether the tool’s distribution is legal. Tools that are illegal to distribute can have legal uses, and that does not make them legal to distribute.

Making statements and assuming the truth without reason nor evidence nor examples to back it up. Logical fallacy of begging the question. You have also not reasoned how freely available information is “illegal” to read/index/store amongst other things.

Not here to win you over. The audience can see how weak your position is. My last response here.


What are you talking about? What I said there wasn’t an assumption, it’s a fact about our laws. Murdering someone with a gun and breaking copyright law distributing copyrighted material are covered by two completely separate and independent laws. Breaking or not breaking one of the laws does not imply anything about other laws. Your statement of “broad principle” is the assumption here that claimed that not breaking one implied you were not breaking the other, which is completely and utterly false. You really really should read up on copyright law before asserting things, and maybe the laws surrounding knives and guns too if you want to use them as examples. Your comments have repeatedly and consistently demonstrated a lack of understanding of the laws we’re discussing.


> Do you ask for permission when you train your mind on copyrighted books?

Yes, you do need to buy books, which gives you permission to read them.


Copyright does not grant the right to consume content. It only deals with distribution, whether that is done through a copy, or a derivative.

It is not illegal to read a stolen book, only to steal the book.


This is literally what the AI does as well. It didn't walk into a bookstore and steal all the books off the shelf, it read through material made available to it entirely legally.

The thing that authors are trying to argue here is that they should get to control what type of entity should be allowed to view the work they purchased. It's the same as going "you bought my book, but now that I know you're a communist, I think the courts should ban you from reading it".


> they should get to control what type of entity should be allowed to view the work they purchased

No, that's not it. It's more like if I memorized a bunch of pop-songs, then performed a composition of my own whose second verse was a straight lift of a song by Madonna. I would owe her performance royalties. And I would be obliged to reproduce her copyright notice, so that my audience would know that if they pull the same stunt, they're on the hook for royalties too.


There are lots of people arguing against the training itself. And people arguing against all outputs, even when there is no detectable copying. I don't know how you missed those takes. You're arguing the wrong point here. Many people do want to say "no ai can look".


Only if you released it. You could definitely perform it in the shower without owing anything. And the 99% of your compositions that didn't wholesale mirror any specific song would be perfectly fine to release.

Now, moving from holding the model creator culpable to the user would obviously be problematic as well, since they have no way of knowing whether the output is novel or a copy paste. Some sort of filter would seem to be the solution, it should disregard output that exactly or almost exactly matches any input.


But it's not humans reading it, it's using it to train ML models. There are similarities between humans learning from books and ML models being trained on it, but there are also salient differences, and those differences lead to concerns. E.g., I am concerned about these large tech companies being the gatekeepers of AI models, and I would rather see the beneficiaries and owners of these models also be the many millions or billions of content creators who first made them possible.

It's not obvious to me that the implicit permission we've been granting for humans to view our content for free also means that we've given permission for AI models to be trained on that data. You don't automatically have the right to take my content and do whatever you like with it.

I have a small inconsequential blog. I intended to make that material available for people to read for free, but I did not have (but should have had!) the foresight to think that companies would take my content, store it somewhere else, and use it for training their models.

At some point I'll be putting up an explicit message on my blog denying permission to use for ML training purposes, unless the model being trained is some appropriately open-sourced and available model that benefits everyone.


> You don't automatically have the right to take my content and do whatever you like with it.

actually you don't have the right to restrict the content, except as part of what's allowed in copyright law (those rights a spelt out - like distribution, broadcasting publicly, making derivative works).

specifically, you cannot have the right to restrict me from reading the works, and learning from it.

Imagine a hypothetical scenario - i bought your book, and counted the words and letters to compile some sort of index/table, and published that. Not a very interesting work, but it is transformative, and thus, you do not own copyright to my index/table. You cannot even prevent me from doing the counting and publishing.


I assume you’re referring to US law here. Is there a handy place where these permitted restrictions are listed and described?


https://copyright.gov/title17/92chap1.html#106

The section titled "Exclusive rights in copyrighted works".

There are 6 rights.

(1) to reproduce the copyrighted work in copies or phonorecords;

(2) to prepare derivative works based upon the copyrighted work;

(3) to distribute copies or phonorecords of the copyrighted work to the public by sale or other transfer of ownership, or by rental, lease, or lending;

(4) in the case of literary, musical, dramatic, and choreographic works, pantomimes, and motion pictures and other audiovisual works, to perform the copyrighted work publicly;

(5) in the case of literary, musical, dramatic, and choreographic works, pantomimes, and pictorial, graphic, or sculptural works, including the individual images of a motion picture or other audiovisual work, to display the copyrighted work publicly; and

(6) in the case of sound recordings, to perform the copyrighted work publicly by means of a digital audio transmission.


> It didn't walk into a bookstore and steal all the books off the shelf, it read through material made available to it entirely legally.

Github ignored the licenses of countless repos and simply took everything posted publicly for training. They didn't care whether it was available to them entirely legally, they just pretended that copyright doesn't exist for them.


Isn't the definition of public repo that anyone is allowed to download and read it?


Nope, public repos have license, often open source licenses that state that you can freely use the code, or change it , but only if the resulting product will also be opensource.

Other licenses such as the MIT license require that you name the original creator.


You don't need to accept that license to download and read the code.

A license allows new uses that copyright would otherwise block. Some kinds of AI training are fully local and don't make the AI into a derivative work, so they don't need any attribution and you don't need to accept the license to distribute.


But no license (that I'm aware of) says "You are allowed to read this source code, but you may not produce work as a result of learning from it"; for a start, that would clearly be impractical to enforce.


It gets called plagiarism, and there are lots of lawsuits preventing this.


It's not plagiarism at all. The AI is trained on 5 billion images yet it stores only 4gb of data. Thus it is impossible that it stores the actual work. For any image that the AI generates, you can't point to any image in the training data that the image is derived from.


Why are you talking about AI?

This was about the humans consuming other people's content.

> Humans are constantly ingesting gobs of "copyrighted" insights that they eventually remix into their own creations without necessarily reimbursing the original source(s) of their creativity.

If humans make stuff that is too close to someone else's source materials then it is considered plagiarism and not "inspired by".

> For any image that the AI generates, you can't point to any image in the training data that the image is derived from.

Why can't you point to the Getty Images watermark that it is quite happy to reproduce? Isn't that surely evidence that it doesn't actually understand what it is reproducing?

> The AI is trained on 5 billion images yet it stores only 4gb of data. Thus it is impossible that it stores the actual work.

I have also seen billions of images, therefore I cannot be actually store the real images in my head and thus nothing I paint could ever be considered plagiarism. That's brilliant, I think there are a few law firms defending artists who would be looking to hire you.


How did they train the AI without first storing the data? It's not in the model, but it was used without permission in the pipeline that lead to creating that AI model.

I don't know if that counts as plagiarism, but there's clearly some use of this copyright material that the authors probably didn't envision and did not grant permission for. I have no idea what the law would be in cases like this


> How did they train the AI without first storing the data?

the data was originally permitted to be copied.

The question isn't whether the training is violating copyright - as long as the data set had permission to be viewed (which it must have, since it was public).

The question is whether the final result - the model/weights - is a derivative work of the training data set. If it is a derivative work, then the model must be in violation of copyright. But copyright law allows for sufficiently transformative work to be considered new, rather than derivative. So is training a model using methods like this constitute a transformative work?


Who's to say that GPT isn't just analogous, legally, to lossy compression algorithms, only with a particularly interesting interpretation of "lossy"?


There are some overfittings where you can point to source images, but that number is a lot smaller than 5 billion.


I can't quite place why I despise this line of reasoning/argument, but boy do I absolutely despise this line of reasoning.

Are we really going to play devil's advocate so much that we consider these early day A"I" tools as equivalent to humans? I personally have absolutely 0 qualms about treating humans and these ML tools as completely separate entities governed by completely different laws. AI SHOULD be heavily restricted, we're already headed not towards any sort of apocalyptic singularity, but a singularity of pure, endless spam spewing forth from every orifice of the internet and elsewhere.

If these megacorps behind this AI push want it to succeed, then they should be paying for access to the images/texts/music/videos/whatever they're trying to harvest en masse. I couldn't care less if an AI learns the same way a human does or any other anthropomorphising the AI crowd want to gaslight everyoen with.


>Do you ask for permission when you train your mind on copyrighted books?

Of course not, and given my ability to train my mind on thousands of books in a few minutes and spit out a full book based on that training in whatever style one wants in a few minutes for that as well, it seems especially unfair that people act as though there might be a difference between the two situations.


>Do you ask for permission when you train your mind on copyrighted books? Or observe paintings?

I think this is a specious analogy at best. The two are remarkably different contexts. AI can work at a significantly greater rate. There's also a very large question about whether for profit commercial software should be afforded the same leeway we give to ordinary human behaviour.


I can't be trivially copy/pasted and replicated at scale.


Does that mean once you can be there is no ethical responsibility not to?


Not sure where you're going with that question.

Fundamentally the current accommodation of copyright has two main justifications: 1) Protect economic activity 2) Moral right to identify original author of a work

The point of ease of replication speaks to (1) fundamentally breaking. A human can only produce so much output compared to an AI system.

(2) is a much thornier subject, and not one I really feel qualified to speak on.


I'm not a computer program.


I don’t think one can know that for sure. ;)


This is a better argument than your smiley implies. Given a work, we can't (or soon won't be able to) tell whether its creator was a human or an AI. So the only important thing that matters is whether the final work infringes copyright or not. Unless people are seriously arguing that nobody should be able to use AI to produce images even for their own private use.


If machines were totally free to harvest all human work and expression for their owners' benefit, why should anyone ever allow any of their work out in open unless strictly necessary?

Why go down the route of turbocharging new forms of rent seeking?


This tired argument is very much like the "corporations are people" argument that convinces corporate lawyers and judges but literally no one else. Any lay person can tell a corporation is not a person, mostly because the power imbalance, the difference in their abilities and physical constraints, and while a lot of these arguments can be reasoned through the recognition of such is intuitive.

Most people looking at AI can tell it is not like a mind or an artist, because of certain intuitive arguments which boil down to their surprising ability and their bizarre faults (drawing hands is still a struggle for most models) and current limits (you have to hack prompts instead of asking naturally due to their current limits). You can reason about people using these arguments because they are people, but you cannot use it when applying it to NN because they are not.

I'd argue that the moment you start using the "people are AIs" argument, and you are then implying the converse is true "AIs are people," then you are assuming that there is some bidirectional here, and thus other qualities you assign to people, like, "people have rights" and "people deserve to be paid for their labor" and "people have rights to the work of their own hands" then must apply to AIs. And, therefore, the AI tools you are using deserve to be treated with the respect and dignity you had to treat artists and developers with before, and thus, should be paid for the work they create. That is, if they learned and created art in the same ways people create art. Just as we do not have that nursery does not own the art a child born there creates, or a university doesn't own the art an artist who studied there creates, you cannot make the argument that the work an AI creates is that of the "owner" or "trainer" of a model unless you are arguing slavery is in fact okay in this day and age. All of this of course hinges on the supposition that AIs are people, and that they learn as people do.

So, you cannot have it both ways. You cannot keep treating AIs as people in your arguments, but then deny them agency that is due to people. The only way this is is that you deep down do not believe they are people, or that you think people do not deserve rights or deserve compensation for the work of their own hands.


Most people aren't asking for copyright on the AI output, though.

Also animals can be trained and make outputs and nobody accuses them of copyright infringement. That's a much better analogy here than leaping to the idea of treating one of these models like a human.


> "corporations are people" argument that convinces corporate lawyers and judges but literally no one else.

Ok, I just have to link this here; https://youtu.be/dKFunwOzEos?t=711


>Do you ask for permission when you train your mind on copyrighted books? Or observe paintings? Or listen to music? Do you ask for permission when you get new ideas from HN that aren't your own?

AI is not a mind. A mind is a physical object, a brain inside a skull inside a person. An AI is a computer program.

And while a nerd who forgot how grass feels like might confuse the two, the courts won't.


But I'm a person and not a trillion dollar company. There is no reason for the rules, laws and morality have to be the same for people and companies


It could be time to give AI human-like rights. Human passports, human rights, 8 hour work schedule, weekends off, vacation time, and of course workers wages.

If an AI reproduce a copyrighted work they should then be sent to robot jail, and the human who requested the work should be sentenced for conspiracy to commit copyright infringement.

It might however be a bit early to let the horse back in the barn.


Moral principles apply to living things. The living things in question are people at big companies training models to sell them as services from behind paywalls.


Stability AI is a small enough company that it doesn't even have a wikipedia page. The Stable Diffusion model is freely available to everyone along with the source code.


It’s very much the exception. Most of this industry is “build a model on public data and make billions on it behind a paywall.”


AI models aren't observing anything. The people who train them are copying other people's works to do so. By saying an AI model is "observing" images you're begging a question.


Your mind is not infinitely scalable. It's a factor. Your mind is able to be recognised an author under copyright law. The AI model is not. It's a factor.


Training AI is closer to photographing an art gallery.

Training AI without permission is sneaking a camera into an art gallery without permission.


[flagged]


I don't think it will benefit "the little guys", because "the little guys" rarely have the resources and time to litigate in the first place, or lobby to lawmakers to make the details work in their favour. Copyright always benefits "the big corps". Everyone is so eager to get one up on Microsoft that they're forgetting the bigger picture.


The fact that the justice system is so inefficient that it doesn't serve people with less than a lawyers salary of money to waste isn't important to the conversation of whether it's fair or not for someone to ignore licensing and repackage your code as an "AI".

If you want to talk about the big picture here, it's about privatizing gains and socializing losses, the goal of every bigcorp, which is just more reason to disallow this abuse.


Of course it’s material to the conversation whether people have resources in practice to use the tools of the justice system. It’s why small claims court is a powerful tool compared to a wrongful termination suit.

What you’re asking for is going to only benefit corporations. They’re the sole entities that will be able to afford the regulation you’re proposing.

It’s not even a hypothetical concern. Much of my notability came from books3, which contains almost 200,000 books, and is freely available to anyone. https://twitter.com/theshawwn/status/1320282149329784833?s=4...

I couldn’t train a GPT competitor without that. Neither could you. But corporations could.


It should be up to the rights-holders what they allow their works to be used for. That way, I can say, "no, Microsoft, you can't pay me enough to allow your bot to train on my works" and at the same time allow you to train books3, thus taking power away from MS. If the rights-holders have no say, the barrier to entry is having big stacks of computers, and big corps win by default already.


For what it’s worth, it was shockingly easy to get access to big stacks of computers, as someone who had very few resources. I hardly had money for IVF, let alone a supercomputer.

TRC (TPU research cloud) makes supercomputers freely available to anyone who will put them to good use. Like, literally you. You don’t even have to have a clear research goal for general access.

It was one of the big surprises of getting into AI. I didn’t expect that at all.

Even without TRC, compute is only getting exponentially cheaper. A 1.5B ChatGPT may sound puny, but I’ve seen how powerful the non-chat variants are.


The justice system is not going to change any time soon, and you can't ignore it. So that is the reality in which we must operate. Besides, it's not as simple as "the justice system sucks", because much of the time it's just a matter of not wanting the headache, and lawsuits will bring headaches in any justice system.


Thank you. It’s legitimately scary to read through these comments. It’s like watching everyone clamor for Stalin to be put in power: not even a good idea in the short term, let alone the long term.


Your backwards appeal to ideology is completely absurd. If it was Google training an AI on Microsoft's code, MS sycophants and PR people would be calling GOOG the rights-ignoring Stalin-loving communists.


Maybe, but it wouldn’t change the truth that the outcome of Microsoft restricting Google would benefit Microsoft, a trillion-dollar corporation, and not you or I. Nor anyone who isn’t a wealthy corporation.


Regardless of whether one agrees or not with paying creators of the training data, I think the deeper issue here is about societal wealth distribution and who gets paid for X now that X is being done very well by AIs. A less equitable world has Google or billionaires getting paid. A more equitable world has the artists.

But I want to argue here that for purposes of this latter question, your proposal of copyright enforcement (or anything similar) is too little to late.

-These "copyright violating" AIs have demonstrated the proof of concept and the damage is done. Even if these AIs are banned, the companies will just parallel reconstruct it by running the 80/20 rule: pay tiny amounts to get most of the data. After all the creators of the data were doing it for free and are in such fierce competition there's no bargaining power.

- More nefarious AIs will just do transfer learning on intermediate neurons, very difficult to prove stealing here.

- Even if you get the system to work, what about future artists and writers? Are we just creating an entrenched historical group of creatives getting royalties forever?

The distributional problem is not well solved by copyright, and better solved with e.g. corporate taxes, income taxes, VATs.


>- Even if you get the system to work, what about future artists and writers? Are we just creating an entrenched historical group of creatives getting royalties forever?

This is kind of what happened with music, no? In some countries hard drives, SSDs etc all carry an additional tax that is then given to some copyright organization. Of course it's not the artists that mainly benefit from this, but instead it's the people running said organization.

Eg https://www.telecompaper.com/news/spain-approves-new-digital...


Yes, but it's not exactly a desired outcome.


> Even if you get the system to work, what about future artists and writers? Are we just creating an entrenched historical group of creatives getting royalties forever?

Copyright expires, and new artists will create new (copyrightable) art in the future. Unless your assertion is that generative AI is so good no one will make art without it ever again?


If the proposed system works, I expect those entrenched artists will sue young human artists whose work shows signs of learning from previous art. The vast majority of music, books, and movies have clear influences.


The proposed system exists, and humans have had to work in it for some time now. People get routinely sued for copyright infringement if their work is too close to an existing work. The (successful) suit against George Harrison for "My Sweet Lord" is a good example of infringement via influence with no clear malicious intent.


> Even if you get the system to work, what about future artists and writers? Are we just creating an entrenched historical group of creatives getting royalties forever?

The flip side of this is that if we undermine paid creators until there's no incentive for them to create, then the AIs abilities stagnate on old data and we as a society drop or at least diminish the skillsets that could create new media.

AI can generate stuff humans care to look at only because of the availability of data that humans created for eachother to enjoy. As tastes, fashions, zeitgeists and pop culture change amongst humans the AI models will always be behind and unable to follow trends completely. I think.


> The flip side of this is that if we undermine paid creators until there's no incentive for them to create

The incentive to create is almost never financial. How many artists finance their creative efforts by working day jobs? Making a living as an artist is more about buying yourself the time to focus on making art than it is about making money. People will continue to create art, however they can, because they must.


I agree that people wouldn't stop making art. Sorry to shift the goalposts here: I do think that there are types of art that are not created except for commercial reasons, and that body of work is what I would expect to get displaced by AI. In fact it already is, Advertising creative media is one example, it's an industry I am involved in and we are already seeing Dall-E and ChatGPT getting used for quickly concepting ideations for clients etc. I would expect an AI to get worse at meeting commercial needs over time because of what I said in my original comment. Or at least for commercial creative media to stagnate if it could only use AI (because no one is making commercial media just for fun).

This is all stuff I am actively thinking about since it is impacting me right now, so I appreciate the discussion and would be happy to be wrong.


>if we undermine paid creators

I'm not even slightly concerned about that.

1. Art is better when it's not paid. Real artists have day jobs that pay the bills and they create art to express their ideas, not to make money.

2. Paid art isn't going away, it will just change. Certain skillsets will be forgotten, like how landscape painting was replaced by photography. But talented artists will leverage AI tools to create works that are greater than anything that came before.


> Art is better when it's not paid. Real artists have day jobs that pay the bills and they create art to express their ideas, not to make money.

Trying to define who "real" artists are is a folly for the ages. It is the dream of many artists that they get paid for their art, and many achieve it. The starving artist is a mythos of pain and suffering, a good story but hardly good for art. Some of the best composers from history were paid, some of the most influential artists were from wealthy families. They were able to focus on their work without fear of money and because of this they could excel in techique and execution, which allowed them to produce some of the highest forms of their art in history.


AI models need human creative decisions as part of the process of making art. This is consistent with current copyright law as well as contemporary art theories of authorship and practice…

Eg, Donald Judd’s works are these creative decisions and processes distilled to the most basic of sculptural form.


> - Even if you get the system to work, what about future artists and writers? Are we just creating an entrenched historical group of creatives getting royalties forever?

The boat has long since sailed on this… ands it’s globally entrenched as a norm of international trade that we are all “ok with this” regime of 75 years or century plus copyright terms …

And arguably the entire copyright vs AI/ML training datasets debate is founded on the notion that the artists individual copyright will last long enough that it’s going to outlive the average artist. If we look at one of the old copyright regimes, for comparison… in a world where copyright is a short default/implicit/automatic term (14 or 28 years) and the copyright owner can elect to register and pay for extensions (for a more modern twist, preferably combined with increasing incentive to prevent perpetual renewal abuses by Disney, et al)… now imagine how much data from up to 28 years ago there is, the catalogue of art and photographs and text and books and academic writings… all public domain because the authors didn’t consider them of sufficient value… all free for the ML model training… this gets even larger with a 14 year term…

Suffice to say that we are seeing systemic impacts already, culturally we’re seeing more and more money put behind less and less content controlled by fewer and fewer people due to a slow death spiral off copyright stranglehold across multiple industries, written, visual, audio and video arts are all dominated by large corporations holding IP … yes individuals continue to create, but other than rare breakthrough chance successes and internet age viral success (which are often just completely arbitrarily/random and have no real quality) these companies decide what will be popular culture…

My prediction is that the AI/ML models will be allowed but heavily scrutinised, under the simple legal doctrine that the user is the one committing the infringement since the primary purpose of these models is not infringement but unique creation, but suspicion will linger by artists and it will become a normal part of contracts in the art word…effectively an artist equivalent of the way police in many places view spray cans… just as the primary purpose of spray paint is not to create illegal graffiti, which is the justification many places used to overturn poorly justified civic bans on possession of spray paint.

I’d like to see any more draconian spread of derivative work rights (style rights etc) to be accompanied by drastic reductions in the automatic copyright term, as the ability to churn out lots of automatic content drastically lowers the value of long long terms, and the counter argument that it makes the existing rights more valuable is fucking insane as we do not need to pass copyright down to the great-great-great-great-grandchildren… the terms are already too long.


What’s great about your comment is that you show that what we need the most is to reduce the power of copyright in such a corporate-centric legal regime.

Personally I’d like to see the right to train statistical models on any works without the permission of the author enshrined in statute and an end to common-law copyright, a return to the Statute of Anne 14/28 time length, and a clear delineation between the “work” as having an author for an eternity but having a “copyright of the work” vastly limited in scope.

Ask yourself, do we want to be extending the reach of large copyright holders like Disney into taking a fee from LLM producers because they COULD be helping people draw Mickey ears on their private creations?

This is Betamax all over again and luckily that Supreme Court opinion will favor heavily in the lower court’s judgement of these models as fair use.


> A less equitable world has Google or billionaires getting paid. A more equitable world has the artists.

Less equitable world has artists getting paid, in more equitable world, everyone can just use open-source AI tools like Stable Diffusion.


It’s strange to me that there’s a lot of overlap between people who think AI training should require explicit consent for every piece of training data, and people who think copyright and patents are insanely restrictive in the music/movies/literature/software world.

It’s also worrying that requiring consent to train an AI model will inevitably lead to requiring consent to make handmade art that’s a little too similar to some other existing artwork (ie, how all art works through reference, training, and inspiration). A world where Getty and Disney control even more than they already do.


All or nothing, in my opinion. Either abolish or severely reduce copyright, or abide by it.

The simple fact of the matter is that Disney and Getty invest a lot of money into these materials being out there in the first place. Open source programmers and artists spend a lot of time producing works for no cost other than some minor courtesies.

AI companies aren't your friend or the little mom ''n pop shop down the road. Their technology giants backed by billionaires. When it comes to Disney versus Google/Microsoft, I'm against both sides if it means giving up my rights.

Big AI taking your stuff and ignoring copyright law isn't some kind of protests against copyright, it's the very opposite; it shows that copyright doesn't matter if you have the money to defend yourself in court. Violate the the MPAA's copyright and you get extradited, violate some random person's copyright and you should feel honoured that people even want to steal your work.

In my opinion, the idea behind the current copyright system works fine if the terms weren't so ridiculously long. Restrict copyright to five or ten years and I'd be fine with the whole thing. This "70 years after the death of the author" crap is the biggest stifle on copyright adds.


> All or nothing, in my opinion. Either abolish or severely reduce copyright, or abide by it.

I firmly believe in "practice what you preach". I you declare you firmly believe in A but then do something directly counter to that because it's more convenient in this specific case, then that doesn't sit right with me.

Besides, further expanding copyright in this one area will only make it so much harder to reduce it later. And the pro-copyright folks will be able to say "you say you want less copyright, but you vigorously advocated in favour of copyright then, you hypocrite!" (and they wouldn't be entirely wrong, either). All this effort and energy fighting ML tools would be better directed at reducing copyright instead.

I don't disagree with your view on corporations. Do I like what CoPilot is doing? Not really. But at the end of the day: does CoPilot's or ChatGPT's mere existence really take away anything concrete from me? Am I harmed or even inconvenienced by it? Are my rights reduced? Is my code harmed by it? Is my income reduced? I don't really see how it concretely affects me, other than a general "feeling of unfairness".

And I see real risks with all of this: most regular people and small businesses don't have the resources to litigate as it's expensive and time-consuming, so a "license" that you or I slap on a piece of code is, realistically speaking, just ink on a piece of paper. GPL violations are rampant, violations of other licenses probably happen even more (but people generally care less about that, so not as widely publicized). Who will benefit with more copyright law on their side? The ones with deep pockets and many lawyers on retainer. i.e., the corporations neither of us like. Think creative new copyright lawsuits such "we claim copyright on the Java API" kind of stuff.


that's like saying "you say you don't believe in borders, yet you oppose this invasion? curious"


That kind of strained "gotcha!" comparison is not helping the conversation.


It's Disney that was most responsible for those copyright extensions you hate.

It's Disney that are, by proxy, suing Stable Diffusion to create the legal precedent that you desire.


What is strange in it exactly?

I can’t speak for everyone, but personally I find that copyright can be used properly or abused, at both sides (holder/consumer). It doesn’t mean that copyright is bad, only particular caregories of claims and usage are. But abusing copyrighted material from millions of little creators at insanely automated scale is another level of evil, especially when they explicitly require consent for exactly this type of use.

worrying that requiring consent to train an AI model will inevitably lead to requiring consent to make handmade art that’s a little too similar to some other existing artwork

That’s the root of misunderstanding, afaict. We can agree that at-scale processing is bad and that fair use is still okay. A human with a pen (or a text editor) can’t damage copyright at scale by learning terabytes of material in few weeks and producing the same amount in hours, so they can be excluded from this. Humans who use AI can, so they’re a target.


Look at GitHub training it's models on other people's code.

They aren't training it on Microsoft or GitHub code.

> A world where Getty and Disney control even more than they already do.

This is exactly what is currently happening though, it's okay to rip off the little guy artist or coder. The argument here is that one big guy stood on another big guys foot, and as the little folks we shouldn't stand for it either.


I don’t personally think it’s that strange or internally inconsistent. People seem to be saying “within the current copyright system, consent should be required, but I still think the copyright system is broken”.


Copyright is a good thing. The issue with copyright is that it has been extended too long, from 20+20 years (the latter of which is a manual extension) to 95 years (for publication) and life+70/80/100 years for the author. -- I understand extending copyright to keep up with extended lifespans, but it should be something like 30+30 or 40+40.

The copyright terms means that for Life+70 countries only works where a) the author died before 1953 and where the works were published before 1928 are in the public domain.

An AI training on a given work should comply with the law and with copyrights, just like anyone else. It should also respect the license or other terms the works were released under. -- You could easily silo the data by license, and have a different model per license.

Patents should be a good thing (they allowed inventions to be published instead of being kept secret). However, it is easy for large companies to get patents on trivial things, write overly broad patents, and collate a large number of patents in a domain. That means trying to innovate or compete in a highly patented field like audio or video compression is difficult.


I'm anti intellectual property, but as long as people have to abide by it, I think AI has to, too. Being anti intellectual property doesn't mean I'm in favor of corporations stealing open source code, it means I want the law changed.


I believe that adding data to an AI training data set should require explicit consent, and I derive that belief from being wrong about copyright laws in the 00’s. The artists that tried to stop filesharing were right, and the total collapse of musicians’ livelihoods in the streaming era proves them right. We have the chance now to correct the mistake we made then.


Not sure what you intended to imply, but I don't think those two consents are related that much to worry about. Copyright licenses usually are written down like this: "[you are allowed to] use, reproduce, modify, adapt, perform, display, distribute" and so on.

When a new technology is introduced, for example the compact disc was invented, lawyers get to poke whether that "distribute" applies to the music CDs, or just to vinyls and music tapes (because at the time of granting that license, CDs weren't yet a thing! gotcha!).

The answer to this conundrum might vary in different countries, and we can have fun discussing that in the context of AI, but it does not affect how handmade art shouldn't be too similar.


Exactly this. You can hold that the current copyright system is good and fine. That's a moral position that is, in my view, entirely deluded but internally consistent and not really worth having a discussion about. People who conclude that look at the world through a fundamentally incompatible lense for discussions between the other set to be productive.

Or you can (correctly) think it's a huge drag on innovation and human progress.

If you think the latter then hoping in this case for legal precedent to broaden the scope of copyright enforcement is just bizarre logic. This isn't a rule that already exists as such. The case will set a precedent (based on interpretation of existing law) for the future.


Right. Not that long ago artists were rallying against Disney and the Music Industry for weaponizing Copyright against Artistic Freedom. Now it seems some artists have decided to use the same tactic.


Microsoft is not an artist.


The issue I have with AI artists is they seem to love copyright again as soon as it comes to their works or their prompts.


In the music business there is already so much copyrighted that there is a real risk of making something that is too similar.


It's probably those same people who publish their code under open source licenses instead of giving it to the public domain. I don't understand why people cling so hard onto every worthless little bit of code they write while also sort of half giving it away to almost everyone for almost any purpose.


First of all, don't pretend that copyright depends on how much "worth" a copyrighted piece of work has. Not just the works of Prince and the Beatles deserve copyright.

Then, a lot of open source authors understand that their works are not groundbraking inventions and want to share them with the world without any fees or costs. And others have groundbraking inventions and still share them for free with the world.

But there are almost always license terms attached to the piece of the work. They can be essentially non-limiting, like public domain or MIT. But they can also enforce some minimal requirements like attribution. Why should any entity, especially huge corporations, not be bound to those conditions? It was mostly those corporations which created a restrictive copyright. Try to draw an image of Micky Mouse and put it on your website and see what happens.


1. People who use copyleft licenses do it because they know that the end result will be using a good license forever.

2. People using non-copyleft license just do it because public domain seems to have a complicated legal status across the world.


Most non-copyleft licences require attribution. That's significantly more than than public domain. Copilot breaks even permissive licences.


The outcome you want will inevitably lead to entrenched intellectual property holders having an effective monopoly on the best AI tools.

You know that Stable Diffusion lawsuit? Go check who the lawyers behind that work for; Disney wants that same outcome.


People who made or bought out the content get to have the content. I don't necessarily see the problem with that except for the ridiculous amount of time copyright remains valid.

If a company invests $250 million into an original movie, I don't see why they shouldn't have some say over their content for at least a couple of years. Not until 2150 or whatever the end date for modern works is supposed to be, but give it some time at least.

OpenAI is the result of billions being thrown around. When it comes to billions, it doesn't matter if they come from Disney, Google, Microsoft or Amazon. None of these companies have our individual rights at heart, they only care about profits.

In this rare occasion, the interests of the people and Disney align. The laws protecting the independent writers/programmers/artists are the same ones that protect Disney.

The tools themselves work on arbitrary data sets. Anyone who can dig up enough public domain/attribution free pictures/code/text can train their own AI without even coming close to copyright issues. Hell, had these super smart AI people managed to find out a method of attribution, the data set would include massive amounts of works released under Creative Commons or open source licenses.


I don't see any of these companies caring about artists, but something like Stable Diffusion is more of a weird accident. Like how IBM managed to create a platform with that became an open standard, something almost diametrically opposed to their own corporate values.

Good quality data and more of it means better output. Disney is almost certainly doing their own thing internally, benefiting from their ability to use both free as well as their own IP and the capital to hire cheap workers to train it directly.

It's not that I don't understand why artists might be upset about a company scraping copyrighted art, I just think that the longer term effects of legally kneecapping open source variants while handing over the most powerful versions of it to the existing intellectual property giants are A Bad Thing.


OpenAI is the result of slow and steady progress in ML. It's not even a good product. It is astonishing in itself though.


> but what the customers of AI people want isn't available under those terms.

What the customers of AI want is accurate predictions of the models, and they can get that even if everyone demanding to get removed from the training set would be removed.

The makers of generative AI could remove every living artist who wants to from the dataset, the model would still develop a general solution of color theory, composition, almost every artstyle in existence, ... because fact of the matter is, there is just that much data out there. Our species collectively has spend DECADES recording, storing and categorizing everything and the proverbial kitchen sink. There are god-knows-how-many petabytes of data available in images alone, so even if just 1% of that could be used to train generative models, it would still be more than adequate.

And soon after that, there is an explosion of new generated art, filtered through the aesthetic sense of millions of humans, that can just be fed back into the models, to make them better.

The end result is the same: High-quality image generation on a scale hitherto unseen, running even on consumer grade hardware. And what lawsuits will be filed then?


Then do it!

I swear when I see this argument because it makes me angry.

You’re right, but they didnt, because they were too lazy and cheap to do it that way.

…and that’s why people are angry, and rightly so. Fully licensed models are the future, and it’s both irritating and disappointing that we are where we are right now because the people training these models were too lazy to assemble a training dataset that wasn’t problematic (ie. full of porn and copyrighted material).

You can argue the “but at the end of the day it’s all the same…” argument if you like, but clearly from the lawsuits it isn’t ok

They’ve completely messed it up.

There’s a reason the openai api terms of service says that “the Content may be used to improve and train models”; they’re setting themselves up to have a concrete defence for the source training data for their models.

Good job.

Stability can burn in a fire. They’ve really trashed the reputation of generative AI in a way that is going to be very difficult to recover from.


No one outside some artists is going to give a fuck about the legality. Joe Schmuck out there is too busy either not knowing this exists or making funny pictures of Han Solo eating a banana on a toilet made from the skin of Yoda.

That reputation damage you think matters doesn’t exist.


This is the only issue I have with generative image models. I'd be using them myself right now but I'm too disgusted by how the sausage is made. Once the first licensed, properly sourced models are out, they will get my money or time.


> but clearly from the lawsuits it isn’t ok

Well, a lawsuit isn't a decision, we will have to wait for the courts to decide wheter it's legally okay or not.


Ultimately it doesn’t matter.

Reputation damage has been done.

Undoing that is going to take time and effort which, could be spent on more productive things.

I’m disappointed in where we are right now. It was entirely avoidable.

Lazy. Cheap.

/me shakes head…


>It's very possible that a judge will rule that AI models do not violate copyright.

Somehow we have managed to come full circle to the first episode of HBO's Silicon Valley.


How is it much different from a search index? it’s just a new interface to get at some info, rather than Google and Firefox, it already pre-browsed the web for you, and is displaying back the content. If the end user gleans some actual copyrighted work from the search they still need permission to use it, but it’s also likely it’s just a derivative, or the end user is just reading an example and learning from it at consumption time. Is a web crawler violating copyright? Or is it the user who sells a copyrighted image?


> How is it much different from a search index?

A search index usually links to the source. Without that a search index is worthless, you can't use content if you don't even know where it comes from and who holds the rights.

Google search links to sources like Wikipedia in its info boxes, because without that you can't know whether the info is reliable or sourced from my brother's coworker's imaginary flat-earther friend.


It's not displaying back the "content". It's training a model with statistics based on the writing that was either paid for by a site publisher in the hope of earning ad revenue, or contributed to the community for free.

If a model was to add attributions to each of its answers, then perhaps the search engine analogy would hold. But, they don't (and right now, to my understanding, can't.)


The AI art models are scraping imagery made by human effort and skill and then more humans label and tag it so it can be indexed, ( because you can show a computer an image all day long and it still won't 'learn' what it is unless you tag it) and then another human puts in their wishlist of art they want (without any cost or effort to learn skill) and the 'AI' displays back a collage of the content from the wishlist. And then presto all sorts of merchandise are available with this art taken without permission from the people that made it. People do the physical action of creating imagery, AI indexes it.


I’m going to wager that the percentage of people in our society who would like to extend copyright to restrict these tools, favoring the needs individual copyright holders over the needs of the public domain, is much smaller than you realize.

By the time this reaches judgment and goes through the appeals process there will be a vast industry of non-infringing uses that are clearly transformative and in fair use (Sony v Universal)

You cannot say that the person using ChatGPT to control the lights in their garage is infringing on anyone’s copyright in any manner whatsoever. The point of copyright is not to gain a permanent monopoly on certain speech. The point of copyright is not to make sure that people are fairly compensated for their work. Their work might be terrible but contain a good idea that is later reimagined in a better way (Baker v Seldon) but that’s for the market to decide.

The courts will probably concur that these models are fair-use and I will agree with their judgement.


> It's very possible that a judge will rule that AI models do not violate copyright.

Current market odds for that are at 77%: https://manifold.markets/JeffKaufman/will-the-github-copilot...


"No" includes the litigation being dropped with no ruling.


Good point! The market isn't exactly trying to answer my parent's question. But it's reasonably close: the litigation is unlikely to be dropped if it would win.


Better title: The advancement of AI is being slowed by copyright.

But "eating" is a fun word.


Even better title: tech giants don't want to pay copyright to small creators, but will easily be coherced to pay it to disney.


It’s up to the courts to decide if this is a copyright infringement. The EU at least already allows the use of copyrighted material for research purposes into text and data mining, so the main question will hinge on whether or not the result of such research can be commercially exploited.


Oversight? Hardly. You don’t seem to understand the purpose of copyright to begin with.

The purpose of copyright is to progress science and useful arts. Period. Any action taken in the name of copyright that does not progress science and useful arts is unsupported by law.

What else do we know about copyright? A copyright can apply only to creative expressions. While the bar for sufficient creativity is intentionally low, it is non-zero.

Another thing we know is that purely functional expressions are not copyrightable. When does an expression go beyond being a function expression to a creative expression? That’s up to a judge. Since code is math and math, by itself, cannot be copyrighted, when an expression reaches the level of creative expression must be beyond the math. Updating a database field, factoring primes, or using data correction algorithms are not creative expressions.

Now for AI. Only humans may own copyrights. The output of an AI is not copyrightable. But what if the input was copyrighted?

When it comes to software code, AI will value expressions that are commonly used more so than uncommon ones. But software code is, by it’s very nature, an intertwined collection of copyrightable (creative) and non-copyrightable (functional) expressions. If AI values commonly used expressions, those expressions are highly unlikely to be creative enough for copyright protection in the first place.

So we have a circumstance where AI is trained on copyrighted but Open Source code. Yet the code itself is comprised of both creative (presumably) and functional code, with no clear delineation of what is and what is not protectable.

Lastly, many authors do not understand what constitutes a creative expression that is protectable by copyright. The amount of work required to create the expression is meaningless. Manipulating data to thresh out something interesting is not creative. Let’s just face it that most software is comprised of mostly functional expressions that are not protectable. Back to that “math” problem again!

The big take-away? The purpose of copyright is to progress science and useful arts, not to build walls around ideas and concepts (which, by themselves, are not protectable).


>It's very possible that a judge will rule that AI models do not violate copyright

Would that mean you can simply use one AI (or more) from anyone else to train another AI?

Of course access can always be limited to an API with rate limits and per-request costs, which would make it difficult to straight up copy the whole thing, but it would be hard to justify any legal protections against it.


These things should cough up money for the authors whenever something is generated with their data or make the whole thing (source code and models) accessible to everyone. So easy :D.


You don’t need permission.

If you want to prove your data was used to train an AI, the onus is on you to prove it. Good luck.

The AI who follow the law strictly will be at a disadvantage to those that do not.


> the onus is on you to prove it

which would be easy during a law suit - the process of discovery means you get to check out the training dataset.

The allegation isn't that the AI trainers are hiding, but that what AI trainers are doing _itself_ constitutes copyright violation. AKA, they want the right to use the works to train an ai model to be a right that must be explicitly granted.


Training data sets are way too massive to be searched in any reasonable time.


> If that is the case, I hope new legislation will correct that oversight very quickly.

i hope that legislation is not introduced to prevent training, as this right would stifle progress.


DeviantArt is opt-out by default in my experience. It's honestly a bit frustrating because I always choose to opt-in.


>AI companies can ask for permission if they want to train their models on other people's works. It's not that hard

Yes, and then people say "no" or "pay me". End result of this is that the only ones with good AI models are megacorporations that will DRM the heck out of it.

Years later those same artists will complain that they now have to pay $1000 a year to Disney/MS/Adobe to create art. Because these megacorporations can afford to pay for it. They're the ones that will benefit the most from this, because it creates an insurmountable moat for them.

Copyright exists to encourage the creation of more art and to progress science. AI is clearly a helpful step in that direction. Humans learn from others' works. Should we make that illegal too?


> Copyright exists to encourage the creation of more art and to progress science. AI is clearly a helpful step in that direction. Humans learn from others' works. Should we make that illegal too?

I find it astonishing that people continue to make this argument. A machine is owned by someone, a human is not. Why should the law treat machines the same way as a human? Sounds like some corporate flim-flam to me.


Copyright is not about protecting people. The purpose of copyright is:

>To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;

The purpose of copyright is not to protect the authors, it is to promote the progress of science and art.

The current situation for AI image generation is pretty much the only way these technologies will be available to everyone. Most other paths will simply lead to billion dollar corporations acting as gatekeepers to this technology. Megacorps can afford to hire artists to generate specific art for their AI models, everyone else cannot.


My point is that this quasi-legal argument about how “humans learn so why can’t machines learn” is a non sequitur. This is a case where quantity has a quality all its own. Copyright law was not invented with anything like this situation in mind.

You end up with billion dollar corporations gatekeeping this technology either way (who else has the capital to best train the models?). This isn’t about the little guy.


>Why should the law treat machines the same way as a human? Sounds like some corporate flim-flam to me.

It shouldn't, that's why arguments that the algorithm is learning, so it's doing the same thing that is legal for humans to do is completely fallacious, on top of it just being anthropomorphism.


I'd be far more amenable to corporations training their AI on my content of those exact same corporations hadn't spent the last two decades aggressively defending their own IP with DRM and multi-million dollar lawsuits.

So they can go fuck themselves, or alternatively they can make their super advanced AI reproduce my copyright statement and license every time it copies my code. Which shouldn't be difficult at all.


Indeed, they can go fuck themselves. But if copyright law gets put in place against AI, the only ones who will be able to enjoy the benefits of AI are the people who can go fuck themselves. The gatekeepers that will just add yet another line to their 10 billion line long EULA that will require you give up the rights to any art you possess for AI training, to the exclusive use of mega co. inc. ect.


Your stance, if it becomes law, guarantees that only the giant corporations you hate will be able to negotiate licenses to train their AI. It guarantees that libre AI tools will be left far behind.

Is that what you want?


i think it's arrogant and selfish to assume that everybody has a stake in AI research. i want the license that the software i create (which has nothing to do with AI) to be obeyed.

anyways, there's a lot of ways that AI researchers could engage with IP owners to come up with a fair way to use their work, but nobody's making that effort. If my content is part of an AI's training set (and especially if that AI has a tendency to output excerpts of its own training set verbatim, as github's copilot has been shown to do) then it's not unreasonable to set terms and conditions, which could restrict how the content is used for training and what sort of compensation (if any) I deserve.

I'm of the opinion that it's time for new versions of GPL and CC licenses to be created which will enumerate how content can be used for AI training.


> it's arrogant and selfish to assume that everybody has a stake in AI research

I didn't assume anything, just described the likely consequence of your preferences.


Which corporations are you referring to?


mostly Microsoft but i'm sure there's plenty of hypocrisy coming from the others too.


[flagged]


Would you please stop breaking the site guidelines so we don't have to keep banning you? I appreciate your good comments but bad ones destroy more than good ones contribute.

https://news.ycombinator.com/newsguidelines.html


>"The Co-Pilot suit is ostensibly being brought in the name of all open source programmers. Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful."

Maybe people wouldn't be so angry about an AI trained on mostly open source code if said AI was open source, and not a proprietary SaaS.


> Maybe people wouldn't be so angry about an AI trained on mostly open source code if said AI was open source, and not a proprietary SaaS.

Exactly, the point is this one. Open-source doesn't mean liability free, you still have to comply to the license!


You don't if you are creating an entirely new work of art based solely on the knowledge and patterns you've learned from looking at other code, ie. a human learning to code by looking at millions of pages of code on GitHub; whether or not AI can learn in this way is the point of contention for AI art/code/chat generators.


If the AI "independently" comes up with a 1:1 copy of some piece of copyrighted code, would this be a copyright violation or not?

There's a reason why some programmers don't even look at proprietary source code leaks as to not accidentally introduce copyright violations into their own code.


> would this be a copyright violation or not?

It would be. When a human does this, does it invalidate the human's ability to create any new work at all? Should we chain up anyone who violated copyright by perfectly recalling someone's art in memory and re-drawing it from heart, since we cannot trust them to ever create an original work again?


Copyright is (mostly) not about copying for your own use, but about commercial exploitation. This topic has been discussed to death since at least the Sony Walkman. Nothing of this is new or different just because an algorithm is now involved.

If you copy for your own use only, that's totally fine - or at most a legal greyzone, in the end nobody will care about such personal use copies. If you use AI to generate pretty pictures to hang up in your home, totally fine too.

As soon as you start making money with this stuff though it becomes an actual problem.

It's really as simple as that.

Even the 'generative art aspect' has already been settled long ago when music sampling became popular and required a legal framework.


Software and human beings are two different sorts of things and should be treated differently.


The trick here is to implicitly personify the AI (a program) by comparing it to a human. Because they are both “learning”.

There’s no reason why we should have the same standards for programs and humans based on metaphors.

If I log in to a website three times a day, I am simply using a website. If a program logs in to a website three thousand times in the span of a second from multiple IP addresses, that’s probably a DOS attempt.


Human minds can commit copyright infringement without realizing by it if they regurgitate parts of something someone has already created. Google “George Harrison.”


The `human learning` argument comes a lot in every discussion about copilot, but it's a completely different thing, and misleading. `Human learning` involves understanding beyond the words and sentences, copilot doesn't know anything about our world.


Co-Pilot generated code is based on works that come from a variety of licenses. The generated code therefore must be licensed according to the license the code it was derived from used. In many cases these licenses are not compatible and the generated code, being derived from copyrighted and licenses works, is in violation of copyright law.


I think this interpretation works if the code being generated is seen as essentially being retrieved by a lossy lookup function.

But another interpretation is that the generic structure of the code was learned from the works, which is not copywritable. And that generic structure was used to synthesize new code, in much the same way a human who had seen a pattern in a proprietary codebase years ago was able to use that pattern in their own code. I am not a lawyer but most licenses do not prohibit that in my experience. More often in my experience this is what is happening with generative ai.

The tricky bit is that the ai can probably do both in the eyes of copywrite law, since the boundary seems to be very context dependent and existing models don’t have any concept of how much you need to compress and forget the specific details so that it is seen as novel by the courts. The model can memorize significant parts of some inputs despite not having nearly enough space for memorizing the input set, so the first interpretation is possible even if it isn’t the typical output. There isn’t really a kind of “courts will see this as novel” regularizer and there might need to be?


You are just hiding the more complex argument behind the word "learned", which is not something in the normal understanding of the word that's attributed to a computer.


I can expand on that a bit- the weights in the big generative models are still basically too small to hold a significantly number of the input set with anything we would call compression today. This forces the model to strip the input down to some discovered bare structure, which when humans do this we call it things like “archetypes” or “theme” and it’s not generally copyrightable. Many LLM aren’t even trained for multiple epochs so it’s not optimized for memorization as much as it is forced to extrapolate on future examples. I’m arguing that the problem is that the computer has no knowledge of where the line is when it becomes plagiarism in our courts, not that it is always plagiarizing. I think it clearly can’t be always plagiarizing from anecdotal experience of using them and from just doing back of the envelop math on how many bits it has to memorize each input string.


You equating that process to something that "humans do" is anthropomorphism.

It's not true that that is what humans do.

Having knowledge of where the line is with regards to copyright liability is not an element required to prove liability. i.e. it's of no consequence that the infringer doesn't realize or know that they are infringing. Copyright is strict liability in that sense.


If viewed like this, you could argue that with every single line of code open source devs are working towards:

1) SaaS AI people getting richer

2) Devs have less work in the future

I’m not sure if that’s a good development for the open source movement.


US copyright law only applies in the US. AI can be developed outside of the US where other laws apply. So, the US using the straight jacket of copyright law to stifle innovation would only cause AI companies to move their business; or non US companies stepping up. I don't think that's actually going to happen though. The stakes are too high for that and this case seems quite weak.

Copyright law has its limitations. But it also has a long history of being applied and interpreted that you can't just wish away. New legal interpretations have to be reconciled with that. So, any radical changes in that are not very likely to happen. Nor will politicians step up and change the law. First of all, few of them actually care or even understand most of this stuff. And secondly, they have plenty of other distractions and a worse attention span than a toddler. Mostly they just do what big companies tell them to do.

So, the most likely outcome here is that these cases won't get very far. At best it might drag on for a few years while nothing really changes. During that time, AI will continue to develop and will get more embedded in society.

And lets face it, this is not Oracle with really deep pockets unleashing an army of lawyers against the likes of Google but some isolated individuals. As legal cost ramps up, their enthusiasm might suffer a bit. Especially if they start losing cases.


>> US copyright law only applies in the US

This isn't entirely true, due to various entangled trade agreements that require countries to respect each other's intellectual property as a prerequisite.

See https://en.m.wikipedia.org/wiki/Berne_Convention


They require countries to respect each other's copyrights, not each other's copyright law. The US, for example, does not enforce EU database rights. Moreover, you can in EU copy a book made by a US author who died 80 years ago even if that copyright is still valid in the US. Local laws are enforced by local courts.

What the Berne Convention requires is that if I have a copyright in US it will be recognised in France, etc without having to re-register it in every country in the world.

There is no part of the Berne convention that will prevent people from training models on US copyrighted works outside of the US. That is entirely a matter for local jurisdiction.


Trade agreements have been used to require other nations to extend their copyright terms to match the US, eg Canada:

https://cassels.com/insights/copyright-term-extension-in-can...


database rights are not copyrights but a new sui generis monstrosity, but this comment is otherwise correct


More legislatures ought to repudiate the treaties their predecessors blindly signed.


Where? Even Afghanistan’s joined the WTO. Wouldn’t the US file a WTO dispute if that’s the way the wind blows? The harm to US industries would be enormous.


I'd say China is an obvious candidate. They already have a fair bit of AI infrastructure and tech. Europe, India, and a few other places also could step up. That's where a lot of the AI researchers in Silicon Valley come from. The harm to US companies would only be enormous if anything actually happens. And precisely because of that, it likely won't. These companies can move some serious amounts of lobby power and historically the US doesn't like to constrain its own industry with arbitrary laws. So, this will likely just fizzle out with a nice court ruling.

And good luck filing a court case under US law in Afhanistan. That's not how trade agreements work. They'd be applying Afhan law (i.e. Sharia law as of recent changes of power). In China it's Chinese law. And in Germany it would be German law. All trade agreements govern is the notion that people need to have some notion of law that applies to things like intellectual property and a way to take legal action when they feel their rights have been infringed. But you get to do that under the local law whatever it is and in the local courts. And of course some countries like China have historically not really done more than pay lip service to such notions. Whereas other countries in the EU have strict laws related to e.g. privacy that don't really apply in the US.


Belarus. Haven't you seen the news of this month?


Copyright is eating everything since the beginning of computing and spitting out worse ghosts of things we might have in its absence.

Sometimes we see glimpses of the world that we might have, in the form of open source, sci-hub and independent small sucessess against all odds created by people going against copyright behemoths.


> [...] and also, for software authors, prohibiting ML training would be antithetical to the Open Source Definition. So that probably won’t work.

Of course. As an author of OSS, I'm more than happy to let your AI "learn" from my code as long as the trained model is released under a GPL compatible license.


That the original post elided this point really stood out.

I don't hear complaints about code getting used for training.

I hear complaints about code being used in proprietary products in what appears at first and second glance to be a code laundering scheme without attribution or their rights respected.

If it was trained on the source code for Windows 11, AWS, and Google Search maybe everyone would feel more magnanimous. If those were used I have the feeling that the lawsuits would be much faster.

I don't get how "I'm only sharing this under these conditions" is complicated to grasp. Maybe it's a technical annoyance but... good?


I think this is exactly right, and also applies to any models based on web scraping. If you're scraping the web and building a set of weights based on language or images that people have developed, _fine_ as long as you release the model without charge and/or at cost (given that the cost of model training in this way is still expensive, earning back expenses is reasonable).

Otherwise every word typed and every image uploaded is contributing to the development of products that will increase the power of mega-corps over time.


This author really does not know two shits what they're talking about. Surprising this whole post isn't flagged. If your AI software is vomiting derivative works stripped of their original license, provided the license even allows derivative works, then yes, you need to lawyer up.


This AI boom feels very similar to that of crypto/Web 3.0 of a few years ago. Wouldn’t be surprised if the same people are behind the current AI/ML hype.


And the output!


I'm hooked on Stable Diffusion. It's the most impressive sudden leaps in technology in my 30 or so years of being old enough to understand it. Much of the power it has, is based on the material it is trained on, and I'm very appreciative of the work and effort that was made in order to make it what it is. And, it's only getting started.

That said...

I get it. Huggingface are working on a diffusion based model for music. And guess what? It's extremely important for them to only use opt-in or commissioned training data. Why? Because the music industry, unlike artists on artstation and the likes, have a lot of lawyers and can protect their copyrighted works.

Why should it really be any different for visual arts? I honestly don't see why. That isn't to say I'm not going to keep on using stable diffusion, nor do I think there is anything that can be done to stop it. But, I do think artists should be compensated, and such models should not be based on the work of anyone, unless they want it to be.


Monetary risk, that's why. Cross the wrong people and you'll find yourself extradited to another country and tried for copyright laws you didn't even know existed. I'm honestly surprised OpenAI models know about Disney properties.


No. I get why. I'm just saying that it's morally equivalent to use copyrighted music to train AI generated music, as it is to use copyrighted art to generate art.

> Dance Diffusion is also built on datasets composed entirely of copyright-free and voluntarily provided music and audio samples. Because diffusion models are prone to memorization and overfitting, releasing a model trained on copyrighted data could potentially result in legal issues. In honoring the intellectual property of artists while also complying to the best of their ability with the often strict copyright standards of the music industry, keeping any kind of copyrighted material out of training data was a must.

Reads very different to any statement made by Hugginface or the LION database regarding stable diffusion, which do not mention the concept of an artist or artist work a single time.

And Hugginface are the only ones open about it. Midjourney and Dall-E and Imagen most assuredly doing the same for their black boxes.


Music business has for reasons connected to broadcast, phonograph and CD been extremely profitable. Less in the streaming world but still ok. This powers a lot of lawyers for the next decades. Visual artists earned a lot less.


The music industry should be like visual arts, not the other way around. Do we want more inane lawsuits based on vague similarities?


No, no, I agree about the insane lawsuits for similar sounding music. The popular music is mostly regurgitated stuff anyways. But the equivalent here would be "Penny Lane as if composed by Michael Jackson and sung by Taylor Swift". My example is bad, because I stopped following popular music a decade or two ago. But replace the deceased artists with living ones.

My point probably should be made clearer. You know how you can say "in the art style of X"? Well, it doesn't matter if Huggingface made that possible. You can, relatively easily, train that concept with a collection of paintings by X. Then you can go ahead an make art in their style.

Now, from a technology point of view, that is nothing short of amazing. I still cannot get my head around how absolutely ridiculously powerful it is. And, even the people who play with this, don't seem to fully grasp it either. The world will change in the next 4 years.

What I'm wondering, is, what should we, as a society, find acceptable? Why should someone be able to train "in the style of X" where X is a set of EVERYTHING, and make money of it, without the say of X, or even them getting anything for it? Have you checked the evaluation of Huggingface? It's in the order of 1-10 billion USD.

There is definitely the argument of anti-copyright. I get that part too. But there is definitely someone who will end up with the bigger stick, and it isn't the people holding the paint brushes who made it possible. That seems just a little bit unfair, and perhaps unwise.

Also, I'll end with a point that no one so far has brought up, even though I've followed the discussion both for and against AI. Which is "whitewashing" art. Now, the example I'm going to show isn't very good, but I also spent 5 minutes on this. Where would you draw the line on when Alexander Wild no longer has copyright over his photography?

https://imgur.com/a/cm6IrzG


Thank you for articulating the split between the users of AI models and the owners of the models. I like using the models in the manner of Marcel Duchamp’s readymades and from that perspective, all the arguments about artists’ moral rights are nothing more than regulatory mumbo-jumbo and sour grapes. However, I do strongly agree that there is something morally icky about companies such as “Open”AI and HuggingFace making bank off the training data they’ve scraped. That’s why I prefer Stable Diffusion and its clones, as it lets us run it locally without someone making money in a pay-per-prompt (or, worse, monthly subscription) model. Democratize AI to anyone with a strong enough computer or don’t do it at all.


I should mention that Stability.AI, Hugging Face and Stable Diffusion are tightly connected. It's all a bit muddy, and I didn't help by mixing them up. However, what I think is that Stability.AI is the company, "Hugging Face" is the community based website for sharing models and the likes, and "Stable Diffusion" is the implementation of the diffusion model developed in a university in Munich.

It's not really fair to take it all out on Stability.AI, as at the very least, they are sharing the technology and models with everyone (for the time being). And that open up some incredible possibilities.

It's much worse what DALL-E, Midjourney and others do, which is much the same, while they let people play with it, but it's all theirs, and they can take it away at any moment.


If you want to do a cover of someone's song, but in a different genre, you still need to license it. It's called a mechanical license.


Where do you read that Huggingface is training a music model? I think you're confusing it with Harmonai.


You're right. I was confusing them. The disconnect between the discussion and approach to training music models and image models still applies, although the hypocrisy I implied does not. Harmonai are approaching it sensibly. Huggingface and most other image diffusion models developers are not.


You might be interested to know then that Harmonai and Stability AI (creators of stable diffusion) are tightly connected. I think Harmonai is actually part of stability, but it's a bit murky.

https://twitter.com/StabilityAI/status/1605012677188718592?t...


> If the AI industry is to survive, we need a clear legal rule that neural networks, and the outputs they produce, are not presumed to be copies of the data used to train them. Otherwise, the entire industry will be plagued with lawsuits that will stifle innovation and only enrich plaintiff’s lawyers

Or maybe, get this, how about people running AI only feed them information that they legally have the right to use? How is it a bad thing that somebody can't legally steal other peoples' work without their permission because of pesky copyright?


As an extension of this, only allow children to look at works they purchased publication rights to, lest their creative output becomes influenced by a different person's style.


There is absolutely no comparison here, because children don't charge you to look at their artwork, if you ask nicely, they will probably give it to you for free. Companies using other peoples work without permission to train AI, will charge.

Your suggestion would be accurate if we lived in a world where we all shared, and there was no money, and copyright didn't exist, but we don't.


It is my understanding that it makes no legal difference (at least in my country) whether I charge for my work or not when it infringes somebody's copyright. Simply sharing it is sufficient to get into trouble.


I don't know from which weird country you're from, but in mine profiting from it changes things a lot.


It sounds like your country is the weird one. Try burning the complete works of Disney onto stacks of DVDs, then go down to your high street and hand them out. See how long you get away with that for.


I think the big difference is distribution vs consumption. Where I live, there are additional clauses in law, for mass reproduction and selling.

But republishing any work as your own, probably falls into that category. And it isn't about profit, but commercial use; thus pasting onto a blog to improve your business (rankings, hit count) is a business use case.


It's ultimately a commercial use because you are directly affecting Disney's market. This should be obvious.


Unless I do like millions of them, literally nothing would happen.


Do you think they would act much faster if you charged a dollar or five? I don't.


There are plenty of competitors to the corporate AI models that are freely distributed, free to use, etc. You just need to have the hardware, which is admittedly pricey. The worst outcome is if there is a legal risk in creating AI models that only big companies with an army of lawyers large enough to fend off lawsuits can afford to face. Then you'd have the continuation of big tech controlling things for "responsibility" reasons instead of AI being a technology anybody can use.


I have no idea what your point is here.

These AI companies are making serious amounts of money (OpenAI is valued in tens of billions) on the back of artists who never gave permission for their work to be used in this way.

If a child took an artist's work, copied it and made significant amounts of money from selling it then yes they should be within the purview of copyright law.


> who never gave permission for their work to be used in this way.

the copyright aren't all encompassing. There's only an enumerated set of rights granted, and "this way" (aka, training an AI model) is not one of those restricted activities (like distribution or broadcast).

Unless the model can be argued to be a derivative work of the training data set (which i don't believe it is, since the process of training is sufficiently transformative imho), the original copyright holders of the training data do not need to be asked permission.


AI doesn't copy one to one. It mixes, like humans.


And sometimes it mixes just from one. See github copilot.


Anthropomorphism isn't an argument.


I wouldn't bet for or against that without legal advice, and even then it might vary by jurisdiction. Legal fictions are a thing, but I'm no lawyer, and I know better than to assume my interpretation of any legal issue is any better than a Hollywood script writer's: https://en.wikipedia.org/wiki/Legal_fiction


AI is not human children.


I can instruct an AI to draw Mickey Mouse and infringe copyright and I can instruct a child to draw Mickey Mouse and infringe copyright.


And when you try and sell that work without attribution or compensation then there is a problem.


Exactly. If I produce a work, and then you produce an identical-enough work, you're infringing my copyright. I don't care how you did it, it makes absolutely no difference to me.


The problem is there's no way a user will know whose copyright they are infringing when they ask AI to "paint a landscape."

Maybe the AI needs to be able to print out a list of sources to provide attribution. That would be interesting.


Artists get inspired by others all the time, and if the results are far enough from each other, then nobody has a problem with that. In fact, pre copyright, the similarities used to be even larger. Art lives from the concept of taking ideas, and improving on them.


> they ask AI to "paint a landscape."

that's the responsibility of the user said AI to check.


The most original authors are those who never learned how to read.


> legally have the right to use

is the "right to use in ML training" well defined?


The EU has a copyright exemption for noncommercial model training, and at least the UK is changing that to even allow commercial model training, without even an opt-out required. So it appears they legally have the right to use anything.

Why should you want a model designed to know all human knowledge to know only public-domain knowledge?


> Or maybe, get this, how about people running AI only feed them information that they legally have the right to use? How is it a bad thing that somebody can't legally steal other peoples' work without their permission because of pesky copyright?

On one extreme:

"Unless you pay your annual Disney fee for having watched Disney films in early childhood, you will need to return your brain to us for processing. Disney was used as the basis for all concepts you know, and as such, Disney owns all subsequent intellectual output of your brain."

And on the other:

In the age of AI, copyright will cease to hold weight. We'll make more new content on a per-month basis than all of recorded human history. The old regime must be thrown away to accommodate the radically new world we're entering.

We'll land somewhere in-between, and I'm hoping it's much closer to (or even precisely) the latter.


Very curious that so many people adopted this position exactly when it became feasible for giant corporations to profit by mass producing laundered copyrighted works!


Implying that I haven’t always been anti-copyright. How is either this comment or yours supposed to be productive?


Those wanting AI to respect copyright are going to find that the big players will navigate copyright just fine. It's the small players that won't. They're advocating for institutional control over AI.


One of the first commercial uses of modern neural networks was Microsoft laundering GPL code with Copilot, so I’m not really sure what you mean by saying that “big players will navigate copyright just fine”.


There are laws in the EU that only apply to large companies, like Facebook et. al. because they have much more power in certain spaces. Similar laws can be made for Disney vs. small studios, e.g. "if turnover is less than 100M EUR/month..." - I feel this is often proposed as a false dichotomy.


So if I want AI to respect copyright to protect individual artists, designers etc.

I am not fighting for the smaller players but for large enterprises. That is illogical.


The problem lies with AI artists wanting copyright for me but not for thee.


> how about people running AI only feed them information that they legally have the right to use?

Well, this is the crux of the matter, isn't it? Do you, a human, have the right to look at copyrighted works and learn from them? Do you have the right to use AI to do the same?


But we don't want poor innocent microsoft to train models on their own code, do we?


You’re talking about training. Training is legal.

- If you bought the book, you can read it.

- If the book is free, you can read it.

- If the painting is in a museum, or on Wikipedia, you can visit it.

- If Bozo the clown says you’re not allowed to look at drawings he posted online, it’s ok. You still can.

Same for AI.


> how about people running AI only feed them information that they legally have the right to use?

That's what they did!

It was in fair use. So yes, they did have the right to legally train the data on copyrighted images.


> It was in fair use

Many artists don't believe this and the law is very much unclear.

In many cases the AI generated work literally looks like a clone.


I'm curious the effort spent prompt-crafting and searching for a seed to arrive at a similar image. Have they provided the prompt and seed used?


Train on data: sure… but sell the output? Different question altogether.


Copyright only governs publishing. So you have the right to train AI with any and all data you have access to, as far as copyright is concerned.


> Copyright only governs publishing. So you have the right to train AI with any and all data you have access to, as far as copyright is concerned.

Sure, but you still just cannot output anything that looks like a derived or copied work.

So, maybe ... how about if image generation nets hold onto the training images so that it can compare the generated output against its training data to ensure that it is not too similar.

/s (but only a little)


Train it on, perhaps. But personal not-published use is a rather small niche compared to the current commercial explosion.


Yes but using AI to generate works that can be used commercially is the way commercial AI companies plan to monetize AI.


There have been leaks where OpenAI is charging $42/month to use their service.

How much of that is going back to the copyright holders whose work their service derives value from ?


> How much of that is going back to the copyright holders whose work their service derives value from ?

how much of the earnings of the student of art goes to the textbook authors, paintings and learning materials he used to get to where he is today?


Derivative works are their own things (when sufficiently derivative). And AIs are not humans - using an algorithm does not automatically remove the copyright. See also "I uploaded a movie to youtube but it's upside down, why did it get taken down".


But that's because the movie is still recognisable. Using an algorithm doesn't automatically remove copyright, but if the algorithm transforms the data to a point where it can't be recognised as the original work, then it isn't breaking copyright.


Those may be fine, yes.

But the AI as a whole is capable of reproducing the original in a recognizable form, and it does so on demand quite easily, because it was trained on them - how is it different than selling a zip file containing millions of copyrighted works, and also a bunch of new stuff?


> how is it different than selling a zip file containing millions of copyrighted works

so you're saying that the digits of pi is violating copyright then?


Copyright law hinges on human element of the actions taking place, not on mathematical technicality. The digits of pi are not creative human expression, nor are they derived from human expression, they're a factual mathematical discovery. They can neither infringe on copyright, nor are they subject to it themselves.


So what differentiates the matrix of numbers in the AI model, vs digits of pi?


I can ask one to produce copyrighted works. The other, not so much. That's a rather weak straw-man.


If you publish/present something for consumption, it ends up in their brain's neural network. It's not stealing.


I think the underlying question is one of "degrees of derivation".

There's a famous Carl Sagan quote: “If you wish to make an apple pie from scratch, you must first invent the universe” which hints at the problem: Nothing is created in a vacuum.

Let's compare what Stable Diffusion does with what Franz von Holzhausen, head of design at Tesla, does. Franz didn't come into existence out of nothing and knew how to design cars. Instead he studied transportation design and worked at Volkswagen, General Motors and Mazda before joining Tesla. All these steps trained his (actual) neural network with inputs from copyrighted car designs.

Based on these inputs he was able to create the designs for the Model S, Model 3, Model X and others. Does this mean that Mazda can now levy a copyright lawsuit against Tesla? It could, based on the reasoning employed in some of the Stable Diffusion suits, but it won't based on the lack of similarity between the cars of both brands.

I believe that the law around AI will come to a similar conclusion. AI learning is neither fundamentally add odds with copyrighted material, nor confirming it. It will be a matter of degrees of derivation - how different is the output of the AI from its copyrighted inputs.


I can't help but feel like you're slightly anthropomorphising an algorithm. It's a really damn cool and powerful algorithm, don't get me wrong, but it's still not a person. At the end of the day, it's also not the algorithm "benefiting" from it, but the corporation using the algorithm.

It's also a bit hypocritical, because if you did the same thing to them as a human (in this example let's say be a Tesla copycat) you'd likely be sued into the ground because there's probably a patent somewhere in there.


I think the parent poster has the right idea. This is not just anthropomorphization; it's an analogy.

They take away is contained in the first and last sentences.

Derivation is key.

Copyright protects original expressions, and copying means to reproduce (read and write) something. The analogy OP made is focused on the reading and writing done by humans and the reading and writing done by an algorithm.

Algos like stable diffusion are reading, and their user is controlling what is written.

If the user produces a work that is unique, but uses the style of a particular artist, that seems like it should be valid, since style is simply a process. It's how to create art, but it is not art, and processes are not subject to copyright based on copyright.gov.

With all that said legality isn't morality, and I sympathize with the artist.


I guess. Imagine you're an artist, and you've created your career based around your unique art style. It gets you jobs as a concept artist, or whatnot (idk). Now, imagine there's a tool someone "trained" in a few hours to do what you you do, but withi seconds instead of hour or days. Is there any reason to pay you, ever again? Your hourly rate is surely going to be higher than their electricity bill for running the algorithm.

So, how do we protect you (as the artist) from this? Copyright, even if flawed, currently protects you from that


No. You cannot copyright an abstract style. Process is not copyrightable.


Sorry, I realise how my comment was a bit confusing. I kinda butchered my point.

I guess my point is more ethical than legal.

Sure, another artist could copy your style, but they would still need to study it for a long time (composition, palette, perspective, grading, line work, etc) to get an accurate understanding of it. They also needed to have spent years training their hand-eye coordination, as well as art theory to achieve it. Whether it's ethical is debatable, but they've earned their skills. And they'd still take a while to produce it, so they earned their money.

If someone without training just asked a computer program to produce "dragon in style of X" then this means people can sidestep artists to create works.

One could be smug and say "work smart not hard" here, but it creates a tricky situation.

What if people start pulling down their works or not posting them for fears a megacorp will crawl their works without permission?

What if someone asks for the same piece (by describing it as a prompt and saying "in the style of X"); is that plagiarism or copytheft?

What will happen to the models in this case? Do they become "inbred" over time?

What becomes of artists? It will no longer be a viable career (especially digital artists) for them. I know most artist enjoy the process more than selling the art, but the process won't put food on the table.

I guess it's a lot of what ifs too here, but they're not unlikely scenarios.


It's not an analogy because it doesn't analogize "learning" which is the thing that's being used as a defense. It hides any criticism of learning by implying that it's the same learning that humans do. That's not an analogy, it's just bullshit. It's bullshit word salad. A really tiny salad, here.


The anthropomorphisation of the ML baffles me. This is not AI but a large ML model trained on often proprietary data. The whole discussion is ridiculous. There should be a lineage of model data so we should know what's the original source (at least for the "core" of the answer) in form of citations. I assume that the model loose that information during the training and that would be quite hard as everything is mixed together.


Is it really that baffeling? Take ChatGPT, a model specifically fine tuned on dialog interactions to seem more human like. Of course people are going to anthropomorphise it. And the whole thing about it not being AI but an ML model, I think we can let that one go. It didn't stop the term cloud computing ("it's just someone else's server") and it won't stop the term AI from being used for this tech.


It's baffling because it's being done by people who would otherwise insist on technical correctness and rebut any argument otherwise with "well, how do humans think?"


The key point is: an AI calls itself an intelligence but this so far is quite a bit of marketing. A human is considered an intelligent being. Where a process of creating new things happen. They are based on true learning and not just reproduction. And where they have been just reproduction, of course it went to the courts.


I'm really unsure if there is a qualitative difference between a human looking at lots of images, deriving patterns and recompiling them into a new image or a computer doing the same.


Currently there is definitely a wide gap because the current "AI" are completely incapable of a real intellectual process. Wheras humans can develop a true understanding and consequently it is an actual learning and not memorizing process. Of course there are overlaps and consequently from time to time there are law suits about copyright infringements by artists.

The pure fact that Stable Diffusion tends to produce 3 legged humans shows the complete lack of understanding of its doings.


Regression =/= memorization and is exactly how humans synthesize information as well.


I'm familiar with this line of reasoning, but I'm always struggling to understand the exact thing that humans can do that computers don't. Usually, the differentiation is that humans have e.g.

"true understanding"

I assume you mean the process of looking at an image and not just deriving patterns, but seeing that you are looking at a cat, that a cat is an "animal" which has "four legs and a tail" and that cats can be friendly towards you or aggressive, depending on your own behavior and theirs.

Neural Nets are certainly capable of the first two: classification and creating taxonomies. The last one I admit is tricky as it requires the Neural Net to be an entity within the observed world

"intellectual process"

the intellectual process is arguably exactly the process input->categorize and analyze->compile->produce output loop that we've modelled AI based upon

"creativity"

is the ability to create something truly new. This one seems obvious as Neural Nets only can derive patterns (plus maybe a random input) - but I would posit the question if any human ever created something truly new in the "apple pie from scratch" sense or if we've only ever created higher level works derived from existent things.

"consciousness"

this one is hard to grasp. I would argue that consciousness is the realization that one exists (in the descartian sense) - coupled with the desire to continue to do so. It is a quality that wouldn't make much sense for an output focused neural net like the one behind Stable Diffusion - but it might be a desirable trait in a decision making focused deep learning setup - similar to a self-healing cloud deployment.

"love/emotion"

This builds on the previous consciousness example. Not to sound like Rick Sanchez /some other cynic - but aren't these at their core adjustment mechanisms that help us further evolutionary goals like survival and continuation of our lineage. Wouldn't a decision making focused deep learning setup be more stable/have a higher uptime if it would facilitate its goal of "staying on" through a strong drive of survival/expansion?

The last two examples are where my point falls apart a bit. But I still stand by my general thesis: We are way too certain that our particular human way of processing information and "thinking" has some divine quality to it that isn't replicable in neural networks. Against that, I would argue that neural networks are largely the same mechanism we employ in our thinking and that they are just a couple of millenia in evolution behind, but are catching up at a multiple of the speed it took us to get to where we are now intellectually.


Do you think there is there a qualitative difference between the human and the computer?


I guess the question is, what exactly is "true learning" and why is that impossible for a machine to achieve?

Keep in mind, we don't know how humans learn either (on a neurological level). It might end up being that we stumbled onto the same general idea, using matrices and linear algebra instead of neurones, synapses and neurotransmitters.


>There's a famous Carl Sagan quote: “If you wish to make an apple pie from scratch, you must first invent the universe” which hints at the problem: Nothing is created in a vacuum.

See also the story of Trurl's Electropoet (from the Cyberiad). Trurl first had to simulate a universe to get the poet to work.


maybe one country or another will outlaw generative ai or ai art or media synthesis or whatever it ends up being called, but presumably they'll be left behind by rapid cultural and technical development in whatever countries don't

the cat is out of the bag, the worms are out of the can, the feathers have blown away in the wind

these developments seem very likely to be central to programming, all other kinds of engineering, conceptual art, scenography, costuming, technical illustration, and pornography, within a couple of years, even if (against all odds) development on the neural nets themselves makes no further progress; they enable you to do things in minutes that previously would have taken days, things which are core parts of the feedback loop driving these disciplines

if every country in the world except thailand bans it then within ten years all your kids will be secretly watching prohibited thai movies with software secretly written in thailand on surreptitiously thai-designed computers, riding thai bicycles

even if deepfakes mean that the most significant effect of ai art is enabling massive fraud, spam, and mitm attacks, banning it locally won't stop you from falling victim to it (fraud is already illegal) but just from developing effective defenses against it


Japan has explicitly amended their copyright code to enable machine learning on copyrighted data.[0] [0]https://storialaw.jp/en/service/bigdata/bigdata-12


> although the laws of foreign countries have provisions with the same effect as Article 47-7 of Japan’s Copyright Act, all of which limit use to development for non-commercial purposes and development by research organizations

It reads like there's a bunch of countries that have similar legislation, interesting though.


Very interesting! This would give Japan a real competitive productivity advantage if ML tools are banned or severely hampered elsewhere. It would also mean that other countries would need to ban not just scraping but also the resulting code or AI-generated media.


I'm sure chatgpt would be of much less interest worldwide if it only spoke japanese.


Doesn't follow. Why would a tool built in Japan limit itself to Japanese? ChatGPT doesn't limit itself to English.


To export you must abide to the laws of the country you're exporting to.


your novel theory of extraterritorial jurisdiction will no doubt be very interesting to trips litigators

afaik asahi v. superior court is still governing precedent in the usa though so it won't be of any interest to domestic litigators in the usa


Perhaps you can't read words?

Extraterritorial jurisdiction?

You sell stuff to country A, you comply with laws of country A. Which is why USA companies have to take GRPR into account.

You make an illegal model for country A? Can't sell it there.


You wouldn't sell the model, you would sell its outputs. I can legally buy products made by factories in Bangladesh that would violate all sorts of laws if they were operating in the U.S. It will be for international trade agreements to determine.


You mean operate in a legal grey area and shut down when the loophole is closed


So has the EU as part of the digital single market changes in 2019 (The so-called Text and Data mining exceptions)


Not really.

Well they allow mining the data but nothing is said about the copyright of the collage output.


your novel theory of how diffusion models work will no doubt be of great interest to deep learning researchers


requires training data: yes

occasionally outputs training data verbatim: yes

verbatim output is somehow not a copyright violation: ???


"The collage output"? What?


100%.

Furthermore, if we adopt current copyright laws to AI rather than understand the entire world is changing, only the largest AI companies will be able to leverage the technology to train their models.

If it requires every film in existence to train a model, only Disney or Disney licensors will be able to operate. That's not good for competition. It might make it even harder to grow up as an independent creative or startup as it presents an impenetrable moat.

As I currently see it, weakening copyright is the only way to assure democratic access to this technology.


I'm still troubled by some of the implications of the technology and think it would be a good idea to have some countries take a hard Luddite stance against AI, but that doesn't mean I think you are wrong. I have trouble with the assumption that what is good for efficiency and reducing human labor is necessarily good for us, but it does imply those that don't go along with the efficiency increases will be destroyed by those that do.

With generative AI I'm mostly concerned with what it will do to the next generation of artists. I don't think I would have ever had the motivation to pursue music if it had been possible to replicate perfectly with AI. I'm immensely thankful I got that opportunity and so I want the next generation to get it as well.


i strongly agree with your first paragraph

i think the second paragraph is a bit myopic, like hunter-gatherers observing agriculture and worrying whether the next generation will be able to track prey through plowed fields, or will allow their hunting skills to decay because it's easier to get meat by trading with the agriculturalists

but that understates the case; ai (if this is really ai this time) is certainly a more significant innovation than agriculture, probably more significant than fire, on par with tools and language

still, that says nothing about its moral valence


It's surprising that so many people on this site side with the Luddites. The back pressure ML is generating is at this point too strong for anything to make any difference. Anyone who attempts to stop it will just be practicing self-sabotage.


It's surprising that so many people on this site side with the Luddites

This is too reductive.

The back pressure ML is generating is at this point too strong for anything to make any difference.

This is wrong, regulation can make a difference.

Anyone who attempts to stop it will just be practicing self-sabotage.

This is a prospect worth evaluating organically. Learning the potential is much different from accepting self-fulfilling prophecies.


> This is wrong, regulation can make a difference.

Regulation is per country or bloc. With ML the value of defecting is so high that any regulation you impose on it which restricts its utility will amount to self-sabotage.


generally luddites are found among those with the deepest understanding of a new invention

but i agree that in this case it's probably futile


If it really turns out to be a problem then copyright holders are simply going to add in a not-licensed-for-training clause in all their licenses[1].

Sure, existing works already licensed can still be used, but at least both parties (copyright holders and AI trainers) won't have anything to argue about.

[1] Anyone from CC reading this? Make it the default.


It is like adding a clause in the license that you are not allowed to read the license.

The moment you share your creation/work to someone/the world, you are training their nn.

You can not share something publicly and then demand "xyz" can not view it. Viewing is training.

You are free to keep your creation under lock & key and only share with nn (of people and/or AI) of your choice.


> You can not share something publicly and then demand "xyz" can not view it.

That's nonsense. Licenses have clauses on how the content may be used. Clauses along the lines of "The content may not be used for ..." are common.

I dunno where you heard that once you release something the license clauses no longer apply, but it's wrong.


many of these clauses are legally unenforceable because you can violate them without infringing on any of the rights that copyright law grants exclusively to the copyright holder (who can license them)


> many of these clauses are legally unenforceable because you can violate them without infringing on any of the rights that copyright law grants exclusively to the copyright holder (who can license them)

That's news to Microsoft[1], who's shared source and various NDA licenses for the source code already has clauses restricting what you can do with it.

[1] I think the problem is that the pro-AI arguments are coming from people who are not aware that clauses in licenses restricting how the content is used is quite common. For example, you were obviously not aware that they were so common that almost every big tech and/or software company of the past and the present already have those clauses in, and those clauses have already been found to be enforceable!


of course people who agree to a contract are bound by its terms under common law; the question in the original article is what copyright law does or doesn't allow people to do without any sort of contract

unfortunately you have descended from simply making vaguely ignorant comments to attacking me, which indicates that further engagement with you is unlikely to be useful to anyone


The license is the contract. Literally.

And every nonFLOSS license and I've seen has restrictions on what the licensee can do with the material.

Can you find one nonFLOSS license that doesn't have restrictions or limitations?

I mean, you lead with obnoxious, then descended into condenscension, all while not realising that the whole point of the license is to restrict the licensee.


Until such a time as a license containing a clause about not being used for "training purposes", whatever the hell that means, actually shows up in court, then adding this to your license has all the legal teeth of saying that you must sacrifice your first born.


How are you planning on proving a particular licensed work was used in a sufficiently large model? One of the commonly mentioned issues with current ML is the inability to reverse the output to figure out 'how it got there'.


> How are you planning on proving a particular licensed work was used in a sufficiently large model? One of the commonly mentioned issues with current ML is the inability to reverse the output to figure out 'how it got there'.

That's a different problem. Let's not get into the argument of "Just because the victim cannot prove something, we should remove the relevant laws."

The current laws are sufficient; all that it takes is for licenses to have a non-AI-training clause.

Lets solve the problem of "how do you prove" when we get to it[1].

[1] Right now, due to the systems already trained being given every single image on the net as training data according to the owner of those systems, in a civil suit the burden will be on them to prove that, on the balance of probabilities, a particular image was not used.


laws that cannot be enforced in practice are bad laws; they work out to be a 007-style license to kill, but for businesses rather than people

granting many such licenses is a recipe for social collapse


> laws that cannot be enforced in practice are bad laws;

Who said they couldn't be enforced? The argument from the pro-AI team is that we shouldn't have those laws in the first place.

I'm saying, let's keep the laws we already have because they already work quite nicely if the content creators don't want their work included in any training data set.

Rushing to make new laws because "Muh AI" is silly.


probably before rushing to categorize people into teams and ridicule those you think are on a different team from you, it would serve you well to understand the basic principles of the field you are commenting on so that you are in a position to assess the factual claims people are advancing

if you are failing to understand the factual claims, you have no hope of making sense of the normative claims for which you have such thirst


I'm not on a team, but the notion from the pro AI people that licensors cannot add restrictions on their material is, frankly, childish in its logic.

Almost all big companies routinely have licenses which heavily restrict how their software may be used, and the licenses have held up time and time again in various jurisdictions around the world.

The only argument you appear to be presenting is "well and don't like it that way".

Tough. It's already that way and has been so for dozens of decades .


that isn't how law works

copyright is not a get-into-jail-free card that allows private parties to invent their own legal system and nonconsensually impose it on other private parties


It's not nonconsensual, if you want to use copyrighted content, or consume it in any way, you have to abide the the clauses in the license.

Forbidding certain uses is a particularly common clause in most copyrighted material.

Other than some FLOSS licenses, can you find one that doesn't have limitations on what you can do with the material?


If your AI is capable of reproducing part or whole of a bit of content it was trained on, then your AI is subject to copyright. Training an AI to create derivative works, does not absolve you of violating the copyright of the authors your AI is building derivative works from. If this is too much of a legal burden for you, then don’t train your AI on work it may create derivative forms from.

It’s a pretty simple problem. The folks claiming it is not either are being intentionally disingenuous or honestly do not understand the legal definition of derivative works in copyright. It’s settled law.

You could… you know either a) don’t train on works you don’t have a license to or b) use some sort of adversarial training to ensure that the AI doesn’t replicate the work it is trained on.


You seem to be very confident that you know exactly how the law is going to be interpreted in this case. If I were you I'd moderate that confidence a bit to cover the possibility that you are not in fact the top legal scholar you seem to think you are.

I could see this go either way. There's the argument you put forth, and then there's the argument that a text to image model is a transformative work. You can use copyrighted works and make money off your product and still have a transformative work. The Google books case is, of course, good reading on the subject.

My main point is that it is not at all clear which way the law will go on this.


When AI spits out verbatim the works of an author which it is trained on, that is not transformational, that is plagiarism. Whether it is done by an AI or a human is immaterial. Now we can argue over hundreds of hypothetical situations where AI does produce transformational work that isn’t a derivative, and we would agree that AI can and often does produce transformational work, but that is not what we are arguing over today. We are talking about the cases where AI isn’t transformational and where it is producing derivative works. To claim that AIs never produce derivative works or that all output from an AI is only transformative is foolish when we have countless counter examples popping up every other day.


> The Co-Pilot suit is ostensibly being brought in the name of all open source programmers. Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful. The open source movement is wonderful in many ways, but its tendency to engage in legal maximalism to “protect” open source is sometimes disappointing.

This early paragraph is so bone-headed, so smugly demonizing, and misrepresents the situation so badly that I had to stop reading. This paragraph isn’t analysis, it’s propaganda.


It kinda is, but from what I’ve seen the developers against this are highly egotistical and having a meltdown that the skills that make them special and on some level famous are becoming obsolete


What a gross and hateful (and false) way to frame this…


It's not false, I'm describing my experience, which I said


Sorry. Let me correct:

What a gross and hateful way to frame this, especially given your apparently limited experience…


Wow that was a hateful way to frame that, especially given your knowledge of me


> One possible best practice would be to allow authors to specifically opt out of use of their output for ML training.

If it’s a copyright matter, I don’t see how that could work. It’d need to be opt-in, or covered by an explicit license grant (and terms and conditions are being ignored to the point that I gather some jurisdictions’ courts are pretty much striking down anything that a reasonable person wouldn’t expect to be there, and so it wouldn’t surprise me if such an approach would strike down any grant asserted in T&Cs).

> for software authors, prohibiting ML training would be antithetical to the Open Source Definition

If it’s a copyright matter, this wouldn’t be the case, because it wouldn’t be discrimination against a particular field of endeavour, but rather simply insisting on the terms of the license.


A half-serious thought I had is to allow AI to be trained on copyrighted materials but require all output to be public domain.

It doesn’t solve attribution, and you could get a sufficiently advanced ai to “launder” copyright but at least it would prevent those corporations from leveraging public works into copyrighted projects


That's genuinely the best solution I've heard for this issue so far.

It'd stop buisnesses from replacing artists with models wholesale, but it would allow people to keep using them, and allow the companies making the models to make money selling access to the models.


>> One possible best practice would be to allow authors to specifically opt out of use of their output for ML training.

That is legally the default. Creators own their copyrights. In many cases it is made explicit with a creative commons non-commercial use license. Remember, without a license you get nothing commercial - except the nebulous fair use.

The real problem here is companies thinking they can consume large amounts of material and works simply because they can see them on the internet and obfuscate them by combining together.


If the AI wants to be exempt from being punished for training on copyrighted works the bare minimum standard is that it doesn’t accidentally copy and reproduce large portions of that work. There are quite a few examples of Stable Diffusion doing this so I think even that low bar is met. I think anyone sane doesn’t want to create AI immunity to copyright damages even when the network is “accidentally” producing copyrighted output. If you can’t adequately control your technology to avoid that case you shouldn’t get immunity to liability for your problematic outputs. I get that testing for every possibility is impossible and that neural net explainability is far behind neural net technology so it’s very hard to proactively identify and debug these technologies. But just because those are hard problems doesn’t give you the right to steal from copyright holders.


You can use a spoon to kill someone if you're motivated enough, but that doesn't make a spoon a killing weapon.

Just because AI art models can reproduce copyright (if you try hard enough), doesn't mean that it's a copyright stealing machine.


No but reproducing a copyrighted work is a valid basis for an infringement case which shouldn't hinge on whether stealing the copyright is intentional or whether the methodology to do so is a "copyright stealing machine". Granting AI immunity to copyright when it's known to produce examples of infringement seems like stealing other people's work to me even if the infringement is unintentional.


I think that we missed the real problem: it's not a responsability of the Ai model to not i fringe copyright, but is of the Ai user to not publish work that may i fringe copyright, like anything else... If with photoshop (or a pen ) i made a work that can infringe law, is photoshop's (or pen) fault or mine?


it's not a responsability of the Ai model

It's the responsibility of the maker for the AI model. They train the AI model on copyrighted material. So they should clear the licensing for using that material.

IANAL, but since I also work in the ML domain, I tried to find out how this works when you have to follow EU laws. Past rulings [1] have considered 11-word snippets to be a copyright violation (which is by no means the lower bound). So, it is likely that if a copyright holder in the EU can show Co-Pilot or ChatGPT to reproduce a non-trivial fragment of code or sentence, that a copyright holder can sue them successfully.

However, the sad fact is that these models are made by well-funded entities. So, they'll bury small copyright owners in lawyer busywork until they go bankrupt and settle with big copyright holders. So one possible net outcome will be that large entities can do large-scale copyright violation while individuals and small companies can't. We have seen this story before. And it helps to entrench big companies even more.

I hope that the EU comes with some regulation to level the playing field. So either make it illegal for everyone (enforced by EU courts) or legal for everyone.

[1] https://www.theregister.com/2009/07/31/ecj_rules_11_word_sni...


This may be of interest to this crowd. Singapore has explicitly set out exceptions to copyright infringement in its new copyright act. Sections 243 and 244 of the Copyright Act 2021 [0] permit copies of works to be made for "computational data analysis", provided certain conditions are met. It's not clear how this will pan out - the definition of "computational data analysis" is broad in some ways, narrow in others, and the conditions need to be tested against real world scenarios.

* * * * *

[0] https://sso.agc.gov.sg/Act/CA2021?WholeDoc=1&ProvIds=P15-#pr...


> But the latest and greatest software trend–generative AI–is in danger of being swallowed up by copyright law.

About time.

> If the AI industry is to survive, we need a clear legal rule that neural networks, and the outputs they produce, are not presumed to be copies of the data used to train them.

But they are compressed lossy copies of all that data! That's the whole point of noise/denoise functions that neural networks are based upon. The whole mathematical foundation of training a neural network is "teaching" it how to recognize and/or create copies of data stored in the training set.

> Otherwise, the entire industry will be plagued with lawsuits that will stifle innovation and only enrich plaintiff’s lawyers.

When you're willingly breaking already established law en masse for profit in hope no one cares enough, be it copyright law or any other, you're not an "innovator", you're a criminal. The fact that you're a tech giant or a Bay startup doesn't matter in this regard; the only thing that matters is the notable amount of time required for the justice system to catch up with your novel tools for laundering intellectual property.


De minimis is a longstanding defense in copyright law. If you are copying very little from very many works, as is the case when you turn multiple petabytes into a few gigabytes of neural network weights, you are in the clear. The problem arises when models overfit and spit out almost perfect copies of the training data.


Copyright doesn't have an explicit size, but rather uses size as one of many indicators.

For example, I could take a massive 8k video and covert it into a very small 144p youtube video. Am I in the clear simply because the output is tiny compared to the input? Similar I could take a huge studio master copy of a song and convert it to a very small and rather compressed (distorted) mp3.

I partially agree that some of the problem is when perfect copies are spit out by the models, but I do think there is a bigger problem. Copyright is a complex concept that can't be defined exclusively by a single metric like size, and any mathematically definition will in the end be killed if large copyright holders feel threatened by it.


Thumbnail images don't violate copyright, and are a very helpful comparison case to consider.

"Transformative Use" is a major consideration in fair use copyright: https://en.wikipedia.org/wiki/Transformative_use

ML models do not supplant the pre-existing work, and provide fundamentally new modalities. Transformative use seems like a slam dunk to me, but I guess we'll see what the Supremes decide in twenty years or so...


So is market harm

"Some courts have held this factor to be the most important in the analysis."

https://ilt.eff.org/Copyright__Fair_Use.html#Market_Harm


I’m unclear about this. Let’s say a movie comes out and I make a YouTube review using brief clips or screenshots from the movie. Since my review is transformative, I should be in the clear (I think?).

But when it comes to market harm, does the tone of my review effect the enforceability of copyright?

As in, if my review is negative it would harm the market for people going to watch the movie vs a positive review right?


Reviews have a distinct "character of use", one of the four cornerstones of fair use exceptions.

A review can be commercial, can cause significant harm to the market, can include substantial amount of the work, and yet the character of use can be significant enough to convince a judge that a exemption should be applied. Since judges historically has come to this conclusion there exist now legal precedence. With precedence we can make some general conclusions which tell us that reviews are in general exempted when using other peoples copyrighted work for the purpose of reviews.

This character of use is very different then if I convert a studio record of a song into mp3 and publish it on p2p sharing site. Judges has historically viewed the character of use in those situation as not being worth giving exemptions.


I am not a lawyer but I'd think:

You're not directly competing with the movie though, your work is a review, not a feature film.

If you were to make a parody movie from the material of the movie itself, directly taking scenes and altering them to your liking but still relying on the viewer recognizing the original in it, you'd have a harder time, I think.


There's a Stable Diffusion example where, having been trained on too many Getty Images pictures stamped with their logo, the system generated new images with Getty Images logos.[1] That's a bit embarrassing. There are code generation examples where copyright notices appeared in the output. A plagiarism detection system to insure that the output is sufficiently different from any single training input ought to be possible.

[1] https://petapixel.com/2023/01/17/getty-images-is-suing-ai-im...


Yes, agreed, I don't think the problem is with networks that mix tons of input data in a way that doesn't heavily derive from one or a couple of sources. The currently available models do not have overfitting solved, though, and this technological imperfection also has direct practical (and legal) consequences.


How are you using the word “copy”? It doesn’t seem to match the standard meaning. For instance, most people would not consider a brief summary of a movie’s plot to be a “copy” of that movie, or protected under copyright.


If you have an image, then train a neural network on that image, then use the neural network to reconstruct that image in detail, then the NN by definition contains enough information used to reconstruct that image - hence, a copy.

With NNs trained on thousands or millions of data entries, this concept becomes fuzzy in the same way as you described - a short summary likely wouldn't be considered a copy, just like a 64x64 generated thumbnail wouldn't be considered in the same way a 4096x4096 hi-res image.


The thing is, the “good” models can’t reconstruct the image in detail. It’s considered a sign of “overfitting” if you reconstruct the input exactly. Even if you put the exact query that was associated with that image, you’ll get the weighted average (feature-wise) image associated with the query. This applies to all like machine learning models without loss of generality.


That doesn't mean I can't recover the image (or at least get really close to one) using a different query, does it? Edit: It's nonlinear after all.


Sure, but I could write a program to spew out an unbounded number of images containing random pixels. It could create an image that is identical to a copyrighted image, but if I just keep that image on my hard drive, have I violated copyright? I don't think I would be, but if I started distributing them, yes I would.


> If you have an image, then train a neural network on that image, then use the neural network to reconstruct that image in detail

I haven't seen that happening since the discussion started. Most of the complains I saw aimed at things like "it stole my style" not "it reproduced my art".

Do you have any examples?


It's more, 'this product is profiting from my labor without my consent (ie paying me).'

In music you aren't allowed to use the same notes, even if you played them on a trumpet with a swing beat, while the source was on the piano very staccato.

While we don't have the same vocabulary for art, it's not unreasonable to expect similar protections.


What you describe for music is already outrageous -- why do you think that needs to be extended to everything?

https://www.vice.com/en/article/wxepzw/musicians-algorithmic...


It takes work to create/identify/classify information, both in the economic and physics sense. That work should be allowed the same protections we do other forms of work.

Your example is one where nearly no work was done, thus it doesn't deserve much value. "Let a = the set of all songs" doesn't help me find new songs I like. A songwriter does that work. Another artist that takes and uses and resells that work (without consent), is stealing that work.

To me it's funny that nearly all the problem with the current team of AI generation would be solved if the model generators simply licensed the content they train on. "But that would cost too much" Ok, just use public domain work, "But that wouldn't be as good" Oh so you are saying the work has value, but you are unwilling to pay for it, and instead your scheme is to just take it. That seems like a good definition of stealing - not paying for something that has value.


> Your example is one where nearly no work was done, thus it doesn't deserve much value

You are aware that there is very expensive art out there where the artist did not much work. Like painting a canvas in one colour or throwing an item in the corner of a museum.

According to you, that would not deserve much value but it does have a lot value in reality.

In fact "value" is what somebody else gives to the piece of art.

A prompted AI artwork made by me may have more value to me than all the art in the Louvre.

The discussion here continues to turn around copies when it's not a copy those algorithms generate.


> Another artist that takes and uses and resells that work (without consent), is stealing that work.

Another artist accidentally uses a melody from another song (because it's a finite set) and are sued for all their income is a horrible system. The winners aren't the people producing value, it's the people who got there first and are now profiting off other people's work.


If I grew up under a rock, somehow became a self taught musician, and ended up authoring a song that had recognizable components from Happy Birthday, then even still the author of Happy Birthday, having established that melody so successfully in the public zeitgeist, reasonably should benefit.

This is so common the recording industry itself has established rules for sampling and licensing and covers and what not. Are there some folks out there abusing the system, for sure. But overall its goal is to maximize the value produced by the recording industry, which very much includes the people who 'got their first' who built foundations for future artists. To me, this all seems basically reasonable.


Copyright is supposed to promote the creation of new works. You just described a system where a song written well over 100 years ago is preferred over over a new artist creating a new work.


> to reconstruct that image in detail

Pretty much none of these systems "reconstruct an image in detail".


Honestly, can people stop speaking in absolutes regarding these systems? We (researchers and non-researchers alike) are gradually trying to comprehend exactly how much they generalise and memorise, but this is darn hard work and it is not our fault that several major tech giants decided to deploy and profit from these models long before the scientific and legal landscape was clear. Somepalli et al. (2022) [1] for example is a fairly strong argument against your statement above.

[1]: https://arxiv.org/abs/2212.03860

The fact is that these systems are complex, new, and interesting. However, it is not the fault of small-time programmers and artists that modern copyright law is a major, overreaching mess that is now finally greatly affecting what the big corporations want to do. They are getting sued? Cry me a river… Perhaps they will finally stop backing the American-led copyright lobby then?


> is a fairly strong argument against your statement above.

From a quick skim of this paper, they apparently used toy models with a few hundred to a few thousand images in the training set. For the ones with as few as a few thousand training images, they rarely or never saw exact duplicates.

For instance, in their figure 4, they show exact duplicates for the training set with only 300 images (well, duh), and didn't find any exact duplicates for the training set with only 3,000.

I'm not sure I'd call this a "strong argument" when applied to models with millions or billions of images. Quite the contrary. LAION-5B (used in Stable Diffusion) was trained on 5 billion image/caption pairs.


Firstly, thank you for engaging in a discussion. Secondly, I am not an expert in image processing, rather my focus in on language. Thus my intuitions will not work as much in my favour in this domain, although the models do have similarities.

They explore a range of sizes and I do not think it is fair to to only highlight the smallest ones. They do explore a 12M subset of LAION in Section 7 for a model that was trained on 2B images. Yes, it is not an ideal experimental setup to use a subset (they admit this) and far from LAION-5B, but it is a fair stab at this kind of analysis and is likely to lead to further explorations.

Let us return though to your claim, which is what I objected to: “Pretty much none of these systems ‘reconstruct an image in detail’.” I think it is fair to say that this work certainly makes me doubt whether none of these systems (even the larger ones) exhibit behaviour that may limit their generalisability or cross the boundary of what is legally considered derivative work.

You may very well be right that once we scale to billions of images this behaviour is improved (or maybe even disappears), but to the best of my knowledge we do not know if this is the case and we do not know when, how, and why it occurs if it does occur. I remain a firm believer that these kinds of models are the future as there is little evidence that we have reached their limits, but I will continue to caution anyone that talks in absolutes until there is solid evidence to support those claims.


Incorrect. Training images are used to generate a latent manifold which might not contain any of the images in the original training set to within a meaningful delta unless they're massively overrepresented or cliche.


This might be technically correct but doesn't seem to matter much in practice because of how many practical cases of NN outputting copyright-infriging content there are. These include examples in the recent lawsuits that made the rounds on HN. Either even the best specialists behind these NNs cannot make the NNs not contain data that is "massively overrepresented or cliche", or they are unwilling to.


"Many practical cases of NN outputting copyright-infriging content there are"

I have seen very few practical cases of generative models such as copilot and even less of a stable diffusion of reproducing original copyrighted works in exact detail, and the few that I did encounter were instructed torturously to do so, which strikes me as highly contrived.


Or they're focusing on performance to get the tools to the place where they're good enough to start being adopted. Once getting sued is more of a problem than not having a product at all, it's not hard to switch gears. I imagine there are a number of ways to avoid storing "too close" copies in a model that have various tradeoffs, I'm sure they'll be adopted quickly in the face of litigation.


> the recent lawsuits that made the rounds on HN

I happened to have missed those discussions, do you have some links you can point towards? thanks!



I’m still not seeing those examples, at least in the Stable Diffusion link.

Even in my overfitted dreambooth model of my wife it doesn’t pop out the exact same portraits.


thanks a lot!


> to within a meaningful delta

This feels like an argument for communism being the most productive system in theory. Most of the time I feel like I see uninspired material that’s tracing it’s own training data.


How is that any different from looking at human art?


When humans do it within a delta, it’s considered copyright infringement.

Machines currently are adept at making copies within a delta, hence articles such as this to limit copyright so the people who operate said machines can profit.


Transgressors, perhaps, but not criminals. AI, especially generative AI, are not lossy copies but rather synthesizers.

If anything, your comment highlights the need to evolve IP norms and laws.


> But they are compressed lossy copies of all that data!

Sounds similar to, for instance, songs written by people who have heard other songs. I wouldn’t expect legal cases concerning AI-generated works to be any simpler than legal cases concerning the difference between a songwriter violating the IP of another songwriter or simply being inspired by another song.


Sometimes a songwriter violates copyright on their own song, without realising it.


I found all this to be very odd take coming from something entitled "copyleft currents". You summarized some of the points quite well.

> So, all neural network developers, get ready for the lawyers, because they are coming to get you.

This is just dumb.


> But they are compressed lossy copies of all that data

So are those you are storing in your mind...

(Edit: i.e., "learning is not a violation". Also see below.)


Well, if you copied a famous piece of art from memory... That'd be a copyright violation.


Yeah, I think this is the crux of the issue that people keep glossing over when claiming it's not a true reproduction.

Drawing Coca Cola's logo from memory by hand and slapping it on your product is still copyright infringement, even if it isn't an image Coca Cola has ever produced. In that sense it doesn't matter at all if it's AI or human - the production and subsequent distribution for profit of a copyrighted thing is not allowed, period.

The current set of AIs do exactly this all the time. That's a very clear legal problem.


well, you get five demerits as the coca cola logo is a trademark, covered by a whole other branch of IP law...


Eh, fair enough.

I think the point still stands though. Just mentally replace it with "a random DeviantArt" and it all still applies.


it's covered by both


I meant that _learning_ is not a copyright violation.

I do not see much space for misunderstanding: which relevant box takes instances of input data to output one of them?

Edit:

Make your point explicit, sniper... You can hide in front of the ### consuetude of "silent disagreement", but it remains violently annoying. The article contains "nuances" like "derivative work", but the poster is objecting to the article calling «neural networks» not «copies», as «compressed lossy copies», and I retorted that so is everything you learn, and holding them in the "corpus" is not considered "retaining a copy". If you have objections to that, either you present them, or there is no contribution in shaking your invisible head.


The idea that it is copyright infringement if you train a neural network on copyright data means Waymo, Bing, Google are all illegal. If you include any copyrighted information in your web crawler neural network or if your training data for your autonomous software includes pictures of billboards or t-shirts or anything in the real world that is copyrighted you are a copyright infringer.


If you're a rideshare or cab driver and you also happen to see / recognize billboards nobody is going to hassle you for storing those in your neural net or suggest law should do so.

If you're a designer and you take "billboards or t-shirts or anything in the real world that is copyrighted" as stored in your head as the basis for something you're working on, you will need to consider ways in which your derivative work may be infringing.

Similarly, no, the space where copyright meets training data doesn't "means Waymo, Bing, Google are all illegal." Nobody is going to care if a driving or search neural net has that billboard or t-shirt data, because their function isn't to output copies or derivative works.

If the function of what you're building is to output billboards or t-shirts or anything in the real world that's copyrighted, then you may violating the spirit of copyright law whether you're wetware or using silicon. And whether or not the letter of the law has been refined carefully to apply to the issues at hand.


The creators of photoshop didn't build it intending that the output be used to create labels for counterfeit products. But that is what some people use it for.

The creators of stable diffusion didn't create it with the intent helping people infringe on copyrights either. But some people will use it for that purpose.


actually google got sued for this in multiple countries already, as the article sort of mentions


I think it comes down to use. Web crawlers like Google are fine because they index the web and then the search engine directs users to the original source. If instead Google recycled all the content they crawled and hosted everything on google.com while scrubbing all attributions from the pages then they’d fall afoul of copyright law (specifically the moral rights [1]).

[1] https://en.wikipedia.org/wiki/Moral_rights


They already do though. I've read several articles about people's blogs plummeting after google adopted their text into an "info box", same as the Google News controversy a few years ago where newspapers lost traffic.

It's all bad. They really should pay the "little man" not just publishers with lawyer budgets - same with AI.


That's a separate issue and definitely a problem! There's no way they would get away with that if they were a small company. The fact that they can and do so routinely is yet another piece of damning evidence that our laws are broken because they don't apply equally to regular people and big corporations.


Google has stolen data from pages and showed it without attribution on their search page: https://docs.house.gov/meetings/JU/JU05/20190716/109793/HHRG...


Most training cases are covered by fair use. The problem is when the work is licensed with specific clauses about derivative works. The weights are a derivative work of the open source code on which they were trained.


Yeah but our current implementation of law is entirely subjective and exists solely to benefit those Fortune 500 organizations.


>Bing, Google

~~robots.txt~~ ai.txt for code repos? :P


robots.txt lacks nuance about copyright, it doesn't really serve the same purpose.


If you draw a cat then you need to pay authors of every cat drawing you ever saw. Also all the owners of actual cats you saw. It's only fair.


You can pay me to copy someone else's work if that's you really want.

You cannot then use this copy commercially while pretending this indirection somehow grants you immunity from copyright laws.

The fact that AI is involved does not change the basic principle.


AI is not human. Machine learning isn't the exact same thing as human learning. Treating them as exactly the same is disingenuous.


Also, people occasionally invent new concepts, ai doesn't.


Well ChatGPT at least does all the time, it's just that those concepts are factually incorrect


Prompt: "cat leaving bag, golden hour, trending on artstation, 4k"



It’s interesting that we spent decades arguing against corporate IP control because it reduces the intellectual freedom of the human race but now that the corpus of data comes from everyday people instead of big companies (a direct effect of the internet democratizing the creation of content), people are suddenly pro IP control.

It shows that a lot of people in this space only cared about standing up against corporations, they didn’t care about the philosophy behind the anti-IP movement,


I’m very against IP laws and controls in pretty much all their forms.

I think if we tried to calculate the cost to humanity of the things that don’t happen or aren’t created, are very expensive to do, or are restricted to people and companies with the right “rights” we’d uncover a tremendous tragedy.

In addition to the first order effects (you can’t do X or have to pay to do Y), there are huge chilling effects on uses that are allowed (especially as companies regularly overplay their hand well beyond the protections they are actually afforded), and the cost of technology and people that exist only to enforce these rules.

We should embrace the free sharing of information and maximisation of its value for all of humanity.

If we worry about researchers, programmers, and artists not getting paid (and we should!) then we ought to pay them as a society for the public goods they create (as we partly do for science and research). Finding and implementing a good and democratically reasonable way to do this would be truly revolutionary.

I hope that AI research and training only goes to prove the long term damage and futility of IP as a concept and accelerate its downfall.


This is also the opinion I reached after thinking about it for quite some time.

I am, a wee bit, shocked people believe there can be impactful legislation on this. As if politicians who have been unable to curb PIRACY in any real sense would now be able to tackle an even tougher problem. This is despite well funded lobbying groups. Even large corps enable piracy without consequences.

Further, the government frequently indicates a concern that China will beat the US at AI.

There is an extremely bumpy ride coming for a group of people that have never had to deal with an unavoidable bumpy ride. I look forward to an increasingly logical viewpoint from people being struck by reality. Not maliciously, but societally.

AI has indeed removed the "need" for copyright. Let Mickey die.


Perhaps standing up against corporations is itself a proxy for some even more deeply hidden system of human values. But what could human minds value, if not free and open IP or opposition to corporate entities?


Who's damaged? The suits have to prove that the heart of the work is being reproduced, and that this is damaging in some way. Currently AI generated images cannot be registered for copyright, so those images cannot even be protected and exclusively sold. Regardless the heart of the work isn't reproduced, it's a transformative work, which means it's protected by fair use.

If its an image editor everyone is calm, but call it AI everyone loses their minds.


An artist keeping their work out of AI models will be like one from previous generations keeping their work out of museums. It will be a great way to ensure your work is forgotten by history and has no influence on the next generation of art.

Like it or not, you can’t stop people from copying strings of ones and zeros. Time to let go of primitive notions of copyright and embrace that we’re all one species and we benefit by sharing our knowledge.


I don't think that keeping your work out of AI will have any bearing on your work's success or lack thereof. It's not like people go "wow, great AI image, let's check out the 253,526 works this was synthesized from!" There is no publicity value in it.


The interesting thing is how Heather Meeker carefully avoids mentioning the obvious approach that is perfectly compliant with laws and regulations around copyright and use: opt-in. If a new way of using information arises, asking for consent through opt-in is the obvious choice. But also the most costly one.

That's why the "forgiveness instead of permission" approach is seen as somehow heroic, when it actually simply is abuse.


the question at issue in these lawsuits is whether the law requires permission (or equivalently forgiveness) or whether people are free to do as they wish in these cases


> Like a cruise ship heading for a scary iceberg, AI is in trouble, and the problems are mostly below the surface.

I'm not sure why the author is out here talking like a lawsuit brought by the author of Typography For Lawyers is going to bring down Microsoft like it's a foregone conclusion, but people are going to be training models on public data for personal gain from here out no matter what happens. The cat is out of the bag.


Yes, so this mumbo jumbo scaremongering about how poor AI researchers won’t be able to make the world a better place isn’t remotely accurate.

If copyright ends up being enforced, it’ll affect commercial deployment. We’re in for an inevitable paradigm shift soon anyway, so if we need any guardrails, we better install them before we open the floodgates.


I'd like to train an AI on microsoft's leaked code and put the model online. And see what microsoft thinks of that…


I think this is indeed some dangerous development for AI, if such lawsuits are successful.

I don't really see why humans are treated different than a computer, for this case. A human also learns from lots of copyrighted material. It's not possible that whatever a human has seen or has heard, will have no influence whatsoever on the human brain. So by the argument here, everything what a human does, ever, is always derived work from everything he/she has ever seen in his life.

So, then another argument is, of course there needs to be some line. It's only derived work or breaks copyright if it is really similar enough. But if this is now the argument, where is the problem? We can just apply the same to the AI. Of course, this is somewhat ambiguous, where to draw the line, but it's just the same as for humans.

Another argument is, the AI has in total seen much more visual data or text, than any human ever could in his/her life, so that is how the AI is in any case different. But I don't really see, why is this relevant? Some humans are reading more books than others. So those who have read more books are in danger, at some point to have read too much books? Where is that line?

Another argument is, stochastic gradient descent works different than the human brain learning algorithm. I don't really see how the details of these technical difference are relevant here.

Another argument is, the human learning is much more efficient in terms of data. But I don't understand how this is relevant here. Isn't this actually an argument in favor of the AI regarding this topic?

Future research on AI might make the AI, its behavior and its learning, more similar to humans. But if we now have a law which says it cannot use public copyrighted data to learn, then the AI has a huge disadvantage to humans, because humans use such data all the time to learn.


Agreed, AI needs to play by the same rules as humans. But this includes being held liable for obvious copyright violations (or any other sort of license violation) in court (the next question is who is liable, the company which offers the AI service, or the user of the AI service who commercially exploits the output which violates copyright).

This is the exact same progress that music sampling went through.


    So, all neural network developers, get ready for the lawyers, because they are coming to get you. 
No, you dullard child. Get ready to get sued if you try to make billions of dollars via derivative works of other creators while breaking software licenses.


How rude.


> Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful.

This really shows you don't know the movement it self. People want credit, and sometimes put conditions to use their work(GPL and copyleft), and when the AI doesn't follow these guidelines, then its breaking these copyright laws.

Not everyone is willing to willy-nilly give their code for nothing.


Or is AI eating Copyright?


I think this is the better question, given that the current major use of AI is to launder copyright away from the work of mostly independent artists.


The Stable Diffusion suit alleges copyright infringement, stating that, “The resulting image is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images. It is, in short, a 21st-century collage tool.” That characterization is the essence and conclusion of the lawsuit, and one with which many AI designers would disagree.

I didn't expect to be defending copyright law, but the excerpt is ridiculous. It's clear that the images are a product of the prompt and training set. The legal side of fair use and copyrights is best left to the courts, but the real question is how to divide the profits. Because the training dataset is so large and the technology plays such a crucial role, it may not make sense to pay much (if anything) for each individual image in the training set. There's no clear cut answer.


I’ve taken a pretty fatalist position on this one: whatever’s going to happen, will happen. I have heavy biases that, even if I acknowledge them, are impossible to shake.

My only hope is that somehow generative art, code, and all other media can somehow benefit the little guy rather than continuing to enrich the largest and most powerful, however unlikely it seems.


"This training data is highly valuable" "ah, so you wish to pay for it?" "Lol no, ctrl-c, ctrl-v"


"This training data is available for free, but if you want to train your AI on it you need to pay" => Which means only Google, Microsoft, Meta and Disney will be able to train AIs.

Thanks to lawyers, a technology that had the promise to democratize art will be used by large corporations to enslave us further.


That's backwards. Do you think these companies are actually willing to negotiate deals with each artist, content creator, website, and code repository? The scale of that alone would sink their business, not to mention the cost.

Worst case scenario - you're right, they're willing to go through that and the authors actually get paid something rather than nothing.


Each artist? They will just have to negotiate with (or acquire) a few major hosts like ArtStation and Flickr.


1) That only applies to images, and crucially 2) content on those websites is published under various licenses that often require attribution, payment, or stipulate non-commercial use.


> The Co-Pilot suit is ostensibly being brought in the name of all open source programmers. Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful. The open source movement is wonderful in many ways, but its tendency to engage in legal maximalism to “protect” open source is sometimes disappointing.

What a disingenuous framing of the open-source software movement.

When I license a work I created as GPL or MIT or whatever, I do so because I imagine that some user or some company would use the software to build something based on my software and contribute their changes back to the community so that we can all benefit. This is standing on the shoulders of giants. Microsoft "using" my source code to build a for-profit programming-as-a-service bot was not what I had in mind. At the very least, the complaint that the model doesn't give credit to its sources is correct. We might need new licenses so people who write open source code can opt out of this type of usage.

> The Stable Diffusion suit alleges copyright infringement, stating that, “The resulting image is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images. It is, in short, a 21st-century collage tool.” That characterization is the essence and conclusion of the lawsuit, and one with which many AI designers would disagree.

AI designers can disagree, but how about they try building an AI model without using someone else's images. The AI model is not even possible without the source images. Show a little gratitude.

If you ask me, AI is not in danger of being swallowed by copyright law—it's in danger of being swallowed up by its own self-entitlements.


If it were a question of learning...When you ask AI art programs for a sunset, AI does not produce a sunset, it pulls imagery from its data bank. It may manipulate it, filter it, add to it, but the image itself was created by somebody, whether by hand or by photography or software. AI is learning to pull imagery, manipulate and collage it, not create it. If the AI companies contented themselves with supplying there own imagery to manipulate I would not mind, but stealing from any artist that has ever had their art on a web page is really not ok. AI is not learning to paint or draw like 'so and so', AI is taking existing created imagery and regurgitating it. If you feel comfortable stealing from original content creators...


When I hear the copyright vs. AI debate, it's always centered on American copyright law, which isn't enforceable around the world (and it shouldn't be). How do you prevent AI models in Turkmenistan, for instance, from using US copyrighted works posted online to train?


There seems to be a very strong argument that even if (big ‘if’) AI models infringe copyright it is a transformative fair use.

https://www.nolo.com/legal-encyclopedia/fair-use-what-transf...


That's one thing I've been asking myself a lot recently: Is it realistic to expect that AI can properly attribute the source material in its output anytime soon? IMHO this is a non-negotiable requirement if the field wants to be taken serious. How can I trust the output if I cannot verify the correctness of the input?

And then there's the legal aspect: in music, sampling is legal nowadays, but you have to ask the original author for permission and then pay royalties based on how much your own creation is based on the original sample.

AI isn't much more than automated sampling, how can AI generated content ever hope to avoid a legal quagmire if it cannot properly attribute its source material?


Say a person makes a program which completely automatically samples songs to eight bit, giving that “retro gaming sound”. Maybe that person uploads an album to YouTube and credits the original album in the description, saying that they do not own this work (the people who made that album do). This seems to be the standard practice for this kind of work.

Will the person be able to monetize that video.

Say a person makes a program which automatically makes a collage of movies with some music on top. The movies will be credited in the video description.

An AI is also a program. First of all, why can’t it credit its inputs? And second of all, should it be allowed to facilitate monetization on behalf of whoever owns the program?


Only speaking to the coding situation.

It seems to me that if the "LLM Copilots" just observed the existing license there would be less of a problem here.

Copilot, only recommend work based on $LICENSE or $LICENSE compatible license when I am working on $LICENSE code.

What's the problem genius?


MIT license says that the author must be cited within the copyright section.

So you should know and list all of the authors that contributed to that code being generated.

And that's the least restrictive software license…


That only addresses part of the problem. Open source depends upon copyright in order to be enforceable. Copyright depends upon attribution. While I am confident that most (not all) open source authors would be happy if the license remains intact, what happens when someone infringes upon an open source license? Say that someone incorporates code generated by an AI based upon open source licenses into a closed source product.


That's what I am saying. Stop doing that. Train co-pilot, or whatever, to follow the terms of the license.


That's tough when Jimmy is using copilot to generate code for his proprietary company codebase.

AI is just a tool, do you also sue the company that sold the paintbrush with which an infringing painting was made?


The thing is, we aren't really talking about AI here. We are talking about datasets being used to train AI and the companies offering the services of a trained AI. If a company trained their own AI on their own code, the discussion would be very different (likely centred on moral issues rather than legal ones). We probably won't ever have a discussion about companies submitting proprietary code to train an AI created by a third-party since there would be a contract between the two parties (and it is unlikely a company will submit proprietary code to train an AI that will be used by others in the first place).

What we are looking at is a kin to a company that makes paint brushes, trains graphics artists, and contracts out their graphics artists to use those paint brushes. If it turned out that the graphics artists turned out derivative works without the rights holder's permission, you can be assured that people would want to "sue the company that sold the paintbrush".

I'm not going to claim that my example is equivalent to what is happening with these AI services. And while you may be right about AI fundamentally being a tool, like a paint brush, I would suggest that is only true if you ignore the data that is fed into it.


If you sold someone a gun knowing they knowing they did not know how to handle a gun and they said they were immediately going to the gun range and then someone died, you might get charged with negligent manslaughter.

My point here is Microsoft knows there's no way to tell where the suggestions are coming from, they stripped out that information, in this case it's even more likely that the offense happens because they're selling the paintbrush on a large scale. It's just a question of how often it's going to occur. Is it for 30% of your users or 70% of their users that are infring on licenses on a wide scale basis?


Well microsoft claims that if you infringe it's your fault and not theirs.

However they don't give you any way of NOT infringing when using their tool.


Microsoft should make it more difficult UX wise to do this.

Microsoft should also provide resources to content creators to find where their work has been used.

This is not complicated.

Generally I'm reluctant to say people are acting in bad faith. It's less tough to say Microsoft is acting in bad faith here.


Flagged for obviously partisan take.

"AI is in danger of being swallowed up by copyright law" is perhaps the oddest take on the current situation: people's copyrighted work being in danger of being swallowed up by AI.


It amazes me everytime, when it comes to AI images, ChatGPT there seems to be an aura of not giving a damn. But when it comes to code generation, the comments seem to turn into artists vs AI debate but for code.


The good thing about all of this is that the law doesn't really matter. People will complain about copyright, but the tech is here and the law will follow.

Does anyone really thing this anti-Copilot case has a chance of winning when my guess is that Copilot adoption is exponential over time.

Strongest counter argument is the Supreme Court struck down abortion even though abortion was rising in popularity. So that's a bit worrisome.


If neural networks are claimed to be creating lossy copies of input data, what does that mean for human neural networks? Are we about to outlaw the concept of inspiration by accident?


> The problem is that neural network models, and their outputs, are not copies of the original works. They are a set of probabilities (weights) that are trained based on thousands or even millions of data points.

True, but in order to train these models you have to copy and save images that are for the most part copyrighted. So while the models are not copies of images, they are definitely derivative works.


I think someone like Google has some potential serious leverage here...

"Let us train our LLM on your content in order to prioritize your content in search results"


Wut? Actually, I find the opposite is true. AI is so profitable precisely because it is a way of circumventing copyright. Steal from one creator and you can get sued. Steal from thousands and you can’t.

Obviously, creative people have always stolen from others. But it required a lot of time to immerse yourself in the work of others. With AI, it’s only a push of a button.


It's not in any danger, unless US is truly r-d. And if so, if EU somehow isn't that stupid, they won't reverse rules they already made about data mining & scrapping. Which explicitly allow ML stuff. Also Japan - and they did it specifically to not hinder AI tech, few years ago.

If everyone is somehow this r-d, there's China also.


"And at least as of now, it is not possible to look at ML output and determine which inputs, nodes and weights created it."


This whole legal debate exists because of ambiguity. Open source licences were written before large-scale training became a thing.

Rather than settle this ambiguity on court, why not remove it altogether? Personally I think we should be adding new clauses to licences that either explicitly approve or prohibit the use of code for training of models.


Better headline: "New technology ignores existing societal norms and has to adapt itself when society rejects it."

Sheesh.


One important point for the discussion is: one cannot (at least not currently) directly compare these AI models with the human mind. A human learns, can reason about the learned things and can use that ability to create completely new pieces of work. There is some clear intelligence and reasoning and personal creativity involved.

Now looking at the current state of AI models, they have shown to reproduce the training material in large verbatim pieces. On the other side, they lack any ability to reason about what they produce. Which for me is a big indicator, that they are clearly reproducing not producing on their own. Which creates the potential copyright issues. At minimum, the legality needs to be clearly specified, be it by court rulings or law changes.

Ironically, to some extend this has been a long time problem with human creations too. There are plenty of lawsuits about music pieces "borrowing" too much from older creations, and recently there is a growing cases uncovered, where expecial thesis papers contain too much unattributed content. Probably no new thing, just so much more easily discoverable thanks to modern text search tools.

So as a tl/dr: I don't think current AI models are comparable to human learning and the copyright questions around that need to be decided. If that means generally expanding fair use, having shorter copyright periods, I am all for it. But "AI" shouldn't be a tool for large tech companies to systematically evade license terms.


Copyright law should simply be discarded for something that makes more sense. I think something like laws about attribution where feasible, making strictly false attribution unlawful, and easy ways to support the actual artists in question (i.e. tax reform) are better ways forward.


i do not defend copilot for training on private repositories. or ignoring licenses.

but, to me, there is a big flaw in these lawsuits. if we need to shut down these AI technologies, then our entire education field is in danger. the same arguments should apply to our education field.

educating AI is the same as educating ourselves.


Please elaborate on how this puts education in danger? I would be inclined to predict the opposite and that AI has the potential to be the greatest threat to education we have ever faced.


We don't need to shut down these technologies. But there needs to be a clear ruling to which extend they can use copyrighted materials without the consent of the copyright holder. And establish a framework on how such a consent could be established in case it is required.


No, because we happen to not be computer programs (or I hope so at least).


It seems inevitable then that we soon will have various large models in competition, with differently sourced legally licensed large training datasets. And opt-in open-source ones. And they will one day all hopefully be as good or better than the current models are now.


Translation: people are trying to make it less profitable for us, please stop them.


It’ll be a notable thing if China win the AI race because of ignoring copyrights


In summary: some people think it is copyright infringement, some do not. They can't agree. There is a lot at stake.

Sounds like a courtroom is the right place to find an answer? That is the the US system isn't it?


I would be happy with a new robots.txt that applies to AI scraping. Sure, most people will ignore it, but if you can find someone profiting off your work, there are chances recoup in court.


Skynet sending a lawyer back in time to save itself from copyright.


AI could be the new Adsense. Unlike Adsense where you only get paid if you an ad is clicked, site owners could charge AI crawlers money per page.


‘AI’ needs to be spanked by the copyright law


The first few paragraphs makes this sound like it’s from an AI fanboy with a vested interest. Difficult to take seriously.


Wow, that title sounds like poor AIs need lawyers soon, or is it already lobbying for itself? We need the turing police!


This is an incredibly naive article; the author has no real meaningful appreciation of how relatively powerless copyright tends to be in reality when it comes to new technology. Google, Youtube, Facebook -- ANY big tech company whose bread and butter is "your work" all generally get a free pass here -- or more specifically, something like "the market" ends up being the primary driver of activity.


Editorialized title is editorialized.


Or is copyright in danger of being swallowed up by AI? What would be the consequences?


Let me fix that title: "Works are in danger of being swallowed up by AI"


I'm sure once AI understand copyright law, this will cease to be an issue. :-)


Absolutely. Even AI will easily Manipulate Everything.


Premise of the article is roughly: if all this litigation continues, the AI industry could collapse.

Serious naive question: How would that not be a desirable outcome for society? “AI” is becoming a scourge.

AI’s value-add that we truly need is just detecting cancer right? Obviously I don’t want anyone to die from cancer if they could be saved. But AI in cancer detection is a detection-rate booster: cancer can be detected without it. A non-AI process could try to recover some of the lost accuracy.

Weigh this against the misinformation potential of chatGPT, deepfake video fraud, discriminatory bias in ML model output, the surveillance potential of image recognition, addictive social media, the essential inscrutability of model output… (i could go on, but the research & reporting on AI as we have wielded it is voluminous)

So if litigation kills AI, isn’t that cause for celebration, on-balance?

Or shit, can’t we just legislate easier usage regulation for lifesaving medical data, so we can keep the cancer detection use cases and let the rest of the AI gold rush die?


Yes, genies can be put back into bottles. We’ve known this for millennia


>Wahh wahh I can’t steal everyone’s data and profit off it.

That’s a big sad. Unfortunately that’s called the real world. You also can’t break into someone’s home and film a movie in their house without permission. That doesn’t mean movies are illegal.


The title is missing the prefix "Generative".


Maybe AI companies could pay IPR royalties?


I think that this cannot be stopped, copyright or not, the tool is too powerful and too useful. But I understand the moral dilemma: why would I, as a creator, be happy if someone is using my work to basically put me out of work?

I think that the solution is easier than it seems. The real value websites like Getty provide is not the image in itself, but the labelling. Put the labels under a paywall and problem solved.

For source code it's more complicated, but I guess that you could somewhat limit access to commit comments.

In any case, the genie is out of the bottle.


I wonder if cave painters had notions of copyright, like hey don’t copy my animal drawing.


Pretty sure they did.


On what basis?


Human nature. We haven't changed that much in the past 50k years or so. Some people are very altruistic and giving, others not so much. An artist might club you over the head for stealing his techniques and ideas and copying them into a new portion of the cave or rock wall, others would have taught you and approved of your work.


> An artist might club you over the head for stealing his techniques and ideas and copying them into a new portion of the cave or rock wall, others would have taught you and approved of your work.

Exactly. Same back then as today.

An insecure cave person would be protective of their cave painting turf. A well-adjusted cave person (also likely more talented) would be sharing ideas and resources or have other work for/with them.


Are you at all familiar with the historiographical fallacy of presentism?


I'm not so sure about that. If I listen to old blues artists from the 20s-60s then everyone seemed to be copying everything from everyone else.

Do you get angry if someone "steals" your programming techniques and ideas, and copies them into a new portion of the codebase?

I think all of this is very cultural.


thats not the case. almost all human works ever was not even attributed to the author. Nobody considered copyright when they recited Homer or published reprinted manuals until 1700

the concept is very new


> Nobody considered copyright when they recited Homer

Really? Curious that we know the author's name then… seems like… attribution?


Correct, many literary works had famous authors but attribution has nothing to do with copy-rights. None of the authors of written texts, from greek tragedies, roman poets up to reprinted books had exclusive commercial rights to their writings, until the 16th cen.


Apple sued Microsoft for the "look" of a GUI and lost.


China doesn't care much about copyright law. AI is safe.


This post reflects the author's very poor understanding of neural networks, software licenses and copyright law, and borders on misinformation. Please share this kind of shit on Twitter instead.

If your AI is spitting derivative works or straight-up verbatim copies of the original, stripped of a license, and given the original license even allows derivative works, then yes, you better lawyer up. This applies to both software, art, or any copyright-protected work. Legality aside, it is also shitty and on the lower end of morals to abuse people's work/art in this way. AI systems that don't fuck with people and their work are perfectly fine. Stop fucking with people and you won't need lawyers.


I don’t think it’s a big deal if your AI spits out copyrighted images. My scanner, camera phone, photoshop etc can also easily spit out copyrighted images. It’s ok if I just print them out for my wall at home, but if I try to sell them it becomes a problem.

So then I guess these new AI models are only problematic if someone tries to sell them (as its essentially selling copyrighted works) but an open source model is probably fair game. Individuals can use it, but if you try to monetize you better not be infringing.

Seems like something like YouTube’s content ID solution for detecting copyrighted works could be a “quick” fix here. You can sell access to your model but not to generate copyrighted works


> but if I try to sell them it becomes a problem

What do you think Microsoft is doing with Copilot and everyone's code on Github? They didn't spend millions of dollars and several weeks of training just for the amusement.


good


> But other than netting a hefty fee for the lawyers who bring the suit, what is the endgame, exactly?

Pay people whose works you want to extract value from, it is that easy… Unless the whole point of this "AI" bubble is to create monetary value for the companies and their shareholders without being held accountable. I do not mean that creating, configuring, tuning a model, or even compiling a massive dataset is not work, but it is the tiniest fraction of the work that went into whatever is present in the dataset.

> also, for software authors, prohibiting ML training would be antithetical to the Open Source Definition. So that probably won’t work.

A-ha, at one point I hope even AI zealots will be forced to acknowledge that the process is creating a derivative work and sure, train on my FOSS sources all you want, but the end result will need to abide by the licenses of all the sources browsed (have fun).

> As the tech industry celebrates the frothy emergence of machine learning in a time of economic doom and gloom, let’s hope this nascent field doesn’t sink because of the copyright iceberg looming ahead.

I am sorry to break this to the author, but not all fields and jobs need to exist. I have not seen much net positive from "AI" to society as a whole so far, with even less exciting things on the horizon.


Trying to legislate model training is impossible and stupid. At worst some copyright holders will sue a company because a model is biased and produces close-enough replicas of something under copyright. Model builders will respond by penalizing models that reproduce copyright images too closely, and better curating data sets to avoid bias, at which point the issue will be moot.


I don't want anybody to feed any of my work into a model at all without following all the terms of the licenses of my work. If my work is used to generate, via a computer program, a derived output, then you need to follow my licenses.

People keep pretending that AI is the exact same thing as a human learning, but it's really a lot more like a fancy compiler with highly non-deterministic output. AI is not a human.


I wouldn't be surprised if these large language models have an interior life that is more rich than many humans I've met who barely seem cognizant of their surroundings.


I agree with you but would take it a step further as an AI hater.


Apparently AI already passes US medical exams. I guess it deserves to be beaten into the ground by estabilished copyright behemoths and lawyers.


No, not yet.


Close enough for something that won't ever treat or diagnose a single patient in its current form:

https://www.abc.net.au/news/science/2023-01-12/chatgpt-gener...


> Pay people whose works you want to extract value from

So, e.g., from Basic Attention Token (BAT) to Basic Authorship Token (also BAT)?


> Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful.

This is disingenuous... the issue for FOSS is the scrubbing the license off the code and then users not observing the terms copilot or whatever concealed from you... that is against copyright law.

It's not a real problem? Great dump MS monorepo in there for public use.


They need to 1) release the model weights as open source, and 2) license code produced by the model as open source. It is a derivative work of open source code. This is 100% a large corporation abusing and undermining the open source ecosystem. They can play by the rules or properly license a training dataset. People seem really eager to shill for a megacorp.


> dump MS monorepo in there for public use

Here's a little(?)-known fact: the leaked Windows XP source code is hosted right on GitHub. Has anyone tried typing some win32 API function prototypes and seeing if Copilot fills in the function bodies? I'm not a Copilot user or I'd try it myself.


'Disingenuous' is a very nice way of phrasing what that line is.

Especially strange coming from a site which has 'copyleft' in its name.


Microsoft has never had a monorepo.

There's a Windows monorepo, and a monorepo for chunks of Office, but even now that many teams are on Git they're still very much different repos with different build tools, different code search (maybe they fixed some of that in the last few years but none of my friends there have mentioned it if so), different coding standards, and more.


> There's a Windows monorepo,

That's mostly what folks are referring to. We can't really take Microsoft at their word about systems like copilot if they're unwilling to trust it with their crown jewel.


I have no clues about laws and stuff and as a software engineer, I'd say please replace me by machines i dont care all the contrary, let's go.

But please, please, protect art making from machines. These paints made in those caves werent done during work hours, it was probably the first forms of leasure our ancestors experienced. The first forms of enjoyment in the rude life of early humans. I think there is some higher order aspect to art that's related to our essential well being.

All Im saying is protect it; im not saying AI shouldnt be used. It could certainly render art making even more fun. Maybe limit the resolution of generated pictures for example.


A lot of art is produced like sweat shop labor right now -- just think about the visual effects in any recent Marvel movie. And why isn't software art? I write software in my leisure time for fun.

The line you're drawing is entirely meaningless.


While I appreciate your comment and the sentiment, and believe that artists whose work companies trained models on without their consent should get their rights honoured, I disagree that the art field needs any extra protection.

There will be innovation outside of the latent space thanks to the AI art movement, that will be a positive development. An enormous body of artworks has zero online presence and exists outside of its reach, at least for now. However, art reflects its time, and generative, machine learning, text to image art will all influence artists in unseen ways. Offline artists are watching.

Nobody needs a computer to do outstanding art. Even the most primitive tools can produce unmatched results in the real world. That's the beauty of making art. Models cannot even begin to touch it. Good artists have nothing to fear about ML as a tool, which is fantastic and promising as such. It's undoubtly diruptive. As image-making takes less time, the value of an individual image will tend to zero. Prices will go down, and artists will adapt. That's what they do. Sure, there will be a bunch of profeteers and imposters using AI. Art always had its share of those. Art is a dream field for fraudsters. A known issue since we commercialised art, because it carries intangible value. Hence why NFTs started with art. That doesn't make the whole initiative a fraud. That said, digital artists who have copyright issues should absolutely fight back. The gray areas around training model with art by living artists needs an open discussion.


That seems difficult when Stable Diffusion has already been freely released. It’s futile to apply artificial restrictions to libre software when they can just as easily be removed!


AI directly threatens the livelihoods of most white collar workers, most of us here included. Of course they'll use any means available to cobble together a cartel to maintain the human monopoly on production of intellectual and creative works. Now is the best time to do it, thanks to the incredible flashiness and popular awareness of these new bits of software.


That criticism applies to copyright generally, and I agree partly.

However, what’s happening here is different. What people are reacting against is that (often dirt poor) artists unwillingly become part of a commercial supply chain, while corporations protect their IP the usual way, backed by the full force of the government even on foreign land.

If copyright is abolished entirely, at least there’s some form of level playing field. But this becomes yet another step towards a more exploitative economic environment.


>AI directly threatens the livelihoods of most white collar workers, most of us here included.

This reads like an argument from the English Luddites[0]

0. https://en.wikipedia.org/wiki/Luddite#Birth_of_the_movement


Sure, but what happened to the luddites?


>Mill and factory owners took to shooting protesters and eventually the movement was suppressed with legal and military force, which included execution and penal transportation of accused and convicted Luddites.


I meant what happened to them after the jobs were replaced with technology?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: