The fact of the matter is that the AI companies don't want to ask for permission, because people will say no. Or worse, ask for attribution or even payment. There is plenty of copyright free/public domain material out there, but what the customers of AI people want isn't available under those terms.
The code to train an AI is not enough to make a product and these people have nothing to add themselves, so they take what others made and use that to make a profit. They can make or pay for their own paintings, their own pictures, their own music, but that would require putting in too much work or paying too much money.
It's very possible that a judge will rule that AI models do not violate copyright. If that is the case, I hope new legislation will correct that oversight very quickly.
Do you ask for permission when you train your mind on copyrighted books? Or observe paintings? Or listen to music? Do you ask for permission when you get new ideas from HN that aren't your own?
Humans are constantly ingesting gobs of "copyrighted" insights that they eventually remix into their own creations without necessarily reimbursing the original source(s) of their creativity.
Time to put the horse back in the barn, cars and trains are here.
Yes, that’s exactly what happens when you buy a book, or pay for a music subscription. The work is in the public domain, then global permission to observe and copy the work is already granted.
> Do you ask for permission when you get new ideas from HN that aren't your own?
You don’t need to. It’s implicitly assumed, by virtue of publishing in a public forum, that the author is providing permission for people read their comments and ideas, and remix them as they wish. That permission doesn’t include exact replication, but reading and understanding is assumed, otherwise why did the author publish it?
> Humans are constantly ingesting gobs of "copyrighted" insights that they eventually remix into their own creations without necessarily reimbursing the original source(s) of their creativity.
Correct. Literally everything produced by a human is automatically copyrighted. But the manner in which work is published creates implicit licenses for the public to consume those works. You publish in public, you automatically grate licenses for the public to consume and transform it.
If a human transforms an idea, it automatically becomes a new idea with its own copyright. The same doesn’t apply to AI because they’re not human, and thus the law generally doesn’t recognise them an having ability to create or transform ideas. If you believe AI can create and transform ideas, then you need lobby for the law to recognise that ability, but right now, only natural humans have that ability according to the law
> Yes, that’s exactly what happens when you buy a book, or pay for a music subscription. The work is in the public domain, then global permission to observe and copy the work is already granted.
You can buy a book, read it, sell the book, and then write and sell another book based on the ideas contained in the first book (Baker v Seldon). This is the cornerstone of contemporary copyright law. Or read the book on a shelf of a bookstore where the clerk is asleep. Or borrow the book from the library or any other manner where direct compensation of the author is nowhere to be seen.
Copyright is consistently interpreted in alignment with the needs of public learning, both by protecting the authorial incentive as well as protecting the public need for knowledge.
Following this logic, isn't training AI on Github or Deviantart 100% fair game then? It's not like OpenAI is infiltrating computers and reading hidden away data.
Unlike forum comments, GitHub code generally has an explicit license attached which you'd have to respect - you know, for instance by giving attribution to every MIT-licensed source that was used.
And even then, let's say someone releases a book with all your HN comments: you are definitely entitled to sue them for copyright. Here's some info from the BBS era, which is still relevant today: https://www.templetons.com/brad/copymyths.html
For example, the "clean-room design" method of copying a work exists precisely to avoid potential copyright issues. One team reads the original work and writes a description in such a way that it cannot possibly be infringing, and a second team reads the description and creates the new work. This avoids any chance of someone reading the original work and incorporating potentially infringing aspects into the new work.
a similar ruling will also be a disaster for software as our tools of expression are very restricted. code is based on boolean algebra and predicate calculus, practice guides like design patterns and books teaching algorithms and data structures.
there are lots of ways to write bad code and only a few for good, correct code. Recognizing this led me to replicating known working code, code I had created, for multiple employers. so who's copyright did I intentionally violate?
I think we are attacking the wrong problem WRT ML and copyright. to me, ML shows the foundation on which copyright is built is a lie. we should use ML to break copyright for code.
Personally I reject that. ML needs to be restricted heavily.
Sounds like an option instead of a need.
Forklifts are "agents of humans" but you still need a license to drive one.
It's pretty obvious to me at least that AI bros are using these tools recklessly and inappropriately, without regard for licensing or copyright, and therefore I am proposing that the tools need to be regulated.
Simple as that.
When you buy a book, you’re not paying a licensing fee. You’re exchanging for goods. You’re granted very few rights to own a copy of the work. But they’re almost all to do with distribution. None of those rights is the right to read it.
>You publish in public, you automatically grate licenses for the public to consume and transform it.
By this interpretation, all the artists upset by stable diffusion have given tacit permission for their works to be used as they are published in the public. Even though those works are posted to websites, the artist has not granted any rights to the viewer of the work.
> only natural humans have that ability according to the law
The law is not explicit about this, and we have case law that describes non-human entities as having rights associated historically with personhood. This is definitely not clear, nor is it obvious.
You are absolutely buying a license to read the material when you purchase a book. That's why books cost more than the paper they're printed on and why pirated books are illegal. The "distribution" rights you refer to stem from the "first sale" doctrine, which acknowledges that the first sale (e.g., you buying a new copy of a book) of a physical object embodying a copyrighted work grants limited distribution rights.
It's not just assumed, it's celebrated when a work of art gathers fans who produce their own, inspired content.
Not sure why it needs to be over-complicated or different for silicone neural nets. But I think it will get very over-complicated, if not politicised, in the following years.
It is implied that if you are using the work by youself, or via a tool you made youself, it's fine.
However works that you redistribute, by copying it yourself or indirectly by tools, said silicone neural nets being one example, instead require a "wide redistribution license agreement", and those are implicitly limited by default unless the work is put in a sorta public domain license.
No you don’t. That would fall under the category of “derivative work” which is still the intellectual property of the original author under most jurisdiction copyright laws.
Therefore, using a training dataset does not constitute copyright violation.
If the AI outputted an exact copy (or a close enough copy, that the laymen would agree it's a copy), then that particular instance of the AI's output is in violation of copyright. The AI model itself violate any copyright.
> Therefore, using a training dataset does not constitute copyright violation.
It's not for you to decide that. Different jurisdictions will have their own process for deciding that and none of them are based on the opinions of random commentators on internet message boards.
Also please bare in mind my comment was reply to a specific statement (repeated below) and not talking about AI in general:
> You publish in public, you automatically grate licenses for the public to consume and transform it.
^ this statement is not correct for the reasons I posted. AI discussions might add colour to the debate but it doesn't alter the incorrectness of the above statement.
> If the AI outputted an exact copy (or a close enough copy, that the laymen would agree it's a copy), then that particular instance of the AI's output is in violation of copyright. The AI model itself violate any copyright.
That assumption needs testing in courts.
As I've posted elsewhere, there have been plenty of cases where copyright holders have successfully sued other creators based on new works that have bared a resemblance to existing works. It happens all the time. I remember reading a story about how a newly successful author was being handed ideas from fans during a book signing only for one of her representatives to intercept them each time. When they later asked why the representative took them, the representative said "it's because if any of your future books follow a similar idea, that fan could sue. But if we can prove you haven't read the idea then the fan has no claim". (to paraphrase)
Experts don't all agree on where the line is with similar works created by humans, let alone the implications of copyrighted content being used as training data for computers. And this is true for every jurisdiction I've researched. So to have random people on HN talk as confidently as they do about this being all perfectly legal is rather preposterous. You don't even fully grasp the intricacies of copyright law in your own jurisdiction, let alone the wider world. In fact this is such a blurred line that I wouldn't be surprised if the some cases would have different rulings in different courts within that same jurisdiction. It's definitely not as clear cut as you allude to.
My experience with ML tools like co-pilot is why I reject copyright claims on ML systems. there are a tool that generated original work based on my instructions not unlike a paintbrush, photoshop, or a CNC machine. My instructions were based on my exposure to copyrighted works.
I use co-pilot as an accessibility device enabling me to write code again. like with speech recognition co-pilot is a force multiplier IF you change how you work. If you keep using the habits formed by typing, you will get shit results.
The end result of the shift in how I work is now I know how to tell co-pilot how to write code in my style. My co-pilot generated code is no less my code than what I generate by hand. Co-pilot acts as an extension of my brain, not my fingers.
Is my co-pilot generated code copyrightable? I say yes because it is the result of this human's creation and instruction.
>>You don’t need to. It’s implicitly assumed, by virtue of publishing in a public forum, that the author is providing permission for people read their comments and ideas, and remix them as they wish.
Ideas are not eligible for copyright protection.
If we agree on this, what we need to resolve mostly seems to be, in how far a human should not be allowed to use publicly available data to make his tool, in the same way he is allowed to use publicly available data to make anything else.
Plenty of people have been successfully sued if their work is too similar to existing content.
This isn’t a new concept that AI is throwing into contention, it’s literally just companies trying to side step copyright law because of “disruption”.
Source: I work for a company in this field and we do gain permission from creators before training our models on their content. It’s very possible to operate this way but a lot of companies simply choose not to.
Little do some of them know that OpenAI was able to get permission from Shutterstock via a partnership to use their copyrighted images in the training set for DALL-E 2.  There is also a reason why Dance Diffusion was trained on only public domain music and copyrighted music which has the actual permission from the authors.  If they did otherwise and monetized on copyrighted music without the permission from musicians or record labels, they would be sued to the ground.
With the recent cases of Getty, Shutterstock, and even as admitted by the CEO of Stability themselves , the way forward for using copyrighted images in the training set for commercial purposes, is via licensing. Neither Getty or Shutterstock are looking for banning it, despite the AI bros claiming that these companies are trying to.
If not, just train only on public domain images to avoid these legal issues.
Where in the guidelines does it mention that one cannot say 'tech bro, finance bro, pharma bro, and more recently and most actively the crypto bro'? These have been there for years despite the guidelines existing.
Me saying 'AI bros' is no different. Given it is fine to mention the tech bros, finance bros, crypto bros and the other, then it is also fine to say 'AI bros'.
I’m not able to read billions of books in less than an hour.
Even if we agree that machine learning is like human learning, scale commonly matters in law.
I am not a lawyer, so the following is only my opinion.
Scale matters, but so does the legality of the thing that scales.
Reading two dozen books by other authors, or studying hundreds of artworks, or visiting the museum of awesome statues every week, in order to get inspired for ones own novel/painting/scuplture, isn't illegal.
So a lawsuit will have a really hard time argueing that it somehow is a problem if its two dozen billion books/paintings/sculptures. Because, such a lawsuit would suddely need to explain why the smaller scale is also problematic, only less so. And given that this is basically how art worked ever since the first human had the idea to paint pictures on a cave wall, that's a hard sell.
But there are no copies. For example, the LAION-2b training data is a total of 240 TB. The pruned SD model based on this dataset, is less than 5GB.
The data isn't copied into the models, it is used to teach the models, letting them learn patterns in the dataset.
Whether or not it's identical to human brains isn't the matter, they'd need to prove how a small 5GB model trained from a huge dataset infringes their rights specifically.
I have a 1.9 GB mp4 file on my harddrive. It contains 2 hours and 15 minutes of 1080p video data at 24 fps. Assuming it was generated from 4096x2160 16-bit color depth source material, the "training data" was 10.32 TB. I bet I could even get a similar size reduction as LAION-2b if I recompressed it to 720p.
Could I not also claim that I created an advanced AI model, which did not copy but learned patterns in the dataset? Modern video compression algorithms are getting quite complicated, after all.
I think no reasonable person would agree with this, but can you prove that the AI model is doing something substantially different?
Such patterns would enable the video file to decode into a multitude of pictures not originally in the training data. Obviously, a video file cannot do that...it's just compressed data.
Generative models however can generate things that are not in its training set.
And of course, there is a fundamental difference in the source data between compressed video and a generative model: video codecs work with a sorted sequence of images, where most images are slight variations of the ones before them. The training for generative AI doesn't have these properties, the input is not an ordered sequence, and even similar pictures are not sequential variations of one another.
Relying solely "uncompressed" size does not a really good metric make (this is analogous to the raw input size of the LAION dataset): one could make a reasonable argument that there are not billions (1) of image-pairs that are effectively identical up to a minute shift. I would posit the correct basis would be the Shannon entropy of the "best fit" ordering (minimize inter-frame diff), versus the lossy-compressed video, and a similar "best fit" ordering for the LAION dataset vs. the model.
My suspicion is that one will find that the relative number of "smooth transition" pairs in LAION viz the whole will be very different from the video.
(1) - Napkin math: There are about 194400 frames, so ~37 billion (37791165600) frame-pairs. Assuming you have runs of about 1 second between hard cuts throughout, so an incidence rate of 1/24 for non-smooth transitions, gives us about ~36 billion "smooth transition" frame-pairs. I think it is safe to assume "on the order of" 1 billion, then. This ignores long "action" scenes with significant variance in images throughout, but also ignores longer-than-1-second slower scenes, hence the order-of-magnitude shrink in the assumption as buffer.
No, but what if you then produce "your own" rendition, or "remix", of that book, song or movie and offer it to the public? E.g. you memorize a collection of Taylor Swift's latest songs, and then start performing a medley of her hits in your local clubs, you may well find yourself in trouble.
ML can do either facsimile or imitation far better than a human mind can.
You seem to be conflating both things and suggesting that ML only does facsimile, which is where the potential legal problems are.
No, it is not. Memorization =! understanding.
I can teach a parrot to spew the times table, good luck getting it to understand how to apply it.
And a parrot is billions upon billions of times more capable than any current AI algos.
AI, like humans, is capable of both imitation and facsimile. It is far superior at both feats.
Your fallacy is that you are noticing AI is superior at facsimile and erroneously assuming it is “not learning”. You are also ignoring the other amazing learning feats of imitation in front of you.
Parrots are lovely animals, but it’s unclear what you think you’ve accomplished by bringing them up. The fact that they are capable of more than just memorization does not differentiate them from advanced AI models, which are rapidly gaining all sorts of abilities.
Polly the parrot would have a hard time producing a picture of Elmo with a light saber in a Superman costume riding a dragon on the moon in the style of Rembrandt (in under 300ms, at least). I also know a parrot couldn’t write a 500 word story about the image.
Diffusion would appear to me to work in much the same way. It doesn't understand what's good ("works", creates acceptable output) or why, but it knows it when it sees it, and has the tools to refine it.
I am pretty sure that GPT-3 is a lot more capable than a parrot in transpiling a function written in Python to Golang, or writing a summary to a tech-magazine article.
Same as StableDiffusion is ALOT more capable than me in drawing, painting and generally making up pretty pictures.
But that never, ever, ever happens with current AI. it is not AGI. it is inspired by nothing, has no creativity, nada, ziltch.
Humans either learn art by being natural art geniuses, or by receiving instruction and learning through an iterative process (where, again, they might create thousands of art works, but nowhere near the scale here), which is very different.
2. An AI has a training set of every image in the world and produces an entirely unique work.
Which do you have more of a problem with?
Copyright law serves the purpose of peoples works being protected from unauthorized parties making copies of their works, and profiting off them.
It doesn't protect from technology making the production of new works cheaper, faster, more efficient. An artist using photoshop can be, and is allowed to be, many times faster than one using oil and canvas.
I think the objectionable thing about the second is that the AI knows everyone's styles, and so can use them in creating something new. Even if the AI is restricted to not be able to paint an image in a certain artist style (as the new version of stable diffusion is, for instance) and the art is unique, I think part of the problem is that the AI is still (presumably) leaning on the collective styles of everyone it has trained over.
If we can train an AI over a small dataset or maybe even a large dataset of old art, or some mix in between, and then maybe fine tune it wtih a snall sampling of modern art, then I believe it would be unobjectionable, as this is largely how humans do it.
Why is it different?
The only difference that matters, is scale. And again, if I want to argue that something done 10000000000 times is legally problematic, I have to be prepared to explain why doing it 10 times is problematic as well, only less so.
Could I have a dollar? What about a billion dollars?
The burden isn't on me to explain why something being scaled up by a billion is not the same.
I am completely aware that scale is a "thing" in legal systems. But as I said before: For scale to be important, the unscaled act in itself has to be problematic already.
Granted, things around power concentration have deep philosophical and social roots.
But I think you meant GPT-3 has seen many books during training, not during inference. You should know that training on millions of books is not the only way GPT-3 learns. It is just the foundation of its knowledge.
GPT-3 learns "in-context", that means it can learn a new word or a new task at first sight. It just needs a description or a few examples. This is the most powerful feature of GPT-3 - in-context learning. And when it comes to ICL, it is much like humans - only sees a few examples, not millions of books.
> “Do you ask for permission when you train your mind on copyrighted books?”
The nature of ICL is that it happens at prediction time. So GPT-3 would have to explicitly be instructed to learn a specific skill. Should it reject instructions if they are sourced from copyrighted books?
I’m not a lawyer, but to me it seems within the realm of possibility that a U.S. court eventually finds strongly in favor of the copyright holders, the Supreme Court agrees (because Big Tech has so few friends left), and OpenAI will be required to destroy the GPT-3 model and all copies of the training data because they can’t filter out copyrighted works.
Just because you can find one way that GPT might be slow doesn't invalidate the point that its training does use massive amounts of data.
Or it’s just too fast so let’s stop it?
Bear in mind, AI is not making artists or creative types obsolete - that would be fair game, just like computers made human calculators obsolete. No, this is about abusing other people's work.
If using someone else work for learning is infringement, then that's going to cause a lot of difficulty for all artists. Try making a rock song without listening to rock, or paint some modern art without viewing it etc.
The loophole is to use copyrighted works for free despite no learning taking place - no human being observing and developing their skills based on that work - rather, an algorithm transforming those works into some other useful interpretation of them.
Where is the abuse happening?
I think you underestimate the sheer volume of data + conclusions the brain ingests and processes on a daily basis, primarily through unconscious experience.
While our brains are more complex than the networks, this has never been in dispute.
The quantity of experiences needed to train GPT-3, however, is many more than we are capable of experiencing in a lifetime.
Also, about volume of data processed by the brain: https://gwern.net/Differences
The difference is that I buy books, pay for visiting museums and buy music in several formats, or pay it accepting to receive advertisement between songs.
It Is expected that If I buy a book I will be allowed to read it without asking for a permission.
What I don't do is copypasting paragraphs of other books to write a new book and claim that is mine. Is a different situation.
No, obviously not, as that would clearly be unworkable and ridiculous. Mainly because we have a very unclear understanding of human creativity, and there’s no way to analyse an individuals mind to understand how they created an idea. Additionally copyrights reach generally stops at the point of “transformation”, once you take any idea an transform it “enough” it’s considered a new idea.
The reason none of the above applies to AI is simply because we’ve declared that only humans can transform and produce new ideas. AI aren’t human, thus they’re not afforded the same rights. Arguing about if there’s an inherent difference between AI creations and human creations is pointless, the law doesn’t care, it has already declared that there’s a difference between AI and human.
If you disagree with declaration, then you need to lobby to change the law. But until the change occurs, your believes are meaningless in the eyes of the law.
In my opinion things like selecting a training dataset and then writing prompts are not creative processes they are mechanical processes. Input in, output out, with barely any interaction from the human.
Consider when you commission an artist to make a painting. You give them a "prompt" by explaining what you want. Maybe you even give a "training dataset", a few examples similar to the look and feel of the result you want.
Then they go off and make something. They show you in process stuff and you make suggestions so the next version they show you is closer to what you want. This repeats until you are both happy. Then you own the drawing. Because you paid them for it.
In this case however it's absolutely clear that you did not create the work. You had input into the creation, but the artist was not a tool you are using to realize your own creative vision. They are the creator, you're a customer for them.
Copyright law is not about logically perfect system, but creating a general environment in which artistic, academic and other creations can appear and benefit the general population.
Yes ... and because it's not a logically perfect system, its lifetime has to be limited. One day we should abolish copyright and find a better, more functional way to drive progress.
Where goes wrong, is when individuals and cooperation believe that such monopolies should be indefinite, and pushed the monopolies beyond the lifetime of the author. A dead author can’t produce new works, so it’s now clear how allowing such long monopolies increases the amount of creative work produced.
The original primary objective of copyright was to create an environment to could produce an endless supply of public work, freely available to all. It’s only abuses of copyright over the past 50 years that have destroyed objective, and ironically it’s copyright holders like Disney that really starting to suffer the consequences.
Winding back copyright durations to better balance the public and private interests would go a long way to resolving many of our issues with copyright today.
> Yes ... and because it's not a logically perfect system, its lifetime has to be limited.
It’s also worth pointing out that no system of law is “perfectly logical”. It’s almost certainly impossible to produce a perfectly logic system because humans are inherently illogical, and binding them into a perfectly logical system of law would almost certainly produce more injustices.
Anything that has infinite supply and zero marginal costs, as Nobel Prize winning economist Samuelson argues when he was looking at the context through lighthouses, should be free to all. By using copyright to make it a monopoly and allowing the extraction of monopoly rents you are drastically reducing the value and reach of the thing that was discovered. Copyright is a hack and this hack is now fundamentally breaking. Instead of trying to save the hack, we need a full rewrite. If winding back the duration of copyright is correct, the best winding back is zero.
As we are a remix culture where idea A and idea B combine to create idea C, we drastically reduce the innovation in our economy through reduced discoveries. This failure ends up with large monopoly holders consolidating into bigger and bigger entities in order to right some of this failure, but that only makes the monopoly extraction worse.
The discoverer should be subsidized for the discovery of that information but it should immediately go to the public domain. How you work out what that works out to is just as abstract as what Spotify works out what each play costs. This is no doubt monstrously complex to figure out the dollar number what some discovery is worth, but it is the economically correct path. Copyright isn't.
: https://courses.cit.cornell.edu/econ335/out/lighthouse.pdf - page 359, first paragraph
There is zero creativity, zero art, zero original thought, zero newness.
When actual AGI happens, then your arguments mean something. Such as in, at least 50 or 100 years down the road.
Try to make a car using the best ideas developed by each car maker. You will be surprised.
yes, thats why I pay a fee to buy/borrow one (or someone pays the fee in the case of a library.)
> listen to music
again money is exchanged.
yes, and so long as they are not derived works, its not a problem.
Copyright is there to allow you and me to develop things and make money from it. It is there to stop people stealing our work, which may have taken years to develop and sell it for a profit with none of the risk.
Large corperations have abused this to make monster profits.
Google have spent billions to try and persuade us that copyright is evil, because they didn't want to pay content producers to host their work (ie music and movies on youtube and local news site)
The issue is this, I might have made a website that tells users how to make a specific type of metal work. I have a free ebook, and I run courses. I have spent many years to to perfect the art, create the tutoring content, recording videos. its advertising supported, and people are asked to consider buying a course, to support the creator.
The AI company comes along and scrapes all the content, allows people to regurgitate it, with more or less accuracy.
The creator now gets less traffic, less money and now cant afford to create more content.
The AI people now skim all the money, and the consumer gets less useful information.
Culture isn't free. Someone is paying for it, and if you stop paying them, then it doesn't get created.
As parent said: Everything is derived work. We are remix machines. It is how we learn and how we make money. Now with AI, apparently, we are offended, when something does it better and faster than we can? To me it seems, if we expect AI to pay additional fees, the question is: Why?
I am not saying that it's not an important question. Google has built its entire business around information other people have provided. I would argue most people are quite happy with the existence of something like Google search and see it as a net positive in their lives. Does that make the business part okay? Where do we stand on this in regard to an open web? Is it okay for Google to do what they do (and if they do it well to win the space), or should there maybe be a license where people have to pay the owner whenever they are indexing a website? I don't know. Feels complicated.
> If you stop paying them, then it doesn't get created.
That's an interesting thought. But is it true and, more so, is it a problem? What if humans from here on will only be paid to create stuff that an AI can't?
someone pays, just maybe not you. How do you think google/meta/et al offer you a service free at the point of delivery, through charity?
> You can google an image of any great work of art and look at it for as long as you like
see my bit about google. The copyright still is with the owner. That image can be removed, should the owner wish, but for various reasons its too expensive to get google to respect that.
> I would argue most people are quite happy with the existence of something like Google search
yes, because its a symbiotic relationship. I as a creator, make something that people want to find, google points them to me, and I get people's attention. I might do that to fluff my ego, or try and convert it to cash through sales or something.
The AI step threatens to remove that relationship. Instead of being passed to me, the AI just pastes shit its gleaned from mine and other websites, leaving no chance of me getting a reward for making that website.
If the copyright on a given work of art is still active, those pictures were taken and are distributed with the permission of the copyright holder (or they're just pirated). That's one of the reasons it's much easier to find images of classic art (for which the copyright has expired) than it is to find images of contemporary art.
> You don't need to pay for or "borrow" anything to learn from copyrighted works.
What exactly do you think copyright is, and would you be surprised to learn that libraries have purchased the books on their shelves?
If you're referring to piracy, that is very much being kept in check. Otherwise, the vast majority of copyrighted art is only available for payment in various ways (streaming services, museum and theatre access fees, library cards, buying e-books etc).
I look forward to a life of horrific poverty
If AI takes jobs because it's simply superior at them, and that creates friction and anxiety until we have stuff figured out, that's of course sad and we should do our best to soften the process, but I think it's inevitable. The carriage must die. It seems obvious that restrictions on training data are just a distraction and will not move the needle on any interesting time frame.
If however AI does not pay forward, in an arrangement that makes our collective lifes better, I will be the first to work on burning it into to the fucking ground.
But, on a lighter note, since that has generally been the direction of human civilization (not linear when zoomed in, but always when zooming out) I remain optimistic.
The weavers were left to rot when the automatic looms came in. (there were in flanders east england and northern france incredibly rich and influential class)
Furniture makers were left to rot when steam power tools came in
Farm labourer were left to starve when steam threshing/harvesting came in.
enclosure was another tragic note in england.
The green shirts were lobbying for "a share of the domestic profit" in the 20s-30s, in the 60s they were convinced that we were going to be working 2 hours a day by now, with robot servants cooking and cleaning for us, and no-one would be living in poverty. Even Orwell has written on this.
Instead we see productivity in the western world dropping. Meaning for every human hour worked we make less money. because I suspect in part to the rise of servant-as-a-service jobs(food/shopping delivery/cleaning/elderly care etc etc) all of which are long hours and low paid.
 well, DDR everyone had a job, but lived in permapovety and were likely to be disappeared if you spoke out.
 specifically the US and UK, who appear to be snorting financial inequality by the metric fuckton
What I was more so thinking of are the unspecific societal functions that evolved to the benefit of everybody, but more so to those who could not have afforded them beforehand: Quality health care, various forms of social support, more accessible education and food, better road systems. The stuff that makes the charts on education, prosperity and health go from bottom left to top right and child mortality and hunger in the opposite direction.
The injustices of the day do not show in the most important, most long term graphs. As far as I can tell (and I am happy to hear your thoughts) this can only be true because people have benefitted increasingly from things improving, over time.
This is just deeply wrong. Culture existed before money. It is tragic to me that a person can't see culture as anything but a marketable good.
with respect, thats not what I am saying, I'm saying it has a cost. If people do not have the means to spend that money on making culture, then it is not created.
Juvenal was a client of someone, and complained about it
Tallis, Allegri, Purcell, Bach, Mozart were all professional composers
The great seats of learning (Ashurbanipal's libary, Venice, Alexandria) are all paid for by a ruler wanting to show off how good they were
Wilde, byron were all rich people wafting around bored and making art on the way.
In the 60s-80s it was possible to live in NYC working at a bar or something, and still have time and money to create art. Where can you do that now?
Now you need to be rich, or have time, or get patrons. The internet is a great way to either lower the cost of entry (see music) or get support to create (see Patreon)
> This is just deeply wrong. Culture existed before money
Culture existed when we had time, food and resources to stop worrying about being cold wet and hungry.
That doesn't seem to be universally true, but an end-game of capitalism. There are countless examples of artistry/sculpture/music that were created long before copyright existed and although they may have been "paid" for it previously, those cultural items can be appreciated without needing to pay someone for it.
There are also many contemporary cultural items that were created without monetary recompense that can also be enjoyed without needing to spend money.
> Copyright is there to allow you and me to develop things and make money from it. It is there to stop people stealing our work, which may have taken years to develop and sell it for a profit with none of the risk.
Your use of the word "stealing" is unnecessarily loaded and specifically means that the creator was deprived of physical ownership which would be incorrect.
You are arguing against your own point here. As I said culture stops being created when there is no money for people to create it.
Should copyright never expire? no. is 25 years enough? you betcha.
> There are also many contemporary cultural items that were created without monetary recompense
Again you are missing the wider point. For culture to be created you need a mix of people, and those people to feel safe enough, and have enough time and energy to create said culture. They will also need money for materials.
As I suspect you are not on a poverty wage, you will have the time, energy and healthcare to be able to create a new thing. This is not a luxury someone who works two jobs just to make rent has.
> our use of the word "stealing" is unnecessarily loaded and specifically means that the creator was deprived of physical ownership which would be incorrect.
stealing is taking with intent to deprive. I mean specifically what I say.
taking someone else's work and selling it as your own to make money, whilst depriving that person of credit or income stream. It is morally wrong.
Now there is an argument about corporations abusing copyright (they do) but, throwing it all out only benefits people like google, amazon and facebook.
The law already makes many distinctions between humans and machines. For example, looking out the window to see when your neighbor is going to the supermarket: allowed; using a machine-vision system to store the movements of groups of people into a large database: not allowed.
Also, "training the mind" and "training a machine learning system" are two completely different things, even though the language used is the same.
It seems to me that one side is arguing that people (as in, individual human beings) already do what the AI is being accused of, the other side argues that it's replicating work.
The truth of the matter is that what is taking place is a different thing altogether. We do generally deal in a different way with "machine behavior" because we recognize it being automatic and reproducible matters.
Yes, and humans are being found liable for copyright infringement for doing so. All that's needed to establish liability is access and substantial similarity; the bar for the latter can be very low indeed (see Williams et al. v. Bridgeport Music et al.).
I pay the books directly (cash, credit) or indirectly (school books via taxes). I do pay the louvre to observe the painting. I also pay to listen music in ads (YouTube) or via subscription (YT Music and Spotify).
This whole thread really makes me want to pull my hair out.
Difference between illegaly creating a (even temporary) copy of a copyrighted work (e.g. streaming a movie) vs. creating a derivative work of said copyrighted work: Two completely different things, with completely different legal outcomes.
If OpenAI in any shape or form creates a temporary copy (<--- by copyright definition of what a copy is!) than this needs to be adressed with the former. If OpenAI creates a work that is considered to be a derivative work (<---- by copyright definition of what a derivative work is!) than that needs to be adressed with the latter.
The crux of this whole thing is: Human minds cannot make a copy of a copyrighted work by definition of copyright laws (in Germany, I presume the same can be said for pretty much all western copyright laws), while anything that a computer does can be construed as making a copy.
but that's not the point of contention. The training data set has been granted the right to be distributed (by virtue of it being available for viewing already - it's not hidden or secret). The proof is that a human can already view it manually. Let's call this 'public'.
The question is, whether using this public training dataset constitutes creating a derivative work. Is the ML model sufficiently transformative, that the ML model is itself a new work and thus does not fall under the copyright of the original dataset?
This is wrong. My paintings are publicly available (especially going by your definition [which I'm confused by the origin of?]). Taking a photograph of my paintings is still a copyright violation. I hope we can ignore all the legal kerfuffle about personal use, as it has no bearing on our discussion. Again -- all of this boils down back to what I've said before: Bare human consumption does not constitute as making a copy, nearly everything else does.
Your second point -- a copyrighted work automatically granting someone else any rights (especially distrubtionial rights) by just being available to be consumed -- is even more wrong. I'm not going to go further into that, as you can very easily prove yourself wrong by googling it.
>The question is, whether using this public training dataset constitutes creating a derivative work
I'm not well versed in the US copyright laws, but I would assume (strongly so) that this would not be the case. I -- again, for US copyright law -- assume that for something to be considered a derivative work, it needs to include (or be present in other ways) copyrightable (!) parts of the original work(s). In other words, the original work needs to "shine through" the derivative work, in one way or the other. The delta of parameter changes of a ML model would (imo) not constitute such a thing.
Problems with derivative works will come into play when considering the things ML models produce.
You are mixing up the two things that I've mentioned in my original comment. You have to differentiate between creating a copy and creating a derivative work. Both of those things matter, when talking about AI, but the former is way more cut clear.
>The question is - to what extent does the exact image of your painting remain within the AI's data matrices?
And the answer is: It's irrelevant. The model has to be ingested with a copy of something. That's all that matters. The AI could even reject learning from that something. By the time that something reaches the AI to even do something with it, it's been copied (in the literal sense) who knows how many times, each of those times being a copyright violation.
I would put the same criteria to the copy made for the purpose of AI training. As long as you have the right to view the image, you would also have the right to ingest that image using an algorithm.
AI is not a mind. It’s a program. We might call it a “mind” as a metaphor, but it’s not really one.
So any justification which presupposes that an AI should be able to do something (really: that the people who are running the AI programs should be able do something) because they are a “mind” is fallacious and doesn’t need to be interrogated.
Also, the fact that Artistic Freedom is now under attack by artists. Not that long ago Artists hated the Music Industry and Corporations such as Disney for weaponizing Copyright law against Artistic Freedom. Now artists are utilizing that same tactic against other artists.
What is your point? If "Linux users" are right it is not FUD or the concept of "FUD" is pointless.
you don't need permission to train on books, but you do need to buy the books or take them from the library one at a time.
"training" these machines so far is not like human learning as becomes apparent when they spit out source code that mirrors individual repositories. And you know that humans are required to both remix their own creations and follow copyright law at thbowe same time, and also adhere to the social and institutional stigmas against extensive uncreative cut and paste paraphrasals.
when training AIs on copyright law trains them in obeying copyright law, they'll be ready for the Turing test, or even to be called AIs.
That's not a problem, we already have copyright laws that prevent people from distributing mirrors of copyrighted works. They don't care about how the works were copied.
yes, isn't that what very many people are saying?
Humans are capable of both facsimile and imitation.
The fact that ML is able to perform facsimile far better than a human can is not evidence that this is “not the same” learning. Only that ML learning is superior. ML is far superior in feats of both imitation and facsimile.
If I show a 3 year old a single picture of a Tiger, and tell him this is a tiger, the child is able to recognize a Tiger fairly accurate in real life without further input. Though the child might say that a house cat is tiger,,,,
ML learning needs millions of pictures to do the same, and still might mistake an elephant for a tiger...
ML is nothing more than graph approximation, there is no logical reasoning
ML is currently capable of the tiger case you mention. It’s generally called “few-shot” or “one-shot” learning. In the context of an image generation model, having never seen a tiger before, if you show it a few pictures of a tiger, it could immediately draw you thousands of tigers in any variation or scenario you can think of, which is way more than a child can do.
As for the need to train on millions of images for the base model, I believe you are trying to say something about “sample efficiency”, and how ML differs from the brain in this regard outside of the few/one-shot contexts (which ML is absolutely capable of). I would argue that sample efficiency of the brain is actually also quite low, much lower than people assume. It’s irrelevant to an argument that ML is not superior, because ML is clearly is capable of learning richer, more effective representations in a shorter wall time than we can, whether it is sample efficient or not. And in the sample efficient few/one-shot contexts (learning what a tiger looks like from one picture), it also outperforms humans in speed accuracy and creativity. It’s not even close.
As for classification errors, ML is capable of some errors we are not, actually by virtue of being superior at learning representations we are not even close to being capable of learning. But those are edge cases, and they are fixed by various means. In the main cases, ML outperforms humans in speed, accuracy and class complexity, all exponentially.
You said something about graph approximation but it doesn’t make a lot of sense. I’m talking about learning and you’re complaining that machine learning is not “logical reasoning”. Whether ML is currently capable of logical reasoning is another discussion. Certain models do demonstrate some types of it today.
“Graph approximation” is a type of learning task. ML is a billion times better than humans at it so it also doesn’t help you argue that ML isn’t superior (in that regard).
This comment fundamentally and dangerously misunderstands Copyright Law. Insights are not copyrighted, nor are they copyrightable. Copyright law controls who gets to distribute a specific “fixation” or performance of work. It is not, and never was about preventing the spread of ideas. Authors and artists have always intended for you to read/observe/listen to their work when you legally acquire a copy. They just want you to not copy it verbatim, but go do your own original work if you want to distribute or sell something.
The whole problem is that today’s NNs are specifically designed to remember and remix only the fixed performative parts of the work, and they, unlike humans, don’t understand the insights at all. They are just deterministic machines that copy and remix at a large scale. As such, it’s pretty clear the people training AI today should expect to have to ask permission before “training” (copying) other people’s work.
A silly example. Making GPT write a rap battle between Keynes and Mises goes beyond a performative remix, it is transformational work, nothing is copied explicitly. If a human were to write it that would not violate copyright.
I think that to tackle this we need a new lens other than copyright in the long term.
The argument that NNs aren’t memorizing is definitely debatable and not necessarily true. They are designed to memorize deltas and averages from examples. They are, at the most fundamental level, building high dimensional splines to approximate their training data, and intentionally trying to minimize the error between the output and the examples. It’s fair to say that “usually” they don’t remember any single training sample, but it’s very easy for NNs to accidentally remember outliers verbatim. The whole reason the lawsuits mentioned in the article are happening is because we keep finding more and more examples where the network has reproduced someone’s specific work in large part. If we’re going to claim that today’s AI is producing original work, then we have to guarantee it, not just assert that it doesn’t usually happen.
> a rap battle between Keynes and Mises goes beyond a performative remix, it is a transformational work, nothing is copied explicitly.
I don’t buy that the work can be called transformational just because the remix doesn’t have any recognizable snippets. GPT is in fact copying individual words explicitly, and it’s putting words together by studying the statistical occurrence of words in context of other words.
> I think that to tackle this we need a new lens other than copyright
I totally agree with that. This question is legitimately hard. We do need a new lens, but we might have to keep and respect the old one too at the same time. I feel like AI work should acknowledge that difficulty and step up to lead the curation of training sets that are legal wrt copyright by design, rather than ignoring the concerns of the very people who made the work they are leveraging.
If you’re executing a NN algorithm in your mind, or via pen & paper, then you are copying from the training samples, because that’s what the algorithm does. During training you compute errors against the samples, and update your weights to reduce error. During inference or generation, you use the weights (the results you remembered across all your training data) to produce an output. When your training samples are clustered in the latent space, the network will only remember an average of the samples, but samples that are sparse and don’t have close neighbors are sometimes remembered verbatim because there’s nothing nearby to average from. You can legally run the algorithm all you want on your own. Once you run it and then distribute the output, it might be in violation of Copyright Law if you accidentally reproduced one of the samples. Same is true for traditional human learning, you can free copy ideas legally, but reproducing too closely something that someone else made may be against the law, even if it was accidental.
Thoughts are never illegal wrt US Copyright Law. It’s a straw man to insist on making this point.
> In other words, the end user of the model is the one to be held responsible if they reproduce and distribute the copyrighted material.
No, this is false because it is the creators of the model that 1) did not legally acquire the source material and 2) distributed the network that contains latent copies of the source material that end users can use to reproduce works from.
This is incorrect. As another poster mentioned, it is not illegal to read a stolen book. It is only illegal to steal the book.
Secondly the source material
is acquired legally since it is open to consumption on the open internet.
Thirdly model does not contain “latent copies of the source material”. By using a simple test (currently legal standard) that if I showed you the node weights and counts of the network no person even trained in the art can identify it to a specific piece of work. Therefore it is at best a derivative, reasonably distinct.
Nope, this is strawman and continuing to demonstrate a misunderstanding of Copyright Law. There is no such legal standard, where did you get that? If the network can reproduce a work, then it does in fact contain a latent copy. Arguing that you can’t see it by inspecting node weights is straw man. You cannot argue that you’re not copying music if you use a new compression algorithm and then suggest it’s distinct and derivative because nobody can read the raw compressed data. That’s not how Copyright Law works. If you can approximately re-perform someone else’s work, you’re in violation. This is true even if you have to run a black-box program to produce the output.
> no person even trained in the art can identify it to a specific piece of work
Ironically, you’re actually admitting that even AI researchers can’t prove the network won’t reproduce someone’s work.
The rest you seem to now be looking for a snarky gotcha, which if you don’t want to have a discussion, then I’m uninterested in discussing further. I made clear above and in a sibling comment that remixes are gray area, and this question is complicated. That said, even if AI people do acquire source material legally, they are in fact copying it and distributing it, and that part alone can potentially violate US Copyright Law. This isn’t even up for debate, so I don’t know why you’re attempting to suggest otherwise. The lawsuits mentioned in the article were brought on evidence that networks violated copyrights of specific existing works, and lots of people have found specific examples of violations.
1) the creating of the model is does not violate copyright. Claiming otherwise means running same algorithm in meatspace would violate copyright laws, which implies thoughts violates laws which is absurd.
2) distribution of the model does not violate copyright laws because the models themselves do not contain latent copies of the work. The model itself is not the work nor a recognizable copy of it nor can it be reconstituted back to the work. It is a tool more analogous to photoshop where the tool can be used to reproduce copyrighted work, yes, by the end user (where I believe the responsibility lies). But the tool itself is not copyrighted work. Microsoft word can be used to generate copyrighted books if I’m correct. Or I can hire smarter tool: a human writer to produce copyrighted works. Is the writer-for-hire illegal? Or his employability is illegal? Of course not. I believe the law will eventually take the position that AI model is a tool.
> nor can it be reconstituted back to the work
This is false. It has already happened multiple times that networks reproduced copyrighted material.
Secondly you seem to be conflating the “tool itself” to “what the tool can do” to be strongly equivalent. I.e if the tool has the capability to violate laws, then the existence and distribution of the tool itself also violates said law. (Not so)
> if the tool has the capability to violate laws, then the existence and distribution of the tool itself violates said law.
That’s right if you remove the word “existence”. Distribution of a NN model that violates copyright by reproducing copyrighted works is illegal. That part has been my point in this thread, it seems like you understand now and we agree.
It’s “existence” is not illegal under US Copyright Law unless you didn’t have the legal right to use the training material, and in that case it’s illegal to use the material whether you used a computer or your brain, it doesn’t matter how you created the neural network (or even whether you created a neural network), the violation there isn’t the act of creating the network, it’s the act of stealing and using material you don’t have permission to use.
This whole discussion would be a lot less frustrating for you if instead of making assumptions and logic arguments about brains and computers, you took some time to read the copyright legal code. https://www.copyright.gov/
Cars, phones, guns, knives (practically anything) can be used to generate activities that break the law. They are perfectly legal to distribute. The onus on the legality of the activity lies with the end user.
> If you understand that the model is a tool, and that as a tool it can be used to generate activity that can violate laws and be used for other perfectly legal activities, then as a broad principle the distribution of said tool is not a violation of said laws
That statement is incorrect, the logic is flawed. Just because a tool has both legal and illegal uses does not necessarily have any bearing whatsoever on whether the tool’s distribution is legal. Tools that are illegal to distribute can have legal uses, and that does not make them legal to distribute.
Making statements and assuming the truth without reason nor evidence nor examples to back it up. Logical fallacy of begging the question. You have also not reasoned how freely available information is “illegal” to read/index/store amongst other things.
Not here to win you over. The audience can see how weak your position is. My last response here.
Yes, you do need to buy books, which gives you permission to read them.
It is not illegal to read a stolen book, only to steal the book.
The thing that authors are trying to argue here is that they should get to control what type of entity should be allowed to view the work they purchased. It's the same as going "you bought my book, but now that I know you're a communist, I think the courts should ban you from reading it".
No, that's not it. It's more like if I memorized a bunch of pop-songs, then performed a composition of my own whose second verse was a straight lift of a song by Madonna. I would owe her performance royalties. And I would be obliged to reproduce her copyright notice, so that my audience would know that if they pull the same stunt, they're on the hook for royalties too.
Now, moving from holding the model creator culpable to the user would obviously be problematic as well, since they have no way of knowing whether the output is novel or a copy paste. Some sort of filter would seem to be the solution, it should disregard output that exactly or almost exactly matches any input.
It's not obvious to me that the implicit permission we've been granting for humans to view our content for free also means that we've given permission for AI models to be trained on that data. You don't automatically have the right to take my content and do whatever you like with it.
I have a small inconsequential blog. I intended to make that material available for people to read for free, but I did not have (but should have had!) the foresight to think that companies would take my content, store it somewhere else, and use it for training their models.
At some point I'll be putting up an explicit message on my blog denying permission to use for ML training purposes, unless the model being trained is some appropriately open-sourced and available model that benefits everyone.
actually you don't have the right to restrict the content, except as part of what's allowed in copyright law (those rights a spelt out - like distribution, broadcasting publicly, making derivative works).
specifically, you cannot have the right to restrict me from reading the works, and learning from it.
Imagine a hypothetical scenario - i bought your book, and counted the words and letters to compile some sort of index/table, and published that. Not a very interesting work, but it is transformative, and thus, you do not own copyright to my index/table. You cannot even prevent me from doing the counting and publishing.
The section titled "Exclusive rights in copyrighted works".
There are 6 rights.
(1) to reproduce the copyrighted work in copies or phonorecords;
(2) to prepare derivative works based upon the copyrighted work;
(3) to distribute copies or phonorecords of the copyrighted work to the public by sale or other transfer of ownership, or by rental, lease, or lending;
(4) in the case of literary, musical, dramatic, and choreographic works, pantomimes, and motion pictures and other audiovisual works, to perform the copyrighted work publicly;
(5) in the case of literary, musical, dramatic, and choreographic works, pantomimes, and pictorial, graphic, or sculptural works, including the individual images of a motion picture or other audiovisual work, to display the copyrighted work publicly; and
(6) in the case of sound recordings, to perform the copyrighted work publicly by means of a digital audio transmission.
Github ignored the licenses of countless repos and simply took everything posted publicly for training. They didn't care whether it was available to them entirely legally, they just pretended that copyright doesn't exist for them.
Other licenses such as the MIT license require that you name the original creator.
A license allows new uses that copyright would otherwise block. Some kinds of AI training are fully local and don't make the AI into a derivative work, so they don't need any attribution and you don't need to accept the license to distribute.
This was about the humans consuming other people's content.
If humans make stuff that is too close to someone else's source materials then it is considered plagiarism and not "inspired by".
> For any image that the AI generates, you can't point to any image in the training data that the image is derived from.
Why can't you point to the Getty Images watermark that it is quite happy to reproduce? Isn't that surely evidence that it doesn't actually understand what it is reproducing?
> The AI is trained on 5 billion images yet it stores only 4gb of data. Thus it is impossible that it stores the actual work.
I have also seen billions of images, therefore I cannot be actually store the real images in my head and thus nothing I paint could ever be considered plagiarism. That's brilliant, I think there are a few law firms defending artists who would be looking to hire you.
I don't know if that counts as plagiarism, but there's clearly some use of this copyright material that the authors probably didn't envision and did not grant permission for. I have no idea what the law would be in cases like this
the data was originally permitted to be copied.
The question isn't whether the training is violating copyright - as long as the data set had permission to be viewed (which it must have, since it was public).
The question is whether the final result - the model/weights - is a derivative work of the training data set. If it is a derivative work, then the model must be in violation of copyright. But copyright law allows for sufficiently transformative work to be considered new, rather than derivative. So is training a model using methods like this constitute a transformative work?
Are we really going to play devil's advocate so much that we consider these early day A"I" tools as equivalent to humans? I personally have absolutely 0 qualms about treating humans and these ML tools as completely separate entities governed by completely different laws. AI SHOULD be heavily restricted, we're already headed not towards any sort of apocalyptic singularity, but a singularity of pure, endless spam spewing forth from every orifice of the internet and elsewhere.
If these megacorps behind this AI push want it to succeed, then they should be paying for access to the images/texts/music/videos/whatever they're trying to harvest en masse. I couldn't care less if an AI learns the same way a human does or any other anthropomorphising the AI crowd want to gaslight everyoen with.
Of course not, and given my ability to train my mind on thousands of books in a few minutes and spit out a full book based on that training in whatever style one wants in a few minutes for that as well, it seems especially unfair that people act as though there might be a difference between the two situations.
I think this is a specious analogy at best. The two are remarkably different contexts. AI can work at a significantly greater rate. There's also a very large question about whether for profit commercial software should be afforded the same leeway we give to ordinary human behaviour.
Fundamentally the current accommodation of copyright has two main justifications:
1) Protect economic activity
2) Moral right to identify original author of a work
The point of ease of replication speaks to (1) fundamentally breaking. A human can only produce so much output compared to an AI system.
(2) is a much thornier subject, and not one I really feel qualified to speak on.
Why go down the route of turbocharging new forms of rent seeking?
Most people looking at AI can tell it is not like a mind or an artist, because of certain intuitive arguments which boil down to their surprising ability and their bizarre faults (drawing hands is still a struggle for most models) and current limits (you have to hack prompts instead of asking naturally due to their current limits). You can reason about people using these arguments because they are people, but you cannot use it when applying it to NN because they are not.
I'd argue that the moment you start using the "people are AIs" argument, and you are then implying the converse is true "AIs are people," then you are assuming that there is some bidirectional here, and thus other qualities you assign to people, like, "people have rights" and "people deserve to be paid for their labor" and "people have rights to the work of their own hands" then must apply to AIs. And, therefore, the AI tools you are using deserve to be treated with the respect and dignity you had to treat artists and developers with before, and thus, should be paid for the work they create. That is, if they learned and created art in the same ways people create art. Just as we do not have that nursery does not own the art a child born there creates, or a university doesn't own the art an artist who studied there creates, you cannot make the argument that the work an AI creates is that of the "owner" or "trainer" of a model unless you are arguing slavery is in fact okay in this day and age. All of this of course hinges on the supposition that AIs are people, and that they learn as people do.
So, you cannot have it both ways. You cannot keep treating AIs as people in your arguments, but then deny them agency that is due to people. The only way this is is that you deep down do not believe they are people, or that you think people do not deserve rights or deserve compensation for the work of their own hands.
Also animals can be trained and make outputs and nobody accuses them of copyright infringement. That's a much better analogy here than leaping to the idea of treating one of these models like a human.
Ok, I just have to link this here; https://youtu.be/dKFunwOzEos?t=711
AI is not a mind. A mind is a physical object, a brain inside a skull inside a person. An AI is a computer program.
And while a nerd who forgot how grass feels like might confuse the two, the courts won't.
If an AI reproduce a copyrighted work they should then be sent to robot jail, and the human who requested the work should be sentenced for conspiracy to commit copyright infringement.
It might however be a bit early to let the horse back in the barn.
Training AI without permission is sneaking a camera into an art gallery without permission.
If you want to talk about the big picture here, it's about privatizing gains and socializing losses, the goal of every bigcorp, which is just more reason to disallow this abuse.
What you’re asking for is going to only benefit corporations. They’re the sole entities that will be able to afford the regulation you’re proposing.
It’s not even a hypothetical concern. Much of my notability came from books3, which contains almost 200,000 books, and is freely available to anyone. https://twitter.com/theshawwn/status/1320282149329784833?s=4...
I couldn’t train a GPT competitor without that. Neither could you. But corporations could.
TRC (TPU research cloud) makes supercomputers freely available to anyone who will put them to good use. Like, literally you. You don’t even have to have a clear research goal for general access.
It was one of the big surprises of getting into AI. I didn’t expect that at all.
Even without TRC, compute is only getting exponentially cheaper. A 1.5B ChatGPT may sound puny, but I’ve seen how powerful the non-chat variants are.
But I want to argue here that for purposes of this latter question, your proposal of copyright enforcement (or anything similar) is too little to late.
-These "copyright violating" AIs have demonstrated the proof of concept and the damage is done. Even if these AIs are banned, the companies will just parallel reconstruct it by running the 80/20 rule: pay tiny amounts to get most of the data. After all the creators of the data were doing it for free and are in such fierce competition there's no bargaining power.
- More nefarious AIs will just do transfer learning on intermediate neurons, very difficult to prove stealing here.
- Even if you get the system to work, what about future artists and writers? Are we just creating an entrenched historical group of creatives getting royalties forever?
The distributional problem is not well solved by copyright, and better solved with e.g. corporate taxes, income taxes, VATs.
This is kind of what happened with music, no? In some countries hard drives, SSDs etc all carry an additional tax that is then given to some copyright organization. Of course it's not the artists that mainly benefit from this, but instead it's the people running said organization.
Copyright expires, and new artists will create new (copyrightable) art in the future. Unless your assertion is that generative AI is so good no one will make art without it ever again?
The flip side of this is that if we undermine paid creators until there's no incentive for them to create, then the AIs abilities stagnate on old data and we as a society drop or at least diminish the skillsets that could create new media.
AI can generate stuff humans care to look at only because of the availability of data that humans created for eachother to enjoy. As tastes, fashions, zeitgeists and pop culture change amongst humans the AI models will always be behind and unable to follow trends completely. I think.
The incentive to create is almost never financial. How many artists finance their creative efforts by working day jobs? Making a living as an artist is more about buying yourself the time to focus on making art than it is about making money. People will continue to create art, however they can, because they must.
This is all stuff I am actively thinking about since it is impacting me right now, so I appreciate the discussion and would be happy to be wrong.
I'm not even slightly concerned about that.
1. Art is better when it's not paid. Real artists have day jobs that pay the bills and they create art to express their ideas, not to make money.
2. Paid art isn't going away, it will just change. Certain skillsets will be forgotten, like how landscape painting was replaced by photography. But talented artists will leverage AI tools to create works that are greater than anything that came before.
Trying to define who "real" artists are is a folly for the ages. It is the dream of many artists that they get paid for their art, and many achieve it. The starving artist is a mythos of pain and suffering, a good story but hardly good for art. Some of the best composers from history were paid, some of the most influential artists were from wealthy families. They were able to focus on their work without fear of money and because of this they could excel in techique and execution, which allowed them to produce some of the highest forms of their art in history.
Eg, Donald Judd’s works are these creative decisions and processes distilled to the most basic of sculptural form.
The boat has long since sailed on this… ands it’s globally entrenched as a norm of international trade that we are all “ok with this” regime of 75 years or century plus copyright terms …
And arguably the entire copyright vs AI/ML training datasets debate is founded on the notion that the artists individual copyright will last long enough that it’s going to outlive the average artist. If we look at one of the old copyright regimes, for comparison… in a world where copyright is a short default/implicit/automatic term (14 or 28 years) and the copyright owner can elect to register and pay for extensions (for a more modern twist, preferably combined with increasing incentive to prevent perpetual renewal abuses by Disney, et al)… now imagine how much data from up to 28 years ago there is, the catalogue of art and photographs and text and books and academic writings… all public domain because the authors didn’t consider them of sufficient value… all free for the ML model training… this gets even larger with a 14 year term…
Suffice to say that we are seeing systemic impacts already, culturally we’re seeing more and more money put behind less and less content controlled by fewer and fewer people due to a slow death spiral off copyright stranglehold across multiple industries, written, visual, audio and video arts are all dominated by large corporations holding IP … yes individuals continue to create, but other than rare breakthrough chance successes and internet age viral success (which are often just completely arbitrarily/random and have no real quality) these companies decide what will be popular culture…
My prediction is that the AI/ML models will be allowed but heavily scrutinised, under the simple legal doctrine that the user is the one committing the infringement since the primary purpose of these models is not infringement but unique creation, but suspicion will linger by artists and it will become a normal part of contracts in the art word…effectively an artist equivalent of the way police in many places view spray cans… just as the primary purpose of spray paint is not to create illegal graffiti, which is the justification many places used to overturn poorly justified civic bans on possession of spray paint.
I’d like to see any more draconian spread of derivative work rights (style rights etc) to be accompanied by drastic reductions in the automatic copyright term, as the ability to churn out lots of automatic content drastically lowers the value of long long terms, and the counter argument that it makes the existing rights more valuable is fucking insane as we do not need to pass copyright down to the great-great-great-great-grandchildren… the terms are already too long.
Personally I’d like to see the right to train statistical models on any works without the permission of the author enshrined in statute and an end to common-law copyright, a return to the Statute of Anne 14/28 time length, and a clear delineation between the “work” as having an author for an eternity but having a “copyright of the work” vastly limited in scope.
Ask yourself, do we want to be extending the reach of large copyright holders like Disney into taking a fee from LLM producers because they COULD be helping people draw Mickey ears on their private creations?
This is Betamax all over again and luckily that Supreme Court opinion will favor heavily in the lower court’s judgement of these models as fair use.
Less equitable world has artists getting paid, in more equitable world, everyone can just use open-source AI tools like Stable Diffusion.
It’s also worrying that requiring consent to train an AI model will inevitably lead to requiring consent to make handmade art that’s a little too similar to some other existing artwork (ie, how all art works through reference, training, and inspiration). A world where Getty and Disney control even more than they already do.
The simple fact of the matter is that Disney and Getty invest a lot of money into these materials being out there in the first place. Open source programmers and artists spend a lot of time producing works for no cost other than some minor courtesies.
AI companies aren't your friend or the little mom ''n pop shop down the road. Their technology giants backed by billionaires. When it comes to Disney versus Google/Microsoft, I'm against both sides if it means giving up my rights.
Big AI taking your stuff and ignoring copyright law isn't some kind of protests against copyright, it's the very opposite; it shows that copyright doesn't matter if you have the money to defend yourself in court. Violate the the MPAA's copyright and you get extradited, violate some random person's copyright and you should feel honoured that people even want to steal your work.
In my opinion, the idea behind the current copyright system works fine if the terms weren't so ridiculously long. Restrict copyright to five or ten years and I'd be fine with the whole thing. This "70 years after the death of the author" crap is the biggest stifle on copyright adds.
I firmly believe in "practice what you preach". I you declare you firmly believe in A but then do something directly counter to that because it's more convenient in this specific case, then that doesn't sit right with me.
Besides, further expanding copyright in this one area will only make it so much harder to reduce it later. And the pro-copyright folks will be able to say "you say you want less copyright, but you vigorously advocated in favour of copyright then, you hypocrite!" (and they wouldn't be entirely wrong, either). All this effort and energy fighting ML tools would be better directed at reducing copyright instead.
I don't disagree with your view on corporations. Do I like what CoPilot is doing? Not really. But at the end of the day: does CoPilot's or ChatGPT's mere existence really take away anything concrete from me? Am I harmed or even inconvenienced by it? Are my rights reduced? Is my code harmed by it? Is my income reduced? I don't really see how it concretely affects me, other than a general "feeling of unfairness".
And I see real risks with all of this: most regular people and small businesses don't have the resources to litigate as it's expensive and time-consuming, so a "license" that you or I slap on a piece of code is, realistically speaking, just ink on a piece of paper. GPL violations are rampant, violations of other licenses probably happen even more (but people generally care less about that, so not as widely publicized). Who will benefit with more copyright law on their side? The ones with deep pockets and many lawyers on retainer. i.e., the corporations neither of us like. Think creative new copyright lawsuits such "we claim copyright on the Java API" kind of stuff.
It's Disney that are, by proxy, suing Stable Diffusion to create the legal precedent that you desire.
I can’t speak for everyone, but personally I find that copyright can be used properly or abused, at both sides (holder/consumer). It doesn’t mean that copyright is bad, only particular caregories of claims and usage are. But abusing copyrighted material from millions of little creators at insanely automated scale is another level of evil, especially when they explicitly require consent for exactly this type of use.
worrying that requiring consent to train an AI model will inevitably lead to requiring consent to make handmade art that’s a little too similar to some other existing artwork
That’s the root of misunderstanding, afaict. We can agree that at-scale processing is bad and that fair use is still okay. A human with a pen (or a text editor) can’t damage copyright at scale by learning terabytes of material in few weeks and producing the same amount in hours, so they can be excluded from this. Humans who use AI can, so they’re a target.
They aren't training it on Microsoft or GitHub code.
> A world where Getty and Disney control even more than they already do.
This is exactly what is currently happening though, it's okay to rip off the little guy artist or coder. The argument here is that one big guy stood on another big guys foot, and as the little folks we shouldn't stand for it either.
The copyright terms means that for Life+70 countries only works where a) the author died before 1953 and where the works were published before 1928 are in the public domain.
An AI training on a given work should comply with the law and with copyrights, just like anyone else. It should also respect the license or other terms the works were released under. -- You could easily silo the data by license, and have a different model per license.
Patents should be a good thing (they allowed inventions to be published instead of being kept secret). However, it is easy for large companies to get patents on trivial things, write overly broad patents, and collate a large number of patents in a domain. That means trying to innovate or compete in a highly patented field like audio or video compression is difficult.
When a new technology is introduced, for example the compact disc was invented, lawyers get to poke whether that "distribute" applies to the music CDs, or just to vinyls and music tapes (because at the time of granting that license, CDs weren't yet a thing! gotcha!).
The answer to this conundrum might vary in different countries, and we can have fun discussing that in the context of AI, but it does not affect how handmade art shouldn't be too similar.
Or you can (correctly) think it's a huge drag on innovation and human progress.
If you think the latter then hoping in this case for legal precedent to broaden the scope of copyright enforcement is just bizarre logic. This isn't a rule that already exists as such. The case will set a precedent (based on interpretation of existing law) for the future.
Then, a lot of open source authors understand that their works are not groundbraking inventions and want to share them with the world without any fees or costs. And others have groundbraking inventions and still share them for free with the world.
But there are almost always license terms attached to the piece of the work. They can be essentially non-limiting, like public domain or MIT. But they can also enforce some minimal requirements like attribution. Why should any entity, especially huge corporations, not be bound to those conditions? It was mostly those corporations which created a restrictive copyright. Try to draw an image of Micky Mouse and put it on your website and see what happens.
2. People using non-copyleft license just do it because public domain seems to have a complicated legal status across the world.
You know that Stable Diffusion lawsuit? Go check who the lawyers behind that work for; Disney wants that same outcome.
If a company invests $250 million into an original movie, I don't see why they shouldn't have some say over their content for at least a couple of years. Not until 2150 or whatever the end date for modern works is supposed to be, but give it some time at least.
OpenAI is the result of billions being thrown around. When it comes to billions, it doesn't matter if they come from Disney, Google, Microsoft or Amazon. None of these companies have our individual rights at heart, they only care about profits.
In this rare occasion, the interests of the people and Disney align. The laws protecting the independent writers/programmers/artists are the same ones that protect Disney.
The tools themselves work on arbitrary data sets. Anyone who can dig up enough public domain/attribution free pictures/code/text can train their own AI without even coming close to copyright issues. Hell, had these super smart AI people managed to find out a method of attribution, the data set would include massive amounts of works released under Creative Commons or open source licenses.
Good quality data and more of it means better output. Disney is almost certainly doing their own thing internally, benefiting from their ability to use both free as well as their own IP and the capital to hire cheap workers to train it directly.
It's not that I don't understand why artists might be upset about a company scraping copyrighted art, I just think that the longer term effects of legally kneecapping open source variants while handing over the most powerful versions of it to the existing intellectual property giants are A Bad Thing.
What the customers of AI want is accurate predictions of the models, and they can get that even if everyone demanding to get removed from the training set would be removed.
The makers of generative AI could remove every living artist who wants to from the dataset, the model would still develop a general solution of color theory, composition, almost every artstyle in existence, ... because fact of the matter is, there is just that much data out there. Our species collectively has spend DECADES recording, storing and categorizing everything and the proverbial kitchen sink. There are god-knows-how-many petabytes of data available in images alone, so even if just 1% of that could be used to train generative models, it would still be more than adequate.
And soon after that, there is an explosion of new generated art, filtered through the aesthetic sense of millions of humans, that can just be fed back into the models, to make them better.
The end result is the same: High-quality image generation on a scale hitherto unseen, running even on consumer grade hardware. And what lawsuits will be filed then?
I swear when I see this argument because it makes me angry.
You’re right, but they didnt, because they were too lazy and cheap to do it that way.
…and that’s why people are angry, and rightly so. Fully licensed models are the future, and it’s both irritating and disappointing that we are where we are right now because the people training these models were too lazy to assemble a training dataset that wasn’t problematic (ie. full of porn and copyrighted material).
You can argue the “but at the end of the day it’s all the same…” argument if you like, but clearly from the lawsuits it isn’t ok
They’ve completely messed it up.
There’s a reason the openai api terms of service says that “the Content may be used to improve and train models”; they’re setting themselves up to have a concrete defence for the source training data for their models.
Stability can burn in a fire. They’ve really trashed the reputation of generative AI in a way that is going to be very difficult to recover from.
That reputation damage you think matters doesn’t exist.
Well, a lawsuit isn't a decision, we will have to wait for the courts to decide wheter it's legally okay or not.
Reputation damage has been done.
Undoing that is going to take time and effort which, could be spent on more productive things.
I’m disappointed in where we are right now. It was entirely avoidable.
/me shakes head…
Somehow we have managed to come full circle to the first episode of HBO's Silicon Valley.
A search index usually links to the source. Without that a search index is worthless, you can't use content if you don't even know where it comes from and who holds the rights.
Google search links to sources like Wikipedia in its info boxes, because without that you can't know whether the info is reliable or sourced from my brother's coworker's imaginary flat-earther friend.
If a model was to add attributions to each of its answers, then perhaps the search engine analogy would hold. But, they don't (and right now, to my understanding, can't.)
By the time this reaches judgment and goes through the appeals process there will be a vast industry of non-infringing uses that are clearly transformative and in fair use (Sony v Universal)
You cannot say that the person using ChatGPT to control the lights in their garage is infringing on anyone’s copyright in any manner whatsoever. The point of copyright is not to gain a permanent monopoly on certain speech. The point of copyright is not to make sure that people are fairly compensated for their work. Their work might be terrible but contain a good idea that is later reimagined in a better way (Baker v Seldon) but that’s for the market to decide.
The courts will probably concur that these models are fair-use and I will agree with their judgement.
Current market odds for that are at 77%: https://manifold.markets/JeffKaufman/will-the-github-copilot...
But "eating" is a fun word.
The purpose of copyright is to progress science and useful arts. Period. Any action taken in the name of copyright that does not progress science and useful arts is unsupported by law.
What else do we know about copyright? A copyright can apply only to creative expressions. While the bar for sufficient creativity is intentionally low, it is non-zero.
Another thing we know is that purely functional expressions are not copyrightable. When does an expression go beyond being a function expression to a creative expression? That’s up to a judge. Since code is math and math, by itself, cannot be copyrighted, when an expression reaches the level of creative expression must be beyond the math. Updating a database field, factoring primes, or using data correction algorithms are not creative expressions.
Now for AI. Only humans may own copyrights. The output of an AI is not copyrightable. But what if the input was copyrighted?
When it comes to software code, AI will value expressions that are commonly used more so than uncommon ones. But software code is, by it’s very nature, an intertwined collection of copyrightable (creative) and non-copyrightable (functional) expressions. If AI values commonly used expressions, those expressions are highly unlikely to be creative enough for copyright protection in the first place.
So we have a circumstance where AI is trained on copyrighted but Open Source code. Yet the code itself is comprised of both creative (presumably) and functional code, with no clear delineation of what is and what is not protectable.
Lastly, many authors do not understand what constitutes a creative expression that is protectable by copyright. The amount of work required to create the expression is meaningless. Manipulating data to thresh out something interesting is not creative. Let’s just face it that most software is comprised of mostly functional expressions that are not protectable. Back to that “math” problem again!
The big take-away? The purpose of copyright is to progress science and useful arts, not to build walls around ideas and concepts (which, by themselves, are not protectable).
Would that mean you can simply use one AI (or more) from anyone else to train another AI?
Of course access can always be limited to an API with rate limits and per-request costs, which would make it difficult to straight up copy the whole thing, but it would be hard to justify any legal protections against it.
If you want to prove your data was used to train an AI, the onus is on you to prove it. Good luck.
The AI who follow the law strictly will be at a disadvantage to those that do not.
which would be easy during a law suit - the process of discovery means you get to check out the training dataset.
The allegation isn't that the AI trainers are hiding, but that what AI trainers are doing _itself_ constitutes copyright violation. AKA, they want the right to use the works to train an ai model to be a right that must be explicitly granted.
i hope that legislation is not introduced to prevent training, as this right would stifle progress.
Yes, and then people say "no" or "pay me". End result of this is that the only ones with good AI models are megacorporations that will DRM the heck out of it.
Years later those same artists will complain that they now have to pay $1000 a year to Disney/MS/Adobe to create art. Because these megacorporations can afford to pay for it. They're the ones that will benefit the most from this, because it creates an insurmountable moat for them.
Copyright exists to encourage the creation of more art and to progress science. AI is clearly a helpful step in that direction. Humans learn from others' works. Should we make that illegal too?
I find it astonishing that people continue to make this argument. A machine is owned by someone, a human is not. Why should the law treat machines the same way as a human? Sounds like some corporate flim-flam to me.
>To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;
The purpose of copyright is not to protect the authors, it is to promote the progress of science and art.
The current situation for AI image generation is pretty much the only way these technologies will be available to everyone. Most other paths will simply lead to billion dollar corporations acting as gatekeepers to this technology. Megacorps can afford to hire artists to generate specific art for their AI models, everyone else cannot.
You end up with billion dollar corporations gatekeeping this technology either way (who else has the capital to best train the models?). This isn’t about the little guy.
It shouldn't, that's why arguments that the algorithm is learning, so it's doing the same thing that is legal for humans to do is completely fallacious, on top of it just being anthropomorphism.