I don't fundamentally disagree with you, but what you are saying doesn't hold water.
> a copy is made and reproduced numerous times when training.
Casually browsing the web creates millions of copies of what are likely the same images and text that models are trained on. Computers cannot move information, they can only copy it and delete the original. Splitting hairs over the semantics of what it means to "copy" isn't a strong argument.
> where it is an authorized viewing
What exactly is an unauthorized viewing of a publicly accessible piece of content online that has been hyperlinked to? If we assume things like robots.txt are respected, what makes the access of that data improper?
> it may output material that competes with the original
An art student could create a forgery. I could craft for myself a replica of a luxury bag. But that's not a crime unless it's done with the intention of deceiving someone or profiting from the work. Intent, after all, is nine tenths of the law.
It's an important right that you should be able to do and create things, even if the sale or distribution of the outputs of those things are prohibited. The ability for a model to produce content which couldn't be distributed shouldn't preempt its existence.
> So you may have copyright violation in distribution of the dataset or a model's output
And neither of those things are the act of training or distributing the model itself!
There is quite a bit of precedent for "making copies of digital things is copyright infringement". Look at lawsuits from the Napster era. [1]
What makes the use improper? Licenses. Terms of service. Mostly licenses though. For example, all the images on Flickr that were uploaded under Creative Commons licenses (e.g. non-commercial) have now been used in a commercial capacity by a company to create and sell a product.
Similarly, code is on Github with specific licenses with specific terms. Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.
The reason I mention competition with the original is the fair use test (USA). When courts decide whether something is fair use they consider a few aspects. Two important ones are whether it is commercial, and whether it is a substitute for the original. When art models output something in the style of a living artist, it is essentially a direct substitute for that person.
Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.
Training the model may very well be a copyright issue. The images have been copied, they are being used. Whether that falls under fair use will likely be determined on a case by case basis in court. I do not believe closed commercial models like Copilot or Dall-e will pass a fair use test.
There is a lot of money involved here though, so we will need to wait for years before we have answers.
> Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.
But the very act of training copilot is not problematic. And in fact, if GitHub never did anything with Copilot, the physical act of training the model is not problematic at all. And that's what at issue here. How Copilot is used is orthogonal to the article.
> Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.
Yes. And training the model isn't the part where you sell it. It's the part where you make it.
> Training the model may very well be a copyright issue. The images have been copied, they are being used.
What do you think "being used" means here? If I work for a company and download a bunch of text and save it to a flash drive, have I violated copyright? Of course not. If I put that data in a spreadsheet, is it copyright infringement? Of course not. If I use Excel formulas on that text is it infringement? Still no.
And so how can you claim in any way that the creation of a model is anything more than aggregating freely available information?
I don't disagree with you about the use of a model. But training the model is just taking some information and running code against it. That's what's important here.
> a copy is made and reproduced numerous times when training.
Casually browsing the web creates millions of copies of what are likely the same images and text that models are trained on. Computers cannot move information, they can only copy it and delete the original. Splitting hairs over the semantics of what it means to "copy" isn't a strong argument.
> where it is an authorized viewing
What exactly is an unauthorized viewing of a publicly accessible piece of content online that has been hyperlinked to? If we assume things like robots.txt are respected, what makes the access of that data improper?
> it may output material that competes with the original
An art student could create a forgery. I could craft for myself a replica of a luxury bag. But that's not a crime unless it's done with the intention of deceiving someone or profiting from the work. Intent, after all, is nine tenths of the law.
It's an important right that you should be able to do and create things, even if the sale or distribution of the outputs of those things are prohibited. The ability for a model to produce content which couldn't be distributed shouldn't preempt its existence.
> So you may have copyright violation in distribution of the dataset or a model's output
And neither of those things are the act of training or distributing the model itself!