So as a code author I am pretty upset about Copilot specifically, and it seems like SD is similar (hadn't heard before about DeviantArt doing the same as what GitHub did). But I agree with this take: the tech is here, it's going to be used, and it's not going to be shut down by a lawsuit. Nor should it, frankly.
What I object to is not the AI itself, or even that my code has been used to train it. It's the copyright for me but not for thee way that it's been deployed. Does GitHub/Microsoft's assertion that training sidesteps licensing apply to GitHub/Microsoft's own code? Do they want to allow (a hypothetical) FSFPilot to be trained on their proprietary source? Have they actually trained Copilot on their own source? If not, why not?
I published my source subject to a license, and the force of that license is provided by my copyright. I'm happy to find other ways of doing things, but it has to be equitable. I'm not simply ceding my authorship to the latest commercial content grab.
> Have they actually trained Copilot on their own source? If not, why not?
People have posted illegal Windows source code leaks to GitHub. Microsoft doesn’t seem to care that much because these repos stay up for months or even years at a time without Microsoft DMCAing them-if you go looking you’ll find some right now. I think it is entirely possible, even likely, that some of those repos were included in Copilot’s training data set. So Copilot actually was trained on (some of) Microsoft’s proprietary source code, and Microsoft doesn’t seem to care.
The question is not whether there's some of their code that they don't mind being incorporated, but whether there's any at all that they wouldn't allow to be. And more importantly, not used for their own bot, but for someone else's.
If licenses don't apply to training, then they don't apply for anyone, anywhere. If they do apply, then Copilot is violating my license.
IANAL, but they likely believe their unpublished source code contains trade secrets. They may believe that training a public model is okay on published source code (irrespective of its copyright license), but that doing so on unpublished source code containing trade secrets might legally count as a voluntary relinquishment of their trade secrets (if we are talking about their own code) or illegal misappropriation of the trade secrets of others (if they trained it on third party private repos)
I seriously doubt Microsoft / GitHub would care if Copilot or a similar model were trained on their proprietary source code. An advanced code completion tool does pose any significant risk of someone building a competitive product to GitHub or any other Microsoft products.
This is an intelligence augmentation tool. It’s effectively like I’m really good at reading billions of lines of code and incorporating the learnings into my own code. If you don’t want people learning from your code, don’t publish it.
I doubt Microsoft sees fragments of Windows source code as a particular crown jewel these days. That said, some of it is decades old code that was intended for the public to see (unlike, presumably, anything in a public GitHub repository). And some of it is presumably third-party code licensed to Microsoft that was likewise never intended for public viewing. So, while it would be a good gesture on the part of Microsoft to scan their own code--if they haven't done so--I could see why it might be problematic. (Just as training on private GitHub repos would be.)
tl;dr I think there's a distinction between training on copyrighted but public content and private content.
Private third-party GitHub repos is another good example. If licenses don't apply to training data, as GitHub has asserted, why not use those too? Do they think they'll get in trouble over it? Why doesn't the same trouble apply to my publicly-readable GPL-licensed code?
I assume there's something in their terms of service about not poking around in private repos and using the code even for internal purposes except for necessary maintenance like backups, court orders, etc.
I am not a lawyer but I also assume Microsoft's position, at least in part, is that they can download and use code in GitHub public repos just like anyone else can and developing a public service based on training with that (and a lot of other) code isn't redistributing that code.
Copyright is not the only law. Something might be permitted by copyright law (as fair use, an implied license, etc)-yet simultaneously violate other laws-breach of contract, misappropriation of trade secrets, etc.
Microsoft is not training copilot on your proprietary code that you keep on your own systems, just like they are not training it on their proprietary code.
What I object to is not the AI itself, or even that my code has been used to train it. It's the copyright for me but not for thee way that it's been deployed. Does GitHub/Microsoft's assertion that training sidesteps licensing apply to GitHub/Microsoft's own code? Do they want to allow (a hypothetical) FSFPilot to be trained on their proprietary source? Have they actually trained Copilot on their own source? If not, why not?
I published my source subject to a license, and the force of that license is provided by my copyright. I'm happy to find other ways of doing things, but it has to be equitable. I'm not simply ceding my authorship to the latest commercial content grab.