So as a code author I am pretty upset about Copilot specifically, and it seems l...

skissane · on Jan 14, 2023

> Have they actually trained Copilot on their own source? If not, why not?

People have posted illegal Windows source code leaks to GitHub. Microsoft doesn’t seem to care that much because these repos stay up for months or even years at a time without Microsoft DMCAing them-if you go looking you’ll find some right now. I think it is entirely possible, even likely, that some of those repos were included in Copilot’s training data set. So Copilot actually was trained on (some of) Microsoft’s proprietary source code, and Microsoft doesn’t seem to care.

b3morales · on Jan 14, 2023

The question is not whether there's some of their code that they don't mind being incorporated, but whether there's any at all that they wouldn't allow to be. And more importantly, not used for their own bot, but for someone else's.

If licenses don't apply to training, then they don't apply for anyone, anywhere. If they do apply, then Copilot is violating my license.

skissane · on Jan 14, 2023

IANAL, but they likely believe their unpublished source code contains trade secrets. They may believe that training a public model is okay on published source code (irrespective of its copyright license), but that doing so on unpublished source code containing trade secrets might legally count as a voluntary relinquishment of their trade secrets (if we are talking about their own code) or illegal misappropriation of the trade secrets of others (if they trained it on third party private repos)

rlt · on Jan 15, 2023

I seriously doubt Microsoft / GitHub would care if Copilot or a similar model were trained on their proprietary source code. An advanced code completion tool does pose any significant risk of someone building a competitive product to GitHub or any other Microsoft products.

This is an intelligence augmentation tool. It’s effectively like I’m really good at reading billions of lines of code and incorporating the learnings into my own code. If you don’t want people learning from your code, don’t publish it.

ghaff · on Jan 14, 2023

I doubt Microsoft sees fragments of Windows source code as a particular crown jewel these days. That said, some of it is decades old code that was intended for the public to see (unlike, presumably, anything in a public GitHub repository). And some of it is presumably third-party code licensed to Microsoft that was likewise never intended for public viewing. So, while it would be a good gesture on the part of Microsoft to scan their own code--if they haven't done so--I could see why it might be problematic. (Just as training on private GitHub repos would be.)

tl;dr I think there's a distinction between training on copyrighted but public content and private content.

b3morales · on Jan 14, 2023

Private third-party GitHub repos is another good example. If licenses don't apply to training data, as GitHub has asserted, why not use those too? Do they think they'll get in trouble over it? Why doesn't the same trouble apply to my publicly-readable GPL-licensed code?

ghaff · on Jan 14, 2023

I assume there's something in their terms of service about not poking around in private repos and using the code even for internal purposes except for necessary maintenance like backups, court orders, etc.

I am not a lawyer but I also assume Microsoft's position, at least in part, is that they can download and use code in GitHub public repos just like anyone else can and developing a public service based on training with that (and a lot of other) code isn't redistributing that code.

skissane · on Jan 14, 2023

Copyright is not the only law. Something might be permitted by copyright law (as fair use, an implied license, etc)-yet simultaneously violate other laws-breach of contract, misappropriation of trade secrets, etc.

woah · on Jan 15, 2023

Microsoft is not training copilot on your proprietary code that you keep on your own systems, just like they are not training it on their proprietary code.