CoPilot using code to train a model doesn’t violate any software licenses.

SXX · on June 25, 2022

Let's train AI on leaked source code of Microsoft, Nvidia and other commercial companies and find out how soon such project going to be sued and shutdown via DMCA.

Might also include source-available code like Unreal Engine, etc.

orf · on June 25, 2022

Not having the resources to defend yourself against large and litigious companies does not mean parent’s point is wrong.

Training CoPilot on code doesn’t violate any software licenses.

tyingq · on June 25, 2022

>Training CoPilot on code doesn’t violate any software licenses.

Even if it regurgitates large unchanged passages of code, with the advertised selling point of using anything it produces in an unrestricted fashion?

synu · on June 25, 2022

I think they are saying the training part itself doesn’t.

It seems to be kind of a fine point, otherwise you could implement a license laundering AI by training it on a single codebase and asking it questions about it. If you end up with a byte by byte replica sans a LICENSE file presumably you wouldn’t get away with actually _using_ the model to clone licensed software even if training it was technically ok.

Imnimo · on June 25, 2022

What if I train a model on one snippet of code, and it always produces that snippet of code for all inputs? If I did not have a license to use that code, would laundering it through a model absolve me of violating the license?

Fordec · on June 25, 2022

And if yes, what happens if I replace the word "model" with "employee"? Suddenly the context changes?

keonix · on June 25, 2022

Nothing changes, it is still a violation

numpad0 · on June 25, 2022

This is something I don't understand, not just about Copilot but for many NN generators: the outputs sometimes seem like an obvious ripoff of something, that no way it qualify as a novel work, although I can't prove it. Yet the outputs are treated as if all copyright issues are discussed and cleared. Just how come?

formerly_proven · on June 25, 2022

> Yet the outputs are treated as if all copyright issues are discussed and cleared. Just how come?

Anyone developing and trying to sell ML models will try hard to act like this is a settled question. This is obviously a huge uncertainty factor for using various commercial ML models.

ipaddr · on June 25, 2022

The copyright issues are downloaded to the user/developer.

If co-pilot suggests anything under copyright and you publish it you get sued.

nojito · on June 25, 2022

That doesn’t have anything to do with my comment.

The model is 100% compliant with open source licenses.

synu · on June 26, 2022

Are you drawing a distinction between creating the model, which depends on viewing open source code and is presumably fine, and using it selling the model which may output licensed code and get you in trouble?

tessierashpool · on June 25, 2022

> CoPilot using code to train a model doesn’t violate any software licenses.

The argument GitHub advanced for this was absolutely ridiculous: that code stops being code when CoPilot analyzes it, qualifies only as text while CoPilot analyzes or suggests it, but magically turns back into code if/when the user incorporates the suggested text into their code base.

The real argument GitHub made was much more credible: the CEO came on here and said something like "the legality of code reuse is an interesting new area of law and we welcome the debate," which in lawyer terms means "we have smart, well-informed lawyers, and we can easily get these cases in front of dipshit judges who don't understand technology."

It's an interesting product, and I hear good things about its utility in practice. But the legal argument is utter nonsense. CoPilot definitely violates many software licenses, and knowingly so.

they4kman · on June 25, 2022

Reading and remembering code is allowed under all the OSS licenses. It's the reproduction of the code that's restricted. The blurry question is always: how much does an expression have to change between it being classified as an exact reproduction, a derivative work, and a novel work?

CoPilot would definitely fail the clean room test, though

ClumsyPilot · on June 25, 2022

the model includes chunks of that code, they copied it into their model