Let's train AI on leaked source code of Microsoft, Nvidia and other commercial companies and find out how soon such project going to be sued and shutdown via DMCA.
Might also include source-available code like Unreal Engine, etc.
I think they are saying the training part itself doesn’t.
It seems to be kind of a fine point, otherwise you could implement a license laundering AI by training it on a single codebase and asking it questions about it. If you end up with a byte by byte replica sans a LICENSE file presumably you wouldn’t get away with actually _using_ the model to clone licensed software even if training it was technically ok.
What if I train a model on one snippet of code, and it always produces that snippet of code for all inputs? If I did not have a license to use that code, would laundering it through a model absolve me of violating the license?
This is something I don't understand, not just about Copilot but for many NN generators: the outputs sometimes seem like an obvious ripoff of something, that no way it qualify as a novel work, although I can't prove it. Yet the outputs are treated as if all copyright issues are discussed and cleared. Just how come?
> Yet the outputs are treated as if all copyright issues are discussed and cleared. Just how come?
Anyone developing and trying to sell ML models will try hard to act like this is a settled question. This is obviously a huge uncertainty factor for using various commercial ML models.
Are you drawing a distinction between creating the model, which depends on viewing open source code and is presumably fine, and using it selling the model which may output licensed code and get you in trouble?
> CoPilot using code to train a model doesn’t violate any software licenses.
The argument GitHub advanced for this was absolutely ridiculous: that code stops being code when CoPilot analyzes it, qualifies only as text while CoPilot analyzes or suggests it, but magically turns back into code if/when the user incorporates the suggested text into their code base.
The real argument GitHub made was much more credible: the CEO came on here and said something like "the legality of code reuse is an interesting new area of law and we welcome the debate," which in lawyer terms means "we have smart, well-informed lawyers, and we can easily get these cases in front of dipshit judges who don't understand technology."
It's an interesting product, and I hear good things about its utility in practice. But the legal argument is utter nonsense. CoPilot definitely violates many software licenses, and knowingly so.
Reading and remembering code is allowed under all the OSS licenses. It's the reproduction of the code that's restricted. The blurry question is always: how much does an expression have to change between it being classified as an exact reproduction, a derivative work, and a novel work?
CoPilot would definitely fail the clean room test, though