> I'm afraid I don't think actually comes even close to resolving the legal implications of recitation.
If Copilot is reciting pieces of GPL code (which we know it can), then not only does it need to point out where it has grabbed that code from, Copilot itself is (probably) required to be GPL-licensed.
I dont follow. If the suggested code is GPLed, it’s your decision to include it in your code or not. If you accept the GPLed code into your nonGPLed code base, you violated the GPL. As a friend of mine said years ago about situations like this, “Saying ’I was just following the algorithm,’ is not a defense.”
Now let’s is Copilot a violation of the GPL? I’m going to assume that it’s codebase is not derived from GPL code. I have nothing to prove this, but most code is not, and Microsoft, GitHub, and OpenAI are reputable organizations, so assuming good faith here seems fair.
Did Copilot train on GPLed code? Absolutely. I don’t think anyone has ever suggested otherwise.
Does processing the code count as an integration? I’d say no. It certainly isn’t part of the executable code base, even in binary form, which is what the GPL was targeting. Even if it was, it wouldn’t be a GPL violation, since the copilot binary isn’t being distributed, but it would be an AGPL violation. I don’t know how popularly the AGPL is, but let’s assuming that at at least something from some AGPLed file exists inside Copilot. Again that doesn’t matter, because the code isn’t actually being executed.
So is it “distributing” the code? Sure, but that’s not a violation. If you make a binary, you have to distribute the code, but the opposite isn’t true. Anyway, just having a piece of GPL source code in a database isn’t a violation, and never has been. You might as well bring saying that because Google’s search index can return entries of the Linux kernel, all of Google is in violation of the GPL. Not even RMS would take that extreme view.
> It certainly isn’t part of the executable code base, even in binary form, which is what the GPL was targeting.
The GPL also targets source forms, such as what is being produced verbatim. (See Clause 1 of the GPL.)
> Again that doesn’t matter, because the code isn’t actually being executed.
That's not a requirement of the GPL or AGPL. It's irrelevant.
> So is it “distributing” the code? Sure, but that’s not a violation.
It is, without the license.
> Anyway, just having a piece of GPL source code in a database isn’t a violation, and never has been. You might as well bring saying that because Google’s search index can return entries of the Linux kernel, all of Google is in violation of the GPL. Not even RMS would take that extreme view.
Storage mechanisms are not the problem here. The GPL source code is not in some database, in this case. A search index is irrelevant, because that is just a storage mechanism.
However, Copilot generates verbatim code, and it generates novel code. That is, it both contains the plain text of the original (recitation and redistribution), generates derived code (transformation).
In both these cases it doesn't attribute, so you can say with certainty that the Copilot software contains the source code, and may create derivative works, all without attributing the license.
It is the fact it contains the source code, reciting it verbatim, that makes Copilot probably need to be GPL-licensed itself. As it is not a storage mechanism.
It is that it distributes derived works without attribution that puts the end-user's codebase at risk of violating the GPL.
>However, Copilot generates verbatim code, and it generates novel code. That is, it both contains the plain text of the original (recitation and redistribution), generates derived code (transformation).
So are you saying that because the language model was trained on a GPL code that even though it spits out novel code, that code is derived?
That seems like a pretty expansive view. I’ve read some GPL code in my life, and I’m sure it has influenced me. Does that make all my code “derived”? I wouldn’t say that. To truly be derived it needs to be a nontrivial amount, otherwise every time you type “i++;” you’re in violation. This is hard to prove.
A clearer cut case is including code it suggested when that’s verbatim. That would be a GPL violation if it’s included in someone’s codebase, but that’s not what it seems you’re arguing. You seem to be arguing that Copilot is in violation simply for suggesting the code.
This means you’re asserting that somehow storing the code in a language model is somehow different from a database, but you haven’t told me why that is.
Databases have a query execution system and a database file. They are separate pieces. The query executor can work on any database file, and swapping out the database file will give different results, even though the execution code is the same.
This is exactly the same case for language generators. You have a language model, and a piece of code that makes predictions based on the given text and the language model. Swap out the language model, you get different results.
The storage formats are different but doesn’t matter. The data and the code are separate. Given this information, why — and be specific — is a language model not like a database?
> That seems like a pretty expansive view. I’ve read some GPL code in my life, and I’m sure it has influenced me. Does that make all my code “derived”? I wouldn’t say that. To truly be derived it needs to be a nontrivial amount, otherwise every time you type “i++;” you’re in violation. This is hard to prove.
You're not a piece of software, so the areas of copyright law that are applicable are completely different. (And yes, copyright does acknowledge a minimal amount required to be copyrightable - but that minimal amount may sometimes be argued to be a single line.)
However, you can absolutely face civil charges if you reproduce too-similar code for a competitor, after absorbing the technical architecture at another workplace.
> This is exactly the same case for language generators. You have a language model, and a piece of code that makes predictions based on the given text and the language model. Swap out the language model, you get different results.
Legally speaking, Copilot isn't advertised with multiple available language models. It isn't presented that way, so it won't be treated that way. It will be treated as a singular piece of software.
> Given this information, why — and be specific — is a language model not like a database?
In the eyes of the law, and this is very specific, the model is marketed as part of the software, and so is part of the software. The underlying design architecture is utterly irrelevant, because it is presented as a package deal of "GitHub Copilot".
> You're not a piece of software, so the areas of copyright law that are applicable are completely different. (And yes, copyright does acknowledge a minimal amount required to be copyrightable - but that minimal amount may sometimes be argued to be a single line.)
Putting aside the philosophical aspects of this statement, you proved my point. I said that the ultimate person held liable for violating a license is not a tool, but the person choosing to integrate suggested changes by the tool. But now somehow you expect me to believe that the person that built an automaton, but is not directing the automaton, and certainly doesn't have final say in whether or not to incorporate the automaton's suggestions is at legally culpable, because they're being held to a stricter standard? If that was legal standard with any tool, then literally every manufacturer of every tool would be held liable for any and all misuse. Obviously, this is not the case.
> Legally speaking, Copilot isn't advertised with multiple available language models. It isn't presented that way, so it won't be treated that way. It will be treated as a singular piece of software.
Actually speaking, you're not a lawyer, and that this is INCREDIBLY controversial statement, that doesn't really standup to much scrutiny, since there is a bright line that separating the two.
Even if Github was ruled against (and they won't be), case law is filled with examples where the injunctive relief is limited to claims presented (in this case source related to a specific a work) rather than the entire system including playback device and the recording.
I dont follow. If the suggested code is GPLed, it’s your decision to include it in your code or not. If you accept the GPLed code into your nonGPLed code base, you violated the GPL. As a friend of mine said years ago about situations like this, “Saying ’I was just following the algorithm,’ is not a defense.”
Now let’s is Copilot a violation of the GPL? I’m going to assume that it’s codebase is not derived from GPL code. I have nothing to prove this, but most code is not, and Microsoft, GitHub, and OpenAI are reputable organizations, so assuming good faith here seems fair.
Did Copilot train on GPLed code? Absolutely. I don’t think anyone has ever suggested otherwise.
Does processing the code count as an integration? I’d say no. It certainly isn’t part of the executable code base, even in binary form, which is what the GPL was targeting. Even if it was, it wouldn’t be a GPL violation, since the copilot binary isn’t being distributed, but it would be an AGPL violation. I don’t know how popularly the AGPL is, but let’s assuming that at at least something from some AGPLed file exists inside Copilot. Again that doesn’t matter, because the code isn’t actually being executed.
So is it “distributing” the code? Sure, but that’s not a violation. If you make a binary, you have to distribute the code, but the opposite isn’t true. Anyway, just having a piece of GPL source code in a database isn’t a violation, and never has been. You might as well bring saying that because Google’s search index can return entries of the Linux kernel, all of Google is in violation of the GPL. Not even RMS would take that extreme view.