With creative common and GPL there is a fairly common practice that work include multiple authors and rights holders. When a single user uploads such work to a hosting provider, the permission given to the provider will be limited to the permission that the user had. They can't give out permissions that they themselves do not have.
It is a similar case when a single user uploads a movie or game to a pirate torrent site. The site can have a terms-of-use that gives a license to the hosting provider, but naturally the users who upload the content might not have the permission to grant anything to the hosting provider. Depending on how much the hosting provider is or should be aware, hosting the content can still be illegal.
> They can't give out permissions that they themselves do not have.
Then, chances are, it's technically illegal to upload those other contributors' code, although if that code is contributed via GitHub itself then the code in the pull request has already been licensed to GH.
It boils down to copyright/DMCA not requiring that hosting providers ensure the code people say they have the rights to is valid at submission, so GitHub now has tons of examples where people themselves lied about the permission when they uploaded code that wasn't theirs, and this will probably be a valid legal defense, at least only for the argument of "does GitHub have the right to use the source in their ML model" (it might really boil down to "are GH's terms vague enough to where nobody thought they the license included the ability to train artificial intelligence").
"The person who uploaded the code lied about their permissions" won't be a valid defense in a copyright lawsuit by the actual copyright owner, at least in the case where there is no other copy of that code also on GitHub that was uploaded by the copyright holder.
In the US what it will be is good evidence to support a claim by GitHub that they were an "innocent infringer"--someone who did not know they were infringing and had no reason to believe that they were.
What that does is in the case where the plaintiff seeks statutory damages (which they almost certainly will¹) is lower the lower limit. Statutory damages are normally $750-30000 (amount determined by the court). If a defendant proves they are an innocent infringer that lower limit drops to $200. If the plaintiff can prove that the infringement was "willful" the upper limit goes up to $150000.
Statutory damages are per work infringed, not per infringement, so we aren't talking $200 or so multiplied by the number of copies GitHub distributed. We are talking of a likely award of $200 or so total (plus maybe attorney fees).
¹It is usually way too hard to determine actual monetary damages in cases like this, and actual damages are likely to be quite low anyway, so plaintiffs almost certainly will go for statutory damages.
"someone who did not know they were infringing and had no reason to believe that they were."
Can this be said by microsoft? They explicitly chose to not include hidden repositories by their paid customers, likely because they knew that those customers would sue them if proprietary code was used as training data.
Apple seemed to have chosen not to include GPL in the app store for very similar reasons. Their term of service require a permission which is incompatible with the terms of GPL, and knowing that GPL software tend to include multiple rights owners, Apple chose to go the route of not allowing GPL.
And last, authors has requested to have their works removed from the training data. It is part of the lawsuit. Can Microsoft then still claim that they did not know they were infringing?
The comment I was responding to was about the case where person X uploads code to GitHub, and that code contains code from person Y whose license to X does not give X permission to grant GitHub the rights that GitHub requires from the uploader, and so GitHub's use of Y's code is without copyright permission.
I believe GitHub would likely be seen as an innocent infringer in that case.
Would that still be the case if Microsoft know that such infringement is likely to occur? Microsoft has been in the software industry for 50 years, has like Apple a app-store and has distributed software from millions of different rights owners. Can they with good faith argue that they had no idea that software often has multiple rights owner and thus a single person who upload software to github is unlikely to have sole copyright ownership.
I doubt Microsoft would make that argument. It is more likely they will argue fair use, but by not using closed repositories owned by paying customers, it seems to show that they themselves have doubt about the legal status of using other peoples copyrighted work for copilot.
> It is more likely they will argue fair use, but by not using closed repositories owned by paying customers, it seems to show that they themselves have doubt about the legal status of using other peoples copyrighted work for copilot.
Or they're worried about leaking secrets, which is a different matter entirely. The amount of copying needed to leak secrets is far lower than the amount needed to commit copyright infringement.
If Copilot is trained on Microsoft's code and accidentally regurgitates a comment, "// for 2024 Xbox", it has done one but not the other.
When copilot was release there were people who got it to print out account and passwords that had been put into the training data. Microsoft should had at minium sanitized the training data so it would not include such information. There is also likely personal information stored in some of those open repositories.
Copyright infringement doesn't have a fixed size. It depend on context and what kind of information is copied. It demonstrate that copilot has not actually learned how to code (as many people like to claim), but is simply a algorithm for copying code. If it had learned to code like a human it wouldn't divulge secrets.
It is a similar case when a single user uploads a movie or game to a pirate torrent site. The site can have a terms-of-use that gives a license to the hosting provider, but naturally the users who upload the content might not have the permission to grant anything to the hosting provider. Depending on how much the hosting provider is or should be aware, hosting the content can still be illegal.