Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.
The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.
The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.
It's not something to dismiss but it is something that has already been addressed. Authors Guild v Google. Google Books is built upon scanning millions of books from libraries without first gaining permission from copyright holders, this was found to not be a violation of copyright.
Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.
At the time the suit was launched, Google search would only display snippet views. The very nature presents the attribution to the user, enabling them to separately obtain a license for the content.
This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.
The fact that they are producing the entire relevant snippet, without attribution and in a way that does not necessitate referencing the source corpus, suggests the transgression is different. It is further amplified by the fact that the output itself is typically integrated in other copyrighted works.
Attribution is irrelevant in Authors Guild, the books were not released under open source licenses where attribution is sufficient to meeting the licensing terms. Google never sought or obtained licenses from any of the publishers, and the court ruled such a license was not needed as Google's usage of the contents of the books (scanning them to build a product) did not represent a copyright infringement.
Attribution is mentioned in this filing because such attribution would be sufficient to meet the licensing terms for some of the alleged infringements.
It's an irrelevant discussion though, the suit does not make a claim that the training of Copilot was an infringement which is where Authors Guild is a controlling precedent.
> Authors Guild v Google. Google Books is built upon scanning millions of books from libraries
I agree it's relevant precedent, but not exactly the same. Libraries are a public good and more importantly Google books references the original works. In short, I don't think that's the final word in all seemingly related cases.
> More specifically, a computer consuming a copyright work is not a violation of copyright.
I don't agree with this way of describing technology, as if humans weren't responsible for operating and designing the technology. Law is concerned with humans and their actions. If you create an autonomous scraper that takes copyrighted works and distributes them, you are (morally) responsible for the act of distributing them, even if you didn't "handle" them or even see them yourself.
Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.
Your disagreement and feelings about how copyright and the law should work are valid, they have very little to do with how copyright is addressed judicially in the United States
At which case Google paid some hundred million $ to companies and authors, created a registry collecting revenues and giving to rightsholders, provided opt-out to already scanned books, etc. Hey, doesn't sound that bad for same thing to happen with Copilot.
A) No it doesn't, there's nothing in the Copilot model or the plugin that represents or constitutes a reproduction of copyright code being distributed by GH/MS. The allegation is it generates code that constitutes a copyright violation. This distinction is not academic, it's significant, and represents an unexplored area of copyright law.
B) "parts of" copyright works are not themselves sufficient to constitute a copyright violation. The violation must be a substantial reproduction. While it's up to the court to determine if the alleged infringements demonstrated in the suit (I'm sure far more will be submitted if this case moves forward) meet this bar, from what I've seen none of them have.
Historically the bar is pretty high for software, hundreds or thousands of lines depending on use case. A purely mechanical description of an operation is not sufficient for copyright, you cannot copyright an implementation of a matrix transformation in isolation no matter what license you slap on the repo. Recall that the recent Google v Oracle case was litigated over tens of thousands of lines of code and found to be fair use because of the context of those lines.
I've yet to see a demonstrated case of Copilot generating code that is both non-transformative and represents a significant reproduction of the source work.
> The allegation is it generates code that constitutes a copyright violation.
The weights of the Copilot very likely contain verbatim parts of the copyrighted code, just like in a zip archive. It chooses semi-randomly which parts to show and sometimes breaks copyright by displaying large enough pieces.
Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.
Am I violating your copyright? Are you entitled to do that?
To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?
I take your code and I compress it in a tar.gz file. Il call that file "the model".
Then I ask an algorithm (Gzip) to infer some code using "the model".
The algorithm (gzip) just learned how to code by reading your code. It just happened to have it memorized in its model.
With the exception that there are infinite types of chords in this case, and even though many musicians follow familiar chord structures the underlying melodies and rhythms are unique enough for any familiar person to be able to differentiate "Red Hot Chill Peppers" from "All-American Rejects", and now there is a system where All-American Rejects hit a few buttons and a song is generated (using audio samples of "Under the Bridge") that sounds like "Under the Bridge pt 2, All-American Rejects Boogaloo".
That's why it's actionable and why there is meat on the bone for this case. The real issue is going to be if they can convince a jury that this software is just stealing code and whether its wrong if a robot does it.
Google doesn't sell its search feature as a product that you can just plagiarize the results from and they're yours. Microsoft does that with Copilot.
Copilot is as much of a search engine as Stable Diffusion or DALL-e are, which is to say they aren't at all. If you want to compare it to a search engine, despite it being a tortured metaphor, the most apt comparison is not to Google, but to The Pirate Bay if TPB stored all of their copyrighted content and served it up themselves.
With Copilot it's your responsibility not to use it as a search engine to copy-paste code. It's completely obvious when it's being used as a search engine so it's not a problem at all.
Stable Diffusion works on completely different principles and they can't exactly replicate a pixels from their training data.
Ok, cool. Presumably that is because it’s smart enough to know that there is only one (public) solution to the constraints you set (like asking it to reproduce licensed code).
Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.
Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.
They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.
But there is no equivalent of "unzipping" for Copilot.
This is a generative neural network. It doesn't contain a copy of your code; it contains weightings that were slightly adjusted by your code. Getting it to output a literal copy is only possible in two cases:
- If your code solves a problem that can only be solved in a single way, for a given coding style / quality level. The AI will usually produce the same result, given the same input, and it's going to be an attempt at a solution. This isn't copyright violation.
- If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
There is no guarantee that a ML network only produces the input data under those two conditions. But even for
> If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
Replication is not a violation if the terms of the license are followed. Many open source projects are replicated hundreds of times with no license violation - that doesn't mean that you can now ignore the license.
But even if they did violate the license, that doesn't give you the right to do it too. There is no requirement to enforce copyright consistently - see e.g. mods for games which are more often than not redistributing copyrighted content and derivatives of it but usually don't run into trouble because they benefit the copyright owner. But try to make your own game based on that same content and the original publisher will not handle it in the same way as those mods. Same for OSS licenses: The original author does not lose any rights to sue you if they have ignored technical license violations by others when those uses are acceptable to the original author.
Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.
You can easily see this happen, the regurgitation of training data, in an over fitted neural net.
This is not necessarily true, the function space defined by the hidden layers might not contain an exact duplicate of the original training input for all (or even most) of the training inputs. Things that are very well represented in the training data probably have a point in the function space that is "lossy compression" level close to the original training image though, not so much in terms of fidelity as in changes to minor details.
When I say encoded or compressed, I do not mean verbatim copies. That can happen, but I wouldn't say it's likely for every piece of training data Copilot was trained on.
Pieces of that data are encoded/compressed/transformed, and given the right incantation, a neutral net can put them together to produce a piece of code that is substantially the same as the code it was trained on. Obviously not for every piece of code it was trained on, but there's enough to see this effect in action.
> which Copilot doesn't have a license to distribute
when you upload code to a public repository on github.com, you necessarily grant GitHub the right to host that code and serve it to other users. the methods used for serving are not specified. This is above and beyond the license specified by the license you choose for your own code.
you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.
Host that code. Serve that code to other users. It does not grant the right to create derivative works of that code outside the purview of the code's license. That would be a non-starter in practice; see every repository with GPL code not written by the repository creator.
Whether the results of these programs is somehow Not A Derivative Work is the question at hand here, not "sharing". I think (and I hope) that the answer to that question won't go the way the AI folks want it to go; the amount of circumlocution needed to excuse that the not actually thinking and perceiving program is deriving data changes from its copyright-protected inputs is a tell that the folks pushing it know it's silly.
Actually pirate bay was even less of an infringement as they did not dsitribute the copygihted content or derivatives themselves, only indexed where it could be found. With Copilot all the content you're getting goes trough Microsoft.
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."
It's served under the terms of my licenses when viewed on GitHub. Both attribution and licenses are shared.
This is like saying GitHub is free to do whatever they want with copyrighted code that's uploaded to their servers, even use it for profit while violating its licenses. According to this logic, Microsoft can distribute software products based on GPL code to users without making the source available to them in violation of the terms of the GPL. Given that Linux is hosted on GitHub, this logic would say that Microsoft is free to base their next version of Windows on Linux without adhering to the GPL and making their source code available to users, which is clearly a violation of the GPL. Copilot doing the same is no different.
> It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?
Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.
The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.
I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.
> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.
If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.
it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.
If I read JRR Tolkien and then go and write a fantasy novel following a unexpected hero on his dangerous quest to undo evil, I haven't infringed, even if I use some of Tolkien's better turns of phrase.
It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
And that is why this lawsuit is dead on arrival.