The legal debate around copyright infringement has always centered around the rights granted by the owner vs the rights appropriated by the user, with the owner's wants superseding user needs/wants. Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users. Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use. Every other instance of code used is a case of copyright infringement, a clear case by Microsoft's own definition of copyright infringement .
Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo. This is a fine line to walk and will likely not survive in a court of law. They are betting on deep lawyer pockets to see them through this, but are more likely than not to lose this battle. I suspect we will see how this plays out in the coming months.
Or maybe it is, but if so it essentially means the end of licensing because it would be trivial to make an AI that can take an input and produce the same output. Or maybe even cp is good enough to strip the source of its original license in that case.
Open source licenses are worth protecting or you break the cycle that helps more software be open.
The test for non-literal copyright infringement is "substantial similarity." If, after filtering out irrelevant and non-copyrightable elements, the allegedly-infringing work is substantially the same as the original work, then it infringes. If it infringes, then two common defenses are independent creation and fair use.
In your hypothetical, the AI-generated work would infringe the original because you stated it would be substantially the same as the copyrighted work. You can't claim independent creation because the algorithm was dependent on the original work and you controlled the output of the algorithm to be exactly like the original work. Fair use is pretty much a non-starter, so I'll skip that analysis.
So, no, you couldn't use an AI to launder copyrighted works into the public domain.
If you are using it to mix a snippet of code (from a sufficiently large code base) into a large code base of your own, then you are just remixing. That is not infringement. In music, there are entire genres based on remixing. You could even take it a step further and ask yourself: what is not a remix?
Point is songwriters absolutely get litigious over reproducing small portions of their IP.
What's more surprising is to see copyleft advocates positioned so strongly in favour of giving copyright that kind of reach. I think that in a different context, some of the cases you refer to would be used by these same copyleft supporters as examples of why copyright needs to be more weakly enforced, not more strongly.
You have to filter out any non-copyrightable elements before you do the substantial similarity analysis. For code, that means removing non-expressive elements like arithmetic or boolean expressions, looping, recursion, conditionals, etc. APIs are not copyrightable under the recent Supreme Court holding in Google v. Oracle.
How much of your code is actually left after filtration?
We could remove the non-copyrightable parts of text works too. Just take out all the basic building blocks of language like verbs, nouns, stop words...
The tests exist as a way of determining if someone did the action of copying which is the important thing at the core of it. And in this case the facts aren’t really in dispute, it’s whether what GH is doing counts as copying.
If you had a really really good memory and remembered almost exactly how your former company implemented something and when faced with a similar problem unknowingly produced similar code. It’s iffy as to whether this is copying. Because it really probably isn’t — you’re allowed to learn from copyrighted works — but the courts aren’t omniscient and when presented with the code they very well may rule that copying was more likely than not.
But we are omniscient in this case so we don’t really need the tests. Is what GH does more like copying or learning? This isn’t something that can be determined purely from the output of the tool.
But perhaps there could be a way to make something that automatically converts these "non-expressive elements" into copyrightable elements?
Output would probably be maddening to figure out though.
The AI part isn't about independent creation but about figuring out what to copy.
Yes, this is what is pretty interesting to me. I said in a previous comment that I have a really good OS generating AI. It asks you your favorite color and outputs a disk image you can use as an installer.
Right now it just happens to output a cracked version of Windows if you answer "blue". Who can know how that happened? It's a black box after all. Seems useful though, since Microsoft is loudly saying that if I distributed this it would have no license problems at all.
If when you crack open Copilot it’s determined that it’s not actually learning and boils down to storing and regurgitating snippets of code no matter how few or many layers of indirection it’s still infringement.
What your AI actually does is underneath all the indirection is what’s important.
Some would say that pretty much _all_ learning involves "storing" and "regurgitating".
Aside: my daughter started writing out her name at kindergarten a couple of years ago. One of the staff seemed a bit dismissive of this, and claimed what was happening wasn't really "writing" it was "merely memorising the shapes of letters and then reproducing them in the right order". <rolls eyes>
I didn't bother to argue...
We know what it's doing, it's doing very well understood statistical inference techniques to derive outputs.
i = i + 1
That not an existing aboutme page. You can go to davidcelis' website and verify that it's completely different.
Copilot just picked a random person and linked to their social media accounts. You can search any large quote within that about me on Google and not find a match, it is unique.
The only two examples of generating large sections of copyrighted work are the quake floating point hack and the zen of python. Both those examples are commonly known and copied and talked about, to the point that they have wikipedia pages.
In fact, that's a crucial selling point for the product.
I think the big question is, if copilot ends up copying significant portions a GPL work, not just tiny snippets, is the resulting work infringing, and if so, who is liable?
I have almost no sense of how often code is infringed currently and how often anyone does anything about it. I have a feeling that we live in a world with constant infringement, basically no one cares, and no one does anything about it. And I would assume, the status quo will maintain its current course with this new tool. But again, I'm giving zero factual evidence, it's just a feeling from not seeing our hearing almost any news about open source code infringment.
But there is also caselaw (involving George Harrison IIRC) on "unconscious copying", where having heard a piece is suggestive that it was not an independent creation, despite not being deliberately copied. So, training on a corpus that includes a specific piece is arguably a case of that.
There"s an interesting question of whether a model is just a sophisticated statistical
compression of a corpus, or whether it is a thing in itself. I would say, if it finds patterns that are disproportionately simpler than the corpus, it has found "something".
But another view is of creation as involving a side-channel or one-time pad, in that music is created by a human and heard by a human, who have common information that has never been present in music before (e.g. specific aspects of common neurophysiology, auditory anatomy, exact heartbeat waveform, new sounds/rhythms in the world, new speech patterns, associations between existing melodic fragments and words/emotions/visuals/status etc).
In this sense, truly new music is discovery of the Human Music Processing System, which ultimately involves the whole human and their social and physical experience.
If it were intelligent, it’d be given the Lagrange specifications and then I’d be able to say “Write me an open world video game based around gang culture and I’d like it to run on the raspberry pie zero.”
Valid point - most hilariously poor example to back that point up in recent memory.
I.e. if you buy a piano or a guitar, you could play and record copyrighted music on it. That's not piano's or guitar's fault though, it's yours.
Suppose a model was trained solely on a single Beatles album. It could only spit out that album. That would be clear infringement, wouldn't it?
No. It’s not. A Markov chain has some very specific properties that are absolutely not fulfilled by GPT-3 models.
Just say “stochastic” if you want a buzzword. Stop appropriating Markov chains.
A Markov chain is a model where: (1) there is a state, (2) the probabilities of the next state depend only on the current state.
That could describe anything from a wet fart to a deterministic computer to GPT-3 to the human mind to quantum mechanics (the real one, not a simulation).
If I unplug the Internet and all USB devices, my computer is a Markov chain with at least several trillion bits of state, so 2^(several trillion) possible states. And there is one next state, which has a probability of 1, and all other possible states have a probability of 0. That's a Markov chain.
GPT models choose the next word probabilistically, with the probabilities chosen by feeding the previous N words into a neural network. That sounds like a Markov chain to me!
What properties does a gpt-3 model have that a Markov chain doesn’t? (Other than effectiveness.)
GPT-3 is conditioned on the entire input sequence as well as its own output, which is strictly NON-MARKOVIAN. In fact, the point in saying something is Markovian is exactly that: the state transition probability only depends on the current state.
"Given a prompt, provide a completion" is what a Markov chain does. GPT-3 is exactly the same, in the sense that apples and oranges both satisfy your hunger.
Edit - googling, the history of player pianos vs copyright is interesting
The former for making and selling it, the latter for buying and using it.
Just like counterfeit goods.
This analogy train went too far, don't you think? All examples that I've seen on Twitter require quite an intentional manipulation by human for Copilot to produce something copyrighted. It does not recite Linux code by pressing 1 key.
Surely a judge presented with the "complex series of button pushes," otherwise known as playing an instrument, would hold the player accountable for any infringement and not the piano?
These analogies have gone so far off the rails that I can't tell which side this thread is arguing for by now ;)
One end is GitHub's, at the input: Copilot's "database" was initialized from code that GitHub does not have copyright to. The contention at this end is that they are ignoring the licenses that would grant them the right to use that code.* The article, GitHub, and others assert that there's no copyright issue for creating a database of this kind (a machine learning model).
The other end is the the developer taking Copilot's output. The article seems to take the (absurd IMO) position that there's also no copyright implications here, because the output is not copyrightable at all.
*And personally this is the side that concerns me most.
IANAL, but this doesn't sound quite right. There is a difference between "using" code (running it in a commercial product) and manipulating it as arbitrary data within a commercial product.
It definitely can be a gray area, but let's say I use Amazon's service where I email a PDF to my Kindle - is it Amazon's responsibility to know the copyright status of the PDF, or mine? In both cases a commercial product is manipulating copywritten data for the benefit of a user.
I'll give the best example, the one task that off the top of my head that I would like some AI help with.
I would really like to replicate the functionality of Java's SSLEngine, but for C#.
If I used Co-Pilot to help, at best, I would need to pay for a legal team to do some form of 'clean room' review of whatever was generated to make sure it did not infringe on the OpenJDK code that is out there. At worst, I would be having to defend myself from Oracle's legal team -anyway-.
And yeah, I'm assuming in this case that Copilot would be 'smart' enough to be able to make the right inferences of that java code and put it into workable C# construct. Stepping back, though, one could still ask the question; what's the risk of a Java developer accidentally getting some OpenJDK code a little too closely? There's an order of magnitude difference between even a smaller AGPL developer and Oracle.
If Microsoft/GH was willing to go to bat and agree to pay for the defense of users of Copilot, I would be far less concerned with the implications of all of this.
It would be extremely interesting to know how much accidental and non accidental code infringement happens and in what proportion of those cases go to the courts. I would guess that both cases happen utterly constantly and it is only a tiny minority of those cases where legal action is taken. If that's the case right now, then nothing has changed with this tool except the possibility of playing hot potato with liability when those few cases that do happen make it to courts. Even if the developer actually wrote the code that infringed, copilot could make a useful scapegoat and every case will have plausible deniability if copilot lacks really good explainability.
Out of pure curiosity, and please do take it as a candid question; do you intend to mean that "I don’t know how what I used works" is a good defence?
And the answer as always is "it depends".
Whether the GPL2 will hold up in court, or whether the courts will uphold this specific case (e.g. can you prove intent? Do you need to?), is a separate issue entirely.
The next question is, can I use GPL'ed code in my product and then claim that it was injected by Copilot to avoid repercussions of my actions if caught?
Not necessarily. Even 11k lines of copied code might fall under fair use, as Oracle recently discovered ;)
But why would copilot do this? It's a language model not a database.
This is incorrect. First of all, GitHub isn't even the people building the model. It's built by OpenAI, which has none of these licenses. Secondly, the model is not built purely from GitHub data. OpenAI is relying on fair use, not on a specific license.
You seem to be confusing what you'd like the law to be with what the law is.
Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.
Particularly in the context of the occasional, unintentional reproduction of short snippets that likely need adaption to the rest of the code they are inserted into I suspect courts are unlikely to find more than de minimis, unactionable, infringement.
Even if the code isn’t being copied verbatim, it feels like the spirit of these licenses is being violated, although I don’t know if that’s enough to get anywhere in court. But if the code is in fact being copied (like that Quake example) then the license is definitely being violated.
But I feel like there’s too much analysis in these comments of whether a current law is being broken, and not enough thought about what will happen if licenses like the GPL can no longer keep intellectual property free. Open source licenses are part of the foundation of this community, and we’ll be much worse off without them. We really need a way to prevent this kind of IP laundering, and if current laws won’t do it, then we need new ones.
I mean in this specific example can’t I though? Yes the first sale doctrine applies to certain kinds of works but not every work and specifically not software. I absolutely can grant you a single non-transferable license to use my software.
and by the GitHub ToS:
> You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
I would consider Copilot to not be part of ”the Service”, but at least currently the definition of ”the Service” is so vague as to include anything that Github does.
Maybe they consider Copilot to be a ”search index” and the suggestions ”[sharing] [Your Content] with other users”.
 Since, as I understand it, it will require separate payment.
 The ToS is currently last edited 2020-11-16, and does not contain the word ”Copilot”
Although I'm curious about what GitHub would do if the original author asked them to remove the work from Copilot. Retrain from scratch every month or so, to remove last month's DCMAed content?
The person who has the account on Github and uploads code to them rarely owns the copyright on all of the code, and therefore doesn't have the right to delegate to Github any further licensing permission.
Furthermore, as described in the article, the legal precedent has been that you don’t actually need copyright to something to train a model on it. You may think that’s silly or inconsistent, but that’s how the legal precedent is.
That's not how it works. Anyone and everyone who distributes it is infringing and carries risk of enforcement action. That could also be someone further downstream.
> Furthermore, as described in the article, the legal precedent has been that you don’t actually need copyright to something to train a model on it.
I'm not commenting on this aspect.
“The DMCA “safe harbors” protect service providers from monetary liability based on the allegedly infringing activities of third parties.”
If they didn't, GitHub itself would be violating copyright every time someone browsed the repo.
And copilot appears to be a part of GitHub.
So why wouldn't copilot itself be covered by that license?
(Certainly people using copilot would not. Let the user beware.)
Edit: downvoted to death but the top reply shows that it's true. An inconvenient truth, I suppose.
> 4. License Grant to Us
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service
> If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
> You may grant further rights if you adopt a license.
> A. Definitions
> The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.
I guest next step github would require cell phone to sign up to make sure who you are and get the rights to profit.
Couple that with the fact, that presumably at some point in the future, Co-pilot will come attached with a subscription model (otherwise why do it in the first place?), and we have the makings of a product that is commercially infringing on copyright left, right and center.
Edit: Sorry downvoters, whether you like it or not, you don't understand the terminology. You're confusing reproduction with redistribution.
Using the terms as you explained them below, I meant that Microsoft/GitHub has permission to reproduce the code so why wouldn't that extend to copilot?
The use of licensed code in other projects must be done under the terms of that license or you aren't legally (under copyright law) allowed to use the code.
The GitHub TOS is a license that is separate from the license in the code. It is legal and common for an author to license the same code multiple ways (and the licenses do not have to agree with each other). By agreeing to GitHub’s TOS and uploading code to GH servers, people are licensing GH to display the code, because the license agreement says so explicitly. This could be problematic if someone uploads code they don’t have the rights to upload, but then the violation is the uploader’s, and not GitHub’s.
Additionally, GH has a provision for already licensed code in section D.6:
“6. Contributions Under Repository License
“Whenever you add Content to a repository containing notice of a license, you license that Content under the same terms, and you agree that you have the right to license that Content under those terms. If you have a separate agreement to license that Content under different terms, such as a contributor license agreement, that agreement will supersede.
“Isn't this just how it works already? Yep. This is widely accepted as the norm in the open-source community; it's commonly referred to by the shorthand "inbound=outbound". We're just making it explicit.”
For the latter you would need redistribution as it is going into a different product, for which you claim ownership, and with possible modifications/adaptations (this would depend on the rights granted by the license). Nowhere on Github's TOS is the word or concept of redistribution referenced.
So, the answer to your original question is "no".
Edit: leereeves modified its comment after I wrote this, so it may not make much sense but you can figure out the point. Best!
(Edit and BTW GH calls out their ‘distribution’ in section D.4 of their TOS explicitly, but without using the word “distribute”. They say you grant them the right to “publish” and “share” code you upload, which means “distribute” under copyright law. They also imply that by spelling out the terms under which they do not “distribute”, which is anytime the content is used outside of GitHub’s services.)
I don’t think you’re correct that the term “redistribution” means either going into another product, nor that it implies a claim of ownership. Putting works into another product is sometimes known as making a derivative work, while “redistributing” is quite commonly used to mean copy-and-distribute as-is. Redistribution can happen via license as well, it requires permission by the copyright owner, but does not imply the redistributor is (or is claiming to be) the copyright owner.
You didn't see the original question, it was edited, so we cannot discuss that further.
"[...] which means “distribute” under copyright law" <-- Citation needed please, because I don't think that's correct.
From the site you linked:
"Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending."
What I seem to grasp about the difference between reproducing and redistributing is that it has to do with the concept of "transfer of ownership". Also derivate work and redistribution are not mutually exclusive.
The moment you create a new thing and start distributing it (even if you do not modify it), you become the de facto owner of that new product, and copyright law is trying to limit the extent of the rights that apply there. So, in the case of music, it's different thing to play (reproduce) a song than to create a new album with your favorite artists that happens to include that particular song (redistribution).
> What I seem to grasp about the difference between reproducing and redistributing is that it has to do with the concept of "transfer of ownership". Also derivate work and redistribution are not mutually exclusive.
What you've misunderstood is it is the copies that are sold, not the copyrights.
> create a new album with your favorite artists that happens to include that particular song (redistribution).
This is not what redistribution means. You seem confused about this word.
Sorry, I'm not following you anymore. I don't even know what you mean by that sentence.
>This is not what redistribution means. You seem confused about this word.
But, that's exactly what redistribution entails ...
The transfer of ownership you referred to is a transfer of ownership of a copy, it is not a transfer of ownership of the original work itself. You misunderstood the passage you quoted to mean that redistribution is transferring ownership of the work itself, as in copyright ownership of the work. But the text you quoted is only talking about transferring ownership of the copies. The text you chose makes more sense in the context of physical copies of books or "phonorecords".
Copyright is meant to protect original/authentic/unprecedented expressions, disregarding the medium where they may exist. So I don't really get your point in trying to make distinction between a copy or a "master"(?) or whatever.
What's at stake is the originality of the expression and what kind of rights does somebody else (i.e. everyone but the creator) have (or not!) over it.
Can I make copies of this original expression? (y/n)
Can I use this into a new product of my own? (y/n)
Whether something is already a copy or not does not really change the extent of the rights that you have (unless it's explicitly stated in the license, of course).
I can see that, and I mean no disrespect, but you shouldn’t have attempted to comment on this topic authoritatively and police what others say without understanding it.
> I don’t really get your point in trying to make distinction between a copy or a “master”(?) or whatever.
You have clearly and repeatedly demonstrated that you don’t understand what “distribute” and “redistribution” means to copyright law. You claimed others were confused about it and that using the word “redistribute” was incorrect, when in fact it was fine and correct.
I’m trying to help you understand that redistribution is a term that is talking about what happens to copies of a work. The sentence you quoted, and the “transfer of ownership” that you said you grasp only have to do with transferring copies, and nothing else.
The main point here is that when GitHub shows you code, it is transferring a copy to you. That is what GitHub calls “publish” and what copyright law calls “publication”, and by publishing they mean redistribution (because the copyright legal code says so).
No, it isn't. You're wrong. Redistribution, or just distribution, in copyright law is plainly and simply making copies of a work available to other people. It does not mean anything more than that, and it does not transfer ownership of anything other than the copy you distribute.
It seems like you're confused; GitHub's terms require users to grant both of those. Copyright law also covers both.
Last time I checked (about an hour ago), that wasn't true. Feel free to provide evidence to support your argument.
"publish" and "share" mean redistribution. "Store" and "copy" mean reproduce.
No. That's something you believe, but it's not necessarily true.
Check here, https://copyrightalliance.org/faqs/what-rights-copyright-own...
Again, distribution has to do with a transfer of ownership. In layman terms, Github can show your code to others but it cannot give (as in ownership) your code to them. It's a bit tricky here since on the web showing something literally means making a copy at some point, but try to view things under the light of "who owns what" and it's a bit easier to grasp.
If you browse through someone's repository, it's pretty clear who the owner of that code is, if a program gives you a chunk of code that it "got from somewhere" there's definitely some sort of change of ownership operation going on; which in this case is interesting, as it went from attributed to someone to missing/unknown.
You're mixing sub-threads here, but you're still confused. Distribution is a transfer of ownership of a copy, it does not grant copyrights or ownership of the work. You can buy a book that was distributed, and that does not give you the right to make copies of the book.
> Github can show your code to others but it cannot give (as in ownership) your code to them.
In the digital world, showing is "distributing", and copyright law is clear about this.
You should perhaps read the definitions that are in the copyright law itself, and try to understand them:
"“Publication” is the distribution of copies or phonorecords of a work to the
public by sale or other transfer of ownership, or by rental, lease, or lending. The
offering to distribute copies or phonorecords to a group of persons for purposes
of further distribution, public performance, or public display, constitutes publication. A public performance or display of a work does not of itself constitute
To perform or display a work “publicly” means—
(1) to perform or display it at a place open to the public or at any place
where a substantial number of persons outside of a normal circle of a family
and its social acquaintances is gathered; or
(2) to transmit or otherwise communicate a performance or display of the
work to a place specified by clause (1) or to the public, by means of any device
or process, whether the members of the public capable of receiving the performance or display receive it in the same place or in separate places and at the
same time or at different times.
>In the digital world, showing is "distributing" [...]
I guess it has to do with how copyright law adapts to the specific circumstances of this particular case. I guess we won't get an answer until a judge justifies some sort of resolution on either side.
My take is that:
* GH showing you some source code on their website is akin to reproduction; even though, of course, a binary copy of the code was made and was transmitted to your local browser in order to be displayed.
* GH taking chunks of code from here and there, and making them available into a new product from which they claim ownership (or the final user, or whatever) is more akin to the physical concept of redistribution.
But we'll have to wait and see.
This is clearly and unambiguously defined as “publication” in the copyright law, where “publication” is defined as distributing copies. (And GitHub’s TOS also calls showing you code “publish”).
There is nothing to wait for, and the law and many court cases have already established clear definitions and precedent on these terms. You just got stuck on the wrong idea, it happens, it’s okay, but if you are curious about copyright and interested in discussing it here, it will certainly help to improve your understanding of the terminology.
> GH taking chunks of code from here and there […] is more akin to the physical concept of redistribution
No, this is still just wrong. You’re talking about derivative works, which is also defined in the copyright legal code. There is no such physical concept of mixing and matching that is called “redistribution” in legal terms. I’m not sure where that idea came from, it might make sense to you or in some narrow contexts, but generally speaking and specifically wrt copyright law, distribution has nothing to do with whether you sample a work nor whether you make a new work out of old works.
The “Terms” link on the copilot page goes directly to GitHub’s TOS, so yes the terms are one and the same.
This question is interesting and I’ll try to help turn the downvotes around, but it might be too late. Anyway, when users agree to allow their code to be “published” by GitHub, they are allowing it to be both copied and distributed. The TOS also says (note the indexing/analysis comment) “This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.”
The part where GitHub might have trouble (I speculate) is that their TOS doesn’t discuss derivative works, and the input code to copilot could have licensing terms on derivative works that get scrubbed out by copilot. OTOH, if copilot were to guarantee that a chunk of code never resembled one of the original inputs it may be legal to create derivative works from samples under fair use.
1. Any code generated by co pilot is likely to be agpl
2. since the authors of copilot used co pilot beta to make co pilot release copilot is very likely using agpl licenced code and therefore in breach of the agpl licence.
so yep, article looks flawed.
> In a few cases, Copilot also reproduces short snippets from the training datasets, according to GitHub’s FAQ.
> This line of reasoning is dangerous in two respects: On the one hand, it suggests that even reproducing the smallest excerpts of protected works constitutes copyright infringement. This is not the case. Such use is only relevant under copyright law if the excerpt used is in turn original and unique enough to reach the threshold of originality.
That analysis may have been reasonable when the post was first written, but subsequent examples seem to show Copilot reproducing far more than the "smallest excerpts" of existing code. For example, the excerpt from the Quake source code appears to easily meet the standard of originality.
It would be quite straightforward to write an additional filter that would check the generated code against the training corpus to exclude exact copies.
This is a code search engine with the ability to integrate search results into your language syntax and program structure. The database is just stored in the neural network.
It’s definitely an impressive and interesting project with useful applications, but it’s not an excuse to violate people’s rights.
But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.
You mean in the same way that google.com isn't "just a database"?
If Copilot isn't intelligent, then what makes it more special than a search engine? How is Copilot not just Limewire but for code?
I could understand the argument that, if Copilot really is intelligent or sentient or something like that, then what it is producing is as original as what a human can produce (although, humans still have to respect copyright laws). However, I haven't seen anyone even attempt to make a serious argument like that.
* AI searches for code in its neural-net-encoded database using your search terms (ex: "fast inverse square root")
* AI parses and generates AST from the snippet it found
* AI parses and generates AST from your existing codebase
* AI merges the ASTs in a way that compiles (it inserts snippet at your cursor, renames variables/function/class names to match existing ones in your program, etc)
* AI converts AST back into source code
Is AI intelligently producing new code in that example? Because I don't think it is.
What would be an interesting test of whether it can actually generate code is if it were tasked with implementing a new algorithm that isn't in the training set at all, and could not possibly be implemented by simply merging existing code snippets together. Maybe by describing a detailed imaginary protocol that does nothing useful, but requires some complicated logic, abstract concepts, and math.
A person can implement an algorithm they've never seen before by applying critical thinking and creativity (and maybe domain knowledge). If an AI can't do that, then you cannot credibly say that it's writing original code, because the only thing it has ever read, and the only thing it will ever write, is other people's code.
There is no database lookup.
I've attempted to break that part down here: https://news.ycombinator.com/item?id=27744156
But you seem to have a basic fundamental misunderstanding of what is going on inside the NN. There is no "search for code" - it is generating new code each time, but sometimes that code will be the same as something it has seen because there is little or no variation in the training data for that snippet.
The NN generates code token by token, conditioned on the code leading up to it (and perhaps the code ahead, similar to BERT).
If you see tokens like this you probably generate the same next token too:
for i in range(1,10)
That's what the NN does, but for much longer range conditioning.
Sounds like there is still a db lookup, just not at runtime and instead at build time of the NN. Can you clarify this please?
The issue here is that certain sentences (code segments) are memorized, and reproduced -- much like a language learner who completes every sentence which begins with "Mi nombre" with the phrase "Mi nombre es Mark". The regurgitation is based on high probability built into the priors, not an explicit lookup. A different logit sampling method (instead of taking the likeliest) reduces regurgitation, without changing anything else about the network. (It also makes nonsense happen more often, since nonsense items are inherently less likely!)
This is a massive simplification. It's adequate for most purposes, but when discussing at this level this simplification breaks down.
Tokens are stored, and the NN contains weights of likelihood of a token occurring after another, given a sequence of prior (and possibly post) tokens.
Verbatim retrieval usually means there is very little variation in the training data for that sequence, so the same set of weights gets stored.
So "retrieval" is actually the same generative process as unique code uses, but the NN hasn't seen any other versions.
How do you differentiate between these two things?
If I take `for i in range(10):` from one place and `print(x * 2) from another place, and combine them to get
for i in range(10):
print(i ** 2)
As an aside, your understanding of how the model works here is completely wrong. Like, just absolutely fundamentally completely wrong.
That's a contrived example because none of those lines could be protected by copyright, patents, etc. A better example might be if you started selling a 30 minute movie that was just the first 15 minutes of Toy Story spliced together with the last 15 minutes of Shrek. I'm not a lawyer, but I'm pretty sure that would qualify as a derivative work, meaning you're potentially infringing on someone's rights (unless they've given you permission/a license).
And to be clear, none of these problems are new. People have been fighting over copyright and it's philosophy in court for a very long time. The only thing that's different here is that it seems some people think it's ok to ignore copyright if you use Copilot as a proxy for the infringement.
> As an aside, your understanding of how the model works here is completely wrong. Like, just absolutely fundamentally completely wrong.
Of course I don't, it's a neural network. You don't know either. That example I posted could be exactly what it's doing, or not even close.
(although for the record, I wasn't trying to explain how copilot works in that comment. It was a hypothetical "AI" for the sake of discussion, not that it matters. My point about it being copyright infringement is the same even if that hypothetical implementation is wrong)
What is this supposed to mean? We know how neural networks generate things like this very very well.
I personally have built a system that takes pictures of hand drawn mobile app layouts into the NN, then generates a JSON based description file that I compile into a Reactive Native and/or HTML5 layout file.
This was trivially easy in 2018 when I did it. It took me maybe 2 weeks engineering time, and I'm no genius. Our understanding of how transformer-based NNs work has come a long way since then, but even back then it was easy to show how conditioning on different parts of the image would generate different code.
Well no. The question I'm asking about, the philosophical distinction between "producing" or "combining" is a valid question no matter the copyrightability of anything. It's an interesting philosophical question even if we presume that copyright is bubkis.
> It was a hypothetical "AI" for the sake of discussion, not that it matters.
Ah, my mistake. I see that now.
> Of course I don't, it's a neural network. You don't know either.
I may not know how to make a lightbulb, but I do know hundreds of ways not to make one ;)
This doesn't hold at all. Not many people can come up with an original sorting algorithm for example, but people write code all the time.
Does it copy all the time? Doesn't matter. Plagiarism is plagiarism even regardless of it is done by a student in school, an author, a monkey or an "AI".
You wouldn't accept this from a student, you shouldn't accept it from a coworker (unless you are releasing under a compatible license) and of course you should accept it from Microsoft.
This should surprise no one who has seen the evolution of language models. Take a look at Kapathy's great write up from way back in 2015. This generates Wikipedia syntax from a character-based RNN. It's operating on a per-character basis, and it doesn't have sufficient capacity to memorise large snippets. (The Paul Graham example spells this out: 1M characters in the dataset = 8M bits, and the network has 3.5M parameters).
Semantic arguments about "is this intelligence?" I'll let others fight.
They seem to believe it is a database system. That's really not how this works, and the fact it behaves like one sometimes is disguising what it is doing.
If I say "write a "for" loop from 0 to 10 in Python" probably 50% of implementations by Python programmers will look exactly the same. Some will be retrieving that from memory, but many will be using a generative process that generates the same code, because they've seen and done similar things thousands of times before.
A neural network is doing a similar thing. "Write quicksort" makes it start generating tokens, and the loss function has optimised it to generate them in an order it has seen before.
It's probably seen a decent number of variations of quicksort, so you might get a mix of what it has seen before. For other pieces of code it has only seen one implementation, so it will generate something very similar. There could be local variations (eg, it sees lots of loops, so it might use a different variation) but in general it will be very similar.
But this isn't a database lookup function - it's generative against a loss function.
This is subtle distinction, but it is reasonable that people on HN understand this.
How are these not both the exact same process of memory recollection? Can you elaborate on the difference between memory recall vs a generative process based on conditioning? I understand how these two are different in application, but not understand why one would say they are fundamentally different processes.
The best I can come up with is this:
Imagine you are implementing a system to give the correct answer to the addition of any two numbers between 1 and 100.
One way to implement it would be to build a large database, loaded with "x" and "y" and their sum. Then when you want to find out what 1 + 2 is you do a lookup.
The other method is to implement a "sum" function.
Both give the same results. The first process is a database lookup, the second is akin to a generative process because it's doing calculation to come up with correct result.
This analogy breaks down because a NN does have a token lookup as well. But the probabilistic computation is the major part of how a NN works, not the lookup part.
Intelligence is mysterious in the same way chemical biology is mysterious (though perhaps to another degree of complexity)... It's not mysterious in the way people getting sick was mysterious before germ theory. There's no reason to think there's some crucial missing phenomenon without which we can't even reason about intelligence.
I'm not sure if I would characterize as a "database stored in a neural net", but that is definitely something to deeply consider.
People have been trying to accomplish that for 65 years. We're not even close. It's the software equivalent of cold fusion (with less scientific rigor)
Also when talking about rights, whether or not Copilot copies doesn't seem sufficient to make a call. For instance, if it has to be coerced by the programmer to produce these kinds of snippets in an obvious way, then it seems fine to lay the blame on the programmer similar to when using regular autocompletion (or copy+paste for that matter).
It has memorized one thing. That doesn't prove it's not intelligent. If anything it's the other way around, we would expect an intelligent being to be capable of memorization.
All I can think of is the Turing test and the AI effect. Eventually we will have an AI that is capable of writing code indistinguishable from a human, and people will STILL say it's "not AI" and "just a code search engine", etc. Obviously this isn't there yet, but it's clearly getting closer.
Is 5 a more intelligent answer than 4, because it is new? Copilot is an autocomplete engine, not creative writing.
The question that brings is that this was found because it is so famous, but what if it is repeating Joe Schmoe's weekend library project, but we will never know because its not famous?
Every literally quoted part that could infringe appears at least 10 times in the training data
Clearly, "10 other people did it" is no defense at all.
That is "10 other people". (Although your point stands, since there doesn't (or at least shouldn't to the point of criminal espionage) be any strong impediment preventing one person from creating 10 different accounts.)
I think that was a big part of the Google Vs Oracle case - how much copying constitutes an infringement?
It looks like they made a fairly complex rubric to apply in the future, it appears it would be on a case by case basis.
They are putting GPL code in non-gpled codebases. Is it okay to take sections of other people's source code and use it on yours, if you just got it as a suggestion?
Do not go down this line of reasoning, otherwise we will be copyrighting the concept of for loops.
Too late, patents pick up where copyright ends, to protect general algorithmic ideas, not just implementations. And we have lots of patents on things that seem trivial now, including for-loops (just see how many patents depend on “a multiplicity”). Look - here’s a helpful lawyer’s template for including for-loops as a claim in your own patents: https://www.natlawreview.com/article/recursive-and-iterative...
Another example is the famous XOR patent https://patents.google.com/patent/US4197590/en
EFF keeps a blog on stupid patents https://www.eff.org/issues/stupid-patent-month
A claim on a combination of elements in which one element is an iterative component is not the same as claim on all iterative components everywhere.
Anyone who thinks that licensing will have an effect on what is happening in reality is severely misguided.
Free and open sound good to me! What do they mean exactly? I guess it’s a non-debatable fact that copyrights and patents are abused by many big companies and patent trolls, but doing away with the system does seem extreme, it has also protected deserving individuals on occasion, no? You are saying that it should always be legal to copy someone else’s code / inventions without giving them any credit or compensation?
> Anyone who thinks that licensing will have an effect on what is happening in reality is severely misguided.
I’m not sure I understand what you mean; lots of licensing activity does have a measurable effect on reality. This article is only a small example, but people get sued all the time over taking code and using it without licensing it.
> [Citation needed]
(Not really your point, as such, but) no, actually, if you claim they did something (nominally) wrong, the onus is on you to provide citations showing they did it.
0: > From an IP absolutist standpoint
1: Well, or that they (voluntarily and explicitly) accepted some responsibility (such as a job as a police officer) that entails a higher level of scrutiny than innocent-until-proven-guilty, but that's not really relevant here.
And then I saw your source code and in the late 80s I changed the variable names, function name, and logic to be x+y+z+0.1
And then I told my friend John that there's a super cool algorithm that adds numbers together, and he made some more changes to it and compiled it for a different platform...
Has anybody broken the law in your mind?
EDIT: because it would seem that the original authors (among them Cleve Moler) don't have any issue with what transpired
Is there a source that said they changed variable and function names and modified the logic?
> because it would seem that the original authors (among them Cleve Moler) don't have any issue with what transpired
Yet. Without an explicit license there is no basis to release it under the GPL (if the code was copied verbatim or had insufficient re-writing). What if the heirs of the copyright owner wanted to assert their rights? Is there a doctrine that if you don't assert your rights you lose them? (Presumably applies to trademarks, but I don't think this is the case for copyrights)
What do we do then? The burden of proof for infringement is on original authors, and they haven't done so for 40 years.
In the late 1700s and early 1800s, Britain had to take measure to prevent visiting Americans and others from memorizing the designs of their new high tech machinery like the steam engine and the power loom.
Where do we draw the line? Shut down the internet until we create a massive copyright detection firewall?
No, we live with the copying and constantly evolve and adapt our business. Death to all patent trolls.
I won't even claim that people must necessarily follow the law. Copyright law is inconsistent at best, and notoriously hard to follow to the letter (and often ridiculous). In practice lawyers assess the legal risk and weigh the outcomes.
I never intended to discuss what we should do, and I definitely did not propose shutting down the internet...
The original discussion was such:
> > They did not copy the implementation, they copied the general idea of what the algorithm should do
You said the original authors did not complain, which is neither here nor there, as I pointed out. There is still some theoretical legal risk if you copy with the owner's knowledge but not express consent. The fact that the burden of proof is on the authors is true but that they have not brought a claim does not mean they cannot prove infringement.
And in case I haven't made it clear, I don't think it's a bad idea to assume the function is under GPL, I just don't think there's a basis for claiming what you originally claimed, and there is still some level of (probably acceptable) risk if you take the purported license of source code as-is.
It's such a famous snippet that it's even included in full on Wikipedia .
I wouldn't be surprised if the next version of Copilot filtered these out.
Algorithms should not be subject to copyright, that way lies madness. It would prevent new generations from building on top of the work of their predecessors, because copyright lasts a very long time. The amounts of code that github copilot reproduces fall squarely into the “shouldn’t be subject to copyright” domain for me, even if they pass the bar for originality.
The algorithm is not copyrighted, but the source code of the function is copyrighted. You could learn how the algorithm works by reading the function, and then write your own function that implements the same algorithm. Algorithms are not copyrightable, they are not subject to copyright. Source code is copyrightable.
Copilot is not reproducing just the algorithm, it is spitting out large chunks of the copyrighted source code, verbatim.
But there is also actually an issue about laundering and what constitutes "use". But there is also de minimis to consider.
And EVERYTHING will depend on jurisdiction of course.
That's not what people care about, people care about their copyright being blatantly violated by a massive corporation _without any consequences_.
Can Copilot produce licensed code verbatim, in enough quantities to matter, with a license your business would be infringing? Yes. Can you easily tell by looking at the output? No. Could someone end up suing you over it? Maybe, if they cared enough to find out. Can you honestly tell your investors, or a company you seek to be acquired by, that nobody else can have valid copyright claim against your code? No.
Well aren't all your assertions exactly the point of contention?
How much code is enough to infringe is a tricky question, though. It's not only a function of size, but also of importance/uniqueness - and we know that Copilot doesn't understand these concepts.
As part of the sequences of rulings in Google vs Oracle, the 9-line rangeCheck function, in the entirety of the Android codebase, was found to be infringing.
Yes, it is, because that means that the algorithm will produce that copyrighted code regardless of the intent of the person who makes it misbehave. People could both accidentally and "accidentally" make it reproduce copyrighted code. In the first case, it's unintentional. In the second, how could you prove it's intentional?
Because of this whole mess, I am actually adding clauses to FOSS licenses that I am writing, just to ensure that my copyright on my code is not infringed by code laundering.
1. A license applied to source code is effective because of your copyright
2. The claim of Copilot's maintainers is that it bypasses copyright
Therefore, they will assert that they can ignore the new license saying "you may not launder my code" just as surely as they can ignore the previous license.
Second, you are correct that Copilot's maintainers claim that it bypasses copyright, but if it does while producing exact copies of code, then copyright is dead, and there are a lot of big companies out there with deep pockets that will ensure that doesn't happen.
They may claim that because their algorithm is a black box, that whatever it produces has no copyright, but my licenses will push back directly on that claim by saying that if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output. After all, that's what we do with GPL and binary code. The binary code is the output of an algorithm (the compiler) whose input was the source code.
I hope by tying it together like that, the terms can close the loophole they are claiming. But of course, I am going to get a lawyer to help me with those licenses.
You're not getting it. If Copilot isn't currently infringing copyright then adding such a clause won't matter. Such a clause would only hold weight when copyright applies. On the other hand, if copyright does apply, then you don't need such a clause because the activity is already a violation of the vast majority of licenses. (It even violates extremely permissive ones because it effectively strips out the license notice.)
The GPL works specifically because copyright applies to the usecase in question. It simply specifies various requirements that you must meet in order to license the code given that copyright applies.
In short, you can't just put a clause into a license saying, effectively, "and also, this license confers superpowers which make it so that my copyright applies in additional situations where it otherwise wouldn't!".
Imagine this simplified scenario first: if I published a source file publicly without any licensing or explanation except a standard copyright notice - "Copyright (C) 2021 MY NAME, all rights reserved", do you think a random person/company can take that code and integrate it into a commercial product?
I would argue not (in general). Copyrights law as it is, does not permit a user who has access to a copy to do whatever they want with that copy (esp. if it involves more copying). OSS licenses do give you much freedom as long as you don't modify it, and that's why we have impression that we can do whatever with publicized source code. However, if we think about other types of copyrighted work, say movies for example, streaming services can "rent" you a movie multiple times even though you've paid to download the content previously. What are you paying for the second time you rent? Another example - some photographers may allow you to freely browse their works, but they can still make you pay money if you want to use their photo in your commercial product.
So why wouldn't copyright restrict usage of source code in similar situations? The GP only needs to add a condition to the license to restrict how users can use it. It will no longer be OSS, but as long as it's his work, I don't see why in principle it shouldn't work.
(In practice, I don't think it will make much difference -- I think your argument is still somewhat compelling, and some people will probably take your position. Conservative corporate lawyers aimed at reducing legal risk would disagree, so it's basically a matter of how much legal risk one is ready to take. Also, for an author trying to do this, note that suing Microsoft in these cases would be expensive, since they will likely fight back given that they spent so much money trying to do this, and the outcome will be uncertain. If really tested in court, given the result of the Oracle v Google case, if the US Supreme Court is impressed by the social/economic benefits that Android brings, I'm pretty sure the justices will be even more impressed by this intelligent code generation thingy, and might just grant this thing a fair use.)
I have a nice strong lock on my door. GitHub (asserts that it) can enter my home through the window.
Adding another deadbolt to the door does not help.
Maybe I'm missing something (just not the thing you said), but has Github made any legal claims so far? The original article is written by a politician in EU...
Even if you're a lawyer defending Github in this case, there's still a couple things that needs to be clarified before you can make the case: (maybe the info is out there but I'm too lazy to research)
- Is Github only using code/repos that are explicitly under OSS licenses? (because if that's the case, then the discussion might be justified in presuming OSS terms, and it may be the case that more restrictive non-OSS licenses would require a different analysis)
- As somebody pointed out in another thread, the Github terms of service agreement seems to grant Github additional rights when dealing with user uploaded content. Is that a legal basis for the use?
And I tend to agree with you (and the other commenter) here. But GitHub doesn't.
> has Github made any legal claims so far?
I'm not sure how actively, but the CEO was here in the announcement thread the other day saying that they think the ingestion of the inputs is a "fair use". They also have some material defending the output side: https://docs.github.com/en/github/copilot/research-recitatio...
> Is Github only using code/repos that are explicitly under OSS licenses?
I don't think we know exactly what code they used as inputs, no.
It's a matter of scale. With a big enough codebase, there will be copyright violations.
The point (that they claim that) you are missing is that if "copyright is relevant to Copilot's input" then almost all existing OSS licenses already don't allow that.
However, GitHub said nothing about the output of the model being fair use. My license will say that the output of their model is under the same license as the input, which means they have restrictions if they want to distribute it (i.e., actually have people use Copilot).
I think this will work because it doesn't say that GitHub is wrong. Instead, it says that, even if GitHub is right, it doesn't matter.
It would also be very bad for GitHub to claim that the output of an algorithm can't be under the same license as the input because we feed licensed code to algorithms all the time and claim that their output is still under the same license. We call those algorithms "compilers" and the binary code they produce is still copyrighted and licensed.
I didn't mean to take a side or argue a position here. I was just pointing out that licenses hold no legal power in the event that copyright itself doesn't apply.
> ... So why wouldn't copyright restrict usage of source code in similar situations?
I'm certainly not an expert here but I believe you are mistaken about the extent to which current copyright law (in the US) restricts such usage. I also don't think that the examples you bring up are as simple as you seem to be making out.
You are legally permitted to record broadcast shows for later viewing; you are not permitted to redistribute the recordings though. I assume (but am not certain) that rentals and streaming are the same. (That being said, bypassing DRM has been made its own crime. This effectively amounts to an end run around the rights otherwise granted to you by US copyright law. But then there are specific exceptions where bypassing DRM is permitted. I digress.)
You aren't legally permitted to mirror the contents of a website (such as the New York Times) without permission but you are allowed to access it since they make it publicly available. You are even permitted to save a copy for your own purposes when you access it; you are not permitted to redistribute that copy.
For an extreme example, consider the recent LinkedIn case. Unless I misunderstood it, the court deemed it acceptable to scrape any publicly available content. Certainly most such scraped content was never explicitly licensed for that though!
Even if the license for a piece of code was entirely proprietary, GitHub presumably acquired it through legal means (ie intentional upload). Once they have it in their possession, it's not at all clear to me that current copyright law in the US has anything to say about how they use it (short of redistribution). Of course, if their ToS promises that they won't use it for other purposes then they can't do that. But assuming they never promised you that in the first place ...
There's a traditional argument here about needing a license to legally incorporate the copyrighted work of another into your own.
One possible counter argument is that training a model on publicly available work is analogous to a person viewing that work. So long as the model never outputs any of the original inputs (or only exceedingly small fragments of them that would fall under fair use regardless) it's not clear that those outputs constitute derivatives at all (in the legal sense). Or they might. The courts haven't weighed in yet as far as I know. (Consider GPT-3 or This Waifu Does Not Exist for additional examples of the sort of ambiguity that's possible here.)
Of course, one possible counter to that is that the model itself is (in many cases) effectively a lossily compressed copy of the original input works. So perhaps redistribution of the model itself would be a violation of copyright. But even if that turns out to be the case, it's still not clear that the output of such a model would run afoul of copyright.
I argue that the output of an algorithm has the same copyright as the inputs to the algorithm, and that's because we use compilers (algorithms) to transform source code all the time already, and no one says that the binary code (outputs) is not copyrighted.
The compiler produces more or less a direct (logical) translation so it's clearly some sort of derivative. We go from C to machine code but the output still "means" the same thing as the input. (More precisely, it's approximately a mathematically transformed subset of the original input. Lots of information is removed, things are reorganized, and a bit of extraneous information gets added in the process.)
For something notably more muddy than a compiler, consider This Waifu Does Not Exist. Any given output is (typically) nowhere near any particular input but you can often spot various strong resemblances.
Alternatively, the implementation of sketch-rnn (https://magenta.tensorflow.org/sketch-rnn-demo) is quite different - it outputs pen strokes instead of pixels. Still, the legal questions remain the same.
For a significantly muddier example, consider GPT-3. The outputs are (typically) not even remotely similar to anything that was input except in very broad strokes.
Where does Copilot fall along this continuum?
For even more confusion, consider running a New York Times article through Google Translate. Are you in the clear to publish that? I seriously doubt it.
But what about running it through an ML algorithm that (attempts to) produce a very brief summary of it? Many such implementations exist in the real world today. Their output is nothing like the input - should it still fall under the copyright of the original?
Finally, it's worth pointing out that for many of the above computerized tasks there are direct human equivalents. Art can be traced on a light table. A drawing can be produced that fuses the styles of two references. News articles can be manually translated or summarized.
Again, my intention here isn't to argue a particular side. I'm just trying to make it clear how complicated this stuff is and the fact that we don't have clear legal answers for most of it yet.
I argue that, even if training a dataset is fair use, distributing the result is copyright infringement. I would want my license to make that part clearer.
I would be inclined to agree that the current situation (ie reproducing training examples verbatim) violates copyright. On the other hand, I'm not so sure that a trained model does (or even should) be subject to the copyright of the inputs.
Of course I acknowledge that the latter view is controversial and also that such issues are so new that they haven't had a chance to be meaningfully addressed by either the courts or the legislature yet.
As an example of a similar situation, see (https://www.thiswaifudoesnotexist.net/) which was trained entirely on copyrighted artwork. Note that there are at least three distinct issues here - training the model, distributing the model itself, and distributing the output of the model.
> I would want my license to make that part clearer.
But again, GitHub's argument here is that the license is completely irrelevant because it doesn't apply in the first place. Thus they won't care one bit about any clarifications you make one way or the other.
You missed my point. I'm not saying that the model is subject to the copyright of the inputs; I'm saying that the model's outputs are, which is entirely different. We say that the output of a compiler is still subject to the copyright of the inputs, so why not this?
Anyway, by providing public access to this thing I infer GitHub to be taking the position that copyright doesn't apply to the output. (And I suspect they are wrong, in particular because of the verbatim code samples people have managed to coax out of it.)
That seems an unlikely legal argument. It would defeat the point of fair use if you couldn’t distribute the result.
And no copyright license can override copyright law. Licenses can only grant rights, they can’t take them away.
But stripping licenses away so that users can't know what rights they have with my code is not that.
Doesn't this make your new licenses incompatible to a lot of existing licenses?
Law isn't code.
So either the added terms are not more restrictive, which basically means they are unnecessary and have no real effect; or they are more restrictive, which is incompatible with the GPL.
You can't have things go both ways. It seems that your argument is "we're not adding restrictions, we're just saying what we think Copyright law / the GPL should actually be like." But unfortunately you can't "clarify" Copyright Law or "clarify" the GPL by adding terms. Ultimately courts decide that.
(Of course, if somehow your "clarification" happens to align with a court decision, then maybe it will work after all. But in theory your "clarification" is still not necessary and has no additional effect....)
Except your clarification will be interpreted by a court of law. “This license is compatible with the GPL and I can interpret the GPL in a way that lets me do something this license says I can't” is much less likely to stand than “well maybe the author thought the GPL said this, but it actually says my interpretation”.
This, of course, presumes that such a license is actually compatible with the GPL, something I'm getting less and less certain of over time. (What constitutes a compiled form? If a predictive model doesn't count – which it might not, since it outputs source code, very much unlike how compiled programs normally work – then my argument falls down. And many other things would also knock the argument down; I'm not confident enough that all my assumptions are right, or that they should be right.)
But yes, my licenses may be incompatible (one-way) with permissive licenses. I say "one-way" because code with permissive licenses can still be used in code under my licenses, but maybe not necessarily the other way around.
I'm okay with that.
If you're just adding something along the lines of "copying passages extensive enough to reach originality is a violation of this license" then that's indeed already covered by the GPL, and there is really no need to add such a passage other than to be more explicit - and confuse people at least at first about why your license is not actually the GPL. So there isn't much of a point to do it in the first place, in my humble opinion.
If you add text that says something along the lines of "you may not use this code as training data", then you created an incompatible license, and your code cannot be used in GPL code bases, and even worse, since it restricts what you can do with the code more than the GPL, it might even mean you stop being reverse-compatible and may not use GPL'ed code yourself in your own custom-license code base.
The AGPL does not further restrict code uses, just broadens the scope of when you have to make available the code, so it's fine there. However, the original BSD license with the advertising clause is considered incompatible with the GPL.
I am not a lawyer, and these are just my quick layman concerns. I fully recognize you're entitled to use whatever license you find suitable for your code and I am absolutely not entitled to your code and work whatsoever.
But that said, I wouldn't touch your code if I saw a "potentially problematic" custom license, and I wouldn't consider contributing to your projects either.
Honestly, with this whole debacle, I am not going to be accepting outside contributions anyway.
I also understand the concern with a problematic license. However, I don't plan to make a specific exemption about machine learning, but rather tie up an ambiguity.
What I think I'll do is that the license will require that when the licensed source code is used, partially or fully, as an input to an algorithm, the license terms must be distributed with the output of that algorithm.
I don't think this is a violation of the GPL at all because the GPL requires you to distribute the license with the binary code of GPL'ed code, and such binary code is the output of an algorithm (the compiler) whose input was the source code.
But what it would do is put the onus on GitHub that, if they used my code in training that data, if they distributed the results (as they are doing), they must distribute my license terms as well and tell users that some of the results are under those terms.
Just because binary code is produced by the operation of an algorithm on source code doesn’t make all output produced any algorithm on that source code binary code. Otherwise checksums and hashes and prime numbers would be copyrighted.
Bats are not birds.
Would such a term be legally binding under present copyright law? Other than disallowing inclusion in a redistributed dataset specifically intended for training ML models, it's not clear to me that it would actually prevent such use if you already had a copy on hand for some other purpose. (Specifically, note that GitHub indeed already has a copy on hand for their authorized primary purpose of publicly distributing it.)
More generally, the manner in which copyright law applies to machine learning algorithms in general hasn't been worked out by either the courts or legislature yet. Hence the current article ...
Whereas using the browser's copy feature requires the user to have intent to use it, getting Copilot to produce exact code does not. And proving that intent is not easy.
I think companies will see that such code can be exactly reproduced and decide to stay away from Copilot. I hope they do. In fact, I am less willing to take outside contributions for my own code, even for bug fixes, just because of the risk that that code came from Copilot.
For example, a book could be copyrighted, but they certainly cannot sue me because a book i wrote contains a sentence that is the same.
However, for my purposes, using a new license with particular terms would only be to make companies like GitHub pause and think before using my code as "training" to an "algorithm" like Copilot.
Tool that could be used to violate copyright := Gets prosecuted by MPAA and friends, legislation is passed to make use / development / distribution of such tools illegal
Bigcorp ships the ML equivalent of ALLCODE.tgz, but you actually gotta look in the no/dont/open/this/folder/gplviolations/quake.c folder := Is this adequate proof that copyright is being violated?
I can see an argument for doing your own research, but I can also see an argument for basing an analysis on what GitHub said in the FAQ — I'm honestly a bit surprised that Microsoft's lawyers let them say that with a product that can reproduce such large blocks of verbatim code.
So it could be that the executives really wanted to do it, and the lawyers thought "OK, technically we're not violating anything...."
At any rate, I like it here; so I'll try to figure out how what I said was flamebait, and try not to say such things again.
[Edited upon re-reading]
You can't copyright an algorithm, but you certainly can copyright the expression of an algorithm in Python. You cannot copyright the words of the English language and their meanings, but Noah Webster absolutely did copyright his dictionary, which was a creative expression of their definitions (and lobbied for the first increase to US copyright law). Webster wasn't the "thought police" for trying to copyright people's understanding of words in English, because he didn't and couldn't copyright them; he copyrighted his expression of what words meant.
If you read the creative expression of an algorithm in Python and then re-express it in English, then sure, copyright protection doesn't extend to that re-expression. But Copilot isn't doing that, it's quite clearly reproducing parts of the original creative expression of an algorithm, not the algorithm itself.
Here's an easy way to demonstrate it: open up a source file in any language other than C and try to get Copilot to spit out an implementation of Quake's fast-inverse-square-root algorithm. You will very quickly discover that Copilot doesn't "know" the algorithm; it only "knows" the specific creative expression of it (comments included).
Yes, a port from language X to Y is widely considered a derived work. Whether it is infringing is a separate question.
In the US, copyright may include the choice of variable names, the organization of the code into modules and functions, and other aspects which where there are the creative choices that may be protected under copyright law.
The relevant process is described at https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari... , which comes from the court case at https://en.wikipedia.org/wiki/Computer_Associates_Internatio.... nearly 30 years ago:
> the court presented a three-step test to determine substantial similarity, abstraction-filtration-comparison. This process is based on other previously established copyright principles of merger, scenes a faire, and the public domain. In this test, the court must first determine the allegedly infringed program's constituent structural parts. Then, the parts are filtered to extract any non-protected elements. Non-protected elements include: elements made for efficiency (i.e. elements with a limited number of ways it can be expressed and thus incidental to the idea), elements dictated by external factors (i.e. standard techniques), and design elements taken from the public domain. Any of these non-protected elements are thrown out and the remaining elements are compared with the allegedly infringing program's elements to determine substantial similarity.
Emphasis mine. This specifically highlights that your example ('only so many ways you can express an algorithm') is not protected under US copyright law.
The originality requirement only applies to other aspects of the generated code, which in this case would include the comments that Copilot generated, and which clearly are not required for the algorithm to work.
For thought police like you describe, look to patent law.
Let's say someone comes up with a new sorting algorithm, which completes in less cycles than was previously believed possible. Sure, it's math, but isn't that a new, creative expression? Don't we want to encourage them to publish their algorithm (one of the key purposes of patents—this way, anyone can use it after 20 years), as opposed to keeping it hidden from the world?
It makes more sense to me than most software patents (admittedly, a low bar to clear). And if the patent office is doing its job (big if), the patents should only be granted for algorithms which are sufficiently novel.
But nowadays I think patent law isn't the right way to do that; trade secrets should be enough. I don't think that what is disclosed to the public in patent applications is of enough value to justify a long monopoly. It's not necessarily a problem with the written law; patents are horrible because of the way courts apply them.
Either way, the copyright of source code is separate from that. Copyright is for the text of a program (the source code), that might e.g. implement an algrithm. The algorithm itself cannot be patented or otherwise legally protected.
I've not come across your stipulation that for a thing to count as an algorithm it must provably halt, but I can go along with that. So I'd argue that in most cases, any function or subroutine provably terminates, even if the program embodying it is not supposed to terminate.
I also don't agree that an algorithm is "just maths". At least, not if you then pivot to saying that a browser isn't "just maths". Any operation performed by a computer is "just maths", because what a CPU does is basically arithmetic and branching.
I don't think it's a question of what does and doesn't "deserve" IP protection. The source code of a browser is clearly an original work, and entitled to protection. But the ideas and procedures it embodies are not "works", and copyright isn't supposed to apply to ideas and procedures.
I'm against the very idea of "intellectual property". It must have seemed a good idea at the time, but I think patents and copyrights have become monsters that inhibit, rather than encourage, innovation and creativity.
Algorithms are distinguished by their proofs of correctness. This elevates them above simple procedures. The halting problems tells us that there is no automatic way to determine whether or not a program terminates. So when we find one, it's like discovering a mathematical law. The proof of an algorithm's correctness is expressed independently of any programming language or platform. What else could they be other than math?
Things like browsers, games, operating systems, e-mail clients, music players etc. are not treated this way. They are not formally specified. They are implemented in the context of a machine and an actual running environment. The source code of the program usually doubles as its specification. It's very different compared to an algorithm.
I agree IP as a concept is bad, but this is the way of the world at least for now. Given where we are, for me it makes sense to draw a line between algorithms and software in the context of copyright.
Can you cite? (Wikipedia's explanation of what an algorithm is doesn't mention that). I'm open to being corrected, but not baldly contradicted.
Wikipedia does mention it though:
In mathematics and computer science, an algorithm is *a finite sequence* of well-defined, computer-implementable instructions, typically to solve a class of specific problems or to perform a computation.
So they stipulate finite input and output and finite runtime. This is in contrast to something like a webserver or OS, which has a potentially unbounded number of inputs, and is expected to run effectively forever.
I mean think about it, what does an OS kernel look like in its most distilled form? It's essentially just an entry point to an infinite loop. Same as games, where they're called "event loops" in that domain.
The way I think about it is this: if computers and programming languages and software didn't exist, algorithms still would. I don't think you can say the same about e.q. Quake. Quake isn't a mathematical truth even though it uses them to work, kind of like how engineers use physics to build a bridge, but the bridge itself isn't physics.
I don't see anything in the WP article about proofs of correctness. "Finite sequence" surely just means that the number of instructions isn't infinite? Come to think of it, that wording seems rather hand-wavey; I wonder if I can find citations to help me improve it.
Well how are you going to show that the algorithm terminates for all inputs without a proof of correctness?
The thing about the situation is that "copying code you found on the Internet" certainly isn't automatically, always legal. That you engaged in copying X from the Internet doesn't make illegal either. Your source for the source code your incorporate into a product doesn't matter, what matters is whether that code is copyrighted and what the license terms (if any) are (and people saying "copyright doesn't apply to machines" are wildly misinterpreting things imo).
Given what's come out, it seems plausible that you could coax the source of whatever smallish open source project you wished out of copilot. Claiming copyright on that code wouldn't be legal regardless of Copilot.
Whether Microsoft/Github would be liable is another question as far as I can tell. I mean, Youtube-dl can be used to violate copyright but it isn't liable for those violations. The only way Copilot is different from youtube-dl is that it tells it's users everything is OK and "they told me it was OK" is generally not a legal defense (IE, I don't know for sure but I'd be shocked if the app shielded it's users from liability). All the open source code is certainly "free to look at" and Copilot putting that on a programmers screen isn't doing more than letting the programmer look at it until the programmer does something (incorporating it into a released work they claim as their own would be act).
The question is how easily a programmer could accidentally come up with a large enough piece of a copyrighted work using Copilot. That question seems to be open.
TL;DR; My entirel amateur legal opinion is that Copilot can't violate copyright but that it's users certainly can.