I disagree with this article. GitHub Copilot is indeed infringing copyright and not only in a grey zone, but in a very clear black and white fashion that our corporate taskmasters (Microsoft included) have defended as infringement.
The legal debate around copyright infringement has always centered around the rights granted by the owner vs the rights appropriated by the user, with the owner's wants superseding user needs/wants. Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users. Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use. Every other instance of code used is a case of copyright infringement, a clear case by Microsoft's own definition of copyright infringement [1][2].
Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo. This is a fine line to walk and will likely not survive in a court of law. They are betting on deep lawyer pockets to see them through this, but are more likely than not to lose this battle. I suspect we will see how this plays out in the coming months.
The part that feels really obvious to me is that, if I made an AI that could generate music by looking through the entire (copyrighted) back catalog of the Beatles for example, and it would output music that I could control to be very much or even exactly like the original recordings, or I could accidentally do it, that it wouldn’t really be a way to launder the original licenses/copyright into the public domain.
Or maybe it is, but if so it essentially means the end of licensing because it would be trivial to make an AI that can take an input and produce the same output. Or maybe even cp is good enough to strip the source of its original license in that case.
Open source licenses are worth protecting or you break the cycle that helps more software be open.
> and it would output music that I could control to be very much or even exactly like the original recordings, or I could accidentally do it, that it wouldn’t really be a way to launder the original licenses/copyright into the public domain.
The test for non-literal copyright infringement is "substantial similarity." If, after filtering out irrelevant and non-copyrightable elements, the allegedly-infringing work is substantially the same as the original work, then it infringes. If it infringes, then two common defenses are independent creation and fair use.
In your hypothetical, the AI-generated work would infringe the original because you stated it would be substantially the same as the copyrighted work. You can't claim independent creation because the algorithm was dependent on the original work and you controlled the output of the algorithm to be exactly like the original work. Fair use is pretty much a non-starter, so I'll skip that analysis.
So, no, you couldn't use an AI to launder copyrighted works into the public domain.
The way I see it, if you would use Copilot to completely (or largely) reproduce an existing work (software), then you would be infringing copyright. This is similar to using an AI to largely replicate a piece of music.
If you are using it to mix a snippet of code (from a sufficiently large code base) into a large code base of your own, then you are just remixing. That is not infringement. In music, there are entire genres based on remixing. You could even take it a step further and ask yourself: what is not a remix?
Yeah, I've heard about some of those cases. It's surprising how far copyright can be stretched sometimes.
What's more surprising is to see copyleft advocates positioned so strongly in favour of giving copyright that kind of reach. I think that in a different context, some of the cases you refer to would be used by these same copyleft supporters as examples of why copyright needs to be more weakly enforced, not more strongly.
At the very least it should be consistent. If Microsoft can sue me for something trivial that probably shouldn't be illegal but is, then I can sue them for something trivial that probably shouldn't be illegal but is.
> Unless you are Github, in which case having your AI copy code vertabim is ok?
You have to filter out any non-copyrightable elements before you do the substantial similarity analysis. For code, that means removing non-expressive elements like arithmetic or boolean expressions, looping, recursion, conditionals, etc. APIs are not copyrightable under the recent Supreme Court holding in Google v. Oracle.
How much of your code is actually left after filtration?
You’re basically arguing that whole books aren’t copyrightable because after you filter out all the words which aren’t individually copyrightable there isn’t anything left to copyright. Which obviously doesn’t make any sense.
The tests exist as a way of determining if someone did the action of copying which is the important thing at the core of it. And in this case the facts aren’t really in dispute, it’s whether what GH is doing counts as copying.
If you had a really really good memory and remembered almost exactly how your former company implemented something and when faced with a similar problem unknowingly produced similar code. It’s iffy as to whether this is copying. Because it really probably isn’t — you’re allowed to learn from copyrighted works — but the courts aren’t omniscient and when presented with the code they very well may rule that copying was more likely than not.
But we are omniscient in this case so we don’t really need the tests. Is what GH does more like copying or learning? This isn’t something that can be determined purely from the output of the tool.
Code elements like conditionals, loops, arithmetic or Boolean expressions, assignment statements, etc. are not copyrightable, but not because they’re insignificant (like individual words in a novel), but because they’re useful articles. Copyright protects expression, not utility, which is the domain of patent law.
That's nonsense though. If we filter out the non-copyrightable parts of music we'd remove all the pitches, eighth notes, quarter notes, half notes...
We could remove the non-copyrightable parts of text works too. Just take out all the basic building blocks of language like verbs, nouns, stop words...
The problem of API copyrightability was not resolved by Google v. Oracle, the ruling was that no matter whether or not APIs are copyrightable, copying APIs is fair use (which is a concept in the USA but not in many other countries around the world).
And in any case I think at the point where the AI (yes I considered scare quotes but decided not to) copies the code verbatim including swear words it is kind of obvious what happens:
The AI part isn't about independent creation but about figuring out what to copy.
"Or maybe it is, but if so it essentially means the end of licensing because it would be trivial to make an AI that can take an input and produce the same output."
Yes, this is what is pretty interesting to me. I said in a previous comment that I have a really good OS generating AI. It asks you your favorite color and outputs a disk image you can use as an installer.
Right now it just happens to output a cracked version of Windows if you answer "blue". Who can know how that happened? It's a black box after all. Seems useful though, since Microsoft is loudly saying that if I distributed this it would have no license problems at all.
I think this is peak programmer logic. You can’t just declare something a black box or declare something an AI — the courts, people, aren’t that stupid. Lawyer words aren’t that magic.
If when you crack open Copilot it’s determined that it’s not actually learning and boils down to storing and regurgitating snippets of code no matter how few or many layers of indirection it’s still infringement.
What your AI actually does is underneath all the indirection is what’s important.
> [..] not actually learning and boils down to storing and regurgitating [...]
Some would say that pretty much _all_ learning involves "storing" and "regurgitating".
Aside: my daughter started writing out her name at kindergarten a couple of years ago. One of the staff seemed a bit dismissive of this, and claimed what was happening wasn't really "writing" it was "merely memorising the shapes of letters and then reproducing them in the right order". <rolls eyes>
Yeah ... that was my point. What copilot does underneath the indirection is steal GPL code and derive "new" stuff from it in a way that launders the licenses.
We know what it's doing, it's doing very well understood statistical inference techniques to derive outputs.
Also, it's trivially easy to teach a real neural network to output a specific binary stream in response to a specific input. My OS AI could use the exact same technology as Copilot, I'd just train it very specifically.
I think the main point that the article makes is that for copyright to work you need some notion of a creative work, and so far it's generally accepted that snippets like
i = i + 1
aren't creative enough to be covered by copyright. The interesting point is where you draw the line between what's boilerplate and what's creative, and legally it will presumably come down to showing that copilot crosses that line egregiously enough for someone to think they've got a successful chance at legal action.
That not an existing aboutme page. You can go to davidcelis' website and verify that it's completely different.
Copilot just picked a random person and linked to their social media accounts. You can search any large quote within that about me on Google and not find a match, it is unique.
The only two examples of generating large sections of copyrighted work are the quake floating point hack and the zen of python. Both those examples are commonly known and copied and talked about, to the point that they have wikipedia pages.
But that about me page is the very definition of boilerplate text, so really it only gives weight to the argument that it's not producing original work.
You got downvoted, but I kind of like this argument. There are a million "about me" pages, but Copilot did a good job of picking one for "generic software engineer". If it could just have changed a word or two to a synonym, it would be great.
But as I understand it, copilot can generate much longer snippets, even entire functions.
I think the big question is, if copilot ends up copying significant portions a GPL work, not just tiny snippets, is the resulting work infringing, and if so, who is liable?
If a tree falls in the forest and no one is around to hear it, does it make a noise? If it is infringing and someone is liable, what proportion of the infringing cases would be found and what proportion of those found instances be brought to the courts?
I have almost no sense of how often code is infringed currently and how often anyone does anything about it. I have a feeling that we live in a world with constant infringement, basically no one cares, and no one does anything about it. And I would assume, the status quo will maintain its current course with this new tool. But again, I'm giving zero factual evidence, it's just a feeling from not seeing our hearing almost any news about open source code infringment.
If you use copilot and it generates a substantial amount of code, you don't know if that might be a replica of code from another project with an incompatible license. If you are law abiding and/or want to respect open aource licenses, then it is on you to figure out what if any license that code would fall under. Which means copilot would only be useful to developers who don't care about FOSS licenses or only use it for snippets that couldn't possibly be considered original enough to be covered by copyright.
Unlike patents, for cooyright, independently created work is not infringing. So, if you build a model that really does actually model music, this could be argued to be independent creation.
But there is also caselaw (involving George Harrison IIRC) on "unconscious copying", where having heard a piece is suggestive that it was not an independent creation, despite not being deliberately copied. So, training on a corpus that includes a specific piece is arguably a case of that.
There"s an interesting question of whether a model is just a sophisticated statistical
compression of a corpus, or whether it is a thing in itself. I would say, if it finds patterns that are disproportionately simpler than the corpus, it has found "something".
But another view is of creation as involving a side-channel or one-time pad, in that music is created by a human and heard by a human, who have common information that has never been present in music before (e.g. specific aspects of common neurophysiology, auditory anatomy, exact heartbeat waveform, new sounds/rhythms in the world, new speech patterns, associations between existing melodic fragments and words/emotions/visuals/status etc).
In this sense, truly new music is discovery of the Human Music Processing System, which ultimately involves the whole human and their social and physical experience.
You say AI, but this is just a database with a weird query language. The fact that it has to be trained on massive amounts of code and can only regurgitate variations of snippets of the training database back makes it quite clear that there is no intelligence in this thing.
If it were intelligent, it’d be given the Lagrange specifications and then I’d be able to say “Write me an open world video game based around gang culture and I’d like it to run on the raspberry pie zero.”
It sounds like you might be the one giving it too much credit. AI is a glorified markov chain, which is essentially a compression algorithm. I agree that it can be an instrument (I've done it: https://soundcloud.com/theshawwn/sets/ai-generated-videogame...) but it's almost trivial to train a model that memorizes by rote.
Suppose a model was trained solely on a single Beatles album. It could only spit out that album. That would be clear infringement, wouldn't it?
Actually GPT-3 is a Markov chain. A Markov chain is a very general term. Just because the simplest ones are stupid, doesn't mean there aren't smarter ones.
A Markov chain is a model where: (1) there is a state, (2) the probabilities of the next state depend only on the current state.
That could describe anything from a wet fart to a deterministic computer to GPT-3 to the human mind to quantum mechanics (the real one, not a simulation).
If I unplug the Internet and all USB devices, my computer is a Markov chain with at least several trillion bits of state, so 2^(several trillion) possible states. And there is one next state, which has a probability of 1, and all other possible states have a probability of 0. That's a Markov chain.
GPT models choose the next word probabilistically, with the probabilities chosen by feeding the previous N words into a neural network. That sounds like a Markov chain to me!
Willfully misunderstanding your interlocutor seems ungracious, you apparently know stochastic is an adjective, so why did you try to use it as a noun?
GPT-3 is conditioned on the entire input sequence as well as its own output, which is strictly NON-MARKOVIAN. In fact, the point in saying something is Markovian is exactly that: the state transition probability only depends on the current state.
Well, yes. And apples aren't oranges, but they share a lot of similar traits.
"Given a prompt, provide a completion" is what a Markov chain does. GPT-3 is exactly the same, in the sense that apples and oranges both satisfy your hunger.
Depends, I think. Generally, I think accidental copyright infringement by an unsuspecting person who plays the song believing they bought it legally would not be held accountable. But if they knew the song was copyrighted and not legally licensed by the piano manufacturer, and they assembled a bunch of friends to listen to the songs, or resold them for money, then yeah perhaps. Most importantly in context, though, the piano will almost certainly not be fined or taken to court. ;)
Even if hypothetically there was such a strange bug in your piano and you decided to exploit it by recording copyrighted music and redistributing it, you would be accountable for it, not a piano.
This analogy train went too far, don't you think? All examples that I've seen on Twitter require quite an intentional manipulation by human for Copilot to produce something copyrighted. It does not recite Linux code by pressing 1 key.
If you have an electronic piano that requires a complex series of button pushes to produce copyrighted music - that's still a copyright violation. Copyright law has no notion that the difficulty of reproducing copyrighted content effects the fact of a violation.
> an electronic piano that requires a complex series of button pushes to produce copyrighted music
Surely a judge presented with the "complex series of button pushes," otherwise known as playing an instrument, would hold the player accountable for any infringement and not the piano?
These analogies have gone so far off the rails that I can't tell which side this thread is arguing for by now ;)
I think the whole swirling discussion is a little confused because there are potentially two "ends" where infringement could happen, and different people are talking about each. And the article covers both.
One end is GitHub's, at the input: Copilot's "database" was initialized from code that GitHub does not have copyright to. The contention at this end is that they are ignoring the licenses that would grant them the right to use that code.* The article, GitHub, and others assert that there's no copyright issue for creating a database of this kind (a machine learning model).
The other end is the the developer taking Copilot's output. The article seems to take the (absurd IMO) position that there's also no copyright implications here, because the output is not copyrightable at all.
*And personally this is the side that concerns me most.
"Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use."
IANAL, but this doesn't sound quite right. There is a difference between "using" code (running it in a commercial product) and manipulating it as arbitrary data within a commercial product.
It definitely can be a gray area, but let's say I use Amazon's service where I email a PDF to my Kindle - is it Amazon's responsibility to know the copyright status of the PDF, or mine? In both cases a commercial product is manipulating copywritten data for the benefit of a user.
I'll give the best example, the one task that off the top of my head that I would like some AI help with.
I would really like to replicate the functionality of Java's SSLEngine, but for C#.
If I used Co-Pilot to help, at best, I would need to pay for a legal team to do some form of 'clean room' review of whatever was generated to make sure it did not infringe on the OpenJDK code that is out there. At worst, I would be having to defend myself from Oracle's legal team -anyway-.
And yeah, I'm assuming in this case that Copilot would be 'smart' enough to be able to make the right inferences of that java code and put it into workable C# construct. Stepping back, though, one could still ask the question; what's the risk of a Java developer accidentally getting some OpenJDK code a little too closely? There's an order of magnitude difference between even a smaller AGPL developer and Oracle.
If Microsoft/GH was willing to go to bat and agree to pay for the defense of users of Copilot, I would be far less concerned with the implications of all of this.
> Stepping back, though, one could still ask the question; what's the risk of a Java developer accidentally getting some OpenJDK code a little too closely? There's an order of magnitude difference between even a smaller AGPL developer and Oracle.
It would be extremely interesting to know how much accidental and non accidental code infringement happens and in what proportion of those cases go to the courts. I would guess that both cases happen utterly constantly and it is only a tiny minority of those cases where legal action is taken. If that's the case right now, then nothing has changed with this tool except the possibility of playing hot potato with liability when those few cases that do happen make it to courts. Even if the developer actually wrote the code that infringed, copilot could make a useful scapegoat and every case will have plausible deniability if copilot lacks really good explainability.
If I use Copilot and it suggests a large block of GPL2'ed code for my project, which I then include, then that is a GPL2 license violation.
Whether the GPL2 will hold up in court, or whether the courts will uphold this specific case (e.g. can you prove intent? Do you need to?), is a separate issue entirely.
The next question is, can I use GPL'ed code in my product and then claim that it was injected by Copilot to avoid repercussions of my actions if caught?
The claim (which I'm not qualified to judge) is that this use falls under fair use. The point of fair use is to allow some use of copyrighted works even if the copyright owner does not license it to you and even if the owner is explicitly hostile towards your usage. If it is indeed fair use, then the license doesn't matter because that's not the thing that's allowing you to use the work.
The proprietary model is a representation of lots of harvested open source code snippets. Without the model copilot is nothing. Arguably, the code snippets are part of the product....
Your example doesn't quite match what's happening in real life though. You're not "using copilot as a mechanism to ferry around code". Co-pilot is making recommendations for what code to use and then also giving that exact code (the text) to you. A more apt example would be if Amazon had some UI which said "What kind of book do you want to read on your kindle?", you click the button labeled "biography", and then Amazon sends your Kindle an AI generated book which is the biography of a famous person, and it just so happens that the "generated" book being sent to you is an exact copy of someone elses book (or incorporates exact copies of chapters/paragraphs of someone elses book), legal disclaimers and all.
Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code
This is incorrect. First of all, GitHub isn't even the people building the model. It's built by OpenAI, which has none of these licenses. Secondly, the model is not built purely from GitHub data. OpenAI is relying on fair use, not on a specific license.
Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.
If your view of the law is correct then programming is illegal, because who among us has not read copyrighted code and used it to train our biological neural network? I suspect your view of the law is not correct.
Your analogy looks alien to me. Can you show us the link between a human and a computer program? WTF is "biological neural network"? Can you quote a law?
Copyright is not an indefinite ability to control a work. You can't for instance stop me from lending or leasing my rightfully acquired copies of your works, nor from making short quotations for criticism, nor teaching a class about it, etc.
Particularly in the context of the occasional, unintentional reproduction of short snippets that likely need adaption to the rest of the code they are inserted into I suspect courts are unlikely to find more than de minimis, unactionable, infringement.
Right, but Copilot is not creating a piece of critical media about the code it is ingesting (IANAL, but my understanding is that this gets interpreted pretty strictly). I don’t think the normal Fair Use classes apply here.
Even if the code isn’t being copied verbatim, it feels like the spirit of these licenses is being violated, although I don’t know if that’s enough to get anywhere in court. But if the code is in fact being copied (like that Quake example) then the license is definitely being violated.
But I feel like there’s too much analysis in these comments of whether a current law is being broken, and not enough thought about what will happen if licenses like the GPL can no longer keep intellectual property free. Open source licenses are part of the foundation of this community, and we’ll be much worse off without them. We really need a way to prevent this kind of IP laundering, and if current laws won’t do it, then we need new ones.
> You can't for instance stop me from lending or leasing my rightfully acquired copies…
I mean in this specific example can’t I though? Yes the first sale doctrine applies to certain kinds of works but not every work and specifically not software. I absolutely can grant you a single non-transferable license to use my software.
> Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users.
and by the GitHub ToS:
> You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
I would consider Copilot to not be part of ”the Service”[1], but at least currently[2] the definition of ”the Service” is so vague as to include anything that Github does.
Maybe they consider Copilot to be a ”search index” and the suggestions ”[sharing] [Your Content] with other users”.
[1] Since, as I understand it, it will require separate payment.
[2] The ToS is currently last edited 2020-11-16, and does not contain the word ”Copilot”
Then you would be the one infringing their copyright, and they could probably sue you.
Although I'm curious about what GitHub would do if the original author asked them to remove the work from Copilot. Retrain from scratch every month or so, to remove last month's DCMAed content?
I doubt that as an interpretation that Github wants people to make since that means that tons of major projects need to be remove from Github. Basically all that are older than Github.
> Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo.
The person who has the account on Github and uploads code to them rarely owns the copyright on all of the code, and therefore doesn't have the right to delegate to Github any further licensing permission.
Legally I believe that’s the person who uploaded the code’s infringement rather than GitHub’s, so long as GitHub deals with takedown requests in a timely manner as per the DMCA.
Furthermore, as described in the article, the legal precedent has been that you don’t actually need copyright to something to train a model on it. You may think that’s silly or inconsistent, but that’s how the legal precedent is.
> that’s the person who uploaded the code’s infringement rather than GitHub’s
That's not how it works. Anyone and everyone who distributes it is infringing and carries risk of enforcement action. That could also be someone further downstream.
> Furthermore, as described in the article, the legal precedent has been that you don’t actually need copyright to something to train a model on it.
If Copilot is infringing copyright by reproducing small samples of the training data, and if we agree that that isn't acceptable, doesn't that effectively spell the end of the road for any and all AI generated content unless the developers explicitly stop their product reproducing data that matches the data it was trained on? That seems like it would have far reaching consequences for AI as an industry.
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service
> If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
> You may grant further rights if you adopt a license.
Not everything on Github was uploaded by the copyright holders. Often enough, it's uploaded by people who only have access to it under an open source license, so Github cannot in general squeeze additional license terms out of the uploader at that point.
There's more than redistribution happening here. Co-pilot is providing a value-add service where the open-source code is an input and the output is a service. As it happens, the service is actually regurgitating the code at this point, but it's important to consider that even if it didn't regurgitate the code verbatim, the fact that the service is making use of that code to provide a value-add means the code is a crucial input to the value proposition. Would Co-pilot be able to provide the value-add without the source? Likely not.
Couple that with the fact, that presumably at some point in the future, Co-pilot will come attached with a subscription model (otherwise why do it in the first place?), and we have the makings of a product that is commercially infringing on copyright left, right and center.
I'm not a lawyer so it's entirely possible I used the wrong term. Thank you for clarifying below.
Using the terms as you explained them below, I meant that Microsoft/GitHub has permission to reproduce the code so why wouldn't that extend to copilot?
Are they displaying the license under which said code is licensed when they display a chunk of licensed code? If not, then they're violating the terms of most licenses (except pure public domain, or other similar licenses which don't have any such requirements attached).
The use of licensed code in other projects must be done under the terms of that license or you aren't legally (under copyright law) allowed to use the code.
> Are they displaying the license under which said code is licensed when they display a chunk of licensed code? If not, then they’re violating the terms
The GitHub TOS is a license that is separate from the license in the code. It is legal and common for an author to license the same code multiple ways (and the licenses do not have to agree with each other). By agreeing to GitHub’s TOS and uploading code to GH servers, people are licensing GH to display the code, because the license agreement says so explicitly. This could be problematic if someone uploads code they don’t have the rights to upload, but then the violation is the uploader’s, and not GitHub’s.
Additionally, GH has a provision for already licensed code in section D.6:
“6. Contributions Under Repository License
“Whenever you add Content to a repository containing notice of a license, you license that Content under the same terms, and you agree that you have the right to license that Content under those terms. If you have a separate agreement to license that Content under different terms, such as a contributor license agreement, that agreement will supersede.
“Isn't this just how it works already? Yep. This is widely accepted as the norm in the open-source community; it's commonly referred to by the shorthand "inbound=outbound". We're just making it explicit.”
As I said, I'm not a lawyer, but I believe they're displaying it under the terms of the GitHub ToS, using rights granted to them when the project is uploaded to GitHub, not under the terms of the license the project uses for everyone else.
Reproduction is enough to cover the first part of your use case. This is mentioned on Github's TOS.
For the latter you would need redistribution as it is going into a different product, for which you claim ownership, and with possible modifications/adaptations (this would depend on the rights granted by the license). Nowhere on Github's TOS is the word or concept of redistribution referenced.
So, the answer to your original question is "no".
Edit: leereeves modified its comment after I wrote this, so it may not make much sense but you can figure out the point. Best!
I’m not sure this is a completely fair take, I think the original question is legitimate and relevant. Github’s TOS does in fact ask the contributor to grant a license for GH to host and serve their code from GH servers. That is both reproduction and distribution as defined by copyright law, and copyright covers both of those at the same time https://www.copyright.gov/what-is-copyright/
(Edit and BTW GH calls out their ‘distribution’ in section D.4 of their TOS explicitly, but without using the word “distribute”. They say you grant them the right to “publish” and “share” code you upload, which means “distribute” under copyright law. They also imply that by spelling out the terms under which they do not “distribute”, which is anytime the content is used outside of GitHub’s services.)
I don’t think you’re correct that the term “redistribution” means either going into another product, nor that it implies a claim of ownership. Putting works into another product is sometimes known as making a derivative work, while “redistributing” is quite commonly used to mean copy-and-distribute as-is. Redistribution can happen via license as well, it requires permission by the copyright owner, but does not imply the redistributor is (or is claiming to be) the copyright owner.
>I think the original question is legitimate and relevant
You didn't see the original question, it was edited, so we cannot discuss that further.
"[...] which means “distribute” under copyright law" <-- Citation needed please, because I don't think that's correct.
From the site you linked:
"Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending."
What I seem to grasp about the difference between reproducing and redistributing is that it has to do with the concept of "transfer of ownership". Also derivate work and redistribution are not mutually exclusive.
The moment you create a new thing and start distributing it (even if you do not modify it), you become the de facto owner of that new product, and copyright law is trying to limit the extent of the rights that apply there. So, in the case of music, it's different thing to play (reproduce) a song than to create a new album with your favorite artists that happens to include that particular song (redistribution).
> "Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending."
> What I seem to grasp about the difference between reproducing and redistributing is that it has to do with the concept of "transfer of ownership". Also derivate work and redistribution are not mutually exclusive.
What you've misunderstood is it is the copies that are sold, not the copyrights.
* edit
> create a new album with your favorite artists that happens to include that particular song (redistribution).
This is not what redistribution means. You seem confused about this word.
> Sorry, I'm not following you anymore. I don't even know what you mean by that sentence.
The transfer of ownership you referred to is a transfer of ownership of a copy, it is not a transfer of ownership of the original work itself. You misunderstood the passage you quoted to mean that redistribution is transferring ownership of the work itself, as in copyright ownership of the work. But the text you quoted is only talking about transferring ownership of the copies. The text you chose makes more sense in the context of physical copies of books or "phonorecords".
Copyright is meant to protect original/authentic/unprecedented expressions, disregarding the medium where they may exist. So I don't really get your point in trying to make distinction between a copy or a "master"(?) or whatever.
What's at stake is the originality of the expression and what kind of rights does somebody else (i.e. everyone but the creator) have (or not!) over it.
Can I make copies of this original expression? (y/n)
Can I use this into a new product of my own? (y/n)
Whether something is already a copy or not does not really change the extent of the rights that you have (unless it's explicitly stated in the license, of course).
I can see that, and I mean no disrespect, but you shouldn’t have attempted to comment on this topic authoritatively and police what others say without understanding it.
> I don’t really get your point in trying to make distinction between a copy or a “master”(?) or whatever.
You have clearly and repeatedly demonstrated that you don’t understand what “distribute” and “redistribution” means to copyright law. You claimed others were confused about it and that using the word “redistribute” was incorrect, when in fact it was fine and correct.
I’m trying to help you understand that redistribution is a term that is talking about what happens to copies of a work. The sentence you quoted, and the “transfer of ownership” that you said you grasp only have to do with transferring copies, and nothing else.
The main point here is that when GitHub shows you code, it is transferring a copy to you. That is what GitHub calls “publish” and what copyright law calls “publication”, and by publishing they mean redistribution (because the copyright legal code says so).
> that's ("create a new album with your favorite artists that happens to include that particular song (redistribution)." exactly what redistribution entails
No, it isn't. You're wrong. Redistribution, or just distribution, in copyright law is plainly and simply making copies of a work available to other people. It does not mean anything more than that, and it does not transfer ownership of anything other than the copy you distribute.
Again, distribution has to do with a transfer of ownership. In layman terms, Github can show your code to others but it cannot give (as in ownership) your code to them. It's a bit tricky here since on the web showing something literally means making a copy at some point, but try to view things under the light of "who owns what" and it's a bit easier to grasp.
If you browse through someone's repository, it's pretty clear who the owner of that code is, if a program gives you a chunk of code that it "got from somewhere" there's definitely some sort of change of ownership operation going on; which in this case is interesting, as it went from attributed to someone to missing/unknown.
> Again, distribution has to do with a transfer of ownership
You're mixing sub-threads here, but you're still confused. Distribution is a transfer of ownership of a copy, it does not grant copyrights or ownership of the work. You can buy a book that was distributed, and that does not give you the right to make copies of the book.
> Github can show your code to others but it cannot give (as in ownership) your code to them.
In the digital world, showing is "distributing", and copyright law is clear about this.
You should perhaps read the definitions that are in the copyright law itself, and try to understand them:
"“Publication” is the distribution of copies or phonorecords of a work to the
public by sale or other transfer of ownership, or by rental, lease, or lending. The
offering to distribute copies or phonorecords to a group of persons for purposes
of further distribution, public performance, or public display, constitutes publication. A public performance or display of a work does not of itself constitute
publication.
To perform or display a work “publicly” means—
(1) to perform or display it at a place open to the public or at any place
where a substantial number of persons outside of a normal circle of a family
and its social acquaintances is gathered; or
(2) to transmit or otherwise communicate a performance or display of the
work to a place specified by clause (1) or to the public, by means of any device
or process, whether the members of the public capable of receiving the performance or display receive it in the same place or in separate places and at the
same time or at different times.
You make some good points, here and on the other comments, so I'm not arguing against you.
>In the digital world, showing is "distributing" [...]
I guess it has to do with how copyright law adapts to the specific circumstances of this particular case. I guess we won't get an answer until a judge justifies some sort of resolution on either side.
My take is that:
* GH showing you some source code on their website is akin to reproduction; even though, of course, a binary copy of the code was made and was transmitted to your local browser in order to be displayed.
while
* GH taking chunks of code from here and there, and making them available into a new product from which they claim ownership (or the final user, or whatever) is more akin to the physical concept of redistribution.
> GH showing you some source code on their website is akin to reproduction
This is clearly and unambiguously defined as “publication” in the copyright law, where “publication” is defined as distributing copies. (And GitHub’s TOS also calls showing you code “publish”).
There is nothing to wait for, and the law and many court cases have already established clear definitions and precedent on these terms. You just got stuck on the wrong idea, it happens, it’s okay, but if you are curious about copyright and interested in discussing it here, it will certainly help to improve your understanding of the terminology.
> GH taking chunks of code from here and there […] is more akin to the physical concept of redistribution
No, this is still just wrong. You’re talking about derivative works, which is also defined in the copyright legal code. There is no such physical concept of mixing and matching that is called “redistribution” in legal terms. I’m not sure where that idea came from, it might make sense to you or in some narrow contexts, but generally speaking and specifically wrt copyright law, distribution has nothing to do with whether you sample a work nor whether you make a new work out of old works.
Yes (as you now know), GitHub’s terms require users uploading code to agree to GitHub being able to both redistribute (“publish”) and reproduce (“copy” / “store”) their code.
The “Terms” link on the copilot page goes directly to GitHub’s TOS, so yes the terms are one and the same.
This question is interesting and I’ll try to help turn the downvotes around, but it might be too late. Anyway, when users agree to allow their code to be “published” by GitHub, they are allowing it to be both copied and distributed. The TOS also says (note the indexing/analysis comment) “This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.”
The part where GitHub might have trouble (I speculate) is that their TOS doesn’t discuss derivative works, and the input code to copilot could have licensing terms on derivative works that get scrubbed out by copilot. OTOH, if copilot were to guarantee that a chunk of code never resembled one of the original inputs it may be legal to create derivative works from samples under fair use.
I'm thinking it's not so much what is legal for Copilot to do with code chunks from GPL'ed code, but what it means for end users (i.e. developers at for-profit companies) to incorporate those chunks into commercial products
There were two parts to the argument which seem to hold water.
1. Any code generated by co pilot is likely to be agpl
2. since the authors of copilot used co pilot beta to make co pilot release copilot is very likely using agpl licenced code and therefore in breach of the agpl licence.
Julia Reda's analysis depends on the factual claim in this key passage:
> In a few cases, Copilot also reproduces short snippets from the training datasets, according to GitHub’s FAQ.
> This line of reasoning is dangerous in two respects: On the one hand, it suggests that even reproducing the smallest excerpts of protected works constitutes copyright infringement. This is not the case. Such use is only relevant under copyright law if the excerpt used is in turn original and unique enough to reach the threshold of originality.
That analysis may have been reasonable when the post was first written, but subsequent examples seem to show Copilot reproducing far more than the "smallest excerpts" of existing code. For example, the excerpt from the Quake source code[0] appears to easily meet the standard of originality.
The excerpt from Quake code is literally one of the most famous functions out there. There is no wonder that it was reproduced verbatim. The share of such code, according to Github is really small.
It would be quite straightforward to write an additional filter that would check the generated code against the training corpus to exclude exact copies.
But the fact that it did that at all should be proof that Copilot is, in fact, copy and pasting rather than actually learning and producing new things using intelligence.
This is a code search engine with the ability to integrate search results into your language syntax and program structure. The database is just stored in the neural network.
It’s definitely an impressive and interesting project with useful applications, but it’s not an excuse to violate people’s rights.
This is all just computational statistics. Why in the world would you invoke ill-defined anthropocentric terminology like "intelligence"? Of course a statistics program isn't "using intelligence".
But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.
> But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.
You mean in the same way that google.com isn't "just a database"?
If Copilot isn't intelligent, then what makes it more special than a search engine? How is Copilot not just Limewire but for code?
I could understand the argument that, if Copilot really is intelligent or sentient or something like that, then what it is producing is as original as what a human can produce (although, humans still have to respect copyright laws). However, I haven't seen anyone even attempt to make a serious argument like that.
It can produce code snippets that were never seen by generating fragments from various sources and combining them in a new way. This makes it different from a search engine, which only returns existing items.
Is it producing code (by which I mean creating/inventing new code by itself), or is it just combining existing code? Because to me it seems like the latter is a more appropriate description.
* AI searches for code in its neural-net-encoded database using your search terms (ex: "fast inverse square root")
* AI parses and generates AST from the snippet it found
* AI parses and generates AST from your existing codebase
* AI merges the ASTs in a way that compiles (it inserts snippet at your cursor, renames variables/function/class names to match existing ones in your program, etc)
* AI converts AST back into source code
Is AI intelligently producing new code in that example? Because I don't think it is.
What would be an interesting test of whether it can actually generate code is if it were tasked with implementing a new algorithm that isn't in the training set at all, and could not possibly be implemented by simply merging existing code snippets together. Maybe by describing a detailed imaginary protocol that does nothing useful, but requires some complicated logic, abstract concepts, and math.
A person can implement an algorithm they've never seen before by applying critical thinking and creativity (and maybe domain knowledge). If an AI can't do that, then you cannot credibly say that it's writing original code, because the only thing it has ever read, and the only thing it will ever write, is other people's code.
But you seem to have a basic fundamental misunderstanding of what is going on inside the NN. There is no "search for code" - it is generating new code each time, but sometimes that code will be the same as something it has seen because there is little or no variation in the training data for that snippet.
The NN generates code token by token, conditioned on the code leading up to it (and perhaps the code ahead, similar to BERT).
If you see tokens like this you probably generate the same next token too:
for i in range(1,10)
You have conditioned your input on the code you have seen and the most likely token you produce is ":".
That's what the NN does, but for much longer range conditioning.
I am not educated on this matter, but I have to ask for your clarification. Would that not be just a pre-emptive lookup? Akin to keeping a cache of "results" per input token that are essentially memorized and regurgitated?
Sounds like there is still a db lookup, just not at runtime and instead at build time of the NN. Can you clarify this please?
GPT-3's raw output is "logits", or indexes into an encoding space. The encoding space contains individual tokens; for generation, it would be words, or even word pieces. The pieces are as small as "for", or "if". Constructing code from an embedding space, even if it is more specialized, is like constructing sentences by using a dictionary -- it is a lookup table, but it's not a database. Generation works by looking at the existing document (or portion), and based on what is already present, generating a token. Then repeating until some condition is met (such as length, end of sentence, something else).
The issue here is that certain sentences (code segments) are memorized, and reproduced -- much like a language learner who completes every sentence which begins with "Mi nombre" with the phrase "Mi nombre es Mark". The regurgitation is based on high probability built into the priors, not an explicit lookup. A different logit sampling method (instead of taking the likeliest) reduces regurgitation, without changing anything else about the network. (It also makes nonsense happen more often, since nonsense items are inherently less likely!)
That isn't even necessary. I've been exploring GPT-3 for a while and it is completely incapable of any reasoning. If you enter short unique logical sentences like "Bob had 5 apples, gave 2 to Mary, then ate the same amount. How many apples Bob has left?" No matter how many previous examples you give it (to be sure it gets the question), it gets it wrong. It is simply incapable of reasoning about what is going on.
> How do you differentiate between these two things?
That's a contrived example because none of those lines could be protected by copyright, patents, etc. A better example might be if you started selling a 30 minute movie that was just the first 15 minutes of Toy Story spliced together with the last 15 minutes of Shrek. I'm not a lawyer, but I'm pretty sure that would qualify as a derivative work, meaning you're potentially infringing on someone's rights (unless they've given you permission/a license).
And to be clear, none of these problems are new. People have been fighting over copyright and it's philosophy in court for a very long time. The only thing that's different here is that it seems some people think it's ok to ignore copyright if you use Copilot as a proxy for the infringement.
> As an aside, your understanding of how the model works here is completely wrong. Like, just absolutely fundamentally completely wrong.
Of course I don't, it's a neural network. You don't know either. That example I posted could be exactly what it's doing, or not even close.
(although for the record, I wasn't trying to explain how copilot works in that comment. It was a hypothetical "AI" for the sake of discussion, not that it matters. My point about it being copyright infringement is the same even if that hypothetical implementation is wrong)
> Of course I don't, it's a neural network. You don't know either. That example I posted could be exactly what it's doing, or not even close.
What is this supposed to mean? We know how neural networks generate things like this very very well.
I personally have built a system that takes pictures of hand drawn mobile app layouts into the NN, then generates a JSON based description file that I compile into a Reactive Native and/or HTML5 layout file.
This was trivially easy in 2018 when I did it. It took me maybe 2 weeks engineering time, and I'm no genius. Our understanding of how transformer-based NNs work has come a long way since then, but even back then it was easy to show how conditioning on different parts of the image would generate different code.
> That's a contrived example because none of those lines could be protected by copyright, patents, etc.
Well no. The question I'm asking about, the philosophical distinction between "producing" or "combining" is a valid question no matter the copyrightability of anything. It's an interesting philosophical question even if we presume that copyright is bubkis.
> It was a hypothetical "AI" for the sake of discussion, not that it matters.
Ah, my mistake. I see that now.
> Of course I don't, it's a neural network. You don't know either.
I may not know how to make a lightbulb, but I do know hundreds of ways not to make one ;)
> A person can implement an algorithm they've never seen before by applying critical thinking and creativity (and maybe domain knowledge). If an AI can't do that, then you cannot credibly say that it's writing original code, because the only thing it has ever read, and the only thing it will ever write, is other people's code.
This doesn't hold at all. Not many people can come up with an original sorting algorithm for example, but people write code all the time.
The fact it reproduce code verbatim, including comments and even swear words means it is definitely copying some of the time.
Does it copy all the time? Doesn't matter. Plagiarism is plagiarism even regardless of it is done by a student in school, an author, a monkey or an "AI".
You wouldn't accept this from a student, you shouldn't accept it from a coworker (unless you are releasing under a compatible license) and of course you should accept it from Microsoft.
Co-pilot produces original code (as in code that has never been written before). It's not just combining snippets.
This should surprise no one who has seen the evolution of language models. Take a look at Kapathy's great write up from way back in 2015[1]. This generates Wikipedia syntax from a character-based RNN. It's operating on a per-character basis, and it doesn't have sufficient capacity to memorise large snippets. (The Paul Graham example spells this out: 1M characters in the dataset = 8M bits, and the network has 3.5M parameters).
Semantic arguments about "is this intelligence?" I'll let others fight.
That is can combine snippets doesn't mean the OP's understanding of how the system works is correct.
They seem to believe it is a database system. That's really not how this works, and the fact it behaves like one sometimes is disguising what it is doing.
If I say "write a "for" loop from 0 to 10 in Python" probably 50% of implementations by Python programmers will look exactly the same. Some will be retrieving that from memory, but many will be using a generative process that generates the same code, because they've seen and done similar things thousands of times before.
A neural network is doing a similar thing. "Write quicksort" makes it start generating tokens, and the loss function has optimised it to generate them in an order it has seen before.
It's probably seen a decent number of variations of quicksort, so you might get a mix of what it has seen before. For other pieces of code it has only seen one implementation, so it will generate something very similar. There could be local variations (eg, it sees lots of loops, so it might use a different variation) but in general it will be very similar.
But this isn't a database lookup function - it's generative against a loss function.
This is subtle distinction, but it is reasonable that people on HN understand this.
> Some will be retrieving that from memory, but many will be using a generative process that generates the same code, because they've seen and done similar things thousands of times before.
How are these not both the exact same process of memory recollection? Can you elaborate on the difference between memory recall vs a generative process based on conditioning? I understand how these two are different in application, but not understand why one would say they are fundamentally different processes.
Analogies start to break down once we are talking at this detailed level.
The best I can come up with is this:
Imagine you are implementing a system to give the correct answer to the addition of any two numbers between 1 and 100.
One way to implement it would be to build a large database, loaded with "x" and "y" and their sum. Then when you want to find out what 1 + 2 is you do a lookup.
The other method is to implement a "sum" function.
Both give the same results. The first process is a database lookup, the second is akin to a generative process because it's doing calculation to come up with correct result.
This analogy breaks down because a NN does have a token lookup as well. But the probabilistic computation is the major part of how a NN works, not the lookup part.
Perhaps it’s not so different from a search engine like Google. The article cites Google’s successful defence, under US copyright law, of its practice of displaying ‘snippets’ from copyrighted books in search results. There is a clear difference between this and the distribution of complete copies on LimeWire.
If you look at it this way, your brain is also "just" computational statics. (Or to be precise, it might be, since we don't yet know in all the details how it works).
Hint: it has been said hundreds of times since the advent of computer science that the brain is "just" [some simple thing that we already understand]. That notion has never once helped us in any way.
A common tech-bro fallacy. We understand exactly what is happening at the base level of a statistics package. We can point to the specific instructions it is undertaking. We haven't the slightest understanding of what "intelligence" is in the human sense, because it's wrapped up with totally mysterious and unsolved problems about the nature of thought and experience more generally.
The fallacy is the god-of-the-gaps "logic" of assuming there's some hand-wavey phenomenon that's qualitatively different from anything we currently understand, just because reality has so much complexity that we are far from reproducing it. You're assuming there's a soul and looking for it, even though you don't call it that.
Intelligence is mysterious in the same way chemical biology is mysterious (though perhaps to another degree of complexity)... It's not mysterious in the way people getting sick was mysterious before germ theory. There's no reason to think there's some crucial missing phenomenon without which we can't even reason about intelligence.
> actually learning and producing new things using intelligence
People have been trying to accomplish that for 65 years. We're not even close. It's the software equivalent of cold fusion (with less scientific rigor)
I think it warrants investigating exactly how and when Copilot reproduces code, but using one example to write it off as just copy and pasting seems excessive.
Also when talking about rights, whether or not Copilot copies doesn't seem sufficient to make a call. For instance, if it has to be coerced by the programmer to produce these kinds of snippets in an obvious way, then it seems fine to lay the blame on the programmer similar to when using regular autocompletion (or copy+paste for that matter).
How do you know it's "just a code search engine"? Or "not AI" or "not learning and producing new things", or all the other claims people are making about it? All of these are essentially untestable statements.
It has memorized one thing. That doesn't prove it's not intelligent. If anything it's the other way around, we would expect an intelligent being to be capable of memorization.
All I can think of is the Turing test and the AI effect. Eventually we will have an AI that is capable of writing code indistinguishable from a human, and people will STILL say it's "not AI" and "just a code search engine", etc. Obviously this isn't there yet, but it's clearly getting closer.
> The excerpt from Quake code is literally one of the most famous functions out there. There is no wonder that it was reproduced verbatim.
The question that brings is that this was found because it is so famous, but what if it is repeating Joe Schmoe's weekend library project, but we will never know because its not famous?
This doesn't stand on its own as a defense: perhaps the 10 inputs were legitimate copies of a single source. They could be forked repos that were properly following the original's license, for example.
Or 10 different GPL projects that legitimately share code that remains copyrighted and protected by the GPL. Or 10 obscure projects that illegitimately copied code but haven't been caught.
Clearly, "10 other people did it" is no defense at all.
It might not even be "10 other people". For projects which originated outside Github, it's common for multiple users to have independently uploaded copies of the project. There's probably at least 10 users who have pushed copies of the GCC codebase to Github, for example.
> at least 10 users who have pushed copies of the GCC codebase
That is "10 other people". (Although your point stands, since there doesn't (or at least shouldn't to the point of criminal espionage) be any strong impediment preventing one person from creating 10 different accounts.)
Would it? What would the threshold be? Twenty lines copied verbatim? Ten lines copied verbatim? What about boiler plate like ten #include statements at the beginning of a file? Or licenses in comments? What if someone has a one-liner that's unique enough to be protected by copyright?
Pretty sure if someone trained a code suggestion tool with Windows source, Microsoft would claim that a single similar character being the same is grounds for copyright infringement.
They are putting GPL code in non-gpled codebases. Is it okay to take sections of other people's source code and use it on yours, if you just got it as a suggestion?
The funny thing about the Quake function is, id Software is almost certainly not the origin of the code. They copied it from somewhere else, possibly added profane comments, then slapped GPLv2 on it. Did they even have the right to do that? From an IP absolutist standpoint, probably not.
> they copied the general idea of what the algorithms should do. Do not go down this line of reasoning
Too late, patents pick up where copyright ends, to protect general algorithmic ideas, not just implementations. And we have lots of patents on things that seem trivial now, including for-loops (just see how many patents depend on “a multiplicity”). Look - here’s a helpful lawyer’s template for including for-loops as a claim in your own patents: https://www.natlawreview.com/article/recursive-and-iterative...
True, correct. I didn’t intend for my comment to be interpreted as suggesting that a claim in a patent is the same thing as a whole patent, I was just pointing out a fun fact.
> Anyone who believes in a free and open society should do away with all copyrights and patents.
Free and open sound good to me! What do they mean exactly? I guess it’s a non-debatable fact that copyrights and patents are abused by many big companies and patent trolls, but doing away with the system does seem extreme, it has also protected deserving individuals on occasion, no? You are saying that it should always be legal to copy someone else’s code / inventions without giving them any credit or compensation?
> Anyone who thinks that licensing will have an effect on what is happening in reality is severely misguided.
I’m not sure I understand what you mean; lots of licensing activity does have a measurable effect on reality. This article is only a small example, but people get sued all the time over taking code and using it without licensing it.
> > They did not copy the implementation, they copied the general idea of what the algorithm should do
> [Citation needed]
(Not really your point, as such, but) no, actually, if you claim they did something (nominally[0]) wrong, the onus is on you to provide citations showing they did it[1].
0: > From an IP absolutist standpoint
1: Well, or that they (voluntarily and explicitly) accepted some responsibility (such as a job as a police officer) that entails a higher level of scrutiny than innocent-until-proven-guilty, but that's not really relevant here.
If you wrote an algorithm in the early 80s that did x+y+z
And then I saw your source code and in the late 80s I changed the variable names, function name, and logic to be x+y+z+0.1
And then I told my friend John that there's a super cool algorithm that adds numbers together, and he made some more changes to it and compiled it for a different platform...
Has anybody broken the law in your mind?
EDIT: because it would seem that the original authors (among them Cleve Moler) don't have any issue with what transpired
The GP's argument is that you don't have evidence that they didn't copy the whole function verbatim.
Is there a source that said they changed variable and function names and modified the logic?
> because it would seem that the original authors (among them Cleve Moler) don't have any issue with what transpired
Yet. Without an explicit license there is no basis to release it under the GPL (if the code was copied verbatim or had insufficient re-writing). What if the heirs of the copyright owner wanted to assert their rights? Is there a doctrine that if you don't assert your rights you lose them? (Presumably applies to trademarks, but I don't think this is the case for copyrights)
The source code in question is over 40 years old and most likely doesn't exist anymore in its original form.
What do we do then? The burden of proof for infringement is on original authors, and they haven't done so for 40 years.
In the late 1700s and early 1800s, Britain had to take measure to prevent visiting Americans and others from memorizing the designs of their new high tech machinery like the steam engine and the power loom.
Where do we draw the line? Shut down the internet until we create a massive copyright detection firewall?
No, we live with the copying and constantly evolve and adapt our business. Death to all patent trolls.
I won't even claim that people must necessarily follow the law. Copyright law is inconsistent at best, and notoriously hard to follow to the letter (and often ridiculous). In practice lawyers assess the legal risk and weigh the outcomes.
I never intended to discuss what we should do, and I definitely did not propose shutting down the internet...
The original discussion was such:
> > They did not copy the implementation, they copied the general idea of what the algorithm should do
> [Citation needed]
You said the original authors did not complain, which is neither here nor there, as I pointed out. There is still some theoretical legal risk if you copy with the owner's knowledge but not express consent. The fact that the burden of proof is on the authors is true but that they have not brought a claim does not mean they cannot prove infringement.
And in case I haven't made it clear, I don't think it's a bad idea to assume the function is under GPL, I just don't think there's a basis for claiming what you originally claimed, and there is still some level of (probably acceptable) risk if you take the purported license of source code as-is.
It's not the actual copying of the idea, but the verbatim reproduction of the function, comments and all. I think people somehow thought that copilot could write code, and so verbatim reproduction was surprising to them.
A quick search shows that this snippet, including comments, is included in thousands of Github repos [1], so it's not surprising that the model learned to reproduce it verbatim.
It's such a famous snippet that it's even included in full on Wikipedia [2].
I wouldn't be surprised if the next version of Copilot filtered these out.
I would love to try a session of clean room reverse engineering using copilot. I would bet you get reasonably far for very common libraries with not much effort. The question would be if such compression/decompression would infringe copyright.
But that fast inverse square root example is particularly interesting because it is also a derivative work. Carmack did not invent it, and several variations of it had been passed around over time.
Algorithms should not be subject to copyright, that way lies madness. It would prevent new generations from building on top of the work of their predecessors, because copyright lasts a very long time. The amounts of code that github copilot reproduces fall squarely into the “shouldn’t be subject to copyright” domain for me, even if they pass the bar for originality.
Something which is a “derivative work” is still copyrighted. In fact, by definition, a “derivative work” is copyrightable. It’s the minimum threshold at which something, based on something else, gets its own, new copyright.
The algorithm is not copyrighted, but the source code of the function is copyrighted. You could learn how the algorithm works by reading the function, and then write your own function that implements the same algorithm. Algorithms are not copyrightable, they are not subject to copyright. Source code is copyrightable.
Copilot is not reproducing just the algorithm, it is spitting out large chunks of the copyrighted source code, verbatim.
The example you linked to is talking about a 16 line function from the Quake source. The Quake source is 167,594 lines in total (counting the C code only). Does that really fail to meet the standard for "smallest excerpt"?
That excerpt has its own Wikipedia page, of course it meets the threshold of originality. In any case, once you are discussing this, you have entered the area of fair use; that is an admission of copyright violation.
Fair use is not a violation of copyright but a specified (and since 1976 statutory) exception to it. You are clearly impugning the doctrine with your comment.
Not only that, but it is clearly someone going out of their way to make it do that. I’m not sure that that is a reasonable test of how the program typically behaves.
Honestly, I feel most people don't care about that. What they do care about, is the risk of Copilot making the user liable for copyright infringement. Even a possibility of it spewing out non-public-domain code should be considered a showstopper for any use of Copilot-generated code in a commercial project.
Can Copilot produce licensed code verbatim, in enough quantities to matter, with a license your business would be infringing? Yes. Can you easily tell by looking at the output? No. Could someone end up suing you over it? Maybe, if they cared enough to find out. Can you honestly tell your investors, or a company you seek to be acquired by, that nobody else can have valid copyright claim against your code? No.
> Can Copilot produce licensed code verbatim, in enough quantities to matter, with a license your business would be infringing? Yes. Can you easily tell by looking at the output? No. Could someone end up suing you over it? Maybe, if they cared enough to find out. Can you honestly tell your investors, or a company you seek to be acquired by, that nobody else can have valid copyright claim against your code? No.
Well aren't all your assertions exactly the point of contention?
Well, the "enough quantities to matter" part wasn't tested in courts yet, but I fail to see a way to rule for "No" here in a way that wouldn't gift us an universal way to turn any code into public domain, destroying source code licensing as a concept. Other than this part, the first two claims have already been demonstrated, and the rest follow from them.
But that is in fact the most fundamental question here. And I’m not fully sold on the idea either that this is going to happen in real-world usage or that a single function in a massive program constitutes a large enough portion to be infringing.
Quake's square root function wasn't the only, or the largest, example of code Copilot reproduces verbatim. Among others I've seen to date is someone generating a real "About" page with PII information of some random software developer.
How much code is enough to infringe is a tricky question, though. It's not only a function of size, but also of importance/uniqueness - and we know that Copilot doesn't understand these concepts.
> ... or that a single function in a massive program constitutes a large enough portion to be infringing.
As part of the sequences of rulings in Google vs Oracle, the 9-line rangeCheck function, in the entirety of the Android codebase, was found to be infringing.
Yes, it is, because that means that the algorithm will produce that copyrighted code regardless of the intent of the person who makes it misbehave. People could both accidentally and "accidentally" make it reproduce copyrighted code. In the first case, it's unintentional. In the second, how could you prove it's intentional?
Because of this whole mess, I am actually adding clauses to FOSS licenses that I am writing, just to ensure that my copyright on my code is not infringed by code laundering.
I'm not at all in favor of the "code laundering" (which is a brilliant term, thank you). But I don't understand how you expect a new license to help.
1. A license applied to source code is effective because of your copyright
2. The claim of Copilot's maintainers is that it bypasses copyright
Therefore, they will assert that they can ignore the new license saying "you may not launder my code" just as surely as they can ignore the previous license.
Second, you are correct that Copilot's maintainers claim that it bypasses copyright, but if it does while producing exact copies of code, then copyright is dead, and there are a lot of big companies out there with deep pockets that will ensure that doesn't happen.
They may claim that because their algorithm is a black box, that whatever it produces has no copyright, but my licenses will push back directly on that claim by saying that if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output. After all, that's what we do with GPL and binary code. The binary code is the output of an algorithm (the compiler) whose input was the source code.
I hope by tying it together like that, the terms can close the loophole they are claiming. But of course, I am going to get a lawyer to help me with those licenses.
> ... if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output.
You're not getting it. If Copilot isn't currently infringing copyright then adding such a clause won't matter. Such a clause would only hold weight when copyright applies. On the other hand, if copyright does apply, then you don't need such a clause because the activity is already a violation of the vast majority of licenses. (It even violates extremely permissive ones because it effectively strips out the license notice.)
The GPL works specifically because copyright applies to the usecase in question. It simply specifies various requirements that you must meet in order to license the code given that copyright applies.
In short, you can't just put a clause into a license saying, effectively, "and also, this license confers superpowers which make it so that my copyright applies in additional situations where it otherwise wouldn't!".
I think the GP's "license" would still be effective, although it would not be "open source" per the OSI definition.
Imagine this simplified scenario first: if I published a source file publicly without any licensing or explanation except a standard copyright notice - "Copyright (C) 2021 MY NAME, all rights reserved", do you think a random person/company can take that code and integrate it into a commercial product?
I would argue not (in general). Copyrights law as it is, does not permit a user who has access to a copy to do whatever they want with that copy (esp. if it involves more copying). OSS licenses do give you much freedom as long as you don't modify it, and that's why we have impression that we can do whatever with publicized source code. However, if we think about other types of copyrighted work, say movies for example, streaming services can "rent" you a movie multiple times even though you've paid to download the content previously. What are you paying for the second time you rent? Another example - some photographers may allow you to freely browse their works, but they can still make you pay money if you want to use their photo in your commercial product.
So why wouldn't copyright restrict usage of source code in similar situations? The GP only needs to add a condition to the license to restrict how users can use it. It will no longer be OSS, but as long as it's his work, I don't see why in principle it shouldn't work.
(In practice, I don't think it will make much difference -- I think your argument is still somewhat compelling, and some people will probably take your position. Conservative corporate lawyers aimed at reducing legal risk would disagree, so it's basically a matter of how much legal risk one is ready to take. Also, for an author trying to do this, note that suing Microsoft in these cases would be expensive, since they will likely fight back given that they spent so much money trying to do this, and the outcome will be uncertain. If really tested in court, given the result of the Oracle v Google case, if the US Supreme Court is impressed by the social/economic benefits that Android brings, I'm pretty sure the justices will be even more impressed by this intelligent code generation thingy, and might just grant this thing a fair use.)
Your summary is generally correct, and I certainly agree with the other commenter's position on their work. But I think you're still missing the point. Copyright is the mechanism that allows you to prevent copying, but GitHub's claim is that copyright is irrelevant to Copilot's input.
I have a nice strong lock on my door. GitHub (asserts that it) can enter my home through the window.
Adding another deadbolt to the door does not help.
I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.
Maybe I'm missing something (just not the thing you said), but has Github made any legal claims so far? The original article is written by a politician in EU...
Even if you're a lawyer defending Github in this case, there's still a couple things that needs to be clarified before you can make the case: (maybe the info is out there but I'm too lazy to research)
- Is Github only using code/repos that are explicitly under OSS licenses? (because if that's the case, then the discussion might be justified in presuming OSS terms, and it may be the case that more restrictive non-OSS licenses would require a different analysis)
- As somebody pointed out in another thread, the Github terms of service agreement seems to grant Github additional rights when dealing with user uploaded content. Is that a legal basis for the use?
> I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.
And I tend to agree with you (and the other commenter) here. But GitHub doesn't.
> has Github made any legal claims so far?
I'm not sure how actively, but the CEO was here in the announcement thread the other day saying that they think the ingestion of the inputs is a "fair use". They also have some material defending the output side: https://docs.github.com/en/github/copilot/research-recitatio...
> Is Github only using code/repos that are explicitly under OSS licenses?
I don't think we know exactly what code they used as inputs, no.
Their argument defending the output side doesn't hold water, IMO. If Copilot produces exact copies verbatim, even some of the time, then as long as customers don't have access to the code used to generate the model, how can they be sure?
It's a matter of scale. With a big enough codebase, there will be copyright violations.
> I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.
The point (that they claim that) you are missing is that if "copyright is relevant to Copilot's input" then almost all existing OSS licenses already don't allow that.
The licenses that I am making implicitly acknowledge the argument that training an ML model is fair use.
However, GitHub said nothing about the output of the model being fair use. My license will say that the output of their model is under the same license as the input, which means they have restrictions if they want to distribute it (i.e., actually have people use Copilot).
I think this will work because it doesn't say that GitHub is wrong. Instead, it says that, even if GitHub is right, it doesn't matter.
It would also be very bad for GitHub to claim that the output of an algorithm can't be under the same license as the input because we feed licensed code to algorithms all the time and claim that their output is still under the same license. We call those algorithms "compilers" and the binary code they produce is still copyrighted and licensed.
> I think your argument is still somewhat compelling, and some people will probably take your position.
I didn't mean to take a side or argue a position here. I was just pointing out that licenses hold no legal power in the event that copyright itself doesn't apply.
> ... So why wouldn't copyright restrict usage of source code in similar situations?
I'm certainly not an expert here but I believe you are mistaken about the extent to which current copyright law (in the US) restricts such usage. I also don't think that the examples you bring up are as simple as you seem to be making out.
You are legally permitted to record broadcast shows for later viewing; you are not permitted to redistribute the recordings though. I assume (but am not certain) that rentals and streaming are the same. (That being said, bypassing DRM has been made its own crime. This effectively amounts to an end run around the rights otherwise granted to you by US copyright law. But then there are specific exceptions where bypassing DRM is permitted. I digress.)
You aren't legally permitted to mirror the contents of a website (such as the New York Times) without permission but you are allowed to access it since they make it publicly available. You are even permitted to save a copy for your own purposes when you access it; you are not permitted to redistribute that copy.
For an extreme example, consider the recent LinkedIn case. Unless I misunderstood it, the court deemed it acceptable to scrape any publicly available content. Certainly most such scraped content was never explicitly licensed for that though!
Even if the license for a piece of code was entirely proprietary, GitHub presumably acquired it through legal means (ie intentional upload). Once they have it in their possession, it's not at all clear to me that current copyright law in the US has anything to say about how they use it (short of redistribution). Of course, if their ToS promises that they won't use it for other purposes then they can't do that. But assuming they never promised you that in the first place ...
There's a traditional argument here about needing a license to legally incorporate the copyrighted work of another into your own.
One possible counter argument is that training a model on publicly available work is analogous to a person viewing that work. So long as the model never outputs any of the original inputs (or only exceedingly small fragments of them that would fall under fair use regardless) it's not clear that those outputs constitute derivatives at all (in the legal sense). Or they might. The courts haven't weighed in yet as far as I know. (Consider GPT-3 or This Waifu Does Not Exist for additional examples of the sort of ambiguity that's possible here.)
Of course, one possible counter to that is that the model itself is (in many cases) effectively a lossily compressed copy of the original input works. So perhaps redistribution of the model itself would be a violation of copyright. But even if that turns out to be the case, it's still not clear that the output of such a model would run afoul of copyright.
I argue that the output of an algorithm has the same copyright as the inputs to the algorithm, and that's because we use compilers (algorithms) to transform source code all the time already, and no one says that the binary code (outputs) is not copyrighted.
The trouble is there seems to be an entire continuum when it comes to degree of transformation.
The compiler produces more or less a direct (logical) translation so it's clearly some sort of derivative. We go from C to machine code but the output still "means" the same thing as the input. (More precisely, it's approximately a mathematically transformed subset of the original input. Lots of information is removed, things are reorganized, and a bit of extraneous information gets added in the process.)
For something notably more muddy than a compiler, consider This Waifu Does Not Exist. Any given output is (typically) nowhere near any particular input but you can often spot various strong resemblances.
Alternatively, the implementation of sketch-rnn (https://magenta.tensorflow.org/sketch-rnn-demo) is quite different - it outputs pen strokes instead of pixels. Still, the legal questions remain the same.
For a significantly muddier example, consider GPT-3. The outputs are (typically) not even remotely similar to anything that was input except in very broad strokes.
Where does Copilot fall along this continuum?
For even more confusion, consider running a New York Times article through Google Translate. Are you in the clear to publish that? I seriously doubt it.
But what about running it through an ML algorithm that (attempts to) produce a very brief summary of it? Many such implementations exist in the real world today. Their output is nothing like the input - should it still fall under the copyright of the original?
Finally, it's worth pointing out that for many of the above computerized tasks there are direct human equivalents. Art can be traced on a light table. A drawing can be produced that fuses the styles of two references. News articles can be manually translated or summarized.
Again, my intention here isn't to argue a particular side. I'm just trying to make it clear how complicated this stuff is and the fact that we don't have clear legal answers for most of it yet.
I argue that, even if training a dataset is fair use, distributing the result is copyright infringement. I would want my license to make that part clearer.
> even if training a dataset is fair use, distributing the result is copyright infringement
I would be inclined to agree that the current situation (ie reproducing training examples verbatim) violates copyright. On the other hand, I'm not so sure that a trained model does (or even should) be subject to the copyright of the inputs.
Of course I acknowledge that the latter view is controversial and also that such issues are so new that they haven't had a chance to be meaningfully addressed by either the courts or the legislature yet.
As an example of a similar situation, see (https://www.thiswaifudoesnotexist.net/) which was trained entirely on copyrighted artwork. Note that there are at least three distinct issues here - training the model, distributing the model itself, and distributing the output of the model.
> I would want my license to make that part clearer.
But again, GitHub's argument here is that the license is completely irrelevant because it doesn't apply in the first place. Thus they won't care one bit about any clarifications you make one way or the other.
You said that you're "not so sure that a trained model does (or even should) be subject to the copyright of the inputs."
You missed my point. I'm not saying that the model is subject to the copyright of the inputs; I'm saying that the model's outputs are, which is entirely different. We say that the output of a compiler is still subject to the copyright of the inputs, so why not this?
I misspoke. (Err mistyped?) I suspect there will often be a stronger case to be made for the model itself falling under copyright than what it outputs. It's up to the courts and the legislature in the end though, so who knows.
Anyway, by providing public access to this thing I infer GitHub to be taking the position that copyright doesn't apply to the output. (And I suspect they are wrong, in particular because of the verbatim code samples people have managed to coax out of it.)
I wish. I just want users to know what rights they have. Ultimately, I want my software to serve end users, not companies. If companies add value for users with my software, that's exactly what I want.
But stripping licenses away so that users can't know what rights they have with my code is not that.
Not necessarily. If you do it right, you've got a perfectly GPL-compatible license (because such laundering is, technically, a violation of the GPL… probably) – it's just a license that's more explicit about what's a license violation.
GPL explicitly forbids re-licensing under more restrictive terms.
So either the added terms are not more restrictive, which basically means they are unnecessary and have no real effect; or they are more restrictive, which is incompatible with the GPL.
You can't have things go both ways. It seems that your argument is "we're not adding restrictions, we're just saying what we think Copyright law / the GPL should actually be like." But unfortunately you can't "clarify" Copyright Law or "clarify" the GPL by adding terms. Ultimately courts decide that.
(Of course, if somehow your "clarification" happens to align with a court decision, then maybe it will work after all. But in theory your "clarification" is still not necessary and has no additional effect....)
> But in theory your "clarification" is still not necessary and has no additional effect....
Except your clarification will be interpreted by a court of law. “This license is compatible with the GPL and I can interpret the GPL in a way that lets me do something this license says I can't” is much less likely to stand than “well maybe the author thought the GPL said this, but it actually says my interpretation”.
This, of course, presumes that such a license is actually compatible with the GPL, something I'm getting less and less certain of over time. (What constitutes a compiled form? If a predictive model doesn't count – which it might not, since it outputs source code, very much unlike how compiled programs normally work – then my argument falls down. And many other things would also knock the argument down; I'm not confident enough that all my assumptions are right, or that they should be right.)
wizzwizz4 is correct. Also, I have explicit clauses saying that GPL/AGPL dominate.
But yes, my licenses may be incompatible (one-way) with permissive licenses. I say "one-way" because code with permissive licenses can still be used in code under my licenses, but maybe not necessarily the other way around.
That does not really ring true to me. AGPL broadens the scope of violations as well, and you cannot use AGPL code in GPL-only code bases without turning the end product AGPL (but you can use GPL-only code in AGPL code bases).
If you're just adding something along the lines of "copying passages extensive enough to reach originality is a violation of this license" then that's indeed already covered by the GPL, and there is really no need to add such a passage other than to be more explicit - and confuse people at least at first about why your license is not actually the GPL. So there isn't much of a point to do it in the first place, in my humble opinion.
If you add text that says something along the lines of "you may not use this code as training data", then you created an incompatible license, and your code cannot be used in GPL code bases, and even worse, since it restricts what you can do with the code more than the GPL, it might even mean you stop being reverse-compatible and may not use GPL'ed code yourself in your own custom-license code base.
The AGPL does not further restrict code uses, just broadens the scope of when you have to make available the code, so it's fine there. However, the original BSD license with the advertising clause is considered incompatible with the GPL.
I am not a lawyer, and these are just my quick layman concerns. I fully recognize you're entitled to use whatever license you find suitable for your code and I am absolutely not entitled to your code and work whatsoever.
But that said, I wouldn't touch your code if I saw a "potentially problematic" custom license, and I wouldn't consider contributing to your projects either.
Honestly, with this whole debacle, I am not going to be accepting outside contributions anyway.
I also understand the concern with a problematic license. However, I don't plan to make a specific exemption about machine learning, but rather tie up an ambiguity.
What I think I'll do is that the license will require that when the licensed source code is used, partially or fully, as an input to an algorithm, the license terms must be distributed with the output of that algorithm.
I don't think this is a violation of the GPL at all because the GPL requires you to distribute the license with the binary code of GPL'ed code, and such binary code is the output of an algorithm (the compiler) whose input was the source code.
But what it would do is put the onus on GitHub that, if they used my code in training that data, if they distributed the results (as they are doing), they must distribute my license terms as well and tell users that some of the results are under those terms.
> binary code is the output of an algorithm (the compiler) whose input was the source code.
Just because binary code is produced by the operation of an algorithm on source code doesn’t make all output produced any algorithm on that source code binary code. Otherwise checksums and hashes and prime numbers would be copyrighted.
You have a point, which is why the legal system would still require that a copy be substantial before they count it as infringing. I would argue that Copilot has already been shown to copy substantial portions, though.
> something along the lines of "you may not use this code as training data"
Would such a term be legally binding under present copyright law? Other than disallowing inclusion in a redistributed dataset specifically intended for training ML models, it's not clear to me that it would actually prevent such use if you already had a copy on hand for some other purpose. (Specifically, note that GitHub indeed already has a copy on hand for their authorized primary purpose of publicly distributing it.)
More generally, the manner in which copyright law applies to machine learning algorithms in general hasn't been worked out by either the courts or legislature yet. Hence the current article ...
To be clear, my suspicion is that this is so unlikely to happen unintentionally that it does not represent a real risk. If the issue is that I can force it to generate infringing output if I really want to, it is an argument against the Web browser too, since I could just as easily use the copyright-unsafe "copy" feature.
Whereas using the browser's copy feature requires the user to have intent to use it, getting Copilot to produce exact code does not. And proving that intent is not easy.
I think companies will see that such code can be exactly reproduced and decide to stay away from Copilot. I hope they do. In fact, I am less willing to take outside contributions for my own code, even for bug fixes, just because of the risk that that code came from Copilot.
That makes sense if you ignore the idea that such a thing would seem unlikely to happen without intent, which was the key thing in the post you’re replying to.
Unlikely stuff will always happen with enough use. There are billions of lines of code in the world. There will be enough copyright violations. Even on single multi-million line codebases, there will be violations.
The answer to your first question is for the courts to decide, unfortunately.
However, for my purposes, using a new license with particular terms would only be to make companies like GitHub pause and think before using my code as "training" to an "algorithm" like Copilot.
Tool that could be used to violate copyright := Gets prosecuted by MPAA and friends, legislation is passed to make use / development / distribution of such tools illegal
Bigcorp ships the ML equivalent of ALLCODE.tgz, but you actually gotta look in the no/dont/open/this/folder/gplviolations/quake.c folder := Is this adequate proof that copyright is being violated?
Since I do not work for the MPAA, I don't see why you expect me to answer for them. Half of the article's argument is that any argument you could use to shut down Copilot would also give a lot of power to such entities if it were accepted.
I'm not sure either —which is why I said "may have been reasonable" instead of "was reasonable" :)
I can see an argument for doing your own research, but I can also see an argument for basing an analysis on what GitHub said in the FAQ — I'm honestly a bit surprised that Microsoft's lawyers let them say that with a product that can reproduce such large blocks of verbatim code.
Yep. Individuals in a FAANG don't have the ability to launch a product without review. Just drafting a press release for a new product involves Comms oversight and VP-level approval.
Please omit flamebait from your HN comments. It tends to produce flamewars, which are tedious and nasty. Your comment would be fine without the last two sentences.
Copyright is (and has been, since the earliest days) about protecting the creative expression of an idea.
You can't copyright an algorithm, but you certainly can copyright the expression of an algorithm in Python. You cannot copyright the words of the English language and their meanings, but Noah Webster absolutely did copyright his dictionary, which was a creative expression of their definitions (and lobbied for the first increase to US copyright law). Webster wasn't the "thought police" for trying to copyright people's understanding of words in English, because he didn't and couldn't copyright them; he copyrighted his expression of what words meant.
If you read the creative expression of an algorithm in Python and then re-express it in English, then sure, copyright protection doesn't extend to that re-expression. But Copilot isn't doing that, it's quite clearly reproducing parts of the original creative expression of an algorithm, not the algorithm itself.
Here's an easy way to demonstrate it: open up a source file in any language other than C and try to get Copilot to spit out an implementation of Quake's fast-inverse-square-root algorithm. You will very quickly discover that Copilot doesn't "know" the algorithm; it only "knows" the specific creative expression of it (comments included).
In the US, copyright may include the choice of variable names, the organization of the code into modules and functions, and other aspects which where there are the creative choices that may be protected under copyright law.
> the court presented a three-step test to determine substantial similarity, abstraction-filtration-comparison. This process is based on other previously established copyright principles of merger, scenes a faire, and the public domain.[1] In this test, the court must first determine the allegedly infringed program's constituent structural parts. Then, the parts are filtered to extract any non-protected elements. Non-protected elements include: elements made for efficiency (i.e. elements with a limited number of ways it can be expressed and thus incidental to the idea), elements dictated by external factors (i.e. standard techniques), and design elements taken from the public domain. Any of these non-protected elements are thrown out and the remaining elements are compared with the allegedly infringing program's elements to determine substantial similarity.
Emphasis mine. This specifically highlights that your example ('only so many ways you can express an algorithm') is not protected under US copyright law.
The originality requirement only applies to other aspects of the generated code, which in this case would include the comments that Copilot generated, and which clearly are not required for the algorithm to work.
For thought police like you describe, look to patent law.
Huh, that's interesting. While I'm hesitant to suggest that what the world needs is even more patents, this doesn't make immediate sense to me.
Let's say someone comes up with a new sorting algorithm, which completes in less cycles than was previously believed possible. Sure, it's math, but isn't that a new, creative expression? Don't we want to encourage them to publish their algorithm (one of the key purposes of patents—this way, anyone can use it after 20 years), as opposed to keeping it hidden from the world?
It makes more sense to me than most software patents (admittedly, a low bar to clear). And if the patent office is doing its job (big if), the patents should only be granted for algorithms which are sufficiently novel.
A new super-fast sorting algorithm (not just a few cycles, but something that actually changes the O-number) would obviously be a fantastic boon - I would want the inventor to benefit from his cleverness.
But nowadays I think patent law isn't the right way to do that; trade secrets should be enough. I don't think that what is disclosed to the public in patent applications is of enough value to justify a long monopoly. It's not necessarily a problem with the written law; patents are horrible because of the way courts apply them.
An algorithm is definitely subject for patent. Its very strange to read such a statement on the interwebs which were heavily affected by patented image compression algorithm.
An algorithm is maths, you can't patent maths. Patent lawyers and business people have however somehow managed to convince courts/patent authorities that configurations of computer systems are patentable (or some similar argument), which then makes software patentable (IANAL but I think it's something like this).
Either way, the copyright of source code is separate from that. Copyright is for the text of a program (the source code), that might e.g. implement an algrithm. The algorithm itself cannot be patented or otherwise legally protected.
An algorithm is maths, but a lot of code isn't algorithmic. Algorithms provably halt, and most software doesn't halt, let alone provably. Operating systems, browsers, games, etc. are non-algorithmic. It's hard to claim that something like a browser is just math and therefore deserves no IP protections.
An algorithm is a reasoning procedure. A program (e.g. a browser) embodies many algorithms.
I've not come across your stipulation that for a thing to count as an algorithm it must provably halt, but I can go along with that. So I'd argue that in most cases, any function or subroutine provably terminates, even if the program embodying it is not supposed to terminate.
I also don't agree that an algorithm is "just maths". At least, not if you then pivot to saying that a browser isn't "just maths". Any operation performed by a computer is "just maths", because what a CPU does is basically arithmetic and branching.
I don't think it's a question of what does and doesn't "deserve" IP protection. The source code of a browser is clearly an original work, and entitled to protection. But the ideas and procedures it embodies are not "works", and copyright isn't supposed to apply to ideas and procedures.
I'm against the very idea of "intellectual property". It must have seemed a good idea at the time, but I think patents and copyrights have become monsters that inhibit, rather than encourage, innovation and creativity.
> I also don't agree that an algorithm is "just maths". At least, not if you then pivot to saying that a browser isn't "just maths".
Algorithms are distinguished by their proofs of correctness. This elevates them above simple procedures. The halting problems tells us that there is no automatic way to determine whether or not a program terminates. So when we find one, it's like discovering a mathematical law. The proof of an algorithm's correctness is expressed independently of any programming language or platform. What else could they be other than math?
Things like browsers, games, operating systems, e-mail clients, music players etc. are not treated this way. They are not formally specified. They are implemented in the context of a machine and an actual running environment. The source code of the program usually doubles as its specification. It's very different compared to an algorithm.
I agree IP as a concept is bad, but this is the way of the world at least for now. Given where we are, for me it makes sense to draw a line between algorithms and software in the context of copyright.
I'm afraid I still don't agree with you; algorithms existed (and were named) long before The Halting Problem was stated. This "proof of correctness" claim doesn't wash with me.
Can you cite? (Wikipedia's explanation of what an algorithm is doesn't mention that). I'm open to being corrected, but not baldly contradicted.
In mathematics and computer science, an algorithm is *a finite sequence* of well-defined, computer-implementable instructions, typically to solve a class of specific problems or to perform a computation.
"Finite sequence", meaning it terminates. The wiki citation (The Definitive Glossary of Higher Mathematical Jargon) provides a little more: "A finite series of well-defined, computer-implementable instructions to solve a specific set of computable problems. It takes a finite amount of initial input(s), processes them unambiguously at each operation, before returning its outputs within a finite amount of time." https://mathvault.ca/math-glossary/#algo
So they stipulate finite input and output and finite runtime. This is in contrast to something like a webserver or OS, which has a potentially unbounded number of inputs, and is expected to run effectively forever.
I mean think about it, what does an OS kernel look like in its most distilled form? It's essentially just an entry point to an infinite loop. Same as games, where they're called "event loops" in that domain.
The way I think about it is this: if computers and programming languages and software didn't exist, algorithms still would. I don't think you can say the same about e.q. Quake. Quake isn't a mathematical truth even though it uses them to work, kind of like how engineers use physics to build a bridge, but the bridge itself isn't physics.
Well, I don't consider an OS or a browser to be the implementation of an algorithm. It has to be possible to understand an algorithm, re-implement it, and so on. I'd say you have to be able to hold it in your head - like Eratosthenes' Sieve.
I don't see anything in the WP article about proofs of correctness. "Finite sequence" surely just means that the number of instructions isn't infinite? Come to think of it, that wording seems rather hand-wavey; I wonder if I can find citations to help me improve it.
> I don't see anything in the WP article about proofs of correctness. "Finite sequence" surely just means that the number of instructions isn't infinite?
Well how are you going to show that the algorithm terminates for all inputs without a proof of correctness?
The thing about the situation is that "copying code you found on the Internet" certainly isn't automatically, always legal. That you engaged in copying X from the Internet doesn't make illegal either. Your source for the source code your incorporate into a product doesn't matter, what matters is whether that code is copyrighted and what the license terms (if any) are (and people saying "copyright doesn't apply to machines" are wildly misinterpreting things imo).
Given what's come out, it seems plausible that you could coax the source of whatever smallish open source project you wished out of copilot. Claiming copyright on that code wouldn't be legal regardless of Copilot.
Whether Microsoft/Github would be liable is another question as far as I can tell. I mean, Youtube-dl can be used to violate copyright but it isn't liable for those violations. The only way Copilot is different from youtube-dl is that it tells it's users everything is OK and "they told me it was OK" is generally not a legal defense (IE, I don't know for sure but I'd be shocked if the app shielded it's users from liability). All the open source code is certainly "free to look at" and Copilot putting that on a programmers screen isn't doing more than letting the programmer look at it until the programmer does something (incorporating it into a released work they claim as their own would be act).
The question is how easily a programmer could accidentally come up with a large enough piece of a copyrighted work using Copilot. That question seems to be open.
TL;DR; My entirel amateur legal opinion is that Copilot can't violate copyright but that it's users certainly can.
> On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either. The output of a machine simply does not qualify for copyright protection – it is in the public domain. That is good news for the open movement and not something that needs fixing.
This is very good news. This line of thought implies that we can legally feed all proprietary code into GitHub Copilot in order to teach it all the patented and secret tricks of the companies we can see (since data mining is not copyright infrigement) in order to have it print those secrets back when we ask it to (so they become public domain).
> The output of a machine simply does not qualify for copyright protection
Good, does the `cp` or `cat` command qualify for the "output of a machine"? Now I can uncopyright everything, hooray. What about converting a video or an image to another format? Again, it's just output of a machine.
Added:
Really, I would've been happy if this was the situation, as I'm, in general, against patents and copyright (in the form that they are now being used).
Wow, that's quite a strawman/bait-and-switch from the article; thanks for highlighting it.
If Copilot is just a machine -- a glorified typewriter -- then the machine's operator is responsible for its output.
Or does the author seriously want to claim that any code added via Copilot to a proprietary codebase would not be proprietary as well? If that were true, Copilot's userbase is going to be...limited.
So then we come full circle on infringement, if the operator is responsible for the code produced by co-pilot then the articles claim that is not infringement because it is machine created fails, as the operator is responsible
The argument in the article is that all code made by co-pilot is not infringement because there is no copyright attached. You seem to imply that the copyright of all code made by co-pilot is copyrighted by the operator of co-pilot thus would then fall under copyright law, and thus would/could be infringing
> Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work
This is, at best, an oversimplification. Code compiled from copyrighted source code is derived work inheriting that copyright according to long-established law. This is exactly why the legal issues around machine learning applied to copyrighted corpora have been contentious.
"Patented and secret tricks" are not protected by copyright, if the output was an actual reimplementation of an idea instead of Copilot regurgitating existing code (https://news.ycombinator.com/item?id=27710287).
The specific implementations are protected by copyright, and the ideas may be protected by patents. In the case of "secret" tricks, they may be protected by trade secret laws, but not if it's in a public GitHub repo.
Except Copilot itself is not open source, so your only way to feed that proprietary code into it would be to upload it to github, which would make you an infringer.
> your only way to feed that proprietary code into it would be to upload it to github, which would make you an infringer
Somebody has to do it. Someone ready to take one for the team (or better yet, already skilled in software piracy, so that it's not a big deal for them). Then, if this argument holds, everyone gets the result in public domain.
As I see it, the argument presented in this post essentially makes Copilot to be an universal copyright laundering machine. Not just for code, but for anything that can be represented digitally.
Obviously this won't stand. While I can see Github ending up protected from all liability, the only way for this to not kill copyright is for the users of Copilot to become at risk of copyright infringement. Which kills the whole value proposition of Copilot.
This seems like a good time to ask what the heck is the value proposition of a thing like this. Are people really going to use the output of this blindly? And if they're going to audit every line - is that really easier than just writing the code yourself? Honestly, at best it feels like a machine for introducing pernicious bugs that "look right" but are semantically wrong. (Which reminds me - were any of the Underhanded C Contest entries in the training data?)
That's a very good question. On the surface, the idea seems to be helping people write code faster - but as you observe, properly auditing generated code is more work than actually writing it from scratch.
Best I can think of in terms of real value delivered, is helping people with first drafts, breaking through the "staring at an empty page" problem. But even with this, I feel it's too risky compared to doing a StackOverflow search, where you can at least see some explanations, discussions, and other relevant context.
It's definitely an interesting vision demonstrator - despite not being quite there, it lets us see that a tool like this that actually worked well (in terms of generating correct, explainable, license-respecting code) would be very useful.
Assume GitHub designs a filter to detect similarities to the training set and displays an attribution link with the result, as a comment. It's no different from using a search engine to find the code and putting it in your project, especially that the code is public already and visible to multiple search engines. You are ultimately responsible, just like you are every day with Google.
But the model has on average just 1 regurgitation in 10 weeks per user, so you can just discard all of them.
Almost all the output of Copilot is not an exact copy of any code in the training set. You discard the 0.001% of generations that are similar to the training data and use the rest.
Can they afford it, though? The most important feature of DNN models is that they're orders of magnitude smaller and cheaper to run than training data; checking each output against the training set will make the model orders of magnitude more expensive to run.
> The GitHub Copilot collects activity from the user’s Visual Studio Code editor, tied to a timestamp, and metadata.
[...]
> This data will only be used by GitHub for:
[...]
> - Improving the underlying code generation models, e.g. by providing positive and negative examples (but always so that your private code is not used as input to suggest code for other users of GitHub Copilot)
I'm inclined to believe this. After all, why would they taint the training data with code from a random guy who is asking for help when they have more than a hundred thousand repos with 100+ stars?
Sure, perhaps they won't use it for training, but the fear is they would use it for corporate espionage, market research, etc. Compared to training, this would be both more useful and much easier to keep deniable / under wraps.
I've had previous employers that were highly concerned about far more innocuous data leaking to competitors, e.g. autocomplete search terms. This means that a Copilot installation at any company that competes (or might compete) with MS should be considered a security breach, IMO. Given Microsoft's presence in so many markets, in general I think it would be foolish to risk this at any company.
You're still sending your intellectual property to Microsoft and hoping that they do only what they say they do with it, and that whatever they do with it will never change.
From context, I thought you were suggesting the strategy "just type pirated proprietary code into the IDE and the Copilot plugin will automatically include it in the training data", since my earlier comment was about the difficulty of training Copilot on such code. I don't believe they won't abuse your work in other ways either.
> On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work.
Cool. I'll just train my new AI on 20 different copies of the same Disney movie and have it generate a new movie. Checkmate, lawyers!
> The court of law would probably have you unveil and tear apart your process and find that you were trying to plagiarize in a roundabout way.
This is the whole idea. Copilot is spitting out considerable chunks of code that is licensed under GPL and it will be up to GitHub to prove that Copilot is not trying to plagiarize this code in a roundabout way.
In the very least, Copilot should have separate data stores for different groups of licenses: public domain, attribution-only, copyleft, etc.. That would already make it much more usable than the current "here's some code, it came from I don't know where, don't ask me" that literally looks like black market deals except they are GitHub-branded.
> You do understand that's not how laws work in general, right?
The only reason the law doesn't work this way for Microsoft Copilot is because the copyright holders are individuals who do not have the capital or expertise to file suit.
If Microsoft instead released a video editor addon that was trained on Disney movies and which would sometimes insert scenes of _any_ Disney movie you can bet your ass we wouldn't be having the same discussion.
Comparing code to movies - in code even a single char difference can change the meaning of everything, in movies - you can skip whole scenes and still get the meaning. I don't think the two are compatible, they are judged by different standards.
I’m not sure what the point is. That the argument of copying Disney movies is not a good analogy, or that courts will find plagiarism in software copyright cases involving Copilot?
The point is that Copilot is effectively doing the Disney movie thing, just with code, and yet this article argues this is all fine. As it is, the article turns Copilot into an universal copyright laundering machine.
Generally speaking - probably yes. That’s because the output would be (a) transformative and (b) not likely to affect the profitability of the original work (at least not directly).
Note: I’m not really trying to comment specifically about the code/movie examples - just the general notion that the more input there is (from different sources), the likelihood that the use will be considered “Fair Use” increases.
Do you think this Copilot is some sort of advanced AGI who just became a genius programmer? Almost every piece of code that it "generates" you will find it almost verbatim in one or several public repos.
If you ever tried AI Dungeon before it imploded, you'd agree GPT-3 is pretty impressive and is generally capable of coming up with original stuff. Obviously with a bias towards the kind of stuff that's in the training dataset, which is why you try to use a very broad dataset. But still original.
I don't think the judge will care how you arrived at the copyright violation, only that you did. But hey, I'd love to see that court case anyhow, maybe I'm wrong.
Well, I have a hard time drawing a line between GitHub Copilot and a compression algorithm.
If you can reproduce a verbatim copy of Quake source code after taking that source code as input before then that's compression. A really fancy, but still.
And given that it reproduces the source code: it has to hold that somewhere.
It would be very interesting if someone could reproduce the Quake example with AGPL code, then request the whole model + code because it clearly contains the AGPL code in some encoded form.
A compressed file containing Quake's source code would be covered by the copyright on Quake's source code. The compression algorithm would not. The algorithm cannot produce the plain-text copyrighted material without the compressed copyrighted material.
Copilot has the ability to produce Quake's source code nearly by itself. And it's a work (not a person), so it can be seen as a derived work. Like a compression algorithm that sometimes tacks on the first paragraph of "50 Shades of Grey" at the end of files.
I'm not a lawyer, but that's my opinion (admittedly, my opinion is softening each day). Plus, the purpose of the tool is to create code for inclusion in projects somebody will hold a copyright over, and they likely won't be the original authors. So it's output should be held to a higher standard than a compression algorithm or keyboard.
> Copilot has the ability to produce Quake's source code nearly by itself.
Was it fed the Quake source code while training? Then it's not producing that code, it's just reproducing it, like a fancy (but imperfect) copy machine.
I'm not sure it's accurate to say that the training source code is "compressed" in the parameters of the model, but certainly some approximation of the training source code is stored in the parameters.
It is probably a stretch, but I think less of a stretch than saying "it just a machine that learned to code and randomly reproduced these 10+ lines of code". That has IMHO a probability of 0.
So if I rule that out, where does it end up? What if we put this as the grown up ML brother of the chain of LZW, PPM, dictionary assisted compression (e.g. zstd) and various attempts at using neural networks for compression?
I would not want to judge this - that's why I put up the AGPL idea. Or even unlicensed code. It would be a very interesting case to watch.
> A compressed file containing Quake's source code would be covered by the copyright on Quake's source code. The compression algorithm would not.
What? Where does the distinction between data and algorithm go with compression algorithms?
In its most abstract form a compression algorithm is function `{0, 1}^n -> {0, 1}^m` such that n < m and the output string is the result of something previously encoded.
Why can't the input string be the seed used to make the machine learnt model generate the Quake source code?
Yes! In every form, lossy compression is distilling meaningful information from noise.
This is a great legal question as it concerns our use of machine agents. We can learn from copyrighted literature or code that we read. Why can't our agents?
Because the process is different. You and any computer agent are allowed to learn the functional, non-copyrightable elements of fast inverse sqrt. When you need that functionality, you can write code that implements your understanding of those non-copyrightable elements and gain copyright over the resulting creative expression.
What you can't do is copy all of the creative expression in the original (such as comments) without complying with the terms of the license. Moreover, reproducing the magic constants is a strong indication that your process didn't independently derive your code because the constants used in the original are unique and non-optimal.
> We can learn from copyrighted literature or code that we read.
Not everywhere. Emulators communities often prohibit people from contributing if they've read the original code to protect themselves from copyright claims.
If your model can't reproduce the Quake source without my input, you haven't really compressed it, especially if the dataset to recreate it is larger than the original. If I have to tell the program exactly what I want in detail to get the Quake source, that's more of a storage database. If I have to guide it intently to get it to output the Quake source, I'm heavily guiding it.
All decompression requires input: the compressed artifact. In this case, the compressed artifact is the semantic queues necessary to extract the Quake inverse square root function.
> especially if the dataset to recreate it is larger than the original
Many types of compression produce a compressed file larger than the original for input data that is not easily compressed. Just because a compressor is bad at compressing (some) inputs, doesn't exclude it from being a compression algorithm.
You would not need to produce a perfect copy. A fansub of a movie is considered an derivative of the movie, while being a far cry from being an actually copy of the movie.
As a subtitle is to a movie, the quake "output" might be much smaller than quake itself.
I know absolutely nothing about IP and even less about compression but aren’t compression algorithms usually run on copyright protected material with the consent of the rights holder or authorized licensee?
This seems to completely ignore the fact that we've seen Copilot regurgitating exact copies of existing code, and even with the incorrect license attached when it was asked for it. [0]
Agree. I've written a comment on her blog about this. Hoping she'll enable it. I've published an opinion piece on the subject matter myself: https://rugpullindex.com/blog#BuiltonStolenData
That function exists in hundreds, if not thousands, of GitHub repositories. The function is so well known it has it's own Wikipedia page. If there is a more famous function in computer science, I don't know what it is. The fact that a machine trained on GitHub repositories might reproduce such common code is not alarming or surprising to me. I think people are using this as an example and implying it's happening all over the place, but I've yet to see another example like it.
It also has no relevance for the discussion at hand. Yes, Github can display all of its content – that's kind of the point of it.
But Copilot doesn't exist to show you random code snippets for the sole purpose of showing them.
Using this copyrighted material to create derivative works is a completely different use case, and not covered at all by the Google Books ruling, or any other I'm aware of.
What about Google Books Ngram Viewer? Isn't that a derivative work based on copyrighted content? It's more than just a search or preview - it contains both novel information and snippets of existing content. Is linguistic corpus a special case?
The research value actually turned activity that would be infringing into activity that was not infringing. Take away things like the ngram viewer and Google Books infringes.
> The output of a machine simply does not qualify for copyright protection
"Simply"? If it were that simple, surely that would mean that the output of the Unix "cp" program would not qualify? What about a DVD copier?
I'm OK with copyright as it used to be, back when I was a teenager; the right expired with the author's life. Corporations couldn't own copyrights. There was no burden on the author to register their rights. And copyright was a civil matter; you sued for actual damages. Infringement wasn't a crime.
I'm not OK with modern copyright law, with criminal penalties, rights that can be transferred to entities that are essentially immortal, and copyright terms that keep getting extended, just before Mickey Mouse and Elvis Presley become public domain.
>> The output of a machine simply does not qualify for copyright protection
> "Simply"? If it were that simple, surely that would mean that the output of the Unix "cp" program would not qualify? What about a DVD copier?
It means that the output of "cp" does not qualify for copyright protection as a derivative work. The output is still a copy of the input and would be subject to the same copyright as that input.
Roughly, a derivative work is a new work that incorporates some copyrightable elements from a previous work. The derivative work gets its own copyright separate from the copyrights of those incorporated elements.
> Roughly, a derivative work is a new work that incorporates some copyrightable elements from a previous work.
By this logic:
- someone could go and copy functions and/or entire files from GPL code bases and use them with a different license.
- someone could use copilot or similar to learn from all available GPL code. Is resulting code GPL?
- someone could use copilot or similar to learn from open source code of their competitors that license doesn't allow them to use. Are the results legal?
The definition of a "derivative work" as stated is correct.
The copyright status of a derivative work is a separate issue: A derivative work can be considered infringing, and a derivative work can be considered non-infringing (ie. due to Fair Use).
Not necessarily. The object code produced by a compiler might best be regarded as a different form of the source code, even though it is not a copy or a new work, according to
> I think they meant “when the output of the machine is significantly original/transformed (i.e. is a new creative work).”
No; the author spends some time asserting that the output of a machine is inherently not a creative work:
> Machine-generated code is not a derivative work
> the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive.
> This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all
The machine is performing some sort of transformation. Whether you want to call that a "(creative) work" or not is irrelevant to the point I was making: The machine is doing something to the input, and that makes the 'cp' comment I replied to kind of silly.
I think we can assume the author doesn't believe that piping bytes through the 'cp' command automatically removes copyright (as the person I replied to suggested).
This is not the black and white issue that the article implies it is, with statements such as "Machine-generated code is not a derivative work".
Imagine a web scraping robot that just grabbed textual news articles and spit them out verbatim to searchers (without giving credit or linking to the original). That is obviously copyright infringement, even though it is done by a robot.
Now imagine it does slight modifications to the text, using a thesaurus and maybe a bit of AI. It might substitute "is able to" for "can", or "frequently" for "often", but otherwise everything is left as is. Is that "machine generated"?
Same goes for a hypothetical bot that scrapes existing music, and after listening to "He's So Fine", comes up with the melody for "My Sweet Lord." (as per the famous George Harrison copyright case from the 70s) It isn't off the hook simply because a machine was involved. If it truly "learns" what makes a good melody, and uses that to generate a very different melody (that might be equally similar to a dozen different songs), that's different.
There is a full spectrum between simple bots that copy verbatim, and something that "deeply learns" and then writes a new article, or generates new source code, or writes a new melody, or whatever.
I don't have a strong opinion on GitHub Copilot since I haven't really studied what it does and therefore I don't know where it lies on that spectrum, but this article is not useful if the author doesn't really explore the nuance, and treats everything as absolutes.
(and I should say, I am very much of the opinion that copyright law as it is, is hopelessly broken, and I am always glad to see things like Copilot just so we can see it demonstrated why. But that is veering off topic...)
Yes, similarly I could definitely create a simple ML algorithm and feed it a single codebase to learn. Then it would be possible to predictively output code, such that it reproduces that entire codebase verbatim.
There's nothing magical about a complex copying method that makes that less of a copyright infringement.
There may be some threshold where it becomes fair use, but I agree with you that it's not cut and dry at all.
The argument that someone could create all possible works of art or music or whatever and copyright them all is a ridiculous idea, from someone who doesn't understand exponentials.
My less lofty personal gripe with Copilot is as follows. I worked hard to produce quality code. GitHub will make money off my code. Copilot users will make money using my code. I - the creator - will make nothing.
At the very least, I should have been asked whether my code can used by Copilot and I should get at least a share of the profit Copilot generates every month, where the share equals to my code / all training code used by Copilot. The latter part could be gamed by other developers in the future, but it's the best I could come up with.
Did you benefit from reading code in your education? Pass it forward! You will benefit many people, don't cut the rope under you. And in turn you will also get the same benefit, and adapted to your needs.
A determination of fair use does take commercialization into account so this is a fully valid concern. GitHub is explicitly looking to profit from the work of others.
If my code isn't released under a permissive license, then I might have the expectation that those wishing to use my code for commercial purposes will contact me and pay for a commercial license.
This is sort of the whole point of non-commercial licensing (and often, of the GPL itself, since many potential licensors don't wish to deal with GPL restrictions).
Sure but did you have the expectation that people wouldn't read your code and learn from it? I think even non-commercial licensing can't prevent that.
If your code is so super-special that you don't want people to read it and go "ah that's a neat linked list reversal algorithm" or whatever then your only options are software patents or keeping it entirely closed source.
Maybe trade secrets, but they tend to apply in very very limited circumstances. I doubt any software would qualify.
There are cases where the network can reproduce code verbatim, just like there are some functions that a good programmer may know by heart. However, this is not how Copilot normally works; it acquires understanding of code, not a copy-paste library.
Well, that's overstating things: it's kinda an open question under what circumstances GPT will learn text by heart. However, it seems to be extremely rare.
(Human creators also sometimes reproduce things they've read verbatim, without noticing.)
Would you consider a Markov chain to have an understanding of the input data? It too sometimes reproduces training data verbatim, but mostly produce remixes of many short snippets instead.
On the spectrum from Markov chain to Human comprehension, I'd submit that GPT is much closer to the former than the latter.
I would consider a Markov chain to have "the sort of thing that understanding is a more complex version of" of the input data. Markov chains compress inputs. Understanding is compression. So I'm willing to bite that bullet.
I submit that GPT is much closer to the latter than the former, and may even overtake it in its domain. It lacks hardware that the human brain uses to reflect on its knowledge and understand what it is thinking - but just as an "overtuned intuition", just like how AlphaZero's network forms a quite capable chess AI without even using tree search, just by "taking the move that feels right to it", I think it's on par with human intuition, or even stronger, because it cannot rely on reflection like humans can.
(In other words, I think there isn't a spectrum of Markov chains to human comprehension. Human comprehension is a different kind of system. However, there is a spectrum of Markov chains to human intuition.)
The point is that if you are happy allowing people to learn from your code, you should be happy allowing an AI to learn from it.
After learning from your code people may be able to write some of down verbatim (if they have a good memory). The same is true for AI.
Just because they both can do that, does not mean that they will do it - Github is already working on informing users when suggested code is a verbatim copy.
So in conclusion, it's unreasonable to object to your code being used for training an AI when you are ok with using it to train a person. It isn't a valid objection that the AI has learnt some snippets by heart, because people can also do that when reading your code.
I'm okay with reuse and commercialization as long as the licensing terms of my code are adhered to. That means proper attribution, distribution of copyright notice and license, and making modified code available to users. Copilot does none of that.
I think people have this impression that copypasting code is all Copilot does. From what experiments people have made, copypasting code is an extremely rare phenomenon.
I'm familiar with training and using ML models. I'm also familiar with the ways such models can encode their training data in the models themselves, hence my criticism.
My understanding is that Github's argument is that their use of the code to train Copilot is fair use. As such, whether the code in question has been released as open source only matters to the extent that it makes it more convenient for Github to access it, but the argument would work as well for a proprietary codebase.
Edit: I just skimmed the copilot blurb again, they seem to refer to "publicly available" sources and not open source code as their input.
>it suggests that even reproducing the smallest excerpts of protected works constitutes copyright infringement
Actually, it is. It has to do with whether the small excerpt is copying what could be called "the heart of the work"; which in the case of code I would argue is almost always what you are after. No one's gonna copy the indentation style, boilerplate around functions/blocks, punctuation, etc. You always go for the "functional" part of the code, which is definitely "the heart of the work".
The heart of Carmack's fast inverse square root lies in its selection of a particular set of constants and operations that happen (i.e. were designed) to approximate the square root without taking an expensive path. Copyright law would look at this novelty; I don't think it would argue around "the use of subtraction and multiplication in a computer program", as that would be plainly stupid.
I am surprised that someone who is supposedly an expert in copyright law does not (or pretends not to know) about this, not only that, but to actually suggest the opposite. This is copyright 101, come on.
> What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.
It is not astonishing at all given:
* proprietary codebase have not been indexed by copilot (at least a public version of it)
* arguably derived code will be used in proprietary programs
Yah, not sure what is astonishing about outrage in response to what appears to be a method for laundering GPL'd software.
Copilot ought only to have indexed public domain, wtf, and other wide-open licensed software. They should remove all GPL'd software from their model, even if that means retraining from scratch.
It's not just GPL, they arguably should remove MIT, BSD and most other Open Source software too, as it's hard to tell when any given snippet crosses a threshold where the original license demands attribution or other things. People seem to forget that even MIT license has actual conditions in it.
If they truly believe copilot does not produce derivative works, then there is no downside to indexing their own code in its entirety; it would probably improve copilot's behaviour.
Well, Microsoft, show us you believe your own arguments!
Does anyone know which codebases got included? I get the impression copilot scraped github - but as it's an internal tool, did it only scrape public repos or has private repos also been slurped?
> ...some commentators accuse GitHub of copyright infringement, because Copilot itself is not released under a copyleft licence...
This is not why. The issue at hand as I understand it is that people using copilot will potentially have code snippets in their work that are already licensed they do not know the license for and that they will not license properly as a result.
That's in the first paragraph. If you enter this discussion with an incorrect presumption from the outset I don't see how you can form a valid defense.
> However, by doing so, the copyleft scene is essentially demanding an extension of copyright to actions that have for good reason not been covered by copyright.
No. Nobody is asking for an extension of copyright protection, we are asking for the existing reach of copyright to be respected. We built our licenses based on a ruleset that we were told is fair. You don't get to violate rules you made and then claim that copyleft people only made their licenses because as a workaround to copyright and so are being hypocrites.
> Others focus on Copilot’s ability to generate outputs based on the training data. One may find both ethically reprehensible, but copyright is not violated in the process.
The arguments I've heard are not that Microsoft is using publicly available information to train it's AI. The argument is that people are potentially (and in some current cases demonstrably) getting copy pasted code snippets from licensed software. If you can't see the plainly obvious problem here it's because you're trying not to.
Also a point made in the article, that machine generated things cannot be copyright because copyright requires a creator, brings up an interesting question as to whether works by people who used copilot can be licensed at all.
I've said this before, but I hope the issue isn't infringement per se, but that the produced code isn't automatically GPL'ed. The author argues that machine generated code isn't copyrighted and this is good because it essentially fits the "data wants to be free" mentality, but I'd say tell that to the people who use it. Will they, after using something derived from open source, have to open source their code? No, they won't. If anything, this finally provides closed source developers with what they've always wanted, a means to rip open source code without having to return contributions.
Julia Reda hints at that last bit as being an issue but only in a parenthetical. To the author, that literally is the whole point. Do people not remember the Free Software vs. Open Source debate? Or GPL vs BSD? The requirement that derived works also be free is literally the important bit in Free Software. This only fits the mentality of "data wanting to be free" if your model of that idea includes the permissive sensibility and doesn't care about actually changing the state of things, which is making free software more widely used in the world over proprietary software.
> Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either. The output of a machine simply does not qualify for copyright protection – it is in the public domain
This is fantastic news.
I'm going to create a bot that crawls sites like GitHub searching for popular libraries. Then it will copy them- sans any license- to it's own website where it will sell these libraries under a new name.
Since there is no creator here, just a piece of software, then there is no copyright violation. My system simply is "inspired" by the original source code using a proprietary algorithm that I call "Copy and paste".
I'm open to accepting venture capital for this project.
That looks to be as much about exceptions for education as about photocopying. Purely from a quick read of the Wikipedia article’s summary, it looks like a lot of it hinges on the interpretation of Section 52(1)(i), https://copyright.gov.in/Documents/CopyrightRules1957.pdf#pa..., “the reproduction of any work— (i) by a teacher or a pupil in the course of instruction; or (ii) as part of the question to be answered in an examination; or (iii) in answers to such questions”.
It's not news, it is fantastic, and you'd do well to understand it.
The output of copilot is macnine-generated and is not subject to copyright. Microsoft cannot claim copyright on what it generates. That does not affect my rights, or anyone else's. I can claim copyright on what I write, and neither copilot nor your stupidity diminish my rights.
MS may argue that what copilot copies is small enough that I have no copyright on that, and win in court. You may put forward the same argument but I think your fate in court would be different.
> The output of copilot is macnine-generated and is not subject to copyright. Microsoft cannot claim copyright on what it generates. That does not affect my rights, or anyone else's.
What about when I, a developer working on a proprietary codebase, blindly commit the output code into our product? Have I created a derivative work or, worse, plagiarized?
Right, which is why it feels like a bad-faith argument in this case to say "a machine can't produce copyrightable code, etc."
We aren't talking about a server sitting in a Microsoft data center shuffling code to another server without human intervention. We are talking about a tool that helps developers create code -- code that is "copyrightable and under another license", and thus in violation.
Id expect that since that’s your intention, as described above, then it is copyright and you would be the actor. In the GitHub case, it isn’t the primary intent but rather a byproduct of the goal of helping another developer? Not a lawyer, but spent enough time with lawyers to know that what you’re describing won’t fly. I don’t even know what GitHub is doing will fly, maybe they’re hoping it gets tested.
I think a good deal of engineers here should familiarize themselves with Julia Reda and her work and ask themselves if they have the legal knowledge to debate on this matter. Common knowledge is not acceptable to determine truth.
Would you really respect the opinion of some dude who's only used Excel about your profession?
Won't work. This tool is attacking them like the presence of a vegan attacks some hardcore meat eaters. They might realize deep down that this is not an argument they can win but it offends their core existence in some ways so they can't help but die defending their incoherent arguments.
Ethical or not, it's clear Microsoft isn't going to get into real legal trouble due to this, and if the tool is genuinely useful, it's going to "allow the laundering of GPL code" into companies, whatever that means.
If that offends people then they better learn the lesson and not produce open source any more. I'm not happy but if thats the direction the natural progression of things take whatever let's see where that goes.
She cites githubs smallest excerpt claim for her reasoning when we already know that the tool happily reproduces entire functions with comments verbatim.
Also her claims about machine generated code have a really funny interaction with the cp command. Clearly cp MicrosoftWindows11Source.zip FreeWindows.zip is not a creative process, cp is a command executed by a machine hence the contents of FreeWindows.zip are now entirely public domain. Man were was she when people where sued over creating entire libraries of public domain movies using BitTorrent?
Just like you accuse her of being out of date with recent findings, y'all seem conveniently out of date with githubs assurance that they will be adding checks to not regurgitate full chunks of code. So what exactly is your point then?
Julia is one the few MEPs that properly engages with issues of copyright and is active in IT. I really appreciate it, even if I dont always agree with her
But she doesn’t seem to have engaged - she seems ignorant of basic facts of what the technology is doing in practice if you read the other comments here which give specific examples.
I've seen way too many screenshots of a dozen-line complete XHR wrappers being suggested[1] to complete a function to imagine Copilot as a generative machine. It's a somewhat fancy copy paste engine, with phenomenal search. But it's smuggled through enough complexity & machinery to obfuscate any legal obligations that might be attached to the original source material.
The article does not set itself up to address this at all:
> Since Copilot also uses the numerous GitHub repositories under copyleft licences such as the GPL as training material, some commentators accuse GitHub of copyright infringement, because Copilot itself is not released under a copyleft licence, but is to be offered as a paid service after a test phase.
I'm all for discussion of whether Copilot itself has to be copyleft. But to me, the immediate concern is that Copilot seems like a way to take copyleft works and remove the copyleft license from those works.
GPL does not permit you to copy source code without attribution. this copier does not provide attribution.
as i just said, i'm not so interested in debating the source-code-copier's licensing. i think it could go either way but i don't really care. the copied source code that the source code copier copies is interesting to me, and feel like the stochiastic parrot act bullshit they are pulling is massive massive sinfully evil bullshit without attribution. the stochiastic parrot can't just ignore all the licensing of what it parrots out.
Please copy my code. Reality is I'll be gone in 100 years tops and I'd be more than glad if my crappy code actually helps someone.
As for attribution, we all learn by looking at code from all kinds of licenses. Between Stack Overflow, projects hosted in GitHub, libraries that sit on our vendor directories and even closed source projects there's a lot that is carried over to new projects without attribution.
We're heading to a world were most projects are basically libraries glued together anyway. Standing on the shoulders of giants and all that.
The dream of an omniscient pair programming buddy is slowly coming to fruition and I for one welcome.
Copilot is just a tool, fancy search engine for the code that's available online. Projects should be judged by the way they use Copilot just like I'm judged if I misuse my car.
I couldn't care less whether my name is shoved in some ever increasing CONTRIBUTOR.md file that no one but machines will read.
I'm actually going to start documenting blocks of code more thoroughly so Copilot can better infer what each block does.
Yes and you did no further research on your own since GitHub already said it's going to fix that (and a competent engineer would know it's trivial to fix that as well).
I'm not a copyright expert but I wonder about an implication of two of this author's points:
- Reading and remembering-about (like reading a book yourself) things does not infringe on copyright.
- Copyright does not apply to the output of mechanistic code generation (as opposed to the human-written code that generates the code).
So where does that leave the quake snippet (setting aside its own release as open source)? Assuming this technical description is correct, Copilot does not contain the code, just the correct weights to contextually reproduce it perfectly. Copyright does not apply to the chunk that Copilot produces, so does the code simply exist as Copilot created it without license? If that is correct, what are the limits? Could I train a ML algorithm to reproduce binaries from context and, if those produced binaries happen to be identical to other copyrighted products, then it's fine?
> If it were not possible to prohibit the use and modification of software code by means of copyright, then there would be no need for licences that prevent developers from making use of those prohibition rights (of course, free software licenses would still fulfil the important function of contractually requiring the publication of modified source code).
The parenthetical backpedaling here is the entire point of copyleft. If it wasn't, copyleft wouldn't exist -- people would just release their software as public domain.
The opposite of "copyleft" isn't "copyright".
The opposite of "copyleft" is "never published", in which case, copyright is irrelevant.
There is plenty of commercial closed-source software based on software released under permissive licenses like BSD, MIT or Apache, because they are not copyleft.
I’d argue that copyright is still relevant when the source code isn’t published. It’s not too difficult to copy an algorithm from a binary even if you don’t have the source.
Really having trouble getting the unrelenting hatred here on this site for something that's fundamentally new, clearly represents progress, and is obviously a trend that's here to stay. <CrazyIdea>Maybe the laws, licensing, etc. need to, you know, adapt, change, and evolve a little bit with time also -- just as everything else changes with time.</CrazyIdea>
I frankly think that the "free culture" label and extremely permissive licenses of many open source project are nothing but a redistribution of wealth upwards. Those with existing capital can make profitable unfree derivative works without any benefit to original authors. This relationship must go both ways if you want actual free culture. Stop producing MIT/BSD code in your non-work time.
This is not a research project, this is a commercial work that produces verbatim copies of code without disclosing its license (or having a license grant in many cases). It doesn't matter how it manages to reproduce it either. It does.
> extremely permissive licenses of many open source project are nothing but a redistribution of wealth upwards
Although I license all my amateur work GPLv3, I dispute the assertion that more permissive licenses are "nothing but" a redistribution of wealth upwards.
Permissive licenses commoditize their features. This is of benefit to everyone, but organizations with more capital are better positioned to leverage that commodity, and typically when they do, they do so selfishly. This further centralizes value with them, but I believe this is still a better outcome than a closed license because of the educational/cultural/technical benefits to everyone of the open license. The capital leverage problem is orthogonal to that.
Copyleft licenses kind of do the same commoditization thing, but explicitly only for share-alike uses, which is why they're the only way to grow open culture relative to proprietary culture: they're designed to deny "freeloaders" on open culture by requiring all derivative works to also be open.
> without any benefit to original authors
Benefiting the original authors is not (directly) the point of open source.
MIT: "Here's some code, go nuts."
GPL: "Here's some code, go nuts, but share the nuts if you do."
There's zero value capture for the original author built into these licenses. If the original author wants to ensure they capture the value from their work, I recommend using a closed, proprietary license. Just be aware they're applying friction to the overall technological development of humanity by doing so.
This is the closest anyone has ever come to convincing me to use GPL instead of MIT license.
But I still want to support small developers with anything I produce for fun, and I'm not willing to give that up to spite the big developers.
For instance, I wrote a small class to load OBJ files in Unity because I needed it for an idea. I went ahead and put it on Github for others that need it, too. I could easily see someone having an idea similar to mine that needed that and couldn't find it out there. (I think there are more libraries like that now, though.) I wanted them to feel comfortable using it, even if they eventually make money with their game.
If a big corp uses that code, too, that sucks. But there's no good way to draw that line in a license, so I didn't.
Having said that, in the future I could see releasing some software that I don't think anyone should profit from, and in that case I'd GPL it. Previously, I'd have just defaulted to the same MIT license. I'm just not sure what that'd be yet.
> This is the closest anyone has ever come to convincing me to use GPL instead of MIT license.
> But I still want to support small developers with anything I produce for fun, and I'm not willing to give that up to spite the big developers.
Another line of thought that may help you choose a license: Do not think only about other developers. Think about the final users of your code that will be running your algorithms on their computers. The GPL protects the right of these users to see and modify the code they run (your code). So-called permissive licenses, on the other hand, let middlemen to strip this right from your users.
Users of your code are in fact freer thanks to copyleft licenses.
> No, but it does force them to GPL their own code if they do. And that's a no-go for most companies.
There's many companies who release source code. I don't know enough to say if that makes the word "most" in your sentence false or not.
Anyways, a company using GPL code need not release all their own code. Just their modifications to that particular GPL program. And then again, only if they do intend to distribute the modified program.
I don't really expect any benefit when writing some pieces of code. These days I just make certain types of code I make public domain, because if it's MIT then BigCorp inc. will just make my name one in a really long list of contributors, and I won't get any benefit either.
Amazing how this was never an issue when other “AI” systems use other people’s data to learn how to drive cars/write text. But man you start messing with developer data and suddenly there are ethical issues! Amazing turnaround.
Face it - AI as we currently call it is just a very sophisticated data sorting algo in most cases (let’s ignore the AlphaZero non supervised learning type). Everyone was getting celebrating when Common Man was destroyed by devs commoditising their knowledge through data capture. But now suddenly it’s a problem! Mess with a man's pocket.
These sorts of discussions always puzzle me. Copyright is not an objective thing, it’s a contract that supports the cash flow of a creator-distributor-consumer chain, given the former two assumed it will work to cover their expenses and return profits they expected. If your AI produces Beatles-like or even better music based on Beatles albums and/or some more, an AI-aware judge may (or may not, depending on the lobbying activity) decide that it is a direct derivative work, and in case it is automated, all copyright rules apply as in “copy” “right”. There is no need for technical objectivity to exist in between, because this law is not about technicalities. What seems like a loophole may be closed easily by a court decision based on much higher matters than “similarity” or “reconstruction”. If anyone can take your album at the release date and “reshuffle” a free version not worse than the original with few clicks, it is obvious damage to the copyright holder and it demotivates creating it. AI couldn’t do any of that back then, and they didn’t include right terms to cover that, but now it can, and they will just add that, unless someone (MS in this case) has better lawyers, who are ready to create a wide-enough precedent and drag it through all instances.
Copilot is the moment when simple functions have been commoditized, you can have as many as you like almost for free, and adapted to any project. Just spend a moment to admire the transition, it's a new stage of post-scarcity.
AI can recreate photos, paintings, sounds, voice, music, human faces, text, dialogue, math, proteins, and now code. It does all this while allowing humans to control and direct the whole process, and create original combinations. They all have no economic value to own and are free to use now, like words in a language. Enjoy!
Remember Karpathy's Char-RNN? How long we've come.
Microsoft bought the place that has a lot of our code and now is going to try and sell us a tool that will regurgitate it back on demand. The entire software industry is already largely based and advanced by the unpaid labor of open-source software project developers, GitHub as a popular open source ally could at least pretend to honor the gentleman's agreement of at least agreeing to respect the open-source origins of a ton of its stack.
If the tool was also open we probably wouldn't have nearly as big a problem, but I guess Microsoft has to recoup the cost of their completely unnecessary purchase.
> On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive.
Wow, what a bunch of, ahem, logic of questionable quality.
"On the other hand, the argument that the outputs of Copilot are works is based on the assumption that a machine can produce. This assumption is wrong and counterproductive. It just moves electrons and hums a bit."
This is a reductio ad absurdum. The argument is bogus, because what matters is the result of a person/other legal entity using the machine and its software.
Modern AI seems more like machine-assisted collage (or pictures, code, text, etc.) than anything else. Someone (of some other algorithm) needs to be added to ensure that the whole thing makes sense. The big problem here is that when an artist creates a collage he/she knows the sources. Here provenance is lost.
[1] Collage (/kəˈlɑːʒ/, from the French: coller, "to glue" or "to stick together";[1]) is a technique of art creation, primarily used in the visual arts, but in music too, by which art results from an assemblage of different forms, thus creating a new whole.
“ If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”
I don’t see my license respected for code it regurgitates that I wrote, there is nothing more to this.
To understand if Copilot is infringing the licenses of codes used in its training data we have to get into the details of what it does and how it works. We can't make a general statement for any code generation software that was trained on open source code.
It is a possible that at some point maybe even in not so distant future we will have ML models so good they can understand abstract concepts, learn and invent new algorithms and implementations by reading code. Such a ML model can be argued is learning similar to a human and hence it's not infringing any copyrights because it's not copying implementations it's learning concepts and ideas.
But we are not there yet. When we get there we will know. Because at that point Siri would be able to have seamless conversations with you. At least half the jobs would disappear in favor of robots in a short time. The world would be a different place.
Let's talk about what Copilot actually can do. It can copy snippets of code from Github while changing variable names. It can autocomplete trivial boilerplate code. If it's automagically generating a function for you that actually does something useful like sorting an array, you can be absolutely sure that it's just copy pasting it from an existing repo with some cosmetic changes.
Of course derivative works are being produced!! Whether you blame Copilot or the developer using it, the result is something that required the original developer of the code in order to be constructed.
Have we reached the point where every “class X” must become “class X_GPL2_CopyrightJohnQSmith_AllRightsReserved” in every code base out there? Do we need to go from header comments at the top of a file to reminder comments at the end of every line?
How does one address the fact that 95% of software is based on the same basic tropes? At a certain level of density, all code trying to achieve a similar function to legally-protected code will convene on an implementation that is almost indistinguishable. With LOC accreting exponentially, only time will determine when we reach that threshold. The Copilots of the world serve to accelerate and monetize this reality.
The idea that the debate actually does a disservice to copyleft by relying on the strictest interpretations of copyright is an interesting perspective to me, but the rest of this seems pretty weak. (Caveat that I'm no lawyer.) Copilot can regurgitate verbatim chunks of other codebases: it seems absurd to me that that wouldn't count as derivative work.
I believe it is true that in most cases it won't be infringing. Since even though it can sometimes output some trivial code verbatim to the original, that original work won't run or compile and therefor it is not even a work, just gibrish. Simply limiting the copilot scraping to software with at least a few files will probably resolve that issue. Moreover, if I where github I would simply change that statement regarding the legality and let the developer make the choice if to use or not. More often then not, it would probably make this academic talk that no one cares about. a few functions here and there is not work. Almost never people try to copy stuff like a micro kernel or something so small as to constitute a work. Personally I would rather treat this as a search engine. I don't copy paste code but I may be old fashion and probably the exception here.
Why is everyone ignoring the fact what neural networks do? It is being used as a search context aware pattern matching and use that to predict what you will write next. Of course it's going to return copyrighted works based on what you right.
It's a pattern matching algorithm what exactly did they think it was going to do?
As best I can tell, people on Hacker News largely think of machine learning as some sort of statistical trick that they don't actually need to apply any further understanding towards. They see it repeat Doom code verbatim and assume it is capable of repeating any and all code it's ever seen verbatim - hence "laundering".
What they maybe aren't considering is that specific snippet is famous. It has likely been pasted thousands of times with and without attribution on public GitHub repositories.
Yes, it has seen code before. No, it didn't memorize the entirety of the dataset it was trained on. If it did - it has explicitly overfit, won't generalize to downstream tasks and ultimately failed at being useful in the general case.
Unfortunately, "we don't know" still, but what may have happened is that their transformer architecture creates a more efficient representation of the byte pair encoding representing the code. In doing so, it is able to learn about context, structure, and logic of the language it is trained on.
Anyways, I think this whole thing is absurd. So far - every "atrocity" I have seen committed by copilot is easily achievable with GitHub advanced search using "code contains text".
But there's a Berne convention that kind of unifies copyright around the world, right? It's not like something can be copyrighted in one country but not another.
The Berne convention sets down some basic principles, but there are an awful lot of edge cases and it is very much the case that things can be copyrighted in one country but not another.
Heck, the duration of copyright isn't even uniform around the world.
>What would then stop a music label from training an AI with its music catalogue to automatically generate every tune imaginable and prohibit its use by third parties? What would stop publishers from generating millions of sentences and privatising language in the process?
The existing barrier we have is that, unless the music label can prove a human artist has listened to the specific song matching the artist's, there's no copyright violation. A copyright protects creators from having their work copied. It doesn't give them ownership over matching works. I'm sure there are plenty of pairs of novels with the same first sentence despite each author never having read the other's work.
If I understand what is being stated correctly; even if I assert a prohibition in my licence for my creative work (code) not to be used by Copilot (or any other machine learning model as training data), it wouldn't matter as its not covered by Copyright?
> Copyleft does not benefit from tighter copyright laws
Of course it does, at least the goal copyleft serves for RMS style Free Software ideologues. While copyleft may be motivated by an ideology that prefers no copyright protections, at least for software, it relies on copyright maximalism to avoid nonfree derivatives. From advocates viewpoint, the worst situation is a copyright regime that is strong enough that it allows nonfree software to exist but is also weak enough that it prevents creating an iron wall that prevents the use of software built by ideolgoical opponents of nonfree software from being used to advance nonfree software.
I think it's time for someone to train AI on leaked proprietary code and source-available code like Unreal Engine. It's cool that we have so much of it right now.
Then we'll see how fast Microsoft and others will shut it down.
> Works licensed under copyleft may be copied, modified and distributed by all, as long as any copies or derivative works may in turn be re-used under the same license conditions. This creates a virtuous circle, thanks to which more and more innovations are open to the general public.
She claims that Copilot advanced the goals of copyleft but copilot does not create a “virtuous cycle” of generating more public IP. The customers of Copilot use Copilot extract public work through Copilot for themselves and are not compelled to contribute back.
I wonder how long it will take for the licenses to start explicitly disallowing this sort of usage. It is clearly something that many open source writers dislike, and in my opinion, rightly so.
Claiming that generated work isn't work seems completely wrong. The hard to argue fact is, looking at the result, it doesn't really matter who wrote it, just how it reads, and what it does.
What is lost in so much of the arguments about Copilot is that someone still needs to actually verify the code does the right thing. I have a feeling this tool does little but increase little bugs like off-by-one errors or all kinds of havoc; primarily because of false confidence in the autocomplete.
Whether Copilot itself violates GPL or not is one issue.
Whether the code produced by Copilot violates GPL or not is a whole different independent issue.
If I am walking down the street, find a piece of paper with code on it, pick it up and add the code to my program and this code turns out to be licensed under the GPL then my program becomes a derivative work. It doesn't matter who wrote it on that piece of paper, whether it's a 100% correct copy of the GPLed code or not or if there are mistakes in it.
Copilot, to me, feels like a faster Stack Overflow. We already copy code snippets from all kinds of places across the web without thinking about how it's licensed. Sometimes, we copy whole functions and files. We're responsible for understanding what's going into our project. We don't blame NPM when it allows us to import a package into a project that subsequently violates the license. I'm absolutely sure this happens more than anyone cares to admit.
Even if true, that doesn't indemnify copilot here. There is no way to _prove_ whether the code was generated by copilot vs yourself. Copilot is just autocomplete, so its still a human checking in the code. While it might not be illegal for copilot to generate those things, its illegal for a human to check it in and claim it as their own.
Whether Copilot infringes copyright is a muddy area. I personally would like to think that the world where machines can be trained on any data is easier to live in than one where trained machines are tainted by the licens of input.
The interesting question however isn't whether Copilot infringes copyrights, but whether those that use copilot do.
One of the points being made is that in the worst-case scenario of getting Copilot to repeat back verbatim chunks of code from projects, something that's not its primary use case, it would be a situation similar to a copy machine.
You can copy a page out of a book, or the whole book, and be covered under fair use. But you can't sell your copy on Amazon. And if you did, the copy machine nor Xerox ran afoul of copyright law.
You could also use a copy machine to copy fragments of the Linux kernel source out of a book about the Linux source and use them to construct an entirely original work that's not considered derivative.
The devil's in the details, but GitHub talks at some length about the plagiarization issue and their plans to detect and link back to where verbatim chunks exist in the training data to let the operator decide what to do soo.. IDK.
And then what license do you choose? Many licenses require you to copy the original license verbatim, which may include the author's name and the date.
Has anyone tried dumping the debugging symbols from a Microsoft binary e.g explorer.exe and tried to autocomplete^Wcopilot its functions? Would be interesting how far Microsoft could be pushed before they ate their own hat.
I’d agree with this conclusion if it wasn’t clear that it is very possible - if not common - for Copilot to just completely copy code. That isn’t fair use - that’s a clear violation of copyright regardless of license.
Huh? The article is full of waffle, but in the condensed form it shills for:
(a) treating an adopted code as «trivial» as i++, which is pure demagogy because what we've seen already on that CoPilot video is NOT trivial
(b) dismissal of (let me put it straight) piracy as somewhat special case of fair-use, which is valid only when code in question stays on video as prop, the real code ISN'T fair-use
(c) accepting (a) and (b) above as ultimate truth just because bogey stricter copyright laws hurts FOSS. And the water is wet. This is absolutely meaningless filler because we know what stricter copyright laws hurts everyone since Napster days.
So my overall impression from this reading is just... Huh?
> My name is Julia, I'm the Pirate in the European Parliament.
Don't you think that our world would be way more relaxed and flourishing place if lawers kept their noses out of software like they are keeping them out of math?
The output of a machine simply does not qualify for copyright protection – it is in the public domain.
Is it just me or is that a patently ridiculous statement? The output of a machine belongs to the person owning/using the machine. If I use a digital camera to take a picture of a copyrighted image I'm still committing copyright infringement despite the output being created by a machine and a bunch of image processing software.
She's arguing that copyleft people are arguing for an effective extension of copyright into places IP lobbyists are currently fighting for. It's not a good framing. She's saying that we shouldn't argue for copyright to be consistent if we're against copyright - arguing that we should make a moral argument against a legal situation.
It's as if we couldn't argue against drug companies being allowed to sell heroin if we were anti-drug war and drugs would remain illegal. It's a strategy argument that leads nowhere. If the result of making machine written works also subject to copyright results in all possible songs being copyrighted by a machine, that's a good outcome. It's obviously absurd and weakens the entire concept.
We should demand consistency.
If this is fine, we might as well stop enforcing the GPL, too. It's a trick of copyright to further the cause of anti-copyright. I'm sure somebody can write an "auto-fork" that will digest GPL'd code and rearrange and rephrase it in order to spit out a clone.
So may be it is best to have a separate license for Machine Learning? Let's call it copilot licences. ( May be it is better to call it an exemption ? )
You will need AGPL / GPL / LGPL / MIT / Apache / BSD + Copilot licences before it can be used for training? Knowing there are a very small possibility that some code snippet will be the output?
I mean we could endless debate this with no solution unless this is put into court.
This is kind of beside the point. Something can still be unethical and perfectly legal. The issue is that machine learning can whitewash a developers intended license.
Or put differently, as a GitHub customer, are you comfortable with your code being used this way? Instead of a passive host, your code is now being used to create tremendous value for GitHub and Microsoft. Do you feel your trust has been violated? (regardless of legality).
TLDR; GitHub will eventually add some kind of "data usage reporting" utility that could show which parts of final your code made with help of this CuckPilot could potentially infringe copyright with links to other known sources of these parts of code. Then they will tell you that it is your responsibility to ensure that your final code does not have copyright issues.
"(of course, free software licenses would still fulfil the important function of contractually requiring the publication of modified source code)"
No no no. Licenses are NOT contracts. Someone who copies or makes derivative works of copylefted software which they then distribute is obliged to remain within the bounds of the license not because they voluntarily promised, but because they don't have any right to act at all except as the license permits.
Just for context, the author of the article you're commenting on is Julia Reda, a EU copyright activist and a former member of the EU parliament. While I likewise don't have too much use for the legal opinion of French courts, I think we can afford to cut her some slack for focusing on legal interpretations in her native jurisdiction.
They have nothing to do with contracts and there's a simple test for it. When contracts are violated, then if there's litigation the parties consult the relevant contract law for how to proceed. When a license is violated, the parties consult whatever law the license was permitting an exception to. If you copy software without a license, you can be sued for copyright infringement. If you fish without a license, you can be sued for trespassing.
A license isn't a contract that binds the licensee, it's a contract that only binds the rightsholder. Since you, the licensee, are not relinquishing any rights in the contract, there's no need for you to agree to anything. The only rights being relinquished are the rightsholder's right to pursue legal retribution for some uses of their work that would otherwise be violations of copyright.
You dont have to call it a contract, but it is a legal document in which one or more parties legally bind themselves, which seems like an adequate definition of a contract to me, and has more etymological fidelity to the word "contract" than other possible definitions that would exclude licenses. A contract is a legal instrument by which the breadth of your rights contract — as in become smaller.
OSS licenses, so far as they a permissive and require nothing in return, are not contracts. This is often the case for simply using the OSS software. The user has no obligations whatsoever.
If, on the other hand, the licensor and licensee both have some obligations (in OSS, this is usually when you modify or redistribute the source or compiled product), then it's basically a contract, no matter what RMS claims.
I mean, with all due respect to the guy, he makes controversial claims even in the field of software engineering (and also free software evangelism), his supposed professional field. Why would you trust what he says about contract law, a field where he has no professional training whatsoever?
(That said, GPLv2 is still an ingenious work for many reasons, albeit lawyers probably won't draft it that way)
"This is often the case for simply using the OSS software. The user has no obligations whatsoever."
This is a category error when it comes to copyleft licenses like the GPL. It has nothing to say about usage.
"If, on the other hand, the licensor and licensee both have some obligations (in OSS, this is usually when you modify or redistribute the source or compiled product), then it's basically a contract, no matter what RMS claims."
No it's not. There are no pre-agreed terms, penalties for violation, expected compensation for services provided or anything like that. GPLed software is copyrighted. Copyright law says you have no rights to copy it or make derivative works of it whatsoever. The license permits you to do so.
"Why would you trust what he says about contract law, a field where he has no professional training whatsoever?"
Because, surprise surprise, he has advice from people who ARE trained in the law.
The GPLv2 text:
" The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). "
Of course this sentence contradicts the previous sentence in the text, which claims that normal usage "is not covered by this License". I presume you'd argue to support your claim, but seriously, this is bad drafting.
> No it's not. There are no pre-agreed terms, penalties for violation, expected compensation for services provided or anything like that.
"penalties for violation", "expected compensation" are not necessary requirements for formation of a contract. The pre-agreed terms are clearly stated in the license text, or at least as clear as far as they don't contradict each other. By the way, "pre-agreed terms" are not necessary for the formation of a contract either.
> Because, surprise surprise, he has advice from people who ARE trained in the law.
Are you trained in the law? Because if you think the Internet should pay regards to somebody trained in the law (even though they may not have learned the law properly) as opposed to somebody who hasn't, then I don't see why you think you have a standing to speak as though you're an authoritative source on the matter.
PS: There's still a nuance that might require clarification(or am I adding confusion?) in your original quote though:
Quote: "(of course, free software licenses would still fulfil the important function of contractually requiring the publication of modified source code)"
Even though as many others have pointed out, OSS licenses can be contracts, I'm actually not sure this sentence is correct.
When somebody uses the source code in compliance with the license terms, a contract might be formed to allow both parties to enjoy rights. However, if one party never complied with those terms and breaches them (eg. distributing source without retaining copyright notices), then arguably no contract was ever formed, and the act is a simple matter of copyright violation and not a "breach of contract".
Hope I'm not splitting hairs.
Disclaimer: learned English common law a bit, not a lawyer.
You're right. I was mistaken -- I thought he was referring to those RMS claims that in general the GPL is not a contract.
In Moglen's article about enforcement, I think he's right that where there's a breach of GPL there is no contract. In fact that's what I said also in my follow up reply.
This is a supposedly progressive politician, young, in an advanced country, her personal platform runs almost entirely on copyright issues and yet she gets almost everything wrong, what can you expect from your usual dinosaurs?
Mass processing, repackaging and then selling the data is an exploitative business these multi-billion companies run without paying anything to the people who produced the data.
> a legal expert weighs in [...] And HN rallies to criticize it
That is an appeal to authority. Being a legal expert does not excuse one's writing from critical analysis. In this case, the post does not address Copilot reproducing large segments of copyrighted code verbatim. That is valid criticism.
This discussion is heavily biased and prioritizes people's emotional need to be credited and/or paid for their work over a discussion of the legal and ethical concerns at play here. It disregards the comments of an expert in the field and focuses instead on demands that may well be unsupported by copyright law.
For example, GitHub license section D.4 specifically grants GitHub the right to display your content, analyze your content, and reproduce it in full to other users of the service. Yet no one seems particularly interested in discussing that here today, because it isn't compatible with the outrage that people are prioritizing on HN when discussion Copilot.
I would have expected HN to be better than Reddit in this regard, but I'm not seeing it yet. I don't know if the expert is right or wrong here, but nothing in today's comments suggests anything new or curious that hasn't already been ranted about in every prior thread about this topic. I specifically care about copyright law and it's disappointing to see HN having a group tantrum instead of a discussion.
The legal commentary I'm seeing from people who really know this stuff is pretty much unanimously in favor of this being legal in at least most of the world based on caselaw--while acknowledging why some might have ethical concerns.
I'm actually sort of curious as to the vigor of the backlash. Because Microsoft? Because of concerns about perceived further undermining of the GPL in particular? Because of people anxious to get their credit? Because...?
> I'm actually sort of curious as to the vigor of the backlash. Because Microsoft? Because of concerns about perceived further undermining of the GPL in particular? Because of people anxious to get their credit? Because...?
Because, this is really against the understanding of what was possible for copyrighted works. So, now that this is possible for anyone, copyright will start to get examined and hopefully updated to be useful in today's environment.
There are about a million problems with this.
This can even be used to intentionally launder source codes from a competitor. Apparently, all it will take will be to steal code (or just fork it), then create more than 10 copies on Github. At that point, copilot will start to emit the code during use. With all the legal commentary saying this isn't infringement, imagine how companies will be able to use this product.
Similarly, the training set can be intentionally polluted, so your competitor finds the output of Copilot worthless.
Because they’re not getting a share of GitHub’s future revenues from their works or from derivations or their work.
(Why do they care so much about revenue? Open source coders and ‘starving artists’, not to mention Covid economic wreckage, the US approach to medical insurance, and the total absence of Universal Basic Income in virtually all countries permitted to access GitHub.)
> For example, GitHub license section D.4 specifically grants GitHub the right to display your content, analyze your content, and reproduce it in full to other users of the service. Yet no one seems particularly interested in discussing that here today, because it isn't compatible with the outrage that people are prioritizing on HN when discussion Copilot.
Well, Copilot isn't really an analaysis and display of the source code within the original meaning that people held. That was meant more to run codeql, github actions, and other analysis while presenting the results in a repository to people. People never anticipated that github would strip their licenses from files and present their source code inside of VSCode for people to use freely. It may be legal, but what we are seeing now is an abuse of the sentences you just quoted that goes outside what they were originally understood to mean.
Is it fair use to remix two musical albums into a new derivative work, that cannot plausibly be judged to replace demand for either original work?
Is it fair use to autogenerate GIFs from movies, perhaps the most protected digital works on the Internet today, in order to use them as reaction memes on Imgur?
Is it fair use to autoextract code fragments from a code base, in order to use them as suggestions on GitHub?
The Internet, and I imagine HN, was in an uproar when the music industry attempted to kill the White Album, because it infringed on their freedom to remix and derive.
The Internet, and I imagine HN, was in an uproar when MLB attempted to kill unauthorized baseball GIFs and replace them with official curated ones, because it infringed on their freedom to remix and derive.
How, precisely, is remixing and deriving from code ‘abusive’, in contrast to the past ten or twenty years of pressure on the Internet to the contrary when remixing and deriving from music or movies?
This is a core point of the original post linked above, where the author is shocked by our demands for more prohibitive copyright interpretations, and I want to call this out more bluntly and less politely than they did:
Fair use of a work is almost always perceived as abusive and unfair by the creator of a work. Creators ignore the cognitive dissonance between their demand to have fair use rights granted more easily to the protected works of others, and their demand to have fair use rights granted less easily to their own protected works.
I see that dissonance go unaddressed in every top-level comment in today’s discussion. I see that desire to deny fair use rights driving hundreds of emotional me-too posts, without considering the framing of whether it is fair use in alignment with every prior copyright outrage we’ve discussed over the years.
My theory is that permitting discussion of fair use would weaken their efforts to groundswell a pitchfork mob, and no one wants to confront their own biases or emotional investment or inability to profit from their code.
Whatever the motivations, HN deserves better than this.
The Grey Album is the only example comparable to what Copilot is doing. And even this is tenuous. It was a one-off and even though the copyright holder EMI did not give permission, the creators of the content remixed were happy with the re-use. Moreover, Danger Mouse could have sought a statutory licence that only applies to music under US law, whereas no such thing exists for code.
None of the other examples match up because the GIFs are not being compiled into films. The remixed works are in a different field of endeavour.
If Copilot were being used to show snippets that scroll across the screen in hacker films, or used by musicians to rap a few lines of Rust code, that would be palatable to the copyright holders. It is transformative and very likely to be fair use.
If Danger Mouse, who created the Grey Album, had instead started a business selling access to a tool that splices in copyrighted music and video based on a clip that the user provided, facilitating widespread, systematic infringement, creators would have been far less sympathetic, and EMI far more persistent in their legal attempts shut it down and collect damages.
You ask a very insightful question. Let me see where I end up running out the analogy in a certain direction.
If Danger Mouse sold a remixing tool that enable widespread remixing of any/all albums, would DM be profiting illegally from the content of others?
In each individual case, the remix album produced would have to pass the fair use tests, and if the user produced a sufficiently close replica, they could be restrained from distributing it. But that wouldn’t implicitly be the remixing tool’s fault, unless it mechanically reproduced a complete protected work with the user completely unaware it was doing so. A dedicated user can make any tool produce a protected work, so we have to aim for the narrow window of user-oblivious in order to fault the tool.
Translating back to Copilot, this then becomes the question: can Copilot regurgitate an entire protected work for a user who then sells that work, with the user being fully unaware that they have reproduced a protected work without meeting fair use terms, such that Copilot is responsible?
Copilot requires user prompting to emit code, and seems to draw the line at around the single function boundary, so reproducing an entire codebase becomes exponentially less likely as the number of functions increases.
So if there were a weakness in Copilot’s defense, it would be in small single-function programs, at which point the parallel to another music case comes to mind: the person who generated and copyrighted every single musical phrase in Western major/minor, to prove that the law as written is not applicable when the total size and complexity of a given work falls below a certain threshold. I thus assume that Copilot is essentially protected in the single function case - it doesn’t matter if you have a protected work for (‘four’ (2 2 +) func), because that’s so simplistic that any human might reproduce it at any time unaided, and so claiming against them would fall flat when a judge applies the common sense threshold. It’s a high bar to expect a judge to recognize this analogy and understand code well enough, but I think between user intention to break fair use being required for complex multi-function systems, and the protection of snippets being essentially impossible to enforce against in music terms, would absolutely shield Copilot from being judged liable and owing damages.
(General disclaimer applies: I am not your lawyer, please seek legal counsel before making use of my opinion, etc.)
You wrote 11 paragraphs about how HN deserves better than what everyone except you wrote. Yet in your reply you didn't address the comment to which you replied.
If it helps you parse my reply, consider that fair use explicitly intends to allow unpredictable derivatives that might otherwise be rejected by the copyright owner of a work, and so most of my reply anchors directly to this final paragraph of yours:
> People never anticipated that github would strip their licenses from files and present their source code inside of VSCode for people to use freely.
I can’t offer you a more detailed mapping of my reply onto exclusively your talking points, as I didn’t consider that a viable constraint. My original point at top of thread remains clear on my mind as I try to provide - using the examples of my own questions and concerns, after objections in the past that I wasn’t! - of what better, more reasoned, more curious, more worthwhile conversation looks like on this topic.
I do accept that not everyone desires to see the change in tone I’m trying to represent here, and no doubt I have been imperfect in my efforts to represent it. I’m sad that this isn’t connecting for you, even though I accept each time I try this that understanding and agreement are never universal. Thank you for your effort in trying to understand all the same.
> D.4 specifically grants GitHub the right to display your content, analyze your content, and reproduce it in full to other users of the service
If you read the section carefully, this covers the right of GitHub to do those things to your content "as necessary to provide the Service". "It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service".
So, does "Service" only cover the type of Github's service at the time of the agreement, or does it allow Github to invent all kinds of unrelated services and use the code as such? If Github can provide a "Copilot" service that arguably "learns" the code, can it also provide a service that blatantly "copies" large pieces of source code for the user (without complying to OSS license terms)?
It's not very clear what the answer would be, but if what I described is allowed, the consequences of this term being so broad would imply that if you're not the copyright owner of code you uploaded to Github, you've probably violated some OSS license by agreeing to Github's terms.
Which OSS licenses are potentially incompatible with GitHub? Are they also incompatible with GitLab? How can one or the other be judged to have exceeded the bounds of what is permissible as a user-generated content provider, and/or fair use rights, in the legal jurisdiction of each?
> For example, GitHub license section D.4 specifically grants GitHub the right to display your content, analyze your content, and reproduce it in full to other users of the service. Yet no one seems particularly interested in discussing that here today, because it isn't compatible with the outrage that people are prioritizing on HN when discussion Copilot.
How applicable is the Github license when a lot of code on Github (let's say eg. the Linux kernel) was posted there by people other than the individual copyright holders? I'd assume they can only rely on the open source license of the code in question, and not really on additional license terms. As far as I can tell, Github claims fair use rather than citing their license.
That's perhaps the most important question of this entire debate, and it's the one that no one is considering seriously here in the comments. I personally think that it's because no one at HN is both competent enough at copyright and licensing law to debate it and willing to spend time debating it with Internet commenters for a $0/hour wage.
If an oversight is all the excuse needed to dismiss considering anything I’ve said that you find unpalatable, then I can save you trouble and instruct you to dismiss everything I ever say, now and in the future, as I am merely human and will continue to be imperfect forever. I don’t generally intend to post this disclaimer on every comment I make, as this is a standard human condition defect in all of us, but I hope this one-time exception allows you to reject my opinions and move on to other discussions with a clear conscience.
Whatever the law, when does learning from what we read devolve into plagiarism?
The poster child for this category would be those programs that generate nonsense English text that recognizably resembles a known author. They choose the next character at random, conditionally based on the previous characters. Too short a context, and the results are gibberish. Too long a context, and the results are plagiarism.
The legal debate around copyright infringement has always centered around the rights granted by the owner vs the rights appropriated by the user, with the owner's wants superseding user needs/wants. Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users. Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use. Every other instance of code used is a case of copyright infringement, a clear case by Microsoft's own definition of copyright infringement [1][2].
Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo. This is a fine line to walk and will likely not survive in a court of law. They are betting on deep lawyer pockets to see them through this, but are more likely than not to lose this battle. I suspect we will see how this plays out in the coming months.
[1] https://www.microsoft.com/info/Cloud.html
[2] https://github.com/contact/dmca