Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Copilot is not infringing copyright (juliareda.eu)
347 points by aarroyoc 82 days ago | hide | past | favorite | 542 comments



I disagree with this article. GitHub Copilot is indeed infringing copyright and not only in a grey zone, but in a very clear black and white fashion that our corporate taskmasters (Microsoft included) have defended as infringement.

The legal debate around copyright infringement has always centered around the rights granted by the owner vs the rights appropriated by the user, with the owner's wants superseding user needs/wants. Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users. Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use. Every other instance of code used is a case of copyright infringement, a clear case by Microsoft's own definition of copyright infringement [1][2].

Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo. This is a fine line to walk and will likely not survive in a court of law. They are betting on deep lawyer pockets to see them through this, but are more likely than not to lose this battle. I suspect we will see how this plays out in the coming months.

[1] https://www.microsoft.com/info/Cloud.html

[2] https://github.com/contact/dmca


The part that feels really obvious to me is that, if I made an AI that could generate music by looking through the entire (copyrighted) back catalog of the Beatles for example, and it would output music that I could control to be very much or even exactly like the original recordings, or I could accidentally do it, that it wouldn’t really be a way to launder the original licenses/copyright into the public domain.

Or maybe it is, but if so it essentially means the end of licensing because it would be trivial to make an AI that can take an input and produce the same output. Or maybe even cp is good enough to strip the source of its original license in that case.

Open source licenses are worth protecting or you break the cycle that helps more software be open.


> and it would output music that I could control to be very much or even exactly like the original recordings, or I could accidentally do it, that it wouldn’t really be a way to launder the original licenses/copyright into the public domain.

The test for non-literal copyright infringement is "substantial similarity." If, after filtering out irrelevant and non-copyrightable elements, the allegedly-infringing work is substantially the same as the original work, then it infringes. If it infringes, then two common defenses are independent creation and fair use.

In your hypothetical, the AI-generated work would infringe the original because you stated it would be substantially the same as the copyrighted work. You can't claim independent creation because the algorithm was dependent on the original work and you controlled the output of the algorithm to be exactly like the original work. Fair use is pretty much a non-starter, so I'll skip that analysis.

So, no, you couldn't use an AI to launder copyrighted works into the public domain.


Unless you are Github, in which case having your AI copy code vertabim is ok?


The way I see it, if you would use Copilot to completely (or largely) reproduce an existing work (software), then you would be infringing copyright. This is similar to using an AI to largely replicate a piece of music.

If you are using it to mix a snippet of code (from a sufficiently large code base) into a large code base of your own, then you are just remixing. That is not infringement. In music, there are entire genres based on remixing. You could even take it a step further and ask yourself: what is not a remix?


Sisqó had to settle with Ricky Martin for quoting "Living the vida loca", then there's the debacle around Katy Perry's Dark Horse...

Point is songwriters absolutely get litigious over reproducing small portions of their IP.

complex.com/music/majority-sisqo-thong-song-publishing-owned-by-writer-livin-la-vida-loca

huffpost.com/entry/katy-perry-dark-horse-lawsuit-payment_n_5d43d825e4b0acb57fca3ff2

factmag.com/2016/06/25/sampling-hip-hop-copyright/


Yeah, I've heard about some of those cases. It's surprising how far copyright can be stretched sometimes.

What's more surprising is to see copyleft advocates positioned so strongly in favour of giving copyright that kind of reach. I think that in a different context, some of the cases you refer to would be used by these same copyleft supporters as examples of why copyright needs to be more weakly enforced, not more strongly.


At the very least it should be consistent. If Microsoft can sue me for something trivial that probably shouldn't be illegal but is, then I can sue them for something trivial that probably shouldn't be illegal but is.


If I don't need your permission to do something with your work, then the conditions under which that permission is granted lose their force.


> Unless you are Github, in which case having your AI copy code vertabim is ok?

You have to filter out any non-copyrightable elements before you do the substantial similarity analysis. For code, that means removing non-expressive elements like arithmetic or boolean expressions, looping, recursion, conditionals, etc. APIs are not copyrightable under the recent Supreme Court holding in Google v. Oracle.

How much of your code is actually left after filtration?


That's nonsense though. If we filter out the non-copyrightable parts of music we'd remove all the pitches, eighth notes, quarter notes, half notes...

We could remove the non-copyrightable parts of text works too. Just take out all the basic building blocks of language like verbs, nouns, stop words...


Individual notes, like individual words, are not copyrightable. Longer sequences, however, can be if they are original.


You’re basically arguing that whole books aren’t copyrightable because after you filter out all the words which aren’t individually copyrightable there isn’t anything left to copyright. Which obviously doesn’t make any sense.

The tests exist as a way of determining if someone did the action of copying which is the important thing at the core of it. And in this case the facts aren’t really in dispute, it’s whether what GH is doing counts as copying.

If you had a really really good memory and remembered almost exactly how your former company implemented something and when faced with a similar problem unknowingly produced similar code. It’s iffy as to whether this is copying. Because it really probably isn’t — you’re allowed to learn from copyrighted works — but the courts aren’t omniscient and when presented with the code they very well may rule that copying was more likely than not.

But we are omniscient in this case so we don’t really need the tests. Is what GH does more like copying or learning? This isn’t something that can be determined purely from the output of the tool.


Code elements like conditionals, loops, arithmetic or Boolean expressions, assignment statements, etc. are not copyrightable, but not because they’re insignificant (like individual words in a novel), but because they’re useful articles. Copyright protects expression, not utility, which is the domain of patent law.


The problem of API copyrightability was not resolved by Google v. Oracle, the ruling was that no matter whether or not APIs are copyrightable, copying APIs is fair use (which is a concept in the USA but not in many other countries around the world).


I’d wager you’d get rid of over 90% of most codebases by removing all that.

But perhaps there could be a way to make something that automatically converts these "non-expressive elements" into copyrightable elements?

Output would probably be maddening to figure out though.


And in any case I think at the point where the AI (yes I considered scare quotes but decided not to) copies the code verbatim including swear words it is kind of obvious what happens:

The AI part isn't about independent creation but about figuring out what to copy.


"Or maybe it is, but if so it essentially means the end of licensing because it would be trivial to make an AI that can take an input and produce the same output."

Yes, this is what is pretty interesting to me. I said in a previous comment that I have a really good OS generating AI. It asks you your favorite color and outputs a disk image you can use as an installer.

Right now it just happens to output a cracked version of Windows if you answer "blue". Who can know how that happened? It's a black box after all. Seems useful though, since Microsoft is loudly saying that if I distributed this it would have no license problems at all.


I think this is peak programmer logic. You can’t just declare something a black box or declare something an AI — the courts, people, aren’t that stupid. Lawyer words aren’t that magic.

If when you crack open Copilot it’s determined that it’s not actually learning and boils down to storing and regurgitating snippets of code no matter how few or many layers of indirection it’s still infringement.

What your AI actually does is underneath all the indirection is what’s important.


> [..] not actually learning and boils down to storing and regurgitating [...]

Some would say that pretty much _all_ learning involves "storing" and "regurgitating".

Aside: my daughter started writing out her name at kindergarten a couple of years ago. One of the staff seemed a bit dismissive of this, and claimed what was happening wasn't really "writing" it was "merely memorising the shapes of letters and then reproducing them in the right order". <rolls eyes>

I didn't bother to argue...


Yeah ... that was my point. What copilot does underneath the indirection is steal GPL code and derive "new" stuff from it in a way that launders the licenses.

We know what it's doing, it's doing very well understood statistical inference techniques to derive outputs.


Also, it's trivially easy to teach a real neural network to output a specific binary stream in response to a specific input. My OS AI could use the exact same technology as Copilot, I'd just train it very specifically.


I think the main point that the article makes is that for copyright to work you need some notion of a creative work, and so far it's generally accepted that snippets like

  i = i + 1
aren't creative enough to be covered by copyright. The interesting point is where you draw the line between what's boilerplate and what's creative, and legally it will presumably come down to showing that copilot crosses that line egregiously enough for someone to think they've got a successful chance at legal action.


Since that article was written people have shown it will generate quite long coherent sections. It will even generate someone’s private about me page: https://twitter.com/kylpeacock/status/1410749018183933952?s=...


Bullshit.

That not an existing aboutme page. You can go to davidcelis' website and verify that it's completely different.

Copilot just picked a random person and linked to their social media accounts. You can search any large quote within that about me on Google and not find a match, it is unique.

The only two examples of generating large sections of copyrighted work are the quake floating point hack and the zen of python. Both those examples are commonly known and copied and talked about, to the point that they have wikipedia pages.


But that about me page is the very definition of boilerplate text, so really it only gives weight to the argument that it's not producing original work.


You got downvoted, but I kind of like this argument. There are a million "about me" pages, but Copilot did a good job of picking one for "generic software engineer". If it could just have changed a word or two to a synonym, it would be great.


That page was so generic that I could fully believe it was originally written by a GPT bot.


Sure, but GitHub's own promotional pages (pretty much any of the gifs on https://copilot.github.com/ as well as other articles, e.g. https://docs.github.com/en/github/copilot/research-recitatio...) show it producing much more elaborate segments than that.

In fact, that's a crucial selling point for the product.


But as I understand it, copilot can generate much longer snippets, even entire functions.

I think the big question is, if copilot ends up copying significant portions a GPL work, not just tiny snippets, is the resulting work infringing, and if so, who is liable?


If a tree falls in the forest and no one is around to hear it, does it make a noise? If it is infringing and someone is liable, what proportion of the infringing cases would be found and what proportion of those found instances be brought to the courts?

I have almost no sense of how often code is infringed currently and how often anyone does anything about it. I have a feeling that we live in a world with constant infringement, basically no one cares, and no one does anything about it. And I would assume, the status quo will maintain its current course with this new tool. But again, I'm giving zero factual evidence, it's just a feeling from not seeing our hearing almost any news about open source code infringment.


If you use copilot and it generates a substantial amount of code, you don't know if that might be a replica of code from another project with an incompatible license. If you are law abiding and/or want to respect open aource licenses, then it is on you to figure out what if any license that code would fall under. Which means copilot would only be useful to developers who don't care about FOSS licenses or only use it for snippets that couldn't possibly be considered original enough to be covered by copyright.


Unlike patents, for cooyright, independently created work is not infringing. So, if you build a model that really does actually model music, this could be argued to be independent creation.

But there is also caselaw (involving George Harrison IIRC) on "unconscious copying", where having heard a piece is suggestive that it was not an independent creation, despite not being deliberately copied. So, training on a corpus that includes a specific piece is arguably a case of that.

There"s an interesting question of whether a model is just a sophisticated statistical compression of a corpus, or whether it is a thing in itself. I would say, if it finds patterns that are disproportionately simpler than the corpus, it has found "something".

But another view is of creation as involving a side-channel or one-time pad, in that music is created by a human and heard by a human, who have common information that has never been present in music before (e.g. specific aspects of common neurophysiology, auditory anatomy, exact heartbeat waveform, new sounds/rhythms in the world, new speech patterns, associations between existing melodic fragments and words/emotions/visuals/status etc). In this sense, truly new music is discovery of the Human Music Processing System, which ultimately involves the whole human and their social and physical experience.


You say AI, but this is just a database with a weird query language. The fact that it has to be trained on massive amounts of code and can only regurgitate variations of snippets of the training database back makes it quite clear that there is no intelligence in this thing.

If it were intelligent, it’d be given the Lagrange specifications and then I’d be able to say “Write me an open world video game based around gang culture and I’d like it to run on the raspberry pie zero.”


Oddly enough there's an Adam Neely video I saw some time ago that may address your points:

https://youtube.com/watch?v=sfXn_ecH5Rw


Wouldn't the parallel be closer to having an ai remix a bunch of songs together?


The Beatles are the farthest thing from public domain possible and would never be included in such an exercise.

Valid point - most hilariously poor example to back that point up in recent memory.


Can put Beatles lyrics as comments. Copilot do look to copy those as well. People can bomb search results, some will certainly bomb Copilot.


You're giving AI too much credit, it's just a tool, it does not have it's own intentions.

I.e. if you buy a piano or a guitar, you could play and record copyrighted music on it. That's not piano's or guitar's fault though, it's yours.


It sounds like you might be the one giving it too much credit. AI is a glorified markov chain, which is essentially a compression algorithm. I agree that it can be an instrument (I've done it: https://soundcloud.com/theshawwn/sets/ai-generated-videogame...) but it's almost trivial to train a model that memorizes by rote.

Suppose a model was trained solely on a single Beatles album. It could only spit out that album. That would be clear infringement, wouldn't it?


It’s funny that people say it’s a glorified Markov chain.

No. It’s not. A Markov chain has some very specific properties that are absolutely not fulfilled by GPT-3 models.

Just say “stochastic” if you want a buzzword. Stop appropriating Markov chains.


Actually GPT-3 is a Markov chain. A Markov chain is a very general term. Just because the simplest ones are stupid, doesn't mean there aren't smarter ones.

A Markov chain is a model where: (1) there is a state, (2) the probabilities of the next state depend only on the current state.

That could describe anything from a wet fart to a deterministic computer to GPT-3 to the human mind to quantum mechanics (the real one, not a simulation).

If I unplug the Internet and all USB devices, my computer is a Markov chain with at least several trillion bits of state, so 2^(several trillion) possible states. And there is one next state, which has a probability of 1, and all other possible states have a probability of 0. That's a Markov chain.

GPT models choose the next word probabilistically, with the probabilities chosen by feeding the previous N words into a neural network. That sounds like a Markov chain to me!


“It’s a stochastic” doesn’t flow, though I guess I could use "stochastic random walk."

What properties does a gpt-3 model have that a Markov chain doesn’t? (Other than effectiveness.)


Willfully misunderstanding your interlocutor seems ungracious, you apparently know stochastic is an adjective, so why did you try to use it as a noun?

GPT-3 is conditioned on the entire input sequence as well as its own output, which is strictly NON-MARKOVIAN. In fact, the point in saying something is Markovian is exactly that: the state transition probability only depends on the current state.


But the current state could include all of the last N output tokens.


ANNs aren't Markov chains


Well, yes. And apples aren't oranges, but they share a lot of similar traits.

"Given a prompt, provide a completion" is what a Markov chain does. GPT-3 is exactly the same, in the sense that apples and oranges both satisfy your hunger.


Is calculator.exe the equivalent to TI-84?


I would say that a TI-84 is a glorified calculator.exe. The TI-84 is usually more effective, but they’re both calculators.


Funny you should say that, as there is a direct line connecting player pianos in the 19th century to copyright law in the 21st:

https://en.m.wikipedia.org/wiki/Mechanical_license


When I press 1 key and it plays copyrighted music, that is the piano's fault.


If you have a piano that plays copyrighted music when you press a single key, isn’t that the piano maker’s fault?

Edit - googling, the history of player pianos vs copyright is interesting

https://slate.com/technology/2014/05/white-smith-music-case-...

https://www.techdirt.com/articles/20100712/18325210185.shtml


It could end up being tried as being both the piano maker’s and player’s fault.

The former for making and selling it, the latter for buying and using it.

Just like counterfeit goods.

No?


Depends, I think. Generally, I think accidental copyright infringement by an unsuspecting person who plays the song believing they bought it legally would not be held accountable. But if they knew the song was copyrighted and not legally licensed by the piano manufacturer, and they assembled a bunch of friends to listen to the songs, or resold them for money, then yeah perhaps. Most importantly in context, though, the piano will almost certainly not be fined or taken to court. ;)


Even if hypothetically there was such a strange bug in your piano and you decided to exploit it by recording copyrighted music and redistributing it, you would be accountable for it, not a piano.

This analogy train went too far, don't you think? All examples that I've seen on Twitter require quite an intentional manipulation by human for Copilot to produce something copyrighted. It does not recite Linux code by pressing 1 key.


If you have an electronic piano that requires a complex series of button pushes to produce copyrighted music - that's still a copyright violation. Copyright law has no notion that the difficulty of reproducing copyrighted content effects the fact of a violation.


> an electronic piano that requires a complex series of button pushes to produce copyrighted music

Surely a judge presented with the "complex series of button pushes," otherwise known as playing an instrument, would hold the player accountable for any infringement and not the piano?

These analogies have gone so far off the rails that I can't tell which side this thread is arguing for by now ;)


I think the whole swirling discussion is a little confused because there are potentially two "ends" where infringement could happen, and different people are talking about each. And the article covers both.

One end is GitHub's, at the input: Copilot's "database" was initialized from code that GitHub does not have copyright to. The contention at this end is that they are ignoring the licenses that would grant them the right to use that code.* The article, GitHub, and others assert that there's no copyright issue for creating a database of this kind (a machine learning model).

The other end is the the developer taking Copilot's output. The article seems to take the (absurd IMO) position that there's also no copyright implications here, because the output is not copyrightable at all.

*And personally this is the side that concerns me most.


A typewriter vs. a machnine that recites paragraphs of shakespeare are two different things.


Neither of them unbind the content from the original license, though.


"Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use."

IANAL, but this doesn't sound quite right. There is a difference between "using" code (running it in a commercial product) and manipulating it as arbitrary data within a commercial product.

It definitely can be a gray area, but let's say I use Amazon's service where I email a PDF to my Kindle - is it Amazon's responsibility to know the copyright status of the PDF, or mine? In both cases a commercial product is manipulating copywritten data for the benefit of a user.


Maybe you're right, maybe you're wrong.

I'll give the best example, the one task that off the top of my head that I would like some AI help with.

I would really like to replicate the functionality of Java's SSLEngine, but for C#.

If I used Co-Pilot to help, at best, I would need to pay for a legal team to do some form of 'clean room' review of whatever was generated to make sure it did not infringe on the OpenJDK code that is out there. At worst, I would be having to defend myself from Oracle's legal team -anyway-.

And yeah, I'm assuming in this case that Copilot would be 'smart' enough to be able to make the right inferences of that java code and put it into workable C# construct. Stepping back, though, one could still ask the question; what's the risk of a Java developer accidentally getting some OpenJDK code a little too closely? There's an order of magnitude difference between even a smaller AGPL developer and Oracle.

If Microsoft/GH was willing to go to bat and agree to pay for the defense of users of Copilot, I would be far less concerned with the implications of all of this.


> Stepping back, though, one could still ask the question; what's the risk of a Java developer accidentally getting some OpenJDK code a little too closely? There's an order of magnitude difference between even a smaller AGPL developer and Oracle.

It would be extremely interesting to know how much accidental and non accidental code infringement happens and in what proportion of those cases go to the courts. I would guess that both cases happen utterly constantly and it is only a tiny minority of those cases where legal action is taken. If that's the case right now, then nothing has changed with this tool except the possibility of playing hot potato with liability when those few cases that do happen make it to courts. Even if the developer actually wrote the code that infringed, copilot could make a useful scapegoat and every case will have plausible deniability if copilot lacks really good explainability.


> every case will have plausible deniability if copilot lacks really good explainability.

Out of pure curiosity, and please do take it as a candid question; do you intend to mean that "I don’t know how what I used works" is a good defence?


Even if it's legal for Copilot to do what it does, does it not violate GPL to take pieces of GPL'ed code and use them in a commercial product?


The basis of the GPL is copyright, so what you're really asking is whether you can use part of a copyrighted work in another work without infringing.

And the answer as always is "it depends".


If I use Copilot and it suggests a large block of GPL2'ed code for my project, which I then include, then that is a GPL2 license violation.

Whether the GPL2 will hold up in court, or whether the courts will uphold this specific case (e.g. can you prove intent? Do you need to?), is a separate issue entirely.

The next question is, can I use GPL'ed code in my product and then claim that it was injected by Copilot to avoid repercussions of my actions if caught?


> If I use Copilot and it suggests a large block of GPL2'ed code for my project, which I then include, then that is a GPL2 license violation

Not necessarily. Even 11k lines of copied code might fall under fair use, as Oracle recently discovered ;)


> If I use Copilot and it suggests a large block of GPL2'ed code for my project

But why would copilot do this? It's a language model not a database.


The claim (which I'm not qualified to judge) is that this use falls under fair use. The point of fair use is to allow some use of copyrighted works even if the copyright owner does not license it to you and even if the owner is explicitly hostile towards your usage. If it is indeed fair use, then the license doesn't matter because that's not the thing that's allowing you to use the work.


There are plenty of SAAS that use GPL'd code on the backend. That's fine.


You mean like Red Hat Enterprise Linux, that kind of commercial product based on GPL'd code?


Yes.


The proprietary model is a representation of lots of harvested open source code snippets. Without the model copilot is nothing. Arguably, the code snippets are part of the product....


Your example doesn't quite match what's happening in real life though. You're not "using copilot as a mechanism to ferry around code". Co-pilot is making recommendations for what code to use and then also giving that exact code (the text) to you. A more apt example would be if Amazon had some UI which said "What kind of book do you want to read on your kindle?", you click the button labeled "biography", and then Amazon sends your Kindle an AI generated book which is the biography of a famous person, and it just so happens that the "generated" book being sent to you is an exact copy of someone elses book (or incorporates exact copies of chapters/paragraphs of someone elses book), legal disclaimers and all.


Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code

This is incorrect. First of all, GitHub isn't even the people building the model. It's built by OpenAI, which has none of these licenses. Secondly, the model is not built purely from GitHub data. OpenAI is relying on fair use, not on a specific license.


> GitHub Copilot is indeed infringing copyright and not only in a grey zone, but in a very clear black and white fashion

You seem to be confusing what you'd like the law to be with what the law is.


Here is an explanation of the law: https://www.copyright.gov/fair-use/more-info.html#:~:text=Fa....

Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.


If your view of the law is correct then programming is illegal, because who among us has not read copyrighted code and used it to train our biological neural network? I suspect your view of the law is not correct.


Humans aren't robots and our laws aren't designed to equate the two, even if you assume that copilot is anything close to intelligence


Your analogy looks alien to me. Can you show us the link between a human and a computer program? WTF is "biological neural network"? Can you quote a law?


Copyright is not an indefinite ability to control a work. You can't for instance stop me from lending or leasing my rightfully acquired copies of your works, nor from making short quotations for criticism, nor teaching a class about it, etc.

Particularly in the context of the occasional, unintentional reproduction of short snippets that likely need adaption to the rest of the code they are inserted into I suspect courts are unlikely to find more than de minimis, unactionable, infringement.


Right, but Copilot is not creating a piece of critical media about the code it is ingesting (IANAL, but my understanding is that this gets interpreted pretty strictly). I don’t think the normal Fair Use classes apply here.

Even if the code isn’t being copied verbatim, it feels like the spirit of these licenses is being violated, although I don’t know if that’s enough to get anywhere in court. But if the code is in fact being copied (like that Quake example) then the license is definitely being violated.

But I feel like there’s too much analysis in these comments of whether a current law is being broken, and not enough thought about what will happen if licenses like the GPL can no longer keep intellectual property free. Open source licenses are part of the foundation of this community, and we’ll be much worse off without them. We really need a way to prevent this kind of IP laundering, and if current laws won’t do it, then we need new ones.


> You can't for instance stop me from lending or leasing my rightfully acquired copies…

I mean in this specific example can’t I though? Yes the first sale doctrine applies to certain kinds of works but not every work and specifically not software. I absolutely can grant you a single non-transferable license to use my software.


> Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users.

and by the GitHub ToS:

> You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

https://docs.github.com/en/github/site-policy/github-terms-o...


> > as necessary to provide the Service

I would consider Copilot to not be part of ”the Service”[1], but at least currently[2] the definition of ”the Service” is so vague as to include anything that Github does.

Maybe they consider Copilot to be a ”search index” and the suggestions ”[sharing] [Your Content] with other users”.

[1] Since, as I understand it, it will require separate payment.

[2] The ToS is currently last edited 2020-11-16, and does not contain the word ”Copilot”


Not everyone who’s code ends up on GitHub has agreed to this set of terms



Does the GitHub ToS matter when I upload code that was written by somebody who doesn’t use GitHub?


Then you would be the one infringing their copyright, and they could probably sue you.

Although I'm curious about what GitHub would do if the original author asked them to remove the work from Copilot. Retrain from scratch every month or so, to remove last month's DCMAed content?


I doubt that as an interpretation that Github wants people to make since that means that tons of major projects need to be remove from Github. Basically all that are older than Github.


> Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo.

The person who has the account on Github and uploads code to them rarely owns the copyright on all of the code, and therefore doesn't have the right to delegate to Github any further licensing permission.


Legally I believe that’s the person who uploaded the code’s infringement rather than GitHub’s, so long as GitHub deals with takedown requests in a timely manner as per the DMCA.

Furthermore, as described in the article, the legal precedent has been that you don’t actually need copyright to something to train a model on it. You may think that’s silly or inconsistent, but that’s how the legal precedent is.


> that’s the person who uploaded the code’s infringement rather than GitHub’s

That's not how it works. Anyone and everyone who distributes it is infringing and carries risk of enforcement action. That could also be someone further downstream.

> Furthermore, as described in the article, the legal precedent has been that you don’t actually need copyright to something to train a model on it.

I'm not commenting on this aspect.


That is absolutely how it works.

https://www.eff.org/issues/dmca

“The DMCA “safe harbors” protect service providers from monetary liability based on the allegedly infringing activities of third parties.”


That does nothing to derisk those further downstream of infringing content.


If Copilot is infringing copyright by reproducing small samples of the training data, and if we agree that that isn't acceptable, doesn't that effectively spell the end of the road for any and all AI generated content unless the developers explicitly stop their product reproducing data that matches the data it was trained on? That seems like it would have far reaching consequences for AI as an industry.


A pandora's box full of already-opened cans of worms -producing factories. The singularity is now?


Doesn't everyone who uploads code to a public repo give Microsoft/GitHub a license to (strike ~redistribute~) reproduce that code?

If they didn't, GitHub itself would be violating copyright every time someone browsed the repo.

And copilot appears to be a part of GitHub.

https://copilot.github.com/

So why wouldn't copilot itself be covered by that license?

(Certainly people using copilot would not. Let the user beware.)

Edit: downvoted to death but the top reply shows that it's true. An inconvenient truth, I suppose.


From the GH TOS:

> 4. License Grant to Us

> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service

https://docs.github.com/en/github/site-policy/github-terms-o...

  .
> 5. License Grant to Other Users

> If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).

> You may grant further rights if you adopt a license.

https://docs.github.com/en/github/site-policy/github-terms-o...

  .
So yes, but only within GitHub.

  .
Edit:

> A. Definitions

> The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.

https://docs.github.com/en/github/site-policy/github-terms-o...

Sneaky bastards.

  .
Edit: Formatting


Not everything on Github was uploaded by the copyright holders. Often enough, it's uploaded by people who only have access to it under an open source license, so Github cannot in general squeeze additional license terms out of the uploader at that point.


That's a good point; what is their obligation/liability in that case?


Yap. Some of them are mirrors uploaded by unknown/non-official uploaders.

I guest next step github would require cell phone to sign up to make sure who you are and get the rights to profit.


There's more than redistribution happening here. Co-pilot is providing a value-add service where the open-source code is an input and the output is a service. As it happens, the service is actually regurgitating the code at this point, but it's important to consider that even if it didn't regurgitate the code verbatim, the fact that the service is making use of that code to provide a value-add means the code is a crucial input to the value proposition. Would Co-pilot be able to provide the value-add without the source? Likely not.

Couple that with the fact, that presumably at some point in the future, Co-pilot will come attached with a subscription model (otherwise why do it in the first place?), and we have the makings of a product that is commercially infringing on copyright left, right and center.


No.

Edit: Sorry downvoters, whether you like it or not, you don't understand the terminology. You're confusing reproduction with redistribution.


I'm not a lawyer so it's entirely possible I used the wrong term. Thank you for clarifying below.

Using the terms as you explained them below, I meant that Microsoft/GitHub has permission to reproduce the code so why wouldn't that extend to copilot?


Are they displaying the license under which said code is licensed when they display a chunk of licensed code? If not, then they're violating the terms of most licenses (except pure public domain, or other similar licenses which don't have any such requirements attached).

The use of licensed code in other projects must be done under the terms of that license or you aren't legally (under copyright law) allowed to use the code.


> Are they displaying the license under which said code is licensed when they display a chunk of licensed code? If not, then they’re violating the terms

The GitHub TOS is a license that is separate from the license in the code. It is legal and common for an author to license the same code multiple ways (and the licenses do not have to agree with each other). By agreeing to GitHub’s TOS and uploading code to GH servers, people are licensing GH to display the code, because the license agreement says so explicitly. This could be problematic if someone uploads code they don’t have the rights to upload, but then the violation is the uploader’s, and not GitHub’s.

Additionally, GH has a provision for already licensed code in section D.6:

“6. Contributions Under Repository License

“Whenever you add Content to a repository containing notice of a license, you license that Content under the same terms, and you agree that you have the right to license that Content under those terms. If you have a separate agreement to license that Content under different terms, such as a contributor license agreement, that agreement will supersede.

“Isn't this just how it works already? Yep. This is widely accepted as the norm in the open-source community; it's commonly referred to by the shorthand "inbound=outbound". We're just making it explicit.”

https://docs.github.com/en/github/site-policy/github-terms-o...


As I said, I'm not a lawyer, but I believe they're displaying it under the terms of the GitHub ToS, using rights granted to them when the project is uploaded to GitHub, not under the terms of the license the project uses for everyone else.


Reproduction is enough to cover the first part of your use case. This is mentioned on Github's TOS.

For the latter you would need redistribution as it is going into a different product, for which you claim ownership, and with possible modifications/adaptations (this would depend on the rights granted by the license). Nowhere on Github's TOS is the word or concept of redistribution referenced.

So, the answer to your original question is "no".

Edit: leereeves modified its comment after I wrote this, so it may not make much sense but you can figure out the point. Best!


I’m not sure this is a completely fair take, I think the original question is legitimate and relevant. Github’s TOS does in fact ask the contributor to grant a license for GH to host and serve their code from GH servers. That is both reproduction and distribution as defined by copyright law, and copyright covers both of those at the same time https://www.copyright.gov/what-is-copyright/

(Edit and BTW GH calls out their ‘distribution’ in section D.4 of their TOS explicitly, but without using the word “distribute”. They say you grant them the right to “publish” and “share” code you upload, which means “distribute” under copyright law. They also imply that by spelling out the terms under which they do not “distribute”, which is anytime the content is used outside of GitHub’s services.)

I don’t think you’re correct that the term “redistribution” means either going into another product, nor that it implies a claim of ownership. Putting works into another product is sometimes known as making a derivative work, while “redistributing” is quite commonly used to mean copy-and-distribute as-is. Redistribution can happen via license as well, it requires permission by the copyright owner, but does not imply the redistributor is (or is claiming to be) the copyright owner.


>I think the original question is legitimate and relevant

You didn't see the original question, it was edited, so we cannot discuss that further.

"[...] which means “distribute” under copyright law" <-- Citation needed please, because I don't think that's correct.

From the site you linked:

"Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending."

What I seem to grasp about the difference between reproducing and redistributing is that it has to do with the concept of "transfer of ownership". Also derivate work and redistribution are not mutually exclusive.

The moment you create a new thing and start distributing it (even if you do not modify it), you become the de facto owner of that new product, and copyright law is trying to limit the extent of the rights that apply there. So, in the case of music, it's different thing to play (reproduce) a song than to create a new album with your favorite artists that happens to include that particular song (redistribution).


> "Distribute copies or phonorecords of the work to the public by sale or other transfer of ownership or by rental, lease, or lending."

> What I seem to grasp about the difference between reproducing and redistributing is that it has to do with the concept of "transfer of ownership". Also derivate work and redistribution are not mutually exclusive.

What you've misunderstood is it is the copies that are sold, not the copyrights.

* edit

> create a new album with your favorite artists that happens to include that particular song (redistribution).

This is not what redistribution means. You seem confused about this word.


>What you've misunderstood is it is the copies that are sold, not the copyrights.

Sorry, I'm not following you anymore. I don't even know what you mean by that sentence.

Edit:

>This is not what redistribution means. You seem confused about this word.

But, that's exactly what redistribution entails ...


> Sorry, I'm not following you anymore. I don't even know what you mean by that sentence.

The transfer of ownership you referred to is a transfer of ownership of a copy, it is not a transfer of ownership of the original work itself. You misunderstood the passage you quoted to mean that redistribution is transferring ownership of the work itself, as in copyright ownership of the work. But the text you quoted is only talking about transferring ownership of the copies. The text you chose makes more sense in the context of physical copies of books or "phonorecords".


I'm not following your logic, really.

Copyright is meant to protect original/authentic/unprecedented expressions, disregarding the medium where they may exist. So I don't really get your point in trying to make distinction between a copy or a "master"(?) or whatever.

What's at stake is the originality of the expression and what kind of rights does somebody else (i.e. everyone but the creator) have (or not!) over it.

Can I make copies of this original expression? (y/n)

Can I use this into a new product of my own? (y/n)

Whether something is already a copy or not does not really change the extent of the rights that you have (unless it's explicitly stated in the license, of course).


> I’m not following your logic, really

I can see that, and I mean no disrespect, but you shouldn’t have attempted to comment on this topic authoritatively and police what others say without understanding it.

> I don’t really get your point in trying to make distinction between a copy or a “master”(?) or whatever.

You have clearly and repeatedly demonstrated that you don’t understand what “distribute” and “redistribution” means to copyright law. You claimed others were confused about it and that using the word “redistribute” was incorrect, when in fact it was fine and correct.

I’m trying to help you understand that redistribution is a term that is talking about what happens to copies of a work. The sentence you quoted, and the “transfer of ownership” that you said you grasp only have to do with transferring copies, and nothing else.

The main point here is that when GitHub shows you code, it is transferring a copy to you. That is what GitHub calls “publish” and what copyright law calls “publication”, and by publishing they mean redistribution (because the copyright legal code says so).


> that's ("create a new album with your favorite artists that happens to include that particular song (redistribution)." exactly what redistribution entails

No, it isn't. You're wrong. Redistribution, or just distribution, in copyright law is plainly and simply making copies of a work available to other people. It does not mean anything more than that, and it does not transfer ownership of anything other than the copy you distribute.


> You're confusing reproduction with redistribution.

It seems like you're confused; GitHub's terms require users to grant both of those. Copyright law also covers both.


>GitHub's terms require users to grant both of those

Last time I checked (about an hour ago), that wasn't true. Feel free to provide evidence to support your argument.


> Last time I checked (about an hour ago), that wasn't true. Feel free to provide evidence to support your argument.

https://docs.github.com/en/github/site-policy/github-terms-o...

"publish" and "share" mean redistribution. "Store" and "copy" mean reproduce.


>"publish" and "share" mean redistribution

No. That's something you believe, but it's not necessarily true.

Check here, https://copyrightalliance.org/faqs/what-rights-copyright-own...

Again, distribution has to do with a transfer of ownership. In layman terms, Github can show your code to others but it cannot give (as in ownership) your code to them. It's a bit tricky here since on the web showing something literally means making a copy at some point, but try to view things under the light of "who owns what" and it's a bit easier to grasp.

If you browse through someone's repository, it's pretty clear who the owner of that code is, if a program gives you a chunk of code that it "got from somewhere" there's definitely some sort of change of ownership operation going on; which in this case is interesting, as it went from attributed to someone to missing/unknown.


> Again, distribution has to do with a transfer of ownership

You're mixing sub-threads here, but you're still confused. Distribution is a transfer of ownership of a copy, it does not grant copyrights or ownership of the work. You can buy a book that was distributed, and that does not give you the right to make copies of the book.

> Github can show your code to others but it cannot give (as in ownership) your code to them.

In the digital world, showing is "distributing", and copyright law is clear about this.

You should perhaps read the definitions that are in the copyright law itself, and try to understand them:

"“Publication” is the distribution of copies or phonorecords of a work to the public by sale or other transfer of ownership, or by rental, lease, or lending. The offering to distribute copies or phonorecords to a group of persons for purposes of further distribution, public performance, or public display, constitutes publication. A public performance or display of a work does not of itself constitute publication.

To perform or display a work “publicly” means—

(1) to perform or display it at a place open to the public or at any place where a substantial number of persons outside of a normal circle of a family and its social acquaintances is gathered; or

(2) to transmit or otherwise communicate a performance or display of the work to a place specified by clause (1) or to the public, by means of any device or process, whether the members of the public capable of receiving the performance or display receive it in the same place or in separate places and at the same time or at different times.

https://www.copyright.gov/title17/title17.pdf


You make some good points, here and on the other comments, so I'm not arguing against you.

>In the digital world, showing is "distributing" [...]

I guess it has to do with how copyright law adapts to the specific circumstances of this particular case. I guess we won't get an answer until a judge justifies some sort of resolution on either side.

My take is that:

* GH showing you some source code on their website is akin to reproduction; even though, of course, a binary copy of the code was made and was transmitted to your local browser in order to be displayed.

while

* GH taking chunks of code from here and there, and making them available into a new product from which they claim ownership (or the final user, or whatever) is more akin to the physical concept of redistribution.

But we'll have to wait and see.


> GH showing you some source code on their website is akin to reproduction

This is clearly and unambiguously defined as “publication” in the copyright law, where “publication” is defined as distributing copies. (And GitHub’s TOS also calls showing you code “publish”).

There is nothing to wait for, and the law and many court cases have already established clear definitions and precedent on these terms. You just got stuck on the wrong idea, it happens, it’s okay, but if you are curious about copyright and interested in discussing it here, it will certainly help to improve your understanding of the terminology.

> GH taking chunks of code from here and there […] is more akin to the physical concept of redistribution

No, this is still just wrong. You’re talking about derivative works, which is also defined in the copyright legal code. There is no such physical concept of mixing and matching that is called “redistribution” in legal terms. I’m not sure where that idea came from, it might make sense to you or in some narrow contexts, but generally speaking and specifically wrt copyright law, distribution has nothing to do with whether you sample a work nor whether you make a new work out of old works.


Yes (as you now know), GitHub’s terms require users uploading code to agree to GitHub being able to both redistribute (“publish”) and reproduce (“copy” / “store”) their code.

The “Terms” link on the copilot page goes directly to GitHub’s TOS, so yes the terms are one and the same.

This question is interesting and I’ll try to help turn the downvotes around, but it might be too late. Anyway, when users agree to allow their code to be “published” by GitHub, they are allowing it to be both copied and distributed. The TOS also says (note the indexing/analysis comment) “This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.”

The part where GitHub might have trouble (I speculate) is that their TOS doesn’t discuss derivative works, and the input code to copilot could have licensing terms on derivative works that get scrubbed out by copilot. OTOH, if copilot were to guarantee that a chunk of code never resembled one of the original inputs it may be legal to create derivative works from samples under fair use.


I'm thinking it's not so much what is legal for Copilot to do with code chunks from GPL'ed code, but what it means for end users (i.e. developers at for-profit companies) to incorporate those chunks into commercial products


There were two parts to the argument which seem to hold water.

1. Any code generated by co pilot is likely to be agpl

2. since the authors of copilot used co pilot beta to make co pilot release copilot is very likely using agpl licenced code and therefore in breach of the agpl licence.

so yep, article looks flawed.


Julia Reda's analysis depends on the factual claim in this key passage:

> In a few cases, Copilot also reproduces short snippets from the training datasets, according to GitHub’s FAQ.

> This line of reasoning is dangerous in two respects: On the one hand, it suggests that even reproducing the smallest excerpts of protected works constitutes copyright infringement. This is not the case. Such use is only relevant under copyright law if the excerpt used is in turn original and unique enough to reach the threshold of originality.

That analysis may have been reasonable when the post was first written, but subsequent examples seem to show Copilot reproducing far more than the "smallest excerpts" of existing code. For example, the excerpt from the Quake source code[0] appears to easily meet the standard of originality.

[0]: https://news.ycombinator.com/item?id=27710287


The excerpt from Quake code is literally one of the most famous functions out there. There is no wonder that it was reproduced verbatim. The share of such code, according to Github is really small.

It would be quite straightforward to write an additional filter that would check the generated code against the training corpus to exclude exact copies.


But the fact that it did that at all should be proof that Copilot is, in fact, copy and pasting rather than actually learning and producing new things using intelligence.

This is a code search engine with the ability to integrate search results into your language syntax and program structure. The database is just stored in the neural network.

It’s definitely an impressive and interesting project with useful applications, but it’s not an excuse to violate people’s rights.


This is all just computational statistics. Why in the world would you invoke ill-defined anthropocentric terminology like "intelligence"? Of course a statistics program isn't "using intelligence".

But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.


> But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.

You mean in the same way that google.com isn't "just a database"?

If Copilot isn't intelligent, then what makes it more special than a search engine? How is Copilot not just Limewire but for code?

I could understand the argument that, if Copilot really is intelligent or sentient or something like that, then what it is producing is as original as what a human can produce (although, humans still have to respect copyright laws). However, I haven't seen anyone even attempt to make a serious argument like that.


It can produce code snippets that were never seen by generating fragments from various sources and combining them in a new way. This makes it different from a search engine, which only returns existing items.


Is it producing code (by which I mean creating/inventing new code by itself), or is it just combining existing code? Because to me it seems like the latter is a more appropriate description.

* AI searches for code in its neural-net-encoded database using your search terms (ex: "fast inverse square root")

* AI parses and generates AST from the snippet it found

* AI parses and generates AST from your existing codebase

* AI merges the ASTs in a way that compiles (it inserts snippet at your cursor, renames variables/function/class names to match existing ones in your program, etc)

* AI converts AST back into source code

Is AI intelligently producing new code in that example? Because I don't think it is.

What would be an interesting test of whether it can actually generate code is if it were tasked with implementing a new algorithm that isn't in the training set at all, and could not possibly be implemented by simply merging existing code snippets together. Maybe by describing a detailed imaginary protocol that does nothing useful, but requires some complicated logic, abstract concepts, and math.

A person can implement an algorithm they've never seen before by applying critical thinking and creativity (and maybe domain knowledge). If an AI can't do that, then you cannot credibly say that it's writing original code, because the only thing it has ever read, and the only thing it will ever write, is other people's code.


This understanding of how a generative process from a NN works is completely wrong.

There is no database lookup.

I've attempted to break that part down here: https://news.ycombinator.com/item?id=27744156

But you seem to have a basic fundamental misunderstanding of what is going on inside the NN. There is no "search for code" - it is generating new code each time, but sometimes that code will be the same as something it has seen because there is little or no variation in the training data for that snippet.

The NN generates code token by token, conditioned on the code leading up to it (and perhaps the code ahead, similar to BERT).

If you see tokens like this you probably generate the same next token too:

  for i in range(1,10)
You have conditioned your input on the code you have seen and the most likely token you produce is ":".

That's what the NN does, but for much longer range conditioning.


I am not educated on this matter, but I have to ask for your clarification. Would that not be just a pre-emptive lookup? Akin to keeping a cache of "results" per input token that are essentially memorized and regurgitated?

Sounds like there is still a db lookup, just not at runtime and instead at build time of the NN. Can you clarify this please?


GPT-3's raw output is "logits", or indexes into an encoding space. The encoding space contains individual tokens; for generation, it would be words, or even word pieces. The pieces are as small as "for", or "if". Constructing code from an embedding space, even if it is more specialized, is like constructing sentences by using a dictionary -- it is a lookup table, but it's not a database. Generation works by looking at the existing document (or portion), and based on what is already present, generating a token. Then repeating until some condition is met (such as length, end of sentence, something else).

The issue here is that certain sentences (code segments) are memorized, and reproduced -- much like a language learner who completes every sentence which begins with "Mi nombre" with the phrase "Mi nombre es Mark". The regurgitation is based on high probability built into the priors, not an explicit lookup. A different logit sampling method (instead of taking the likeliest) reduces regurgitation, without changing anything else about the network. (It also makes nonsense happen more often, since nonsense items are inherently less likely!)


This response is correct.


The code is literally stored in the weights of the NN, in a weird codification. That's exactly why it can retrieved verbatim.


> The code is literally stored in the weights of the NN, in a weird codification.

This is a massive simplification. It's adequate for most purposes, but when discussing at this level this simplification breaks down.

Tokens are stored, and the NN contains weights of likelihood of a token occurring after another, given a sequence of prior (and possibly post) tokens.

Verbatim retrieval usually means there is very little variation in the training data for that sequence, so the same set of weights gets stored.

So "retrieval" is actually the same generative process as unique code uses, but the NN hasn't seen any other versions.


That isn't even necessary. I've been exploring GPT-3 for a while and it is completely incapable of any reasoning. If you enter short unique logical sentences like "Bob had 5 apples, gave 2 to Mary, then ate the same amount. How many apples Bob has left?" No matter how many previous examples you give it (to be sure it gets the question), it gets it wrong. It is simply incapable of reasoning about what is going on.


> Is it producing code (by which I mean creating/inventing new code by itself), or is it just combining existing code?

How do you differentiate between these two things?

If I take `for i in range(10):` from one place and `print(x * 2) from another place, and combine them to get

    for i in range(10):
        print(i ** 2)
have I "produced" something or "combined" existing things?

As an aside, your understanding of how the model works here is completely wrong. Like, just absolutely fundamentally completely wrong.


> How do you differentiate between these two things?

That's a contrived example because none of those lines could be protected by copyright, patents, etc. A better example might be if you started selling a 30 minute movie that was just the first 15 minutes of Toy Story spliced together with the last 15 minutes of Shrek. I'm not a lawyer, but I'm pretty sure that would qualify as a derivative work, meaning you're potentially infringing on someone's rights (unless they've given you permission/a license).

And to be clear, none of these problems are new. People have been fighting over copyright and it's philosophy in court for a very long time. The only thing that's different here is that it seems some people think it's ok to ignore copyright if you use Copilot as a proxy for the infringement.

> As an aside, your understanding of how the model works here is completely wrong. Like, just absolutely fundamentally completely wrong.

Of course I don't, it's a neural network. You don't know either. That example I posted could be exactly what it's doing, or not even close.

(although for the record, I wasn't trying to explain how copilot works in that comment. It was a hypothetical "AI" for the sake of discussion, not that it matters. My point about it being copyright infringement is the same even if that hypothetical implementation is wrong)


> Of course I don't, it's a neural network. You don't know either. That example I posted could be exactly what it's doing, or not even close.

What is this supposed to mean? We know how neural networks generate things like this very very well.

I personally have built a system that takes pictures of hand drawn mobile app layouts into the NN, then generates a JSON based description file that I compile into a Reactive Native and/or HTML5 layout file.

This was trivially easy in 2018 when I did it. It took me maybe 2 weeks engineering time, and I'm no genius. Our understanding of how transformer-based NNs work has come a long way since then, but even back then it was easy to show how conditioning on different parts of the image would generate different code.


> That's a contrived example because none of those lines could be protected by copyright, patents, etc.

Well no. The question I'm asking about, the philosophical distinction between "producing" or "combining" is a valid question no matter the copyrightability of anything. It's an interesting philosophical question even if we presume that copyright is bubkis.

> It was a hypothetical "AI" for the sake of discussion, not that it matters.

Ah, my mistake. I see that now.

> Of course I don't, it's a neural network. You don't know either.

I may not know how to make a lightbulb, but I do know hundreds of ways not to make one ;)


> A person can implement an algorithm they've never seen before by applying critical thinking and creativity (and maybe domain knowledge). If an AI can't do that, then you cannot credibly say that it's writing original code, because the only thing it has ever read, and the only thing it will ever write, is other people's code.

This doesn't hold at all. Not many people can come up with an original sorting algorithm for example, but people write code all the time.


The fact it reproduce code verbatim, including comments and even swear words means it is definitely copying some of the time.

Does it copy all the time? Doesn't matter. Plagiarism is plagiarism even regardless of it is done by a student in school, an author, a monkey or an "AI".

You wouldn't accept this from a student, you shouldn't accept it from a coworker (unless you are releasing under a compatible license) and of course you should accept it from Microsoft.


Co-pilot produces original code (as in code that has never been written before). It's not just combining snippets.

This should surprise no one who has seen the evolution of language models. Take a look at Kapathy's great write up from way back in 2015[1]. This generates Wikipedia syntax from a character-based RNN. It's operating on a per-character basis, and it doesn't have sufficient capacity to memorise large snippets. (The Paul Graham example spells this out: 1M characters in the dataset = 8M bits, and the network has 3.5M parameters).

Semantic arguments about "is this intelligence?" I'll let others fight.

[1] http://karpathy.github.io/2015/05/21/rnn-effectiveness/


You're swapping "It does, at least occasionally, combine snippets" for "it's just combining snippets"


That is can combine snippets doesn't mean the OP's understanding of how the system works is correct.

They seem to believe it is a database system. That's really not how this works, and the fact it behaves like one sometimes is disguising what it is doing.

If I say "write a "for" loop from 0 to 10 in Python" probably 50% of implementations by Python programmers will look exactly the same. Some will be retrieving that from memory, but many will be using a generative process that generates the same code, because they've seen and done similar things thousands of times before.

A neural network is doing a similar thing. "Write quicksort" makes it start generating tokens, and the loss function has optimised it to generate them in an order it has seen before.

It's probably seen a decent number of variations of quicksort, so you might get a mix of what it has seen before. For other pieces of code it has only seen one implementation, so it will generate something very similar. There could be local variations (eg, it sees lots of loops, so it might use a different variation) but in general it will be very similar.

But this isn't a database lookup function - it's generative against a loss function.

This is subtle distinction, but it is reasonable that people on HN understand this.


> Some will be retrieving that from memory, but many will be using a generative process that generates the same code, because they've seen and done similar things thousands of times before.

How are these not both the exact same process of memory recollection? Can you elaborate on the difference between memory recall vs a generative process based on conditioning? I understand how these two are different in application, but not understand why one would say they are fundamentally different processes.


Analogies start to break down once we are talking at this detailed level.

The best I can come up with is this:

Imagine you are implementing a system to give the correct answer to the addition of any two numbers between 1 and 100.

One way to implement it would be to build a large database, loaded with "x" and "y" and their sum. Then when you want to find out what 1 + 2 is you do a lookup.

The other method is to implement a "sum" function.

Both give the same results. The first process is a database lookup, the second is akin to a generative process because it's doing calculation to come up with correct result.

This analogy breaks down because a NN does have a token lookup as well. But the probabilistic computation is the major part of how a NN works, not the lookup part.


Perhaps it’s not so different from a search engine like Google. The article cites Google’s successful defence, under US copyright law, of its practice of displaying ‘snippets’ from copyrighted books in search results. There is a clear difference between this and the distribution of complete copies on LimeWire.


If you look at it this way, your brain is also "just" computational statics. (Or to be precise, it might be, since we don't yet know in all the details how it works).


Hint: it has been said hundreds of times since the advent of computer science that the brain is "just" [some simple thing that we already understand]. That notion has never once helped us in any way.


Define intelligence


Also define "computational statistics". It'll be fun to try and fail to draw a clear line between the two.


A common tech-bro fallacy. We understand exactly what is happening at the base level of a statistics package. We can point to the specific instructions it is undertaking. We haven't the slightest understanding of what "intelligence" is in the human sense, because it's wrapped up with totally mysterious and unsolved problems about the nature of thought and experience more generally.


The fallacy is the god-of-the-gaps "logic" of assuming there's some hand-wavey phenomenon that's qualitatively different from anything we currently understand, just because reality has so much complexity that we are far from reproducing it. You're assuming there's a soul and looking for it, even though you don't call it that.

Intelligence is mysterious in the same way chemical biology is mysterious (though perhaps to another degree of complexity)... It's not mysterious in the way people getting sick was mysterious before germ theory. There's no reason to think there's some crucial missing phenomenon without which we can't even reason about intelligence.


To be fair, they themselves referred to intelligence as "ill-defined"...


It also shows that copilot knows nothing about copyright, and is incapable of considering copyright as such.

I'm not sure if I would characterize as a "database stored in a neural net", but that is definitely something to deeply consider.


> actually learning and producing new things using intelligence

People have been trying to accomplish that for 65 years. We're not even close. It's the software equivalent of cold fusion (with less scientific rigor)


I think it warrants investigating exactly how and when Copilot reproduces code, but using one example to write it off as just copy and pasting seems excessive.

Also when talking about rights, whether or not Copilot copies doesn't seem sufficient to make a call. For instance, if it has to be coerced by the programmer to produce these kinds of snippets in an obvious way, then it seems fine to lay the blame on the programmer similar to when using regular autocompletion (or copy+paste for that matter).


How do you know it's "just a code search engine"? Or "not AI" or "not learning and producing new things", or all the other claims people are making about it? All of these are essentially untestable statements.

It has memorized one thing. That doesn't prove it's not intelligent. If anything it's the other way around, we would expect an intelligent being to be capable of memorization.

All I can think of is the Turing test and the AI effect. Eventually we will have an AI that is capable of writing code indistinguishable from a human, and people will STILL say it's "not AI" and "just a code search engine", etc. Obviously this isn't there yet, but it's clearly getting closer.


Big part of work of almost any software engineer is finding similar already written parts of code and adapting them. How is this different?


Because you can't sue an AI


When then you ask, what is 2+2=...

Is 5 a more intelligent answer than 4, because it is new? Copilot is an autocomplete engine, not creative writing.


But if it wasn't copying literally, why was the comments included literally?


> The excerpt from Quake code is literally one of the most famous functions out there. There is no wonder that it was reproduced verbatim.

The question that brings is that this was found because it is so famous, but what if it is repeating Joe Schmoe's weekend library project, but we will never know because its not famous?


Because someone already checked, and it doesn't:

https://docs.github.com/en/github/copilot/research-recitatio...

Every literally quoted part that could infringe appears at least 10 times in the training data


This doesn't stand on its own as a defense: perhaps the 10 inputs were legitimate copies of a single source. They could be forked repos that were properly following the original's license, for example.


Or 10 different GPL projects that legitimately share code that remains copyrighted and protected by the GPL. Or 10 obscure projects that illegitimately copied code but haven't been caught.

Clearly, "10 other people did it" is no defense at all.


It might not even be "10 other people". For projects which originated outside Github, it's common for multiple users to have independently uploaded copies of the project. There's probably at least 10 users who have pushed copies of the GCC codebase to Github, for example.


> at least 10 users who have pushed copies of the GCC codebase

That is "10 other people". (Although your point stands, since there doesn't (or at least shouldn't to the point of criminal espionage) be any strong impediment preventing one person from creating 10 different accounts.)


It isn't "10+ other people wrote that", though. It's the same work, by the same person, being represented 10+ times in the training corpus.


Someone that works at Github already checked. I think asking for a more independant study is fair.


Would it? What would the threshold be? Twenty lines copied verbatim? Ten lines copied verbatim? What about boiler plate like ten #include statements at the beginning of a file? Or licenses in comments? What if someone has a one-liner that's unique enough to be protected by copyright?


https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...

I think that was a big part of the Google Vs Oracle case - how much copying constitutes an infringement?

It looks like they made a fairly complex rubric to apply in the future, it appears it would be on a case by case basis.


Pretty sure if someone trained a code suggestion tool with Windows source, Microsoft would claim that a single similar character being the same is grounds for copyright infringement.

They are putting GPL code in non-gpled codebases. Is it okay to take sections of other people's source code and use it on yours, if you just got it as a suggestion?


The true test will be whether MS indemnifies me against claims of copyright (and patent) infringement due to use of their tool.


This will be interesting to watch.


The funny thing about the Quake function is, id Software is almost certainly not the origin of the code. They copied it from somewhere else, possibly added profane comments, then slapped GPLv2 on it. Did they even have the right to do that? From an IP absolutist standpoint, probably not.

https://www.beyond3d.com/content/articles/8/


They did not copy the implementation, they copied the general idea of what the algorithm should do.

Do not go down this line of reasoning, otherwise we will be copyrighting the concept of for loops.


> they copied the general idea of what the algorithms should do. Do not go down this line of reasoning

Too late, patents pick up where copyright ends, to protect general algorithmic ideas, not just implementations. And we have lots of patents on things that seem trivial now, including for-loops (just see how many patents depend on “a multiplicity”). Look - here’s a helpful lawyer’s template for including for-loops as a claim in your own patents: https://www.natlawreview.com/article/recursive-and-iterative...

Another example is the famous XOR patent https://patents.google.com/patent/US4197590/en

EFF keeps a blog on stupid patents https://www.eff.org/issues/stupid-patent-month


> Look - here’s a helpful lawyer’s template for including for-loops as a claim in your own patents:

A claim on a combination of elements in which one element is an iterative component is not the same as claim on all iterative components everywhere.


True, correct. I didn’t intend for my comment to be interpreted as suggesting that a claim in a patent is the same thing as a whole patent, I was just pointing out a fun fact.


Anyone who believes in a free and open society should do away with all copyrights and patents.

Anyone who thinks that licensing will have an effect on what is happening in reality is severely misguided.


> Anyone who believes in a free and open society should do away with all copyrights and patents.

Free and open sound good to me! What do they mean exactly? I guess it’s a non-debatable fact that copyrights and patents are abused by many big companies and patent trolls, but doing away with the system does seem extreme, it has also protected deserving individuals on occasion, no? You are saying that it should always be legal to copy someone else’s code / inventions without giving them any credit or compensation?

> Anyone who thinks that licensing will have an effect on what is happening in reality is severely misguided.

I’m not sure I understand what you mean; lots of licensing activity does have a measurable effect on reality. This article is only a small example, but people get sued all the time over taking code and using it without licensing it.


> They did not copy the implementation, they copied the general idea of what the algorithm should do

[Citation needed]


> > They did not copy the implementation, they copied the general idea of what the algorithm should do

> [Citation needed]

(Not really your point, as such, but) no, actually, if you claim they did something (nominally[0]) wrong, the onus is on you to provide citations showing they did it[1].

0: > From an IP absolutist standpoint

1: Well, or that they (voluntarily and explicitly) accepted some responsibility (such as a job as a police officer) that entails a higher level of scrutiny than innocent-until-proven-guilty, but that's not really relevant here.


If you wrote an algorithm in the early 80s that did x+y+z

And then I saw your source code and in the late 80s I changed the variable names, function name, and logic to be x+y+z+0.1

And then I told my friend John that there's a super cool algorithm that adds numbers together, and he made some more changes to it and compiled it for a different platform...

Has anybody broken the law in your mind?

EDIT: because it would seem that the original authors (among them Cleve Moler) don't have any issue with what transpired


The GP's argument is that you don't have evidence that they didn't copy the whole function verbatim.

Is there a source that said they changed variable and function names and modified the logic?

> because it would seem that the original authors (among them Cleve Moler) don't have any issue with what transpired

Yet. Without an explicit license there is no basis to release it under the GPL (if the code was copied verbatim or had insufficient re-writing). What if the heirs of the copyright owner wanted to assert their rights? Is there a doctrine that if you don't assert your rights you lose them? (Presumably applies to trademarks, but I don't think this is the case for copyrights)


The source code in question is over 40 years old and most likely doesn't exist anymore in its original form.

What do we do then? The burden of proof for infringement is on original authors, and they haven't done so for 40 years.

In the late 1700s and early 1800s, Britain had to take measure to prevent visiting Americans and others from memorizing the designs of their new high tech machinery like the steam engine and the power loom.

Where do we draw the line? Shut down the internet until we create a massive copyright detection firewall?

No, we live with the copying and constantly evolve and adapt our business. Death to all patent trolls.


> Where do we draw the line?

I won't even claim that people must necessarily follow the law. Copyright law is inconsistent at best, and notoriously hard to follow to the letter (and often ridiculous). In practice lawyers assess the legal risk and weigh the outcomes.

I never intended to discuss what we should do, and I definitely did not propose shutting down the internet...

The original discussion was such:

> > They did not copy the implementation, they copied the general idea of what the algorithm should do

> [Citation needed]

You said the original authors did not complain, which is neither here nor there, as I pointed out. There is still some theoretical legal risk if you copy with the owner's knowledge but not express consent. The fact that the burden of proof is on the authors is true but that they have not brought a claim does not mean they cannot prove infringement.

And in case I haven't made it clear, I don't think it's a bad idea to assume the function is under GPL, I just don't think there's a basis for claiming what you originally claimed, and there is still some level of (probably acceptable) risk if you take the purported license of source code as-is.


How do we obtain the source code to verify it wasnt infringed upon?


It's not the actual copying of the idea, but the verbatim reproduction of the function, comments and all. I think people somehow thought that copilot could write code, and so verbatim reproduction was surprising to them.


A quick search shows that this snippet, including comments, is included in thousands of Github repos [1], so it's not surprising that the model learned to reproduce it verbatim.

It's such a famous snippet that it's even included in full on Wikipedia [2].

I wouldn't be surprised if the next version of Copilot filtered these out.

[1] https://github.com/search?q=0x5f3759df+what+the+fuck&type=co...

[2] https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...


I would love to try a session of clean room reverse engineering using copilot. I would bet you get reasonably far for very common libraries with not much effort. The question would be if such compression/decompression would infringe copyright.


But that fast inverse square root example is particularly interesting because it is also a derivative work. Carmack did not invent it, and several variations of it had been passed around over time.

Algorithms should not be subject to copyright, that way lies madness. It would prevent new generations from building on top of the work of their predecessors, because copyright lasts a very long time. The amounts of code that github copilot reproduces fall squarely into the “shouldn’t be subject to copyright” domain for me, even if they pass the bar for originality.


Something which is a “derivative work” is still copyrighted. In fact, by definition, a “derivative work” is copyrightable. It’s the minimum threshold at which something, based on something else, gets its own, new copyright.

The algorithm is not copyrighted, but the source code of the function is copyrighted. You could learn how the algorithm works by reading the function, and then write your own function that implements the same algorithm. Algorithms are not copyrightable, they are not subject to copyright. Source code is copyrightable.

Copilot is not reproducing just the algorithm, it is spitting out large chunks of the copyrighted source code, verbatim.


The example you linked to is talking about a 16 line function from the Quake source. The Quake source is 167,594 lines in total (counting the C code only). Does that really fail to meet the standard for "smallest excerpt"?


That excerpt has its own Wikipedia page, of course it meets the threshold of originality. In any case, once you are discussing this, you have entered the area of fair use; that is an admission of copyright violation.


Having a WP page isn't proof of threshold-passing. https://en.wikipedia.org/wiki/BACH_motif

But there is also actually an issue about laundering and what constitutes "use". But there is also de minimis to consider.

And EVERYTHING will depend on jurisdiction of course.

IANAL


Fair use is not a violation of copyright but a specified (and since 1976 statutory) exception to it. You are clearly impugning the doctrine with your comment.


Not only that, but it is clearly someone going out of their way to make it do that. I’m not sure that that is a reasonable test of how the program typically behaves.


> I’m not sure that that is a reasonable test of how the program typically behaves.

That's not what people care about, people care about their copyright being blatantly violated by a massive corporation _without any consequences_.


Honestly, I feel most people don't care about that. What they do care about, is the risk of Copilot making the user liable for copyright infringement. Even a possibility of it spewing out non-public-domain code should be considered a showstopper for any use of Copilot-generated code in a commercial project.

Can Copilot produce licensed code verbatim, in enough quantities to matter, with a license your business would be infringing? Yes. Can you easily tell by looking at the output? No. Could someone end up suing you over it? Maybe, if they cared enough to find out. Can you honestly tell your investors, or a company you seek to be acquired by, that nobody else can have valid copyright claim against your code? No.


> Can Copilot produce licensed code verbatim, in enough quantities to matter, with a license your business would be infringing? Yes. Can you easily tell by looking at the output? No. Could someone end up suing you over it? Maybe, if they cared enough to find out. Can you honestly tell your investors, or a company you seek to be acquired by, that nobody else can have valid copyright claim against your code? No.

Well aren't all your assertions exactly the point of contention?


Well, the "enough quantities to matter" part wasn't tested in courts yet, but I fail to see a way to rule for "No" here in a way that wouldn't gift us an universal way to turn any code into public domain, destroying source code licensing as a concept. Other than this part, the first two claims have already been demonstrated, and the rest follow from them.


But that is in fact the most fundamental question here. And I’m not fully sold on the idea either that this is going to happen in real-world usage or that a single function in a massive program constitutes a large enough portion to be infringing.


Quake's square root function wasn't the only, or the largest, example of code Copilot reproduces verbatim. Among others I've seen to date is someone generating a real "About" page with PII information of some random software developer.

How much code is enough to infringe is a tricky question, though. It's not only a function of size, but also of importance/uniqueness - and we know that Copilot doesn't understand these concepts.


> ... or that a single function in a massive program constitutes a large enough portion to be infringing.

As part of the sequences of rulings in Google vs Oracle, the 9-line rangeCheck function, in the entirety of the Android codebase, was found to be infringing.


Ok, but is “I can go out of my way to make it misbehave” adequate proof that the copyright is being violated?


Not GP.

Yes, it is, because that means that the algorithm will produce that copyrighted code regardless of the intent of the person who makes it misbehave. People could both accidentally and "accidentally" make it reproduce copyrighted code. In the first case, it's unintentional. In the second, how could you prove it's intentional?

Because of this whole mess, I am actually adding clauses to FOSS licenses that I am writing, just to ensure that my copyright on my code is not infringed by code laundering.


I'm not at all in favor of the "code laundering" (which is a brilliant term, thank you). But I don't understand how you expect a new license to help.

1. A license applied to source code is effective because of your copyright

2. The claim of Copilot's maintainers is that it bypasses copyright

Therefore, they will assert that they can ignore the new license saying "you may not launder my code" just as surely as they can ignore the previous license.


First, I did not come up with the term "code laundering." I cannot claim credit for that; I saw it first on HN on https://news.ycombinator.com/item?id=27729209 somewhere.

Second, you are correct that Copilot's maintainers claim that it bypasses copyright, but if it does while producing exact copies of code, then copyright is dead, and there are a lot of big companies out there with deep pockets that will ensure that doesn't happen.

They may claim that because their algorithm is a black box, that whatever it produces has no copyright, but my licenses will push back directly on that claim by saying that if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output. After all, that's what we do with GPL and binary code. The binary code is the output of an algorithm (the compiler) whose input was the source code.

I hope by tying it together like that, the terms can close the loophole they are claiming. But of course, I am going to get a lawyer to help me with those licenses.


> ... if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output.

You're not getting it. If Copilot isn't currently infringing copyright then adding such a clause won't matter. Such a clause would only hold weight when copyright applies. On the other hand, if copyright does apply, then you don't need such a clause because the activity is already a violation of the vast majority of licenses. (It even violates extremely permissive ones because it effectively strips out the license notice.)

The GPL works specifically because copyright applies to the usecase in question. It simply specifies various requirements that you must meet in order to license the code given that copyright applies.

In short, you can't just put a clause into a license saying, effectively, "and also, this license confers superpowers which make it so that my copyright applies in additional situations where it otherwise wouldn't!".


I think the GP's "license" would still be effective, although it would not be "open source" per the OSI definition.

Imagine this simplified scenario first: if I published a source file publicly without any licensing or explanation except a standard copyright notice - "Copyright (C) 2021 MY NAME, all rights reserved", do you think a random person/company can take that code and integrate it into a commercial product?

I would argue not (in general). Copyrights law as it is, does not permit a user who has access to a copy to do whatever they want with that copy (esp. if it involves more copying). OSS licenses do give you much freedom as long as you don't modify it, and that's why we have impression that we can do whatever with publicized source code. However, if we think about other types of copyrighted work, say movies for example, streaming services can "rent" you a movie multiple times even though you've paid to download the content previously. What are you paying for the second time you rent? Another example - some photographers may allow you to freely browse their works, but they can still make you pay money if you want to use their photo in your commercial product.

So why wouldn't copyright restrict usage of source code in similar situations? The GP only needs to add a condition to the license to restrict how users can use it. It will no longer be OSS, but as long as it's his work, I don't see why in principle it shouldn't work.

(In practice, I don't think it will make much difference -- I think your argument is still somewhat compelling, and some people will probably take your position. Conservative corporate lawyers aimed at reducing legal risk would disagree, so it's basically a matter of how much legal risk one is ready to take. Also, for an author trying to do this, note that suing Microsoft in these cases would be expensive, since they will likely fight back given that they spent so much money trying to do this, and the outcome will be uncertain. If really tested in court, given the result of the Oracle v Google case, if the US Supreme Court is impressed by the social/economic benefits that Android brings, I'm pretty sure the justices will be even more impressed by this intelligent code generation thingy, and might just grant this thing a fair use.)


Your summary is generally correct, and I certainly agree with the other commenter's position on their work. But I think you're still missing the point. Copyright is the mechanism that allows you to prevent copying, but GitHub's claim is that copyright is irrelevant to Copilot's input.

I have a nice strong lock on my door. GitHub (asserts that it) can enter my home through the window.

Adding another deadbolt to the door does not help.


I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

Maybe I'm missing something (just not the thing you said), but has Github made any legal claims so far? The original article is written by a politician in EU...

Even if you're a lawyer defending Github in this case, there's still a couple things that needs to be clarified before you can make the case: (maybe the info is out there but I'm too lazy to research)

- Is Github only using code/repos that are explicitly under OSS licenses? (because if that's the case, then the discussion might be justified in presuming OSS terms, and it may be the case that more restrictive non-OSS licenses would require a different analysis)

- As somebody pointed out in another thread, the Github terms of service agreement seems to grant Github additional rights when dealing with user uploaded content. Is that a legal basis for the use?


> I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

And I tend to agree with you (and the other commenter) here. But GitHub doesn't.

> has Github made any legal claims so far?

I'm not sure how actively, but the CEO was here in the announcement thread the other day saying that they think the ingestion of the inputs is a "fair use". They also have some material defending the output side: https://docs.github.com/en/github/copilot/research-recitatio...

> Is Github only using code/repos that are explicitly under OSS licenses?

I don't think we know exactly what code they used as inputs, no.


Their argument defending the output side doesn't hold water, IMO. If Copilot produces exact copies verbatim, even some of the time, then as long as customers don't have access to the code used to generate the model, how can they be sure?

It's a matter of scale. With a big enough codebase, there will be copyright violations.


> I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

The point (that they claim that) you are missing is that if "copyright is relevant to Copilot's input" then almost all existing OSS licenses already don't allow that.


The licenses that I am making implicitly acknowledge the argument that training an ML model is fair use.

However, GitHub said nothing about the output of the model being fair use. My license will say that the output of their model is under the same license as the input, which means they have restrictions if they want to distribute it (i.e., actually have people use Copilot).

I think this will work because it doesn't say that GitHub is wrong. Instead, it says that, even if GitHub is right, it doesn't matter.

It would also be very bad for GitHub to claim that the output of an algorithm can't be under the same license as the input because we feed licensed code to algorithms all the time and claim that their output is still under the same license. We call those algorithms "compilers" and the binary code they produce is still copyrighted and licensed.


> I think your argument is still somewhat compelling, and some people will probably take your position.

I didn't mean to take a side or argue a position here. I was just pointing out that licenses hold no legal power in the event that copyright itself doesn't apply.

> ... So why wouldn't copyright restrict usage of source code in similar situations?

I'm certainly not an expert here but I believe you are mistaken about the extent to which current copyright law (in the US) restricts such usage. I also don't think that the examples you bring up are as simple as you seem to be making out.

You are legally permitted to record broadcast shows for later viewing; you are not permitted to redistribute the recordings though. I assume (but am not certain) that rentals and streaming are the same. (That being said, bypassing DRM has been made its own crime. This effectively amounts to an end run around the rights otherwise granted to you by US copyright law. But then there are specific exceptions where bypassing DRM is permitted. I digress.)

You aren't legally permitted to mirror the contents of a website (such as the New York Times) without permission but you are allowed to access it since they make it publicly available. You are even permitted to save a copy for your own purposes when you access it; you are not permitted to redistribute that copy.

For an extreme example, consider the recent LinkedIn case. Unless I misunderstood it, the court deemed it acceptable to scrape any publicly available content. Certainly most such scraped content was never explicitly licensed for that though!

Even if the license for a piece of code was entirely proprietary, GitHub presumably acquired it through legal means (ie intentional upload). Once they have it in their possession, it's not at all clear to me that current copyright law in the US has anything to say about how they use it (short of redistribution). Of course, if their ToS promises that they won't use it for other purposes then they can't do that. But assuming they never promised you that in the first place ...

There's a traditional argument here about needing a license to legally incorporate the copyrighted work of another into your own.

One possible counter argument is that training a model on publicly available work is analogous to a person viewing that work. So long as the model never outputs any of the original inputs (or only exceedingly small fragments of them that would fall under fair use regardless) it's not clear that those outputs constitute derivatives at all (in the legal sense). Or they might. The courts haven't weighed in yet as far as I know. (Consider GPT-3 or This Waifu Does Not Exist for additional examples of the sort of ambiguity that's possible here.)

Of course, one possible counter to that is that the model itself is (in many cases) effectively a lossily compressed copy of the original input works. So perhaps redistribution of the model itself would be a violation of copyright. But even if that turns out to be the case, it's still not clear that the output of such a model would run afoul of copyright.


You have good points.

I argue that the output of an algorithm has the same copyright as the inputs to the algorithm, and that's because we use compilers (algorithms) to transform source code all the time already, and no one says that the binary code (outputs) is not copyrighted.


The trouble is there seems to be an entire continuum when it comes to degree of transformation.

The compiler produces more or less a direct (logical) translation so it's clearly some sort of derivative. We go from C to machine code but the output still "means" the same thing as the input. (More precisely, it's approximately a mathematically transformed subset of the original input. Lots of information is removed, things are reorganized, and a bit of extraneous information gets added in the process.)

For something notably more muddy than a compiler, consider This Waifu Does Not Exist. Any given output is (typically) nowhere near any particular input but you can often spot various strong resemblances.

Alternatively, the implementation of sketch-rnn (https://magenta.tensorflow.org/sketch-rnn-demo) is quite different - it outputs pen strokes instead of pixels. Still, the legal questions remain the same.

For a significantly muddier example, consider GPT-3. The outputs are (typically) not even remotely similar to anything that was input except in very broad strokes.

Where does Copilot fall along this continuum?

For even more confusion, consider running a New York Times article through Google Translate. Are you in the clear to publish that? I seriously doubt it.

But what about running it through an ML algorithm that (attempts to) produce a very brief summary of it? Many such implementations exist in the real world today. Their output is nothing like the input - should it still fall under the copyright of the original?

Finally, it's worth pointing out that for many of the above computerized tasks there are direct human equivalents. Art can be traced on a light table. A drawing can be produced that fuses the styles of two references. News articles can be manually translated or summarized.

Again, my intention here isn't to argue a particular side. I'm just trying to make it clear how complicated this stuff is and the fact that we don't have clear legal answers for most of it yet.


Ah, I see.

I argue that, even if training a dataset is fair use, distributing the result is copyright infringement. I would want my license to make that part clearer.


> even if training a dataset is fair use, distributing the result is copyright infringement

I would be inclined to agree that the current situation (ie reproducing training examples verbatim) violates copyright. On the other hand, I'm not so sure that a trained model does (or even should) be subject to the copyright of the inputs.

Of course I acknowledge that the latter view is controversial and also that such issues are so new that they haven't had a chance to be meaningfully addressed by either the courts or the legislature yet.

As an example of a similar situation, see (https://www.thiswaifudoesnotexist.net/) which was trained entirely on copyrighted artwork. Note that there are at least three distinct issues here - training the model, distributing the model itself, and distributing the output of the model.

> I would want my license to make that part clearer.

But again, GitHub's argument here is that the license is completely irrelevant because it doesn't apply in the first place. Thus they won't care one bit about any clarifications you make one way or the other.


You said that you're "not so sure that a trained model does (or even should) be subject to the copyright of the inputs."

You missed my point. I'm not saying that the model is subject to the copyright of the inputs; I'm saying that the model's outputs are, which is entirely different. We say that the output of a compiler is still subject to the copyright of the inputs, so why not this?


I misspoke. (Err mistyped?) I suspect there will often be a stronger case to be made for the model itself falling under copyright than what it outputs. It's up to the courts and the legislature in the end though, so who knows.

Anyway, by providing public access to this thing I infer GitHub to be taking the position that copyright doesn't apply to the output. (And I suspect they are wrong, in particular because of the verbatim code samples people have managed to coax out of it.)


> even if training a dataset is fair use, distributing the result is copyright infringement

That seems an unlikely legal argument. It would defeat the point of fair use if you couldn’t distribute the result.

And no copyright license can override copyright law. Licenses can only grant rights, they can’t take them away.


Can you add fines?


I wish. I just want users to know what rights they have. Ultimately, I want my software to serve end users, not companies. If companies add value for users with my software, that's exactly what I want.

But stripping licenses away so that users can't know what rights they have with my code is not that.


>I am actually adding clauses to FOSS licenses that I am writing

Doesn't this make your new licenses incompatible to a lot of existing licenses?


Not necessarily. If you do it right, you've got a perfectly GPL-compatible license (because such laundering is, technically, a violation of the GPL… probably) – it's just a license that's more explicit about what's a license violation.

Law isn't code.


GPL explicitly forbids re-licensing under more restrictive terms.

So either the added terms are not more restrictive, which basically means they are unnecessary and have no real effect; or they are more restrictive, which is incompatible with the GPL.

You can't have things go both ways. It seems that your argument is "we're not adding restrictions, we're just saying what we think Copyright law / the GPL should actually be like." But unfortunately you can't "clarify" Copyright Law or "clarify" the GPL by adding terms. Ultimately courts decide that.

(Of course, if somehow your "clarification" happens to align with a court decision, then maybe it will work after all. But in theory your "clarification" is still not necessary and has no additional effect....)


> But in theory your "clarification" is still not necessary and has no additional effect....

Except your clarification will be interpreted by a court of law. “This license is compatible with the GPL and I can interpret the GPL in a way that lets me do something this license says I can't” is much less likely to stand than “well maybe the author thought the GPL said this, but it actually says my interpretation”.

This, of course, presumes that such a license is actually compatible with the GPL, something I'm getting less and less certain of over time. (What constitutes a compiled form? If a predictive model doesn't count – which it might not, since it outputs source code, very much unlike how compiled programs normally work – then my argument falls down. And many other things would also knock the argument down; I'm not confident enough that all my assumptions are right, or that they should be right.)


GPL code and its derivatives can't be distributed with additional restrictions.


wizzwizz4 is correct. Also, I have explicit clauses saying that GPL/AGPL dominate.

But yes, my licenses may be incompatible (one-way) with permissive licenses. I say "one-way" because code with permissive licenses can still be used in code under my licenses, but maybe not necessarily the other way around.

I'm okay with that.


That does not really ring true to me. AGPL broadens the scope of violations as well, and you cannot use AGPL code in GPL-only code bases without turning the end product AGPL (but you can use GPL-only code in AGPL code bases).

If you're just adding something along the lines of "copying passages extensive enough to reach originality is a violation of this license" then that's indeed already covered by the GPL, and there is really no need to add such a passage other than to be more explicit - and confuse people at least at first about why your license is not actually the GPL. So there isn't much of a point to do it in the first place, in my humble opinion.

If you add text that says something along the lines of "you may not use this code as training data", then you created an incompatible license, and your code cannot be used in GPL code bases, and even worse, since it restricts what you can do with the code more than the GPL, it might even mean you stop being reverse-compatible and may not use GPL'ed code yourself in your own custom-license code base.

The AGPL does not further restrict code uses, just broadens the scope of when you have to make available the code, so it's fine there. However, the original BSD license with the advertising clause is considered incompatible with the GPL.

I am not a lawyer, and these are just my quick layman concerns. I fully recognize you're entitled to use whatever license you find suitable for your code and I am absolutely not entitled to your code and work whatsoever.

But that said, I wouldn't touch your code if I saw a "potentially problematic" custom license, and I wouldn't consider contributing to your projects either.


I understand your concerns.

Honestly, with this whole debacle, I am not going to be accepting outside contributions anyway.

I also understand the concern with a problematic license. However, I don't plan to make a specific exemption about machine learning, but rather tie up an ambiguity.

What I think I'll do is that the license will require that when the licensed source code is used, partially or fully, as an input to an algorithm, the license terms must be distributed with the output of that algorithm.

I don't think this is a violation of the GPL at all because the GPL requires you to distribute the license with the binary code of GPL'ed code, and such binary code is the output of an algorithm (the compiler) whose input was the source code.

But what it would do is put the onus on GitHub that, if they used my code in training that data, if they distributed the results (as they are doing), they must distribute my license terms as well and tell users that some of the results are under those terms.


> binary code is the output of an algorithm (the compiler) whose input was the source code.

Just because binary code is produced by the operation of an algorithm on source code doesn’t make all output produced any algorithm on that source code binary code. Otherwise checksums and hashes and prime numbers would be copyrighted.

Bats are not birds.


You have a point, which is why the legal system would still require that a copy be substantial before they count it as infringing. I would argue that Copilot has already been shown to copy substantial portions, though.


> something along the lines of "you may not use this code as training data"

Would such a term be legally binding under present copyright law? Other than disallowing inclusion in a redistributed dataset specifically intended for training ML models, it's not clear to me that it would actually prevent such use if you already had a copy on hand for some other purpose. (Specifically, note that GitHub indeed already has a copy on hand for their authorized primary purpose of publicly distributing it.)

More generally, the manner in which copyright law applies to machine learning algorithms in general hasn't been worked out by either the courts or legislature yet. Hence the current article ...


To be clear, my suspicion is that this is so unlikely to happen unintentionally that it does not represent a real risk. If the issue is that I can force it to generate infringing output if I really want to, it is an argument against the Web browser too, since I could just as easily use the copyright-unsafe "copy" feature.


I don't entirely agree.

Whereas using the browser's copy feature requires the user to have intent to use it, getting Copilot to produce exact code does not. And proving that intent is not easy.

I think companies will see that such code can be exactly reproduced and decide to stay away from Copilot. I hope they do. In fact, I am less willing to take outside contributions for my own code, even for bug fixes, just because of the risk that that code came from Copilot.


That makes sense if you ignore the idea that such a thing would seem unlikely to happen without intent, which was the key thing in the post you’re replying to.


Unlikely stuff will always happen with enough use. There are billions of lines of code in the world. There will be enough copyright violations. Even on single multi-million line codebases, there will be violations.


How long does it have to be for you to consider it copyrighted code?

For example, a book could be copyrighted, but they certainly cannot sue me because a book i wrote contains a sentence that is the same.


The answer to your first question is for the courts to decide, unfortunately.

However, for my purposes, using a new license with particular terms would only be to make companies like GitHub pause and think before using my code as "training" to an "algorithm" like Copilot.


Double standards ensue.

Tool that could be used to violate copyright := Gets prosecuted by MPAA and friends, legislation is passed to make use / development / distribution of such tools illegal

Bigcorp ships the ML equivalent of ALLCODE.tgz, but you actually gotta look in the no/dont/open/this/folder/gplviolations/quake.c folder := Is this adequate proof that copyright is being violated?


Since I do not work for the MPAA, I don't see why you expect me to answer for them. Half of the article's argument is that any argument you could use to shut down Copilot would also give a lot of power to such entities if it were accepted.


It can, that does not mean that it will, in any case other than people actively probing it for that.


I'm not sure if making an "analysis" without doing any research whatsoever is reasonable.


I'm not sure either —which is why I said "may have been reasonable" instead of "was reasonable" :)

I can see an argument for doing your own research, but I can also see an argument for basing an analysis on what GitHub said in the FAQ — I'm honestly a bit surprised that Microsoft's lawyers let them say that with a product that can reproduce such large blocks of verbatim code.


My guess is that their lawyers weren't consulted, and that the Github people just shipped it on their own.


That literally cannot happen in FAANG/MS, esp not when the CEO announces the product in a public blog post.


Yep. Individuals in a FAANG don't have the ability to launch a product without review. Just drafting a press release for a new product involves Comms oversight and VP-level approval.


It's not obvious that Microsoft is violating copyright yet. The main concern is whether the product makes others liable.

So it could be that the executives really wanted to do it, and the lawyers thought "OK, technically we're not violating anything...."


[flagged]


Please omit flamebait from your HN comments. It tends to produce flamewars, which are tedious and nasty. Your comment would be fine without the last two sentences.

https://news.ycombinator.com/newsguidelines.html


I didn't realise I was perpetrating flamebait; my last two sentences were meant as rhetorical hyperbole (and I wasn't targetting anyone here!)

At any rate, I like it here; so I'll try to figure out how what I said was flamebait, and try not to say such things again.

Sorry.

[Edited upon re-reading]


Copyright is (and has been, since the earliest days) about protecting the creative expression of an idea.

You can't copyright an algorithm, but you certainly can copyright the expression of an algorithm in Python. You cannot copyright the words of the English language and their meanings, but Noah Webster absolutely did copyright his dictionary, which was a creative expression of their definitions (and lobbied for the first increase to US copyright law). Webster wasn't the "thought police" for trying to copyright people's understanding of words in English, because he didn't and couldn't copyright them; he copyrighted his expression of what words meant.

If you read the creative expression of an algorithm in Python and then re-express it in English, then sure, copyright protection doesn't extend to that re-expression. But Copilot isn't doing that, it's quite clearly reproducing parts of the original creative expression of an algorithm, not the algorithm itself.

Here's an easy way to demonstrate it: open up a source file in any language other than C and try to get Copilot to spit out an implementation of Quake's fast-inverse-square-root algorithm. You will very quickly discover that Copilot doesn't "know" the algorithm; it only "knows" the specific creative expression of it (comments included).


> So is it the intention that rewriting in Python an algorithm previously expressed in C would be infringing?

Yes, a port from language X to Y is widely considered a derived work. Whether it is infringing is a separate question.


There are no thought police here.

In the US, copyright may include the choice of variable names, the organization of the code into modules and functions, and other aspects which where there are the creative choices that may be protected under copyright law.

The relevant process is described at https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari... , which comes from the court case at https://en.wikipedia.org/wiki/Computer_Associates_Internatio.... nearly 30 years ago:

> the court presented a three-step test to determine substantial similarity, abstraction-filtration-comparison. This process is based on other previously established copyright principles of merger, scenes a faire, and the public domain.[1] In this test, the court must first determine the allegedly infringed program's constituent structural parts. Then, the parts are filtered to extract any non-protected elements. Non-protected elements include: elements made for efficiency (i.e. elements with a limited number of ways it can be expressed and thus incidental to the idea), elements dictated by external factors (i.e. standard techniques), and design elements taken from the public domain. Any of these non-protected elements are thrown out and the remaining elements are compared with the allegedly infringing program's elements to determine substantial similarity.

Emphasis mine. This specifically highlights that your example ('only so many ways you can express an algorithm') is not protected under US copyright law.

The originality requirement only applies to other aspects of the generated code, which in this case would include the comments that Copilot generated, and which clearly are not required for the algorithm to work.

For thought police like you describe, look to patent law.


Clarification: One cannot patent an algorithm, but an implementation in source code can certainly be copyrighted.


Huh, that's interesting. While I'm hesitant to suggest that what the world needs is even more patents, this doesn't make immediate sense to me.

Let's say someone comes up with a new sorting algorithm, which completes in less cycles than was previously believed possible. Sure, it's math, but isn't that a new, creative expression? Don't we want to encourage them to publish their algorithm (one of the key purposes of patents—this way, anyone can use it after 20 years), as opposed to keeping it hidden from the world?

It makes more sense to me than most software patents (admittedly, a low bar to clear). And if the patent office is doing its job (big if), the patents should only be granted for algorithms which are sufficiently novel.


A new super-fast sorting algorithm (not just a few cycles, but something that actually changes the O-number) would obviously be a fantastic boon - I would want the inventor to benefit from his cleverness.

But nowadays I think patent law isn't the right way to do that; trade secrets should be enough. I don't think that what is disclosed to the public in patent applications is of enough value to justify a long monopoly. It's not necessarily a problem with the written law; patents are horrible because of the way courts apply them.


An algorithm is definitely subject for patent. Its very strange to read such a statement on the interwebs which were heavily affected by patented image compression algorithm.


>One cannot patent an algorithm

software patents?


An algorithm is maths, you can't patent maths. Patent lawyers and business people have however somehow managed to convince courts/patent authorities that configurations of computer systems are patentable (or some similar argument), which then makes software patentable (IANAL but I think it's something like this).

Either way, the copyright of source code is separate from that. Copyright is for the text of a program (the source code), that might e.g. implement an algrithm. The algorithm itself cannot be patented or otherwise legally protected.


An algorithm is maths, but a lot of code isn't algorithmic. Algorithms provably halt, and most software doesn't halt, let alone provably. Operating systems, browsers, games, etc. are non-algorithmic. It's hard to claim that something like a browser is just math and therefore deserves no IP protections.


An algorithm is a reasoning procedure. A program (e.g. a browser) embodies many algorithms.

I've not come across your stipulation that for a thing to count as an algorithm it must provably halt, but I can go along with that. So I'd argue that in most cases, any function or subroutine provably terminates, even if the program embodying it is not supposed to terminate.

I also don't agree that an algorithm is "just maths". At least, not if you then pivot to saying that a browser isn't "just maths". Any operation performed by a computer is "just maths", because what a CPU does is basically arithmetic and branching.

I don't think it's a question of what does and doesn't "deserve" IP protection. The source code of a browser is clearly an original work, and entitled to protection. But the ideas and procedures it embodies are not "works", and copyright isn't supposed to apply to ideas and procedures.

I'm against the very idea of "intellectual property". It must have seemed a good idea at the time, but I think patents and copyrights have become monsters that inhibit, rather than encourage, innovation and creativity.


> I also don't agree that an algorithm is "just maths". At least, not if you then pivot to saying that a browser isn't "just maths".

Algorithms are distinguished by their proofs of correctness. This elevates them above simple procedures. The halting problems tells us that there is no automatic way to determine whether or not a program terminates. So when we find one, it's like discovering a mathematical law. The proof of an algorithm's correctness is expressed independently of any programming language or platform. What else could they be other than math?

Things like browsers, games, operating systems, e-mail clients, music players etc. are not treated this way. They are not formally specified. They are implemented in the context of a machine and an actual running environment. The source code of the program usually doubles as its specification. It's very different compared to an algorithm.

I agree IP as a concept is bad, but this is the way of the world at least for now. Given where we are, for me it makes sense to draw a line between algorithms and software in the context of copyright.


I'm afraid I still don't agree with you; algorithms existed (and were named) long before The Halting Problem was stated. This "proof of correctness" claim doesn't wash with me.

Can you cite? (Wikipedia's explanation of what an algorithm is doesn't mention that). I'm open to being corrected, but not baldly contradicted.


I remember learning this in my algorithms class many years ago. We used the Comen et al Intro to Algorithms book, which goes a long way to providing a proof for every algorithm discussed: https://mitpress.mit.edu/books/introduction-algorithms-third...

Wikipedia does mention it though:

  In mathematics and computer science, an algorithm is *a finite sequence* of well-defined, computer-implementable instructions, typically to solve a class of specific problems or to perform a computation.
"Finite sequence", meaning it terminates. The wiki citation (The Definitive Glossary of Higher Mathematical Jargon) provides a little more: "A finite series of well-defined, computer-implementable instructions to solve a specific set of computable problems. It takes a finite amount of initial input(s), processes them unambiguously at each operation, before returning its outputs within a finite amount of time." https://mathvault.ca/math-glossary/#algo

So they stipulate finite input and output and finite runtime. This is in contrast to something like a webserver or OS, which has a potentially unbounded number of inputs, and is expected to run effectively forever.

I mean think about it, what does an OS kernel look like in its most distilled form? It's essentially just an entry point to an infinite loop. Same as games, where they're called "event loops" in that domain.

The way I think about it is this: if computers and programming languages and software didn't exist, algorithms still would. I don't think you can say the same about e.q. Quake. Quake isn't a mathematical truth even though it uses them to work, kind of like how engineers use physics to build a bridge, but the bridge itself isn't physics.


Well, I don't consider an OS or a browser to be the implementation of an algorithm. It has to be possible to understand an algorithm, re-implement it, and so on. I'd say you have to be able to hold it in your head - like Eratosthenes' Sieve.

I don't see anything in the WP article about proofs of correctness. "Finite sequence" surely just means that the number of instructions isn't infinite? Come to think of it, that wording seems rather hand-wavey; I wonder if I can find citations to help me improve it.


> I don't see anything in the WP article about proofs of correctness. "Finite sequence" surely just means that the number of instructions isn't infinite?

Well how are you going to show that the algorithm terminates for all inputs without a proof of correctness?


On top of the sister comment software patents are not a thing in general in Europe which I would imagine is the authors area of expertise.


It's not the algorithm that's copyrighted. It's the source code that implements it.


[Ianal]

The thing about the situation is that "copying code you found on the Internet" certainly isn't automatically, always legal. That you engaged in copying X from the Internet doesn't make illegal either. Your source for the source code your incorporate into a product doesn't matter, what matters is whether that code is copyrighted and what the license terms (if any) are (and people saying "copyright doesn't apply to machines" are wildly misinterpreting things imo).

Given what's come out, it seems plausible that you could coax the source of whatever smallish open source project you wished out of copilot. Claiming copyright on that code wouldn't be legal regardless of Copilot.

Whether Microsoft/Github would be liable is another question as far as I can tell. I mean, Youtube-dl can be used to violate copyright but it isn't liable for those violations. The only way Copilot is different from youtube-dl is that it tells it's users everything is OK and "they told me it was OK" is generally not a legal defense (IE, I don't know for sure but I'd be shocked if the app shielded it's users from liability). All the open source code is certainly "free to look at" and Copilot putting that on a programmers screen isn't doing more than letting the programmer look at it until the programmer does something (incorporating it into a released work they claim as their own would be act).

The question is how easily a programmer could accidentally come up with a large enough piece of a copyrighted work using Copilot. That question seems to be open.

TL;DR; My entirel amateur legal opinion is that Copilot can't violate copyright but that it's users certainly can.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: