Should GitHub be sued for training Copilot on GPL code?

Longlius · on June 23, 2022

The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

On a tangential note, I always find the discussions surrounding FOSS licenses and copyright rather amusing in a sad way. There's a certain kind of entitlement a lot of people feel towards FOSS that they certainly do not express towards proprietary software and I imagine this a great source of the resentment and burn-out FOSS maintainers feel.

TuringNYC · on June 23, 2022

>> "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

IANAL, but isnt the concept of "derived data" pretty standard? You dont need to copy data for it to be infringing. I've tackled derived data clauses regularly when negotiating data contracts at work and there is always verbiage and discussion around it (e.g., are we allowed to publish an average of the purchased data)

pbhjpbhj · on June 23, 2022

An average is not related to artistic aspects of the data and so can't be a derivative in the copyright sense (based on international law - one of the principle Conventions is fully titled "the Berne convention for the protection of literary and artistic laws", that's because copyright protects literary and artistic works).

Provided you have rights to access a body of statistics, then copyright has nothing to say -- save overreaching national caselaw (!) -- on your derivation of mathematical, technical, or scientific data from that work.

But a contractual clause, in general, doesn't care about copyright; of you've contracted not to derive data from a work then that's orthogonal to copyright.

IANA(IP)L, this is my opinion and unrelated to my employment.

jhugo · on June 23, 2022

> The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

Of course it isn't the same as a human programmer doing anything. It's a complex piece of software, which we happen to misuse the term "AI" to describe, but it is not intelligent.

mtkhaos · on June 23, 2022

Technically just advanced indexing of code snippets

vinnymac · on June 23, 2022

Precisely. I don't see a difference between what Google indexes on its search engine, and what CoPilot can recommend. Google has been and still does get slapped on the wrist when they don't respond to take down requests. It seems this is missing from CoPilot currently, and will open them up to a number of lawsuits in the future if it continues to operate as it does.

merpnderp · on June 23, 2022

Except Google isn’t creating anything new, Copilot is. I’ve had it kick out some very interesting short stories based on Sherlock Holmes, so if I published those would they infringe?

jhugo · on June 23, 2022

You have no idea where they came from, or whether they are as new as you assume. Maybe they'd infringe, maybe they wouldn't!

merpnderp · on June 23, 2022

Google returned nothing which contained exact matches of some of the more interesting dialogue. It would be a serious find - worthy of a paper - to disprove that GPT-3 is generating novel text/code.

jhugo · on June 24, 2022

It's easy to prove that it sometimes regurgitates text verbatim — just play with it for a while. Having certainty that any given span of text/code is novel is extraordinarily difficult.

InvertedRhodium · on June 24, 2022

Where does human creativity come from?

nojs · on June 23, 2022

> I don't see a difference between what Google indexes on its search engine, and what CoPilot can recommend.

That is an extremely disingenuous take. It produces novel output, so is not merely an “index” in any sense of the word.

greyface- · on June 23, 2022

Each search results page is novel and unique. Even two different users making the same query will get different results thanks to the "search personalization" Google is doing these days.

dahfizz · on June 23, 2022

Google search doesn't synthesize anything. It collects results and orders them according to an algorithm. Copilot and similar language models can synthesize new text. That's clearly different than just presenting existing text.

belorn · on June 24, 2022

Copilot can't create new novel concepts. In the end it is just an complex mathematical formula that return a set of code references with a set of length determined by the math.

The illusion of creativity is similar to that of technology. Sufficient advanced technology is indistinguishable from magic, and sufficient advanced math is indistinguishable from intelligence. The relation between AI and math is the same as the relation between magic and technology.

UncleMeat · on June 23, 2022

All of the large language models can emit text that has never been seen in the training set (unless you go so far to consider each character to be a snippet).

BeefWellington · on June 23, 2022

They can also emit text that it verbatim copied.

Infringement isn't about how the infringing system works, it's about the product of that work.

jhugo · on June 23, 2022

> Infringement isn't about how the infringing system works, it's about the product of that work.

Exactly this. It makes zero difference that you produced your infringing work with the help of a program that happens to be extremely complex and marketed as "AI".

thesz · on June 24, 2022

So do smaller models and I have to note smaller models are better at that.

Paradigma11 · on June 23, 2022

If it gives the same output as a human programmer for the same input, why would it be legally relevant if one system has the intelligent property?

remram · on June 23, 2022

But that's the whole point of copyright. The same piece of code you copy from a Google search can legally be used by you if your developer came up with it, and not if Oracle came up with it. Where you copied it from is the entire point.

jazzyjackson · on June 23, 2022

of course it's not intelligent, but we still have to decide how the law applies to the actions of software, or otherwise re-frame the whole thing to include the co-pilot developers doing the copyright infringement when they trained the model - not the current discussion which gives agency to the IDE plug-in "choosing" a code snippet to paste.

jhugo · on June 23, 2022

I don't see how the way the software was built is particularly relevant.

It's just a tool used by the developer; the onus is on the developer to ensure they don't infringe the licenses of the source code they incorporate in their software. Since Copilot makes it impossible to know where it's barfing code up from and what license that code is under, a developer who cares about not getting sued probably needs to avoid using Copilot.

jazzyjackson · on June 23, 2022

Eh, the law is about making a copy. If my IDE plug-in fills in some code, the question is, did I copy the code, did the robot copy the code, or did the developers that wrote "cp github.db ~/trainingset" copy the code?

ycombobreaker · on June 23, 2022

The authors of the tool created something that can be used for copyright infringement.

The tool itself lacks agency, it did what it was programmed to do.

If you took the tool's suggestions and proceeded to published a derivative work, you may have infringed.

This really doesn't feel any different from P2P filesharing services. Rightsholders have targeted tool publishers in the past, because they are the largest single target and not anonymous; but ultimately the infringement is performed by the end user.

jhugo · on June 24, 2022

This isn't complicated at all. You copied the code, which isn't an issue until you then go on to do something which infringes the license (e.g. publish under a different license, publish binaries without publishing source, publish without attribution, whatever it is that the license requires).

LeifCarrotson · on June 23, 2022

A law that uses the archaic terms "copy and paste", referring to a time when people would make an analog photocopy of a document written using a typewriter, trim it out with scissors or a knife, and glue it to their book with the pasty remains from boiling animal collagen cannot be trusted to apply word-for-word in a time when technology has obsoleted the glue, typewriter, xerox machine, and even the paper.

It is not the same as a human, no, but it's not hard to choose a definition of the word "intelligent" that can accurately describe something that can be done by a program.

When a human walks around a puddle, are they demonstrating intelligence? When a horse avoids stepping in a hole, is the horse intelligent? When a robotic vacuum avoids a stairway, is it intelligent? When a self-driving car avoids a bollard, is that intelligent?

Whether there's a being inside the device that believes it experiences consciousness or not, the same outcome happens. A Searle's Chinese Room that produces copies of Chinese IP, a trained monkey that does so, or a human that does the same thing, the outcome is very similar.

bencollier49 · on June 23, 2022

Perhaps it's a little bit like employing a human programmer with an eidetic memory who occasionally remembers entire largish functions.

If he were able to remember a large enough piece of copyrighted code, and reused it, then it still wouldn't be fair use, even if he changed a variable name here or there, or the license message.

Longlius · on June 23, 2022

Yeah, that's definitely the impression I get from the few Copilot examples I've seen. I've not personally used Copilot so I refrained from making absolute statements about its behavior in my top comment.

But I think the conclusion most people are settling on is that it's definitely infringing.

jka · on June 23, 2022

A possible response that I'd predict from GitHub would be to attribute much/all of the responsibility to the user.

The argument would be along the lines of: you as the user are the one who asked the eidetic programmer (nice terminology, @bencollier49) to produce code for your project; all we did is make the programmer available to you.

WithinReason · on June 24, 2022

Relevant parts from the Copilot FAQ (https://github.com/features/copilot/):

Does GitHub own the code generated by GitHub Copilot?

GitHub Copilot is a tool, like a compiler or a pen. GitHub does not own the suggestions GitHub Copilot generates. The code you write with GitHub Copilot’s help belongs to you, and you are responsible for it. We recommend that you carefully test, review, and vet the code before pushing it to production, as you would with any code you write that incorporates material you did not independently originate.

Does GitHub Copilot recite code from the training set?

The vast majority of the code that GitHub Copilot suggests has never been seen before. Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set. Previous research showed that many of these cases happen when GitHub Copilot is unable to glean sufficient context from the code you are writing, or when there is a common, perhaps even universal, solution to the problem.

pmarreck · on June 23, 2022

I've used Copilot for months and honestly it's become one of my most favorite inventions in all of programming- and this is key- even when it screws up (such as by suggesting Ruby-syntax code to autocomplete Elixir code). It tickles the "childlike joy" funnybone in me, the same one that got me into programming to begin with. I don't know how long it will take for typing "#ANSI yellow" (for example) and autocompleting to the right codes to get old, or every time it autocompletes anything considered "boilerplate," but it hasn't, yet!

You know, pretty much all of programming can be summed up as "tedious labor elimination," and this tool directs that same labor elimination at the work of programming itself (I no longer have to constantly google syntax idiosyncrasies etc.), and NOW coders are pissed? I don't get it. Eat your own dog food, people, because this is what it looks and tastes like.

As to the copyright infringement or licensing-violation claims, I have yet to see it autocomplete an entire algorithm correctly, or one copied verbatim from somewhere, although that could be mitigated. You still have to pay attention (kind of like Tesla autopilot), it's not going to eliminate your job.

Longlius · on June 23, 2022

No one is complaining about copilot making programming easier or automating it.

We're upset because it's quite literally infringing on intellectual property. Infringing on intellectual property that's been set aside for the exclusive use of the commons.

jazzyjackson · on June 23, 2022

god bless AI for moving human society beyond silly notions like ideas-as-property

copyright was established to increase the innovation and creative will of the arts and sciences, what could increase that creative force more than an AI assistant who has seen every creative work ever made?

account42 · on June 24, 2022

I'd be all for returning ALL code to the commons.

Except that is not what is happening here. The problem is that AI is being used to take code, which was provided to the commons under the explicit condition that anything built with it is also released under the same terms, is now being fed to a magic mystery machine to produce code that can supposedly legally be witheld from the comments. The only code that this affects is the one that was already shared - you won't see Microsoft feeding Windows and Office source code into Copilot anytime soon.

pmarreck · on June 23, 2022

Do you use it? Have you ever used it? How many people making negative comments about it here have actually used it? I don't actually believe many have. I suggest at least trying it out before lighting your torches.

If it infringes everyone equally and everyone equally benefits from the infringement, has a net wrong actually occurred? (which of course begs the "do the ends justify the means" question...)

I don't see how this is any different a form of "infringement" than me copying and pasting snippets of other peoples' code, and then modifying it to suit my particular context, without specific attribution, except that the latter is a much more laborious and time-consuming process than copilot autocomplete, and programming is all about tedium elimination

ryukafalz · on June 23, 2022

> If it infringes everyone equally and everyone equally benefits from the infringement, has a net wrong actually occurred?

It’s not done equally though. Copyleft code is extremely likely to be on GitHub somewhere, while internal proprietary code is often not. Copilot will thus have been trained more on the former than the latter.

> I don't see how this is any different a form of "infringement" than me copying and pasting snippets of other peoples' code, and then modifying it to suit my particular context, without specific attribution

It’s no different, but that is also copyright infringement.

pmarreck · on June 23, 2022

> but that is also copyright infringement.

so basically all of Stackoverflow is copyright infringement and has been for decades? Find me the programmer who has never either 1) copied and pasted directly from the internet, or 2) taken an idea found on the internet and massaged it for their own purposes. I mean... this is basically why programming is so lucrative IMHO. Everyone is piggybacking off of everyone else's work (at least in open source)

zvr · on June 24, 2022

The tens of thousands of developers in a company I am familiar with have taken a basic training on intellectual property concepts and software licenses.

A typical case mentioned in the training is that code from StackOverflow is (probably) licenses under CC-BY-SA 4.0 and as such it can never be copied inside their proprietary-licensed code base.

ryukafalz · on June 24, 2022

This is something I wish more companies would do. It’s sorely needed.

ryukafalz · on June 23, 2022

(Recent) StackOverflow contributions are licensed under CC BY-SA 4.0 by default (though the author can of course release it under any additional licenses they choose): https://stackoverflow.com/help/licensing

If the code is really sufficiently trivial (and I’d guess that most code samples you’ll find on StackOverflow are) you may have a fair use argument in the US. Generally speaking though (and especially for anything nontrivial) you need to respect the license. CC BY-SA 4.0 is one-way compatible with GPLv3, though, so that helps if you’re including it in a GPLv3 codebase: https://creativecommons.org/2015/10/08/cc-by-sa-4-0-now-one-...

pmarreck · on June 23, 2022

Define "sufficiently trivial"

ryukafalz · on June 23, 2022

It's fuzzy and imprecise, as many legal concepts are. Small/unoriginal enough snippets may not even be copyrightable:

https://en.wikipedia.org/wiki/Threshold_of_originality

Then even if it is copyrightable, under some circumstances your use of it may be considered fair use anyway:

https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors

Or potentially de minimis:

https://blogs.library.unt.edu/copyright/2017/09/05/the-de-mi...

But when in doubt, ask for permission or ask a lawyer.

account42 · on June 24, 2022

Even apart from copyright aspect, it would be nice if we as programmers would improve our attitude towards attribution. If researchers can cite the work that has influenced theirs without legal threats than so can we.

belorn · on June 24, 2022

Github explicitly leaves out proprietary code bases, included microsoft windows source code (Microsoft own github and uses it for their own products).

If Microsoft included their own source code when training copilot then at least they would be intellectually honest, but they don't. They only consider GPL and other free and open source code to be up for grabs.

dal · on June 23, 2022

This kind of reminds me of when someone reverse engineers a piece of software to document interfaces, protocols or APIs for the purpose of writing compatible software. Then a second person not involved in the RE process implement compatible software from the documentation the first person wrote.

This is to avoid any contamination and verbatim copies of code. Once you have read a piece of code there is a risk of "contamination" and you will be influenced by it. It does not matter if you directly copy it, write it out from memory or use an AI to regurgitate it. It will be a copy of the code. To me this is very clear.

dayjah · on June 23, 2022

This sounds like “taint” in the M&A space. I’ve very limited experience of it and would be interested in hearing more from the better informed folks on this topic!

My limited experience: my then-employer opted not to acquire a company after doing due diligence. Ultimately we decided that the price of acquisition (both paid out, and also incurred in internal time) was below the cost of building a comparable product ourselves.

As the dev who did the tech portion of the due diligence I was now “tainted” by my knowledge of their system. As a result I could not work directly on the effort to build our own comparable solution.

account42 · on June 24, 2022

Another example is Wine: Anyone who has seen the Windows source code is not allowed to contribute [0]

[0] https://wiki.winehq.org/Developer_FAQ#Who_can.27t_contribute...

jeroenhd · on June 23, 2022

A human who will type out the fast inverse square root algorithm line by line won't be exempt from copyright/license infringement just because he remembered it from the top of their head. However, using the same concepts is likely to be fine outside silly jurisdictions where software patents are a thing.

The difference is that AI isn't able to grasp concepts, it's only capable of rehashing patterns. If it is able to understand concepts then it should be shut down and researched immediately, because it's either close to gaining consciousness or already has done so.

The core of copilot is a file or a block of memory laying out a bunch of floating points that get processed and turned into code. This arrangement of floats is derived from source code, with licenses and copyright notices.

I don't think it's any different from turning code into a compiled program. Any developer will understand that a compiled version of GPL code is a derived work and subject to the GPL license. Why would a compiler that turns code into floats be any different? Sure, those floats get mixed up with the floats from other source code, but linking to GPL'd code does something very similar and is also covered by the license.

It's possible to consider copilot similar to hashing: a SHA hash of a binary isn't subject to the binary's license, that'd be silly. However, hashes are inherently one-way, and copilot isn't.

A question I'd like to ask Microsoft is "if I steal the Windows source code and train an AI on it, can that AI be freely distributed and used for Wine/ReactOS/etc?" If Microsoft sticks to the stance that AI isn't subject to the licenses on software then a leaked source AI should be fine, but if they want to protect their intellectual property then they will send cease and desist letters to anyone even thinking about using such an AI model for code completion. My expectation is that Microsoft will act against such an AI.

Regardless, the fact that Github did not ask permission or provide an opt out before training started is a huge middle finger to all open source developers. Even if they can get away with this stuff legally, this approach has surely offended many open source developers who want big tech companies to abide by their code licenses. I don't do much open source work myself but I've been offended by the whole process from the day copilot rolled out and I don't believe I'm alone in this.

tzs · on June 23, 2022

> A human who will type out the fast inverse square root algorithm line by line won't be exempt from copyright/license infringement just because he remembered it from the top of their head.

A human would probably try to defend against a copyright infringement suit over that by arguing something like the following.

There isn't sufficient creative expression in fast inverse square root (FISR) to be copyrightable. There is plenty of creativity in that thing, but it is in things that are not copyrightable such as the underlying mathematics that it is using. Copyright covers expression of ideas, not use of ideas (that's patents) or the ideas themselves.

The expression in FISR that they probably are copying from is pretty much all just in choosing the names of variables, and most implementations I've seen just use pretty normal names that follow normal naming conventions that people use when they aren't putting any thought into naming their variables.

That level of expression is arguably not creative enough to support copyright, at least in the US after Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991) [1].

(I'm assuming that the human didn't do anything stupid, like reproduce the comments too).

[1] https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R....

jeroenhd · on June 23, 2022

I think the FISR is one of the few algorithms that I would actually consider creatively enough to match the creativity requirement. It's counter intuitive math that I would think the vast majority of programmers would never be able to come up with. It's an elegant bit twiddling algorithm that requires one or two blog posts to truly understand, it's not something you read and think "oh, that makes sense, moving on".

Algorithms for generic mathematical operations such as the dot product or matrix multiplication are often trivial to deduce, though optimizer vectorized versions perhaps less so. Most helper functions are unoriginal enough that no reasonable copyright law would protect them, which is also the case for (too) many cases of patented code.

The copyright question does ignore the code license question, though. If a complicated algorithm like FISR is not original enough the what protects any boring old operating system code? What stands in the way of publicly hosting Microsoft's leaked sources, as clearly the code is all quite trivial? There is very little in an operating system that other operating system developers haven't thought of or would reasonably have come up with had they been constrained to the same restrictions.

The variable names are one thing, though they could be chosen much more descriptively. However, the system also output the comment "// what the fuck?" which is not only terribly nondescriptive, it's also something that the system couldn't have come up with if it would have learned from code in any practical form.

The suit you linked is about the difference between information and creativity. However, the case surrounds a data set, something simply factual, rather than a composed piece of information such as code or a book. Code listed on Github is not similar to the listings in a phone book. If they were, all software copyright, proprietary or otherwise, goes down the drain. I think that's impractical to say the least.

OkayPhysicist · on June 23, 2022

Algorithms are patentable, not copyrightable.

FISR could have been patented (and be now in the public domain anyway), but only it's specific implementation in DOOM is covered by copyright.

Also, your argument follows a composition fallacy: emergent properties exist, and thus you cannot simply say that because each individual piece of a whole is trivial, the whole is trivial. Heck, software pretty much by definition goes against that. For relevant precedent, there is no shortage of information that becomes classified when in aggregate. Knowing where a certain piece of infrastructure is isn't likely classified, but knowing where all the strategically important pieces of infrastructure are certainly is.

Which is why the question isn't whether the users of Copilot are infringing someone's GPL (they'd likely have a solid defense based on the individual piece not being sufficient to hold copyright protection), it's whether Copilot itself constitutes a derivative work of its input data, which it consumed as whole (copyrighted) works.

ImprobableTruth · on June 23, 2022

I'm curious as to what distinction you draw between rehashing patterns and grasping a concept.

jeroenhd · on June 23, 2022

That's a philosophical question that nobody can know a definitive answer to.

Personally I'd say the difference is understanding why a certain pattern works rather than blindly inserting whatever works. It's the classic Chinese Room thought experiment.

monkeybutton · on June 23, 2022

Just reading certain code is enough to taint a human programmer though. Some companies have policies against hiring developers with experience on some OSS projects because they have their own clean room implementation they want to protect.

rag-hav · on June 23, 2022

> Some companies have policies against hiring developers with experience on some OSS projects

Can you please elaborate on this?

monkeybutton · on June 23, 2022

They're basically following this process to build their products: https://en.m.wikipedia.org/wiki/Clean_room_design

Sateeshm · on June 23, 2022

Season 1 of halt and catch fire

usrn · on June 23, 2022

Right. I don't see any way this is legal.

Zedmor · on June 23, 2022

Never heard of single one. I bet you just invented it.

homarp · on June 23, 2022

take Windows NT source, train your local version, deploy it on the internet, advertize it does Windows code completion

wait for Microsodlft lawyer to get answer to your original question

vslira · on June 23, 2022

On your tangential note: I always assumed many in the FLOSS side are actually against most cases of copyright as applied to software, but since it is the regulatory standard, they put a strong emphasis on making it work for their purposes, thus the somewhat ironic “copyleft”. It’s a “don’t hate the player hate the game” situation for them

Longlius · on June 23, 2022

This is definitely the orthodox take. If shared source code was the norm and software wasn't subject to copyright (or really if either of those two conditions were met), there'd be no need for FOSS as an ideology. The purpose of copyleft is to ensure that there's a permanent bulwark against code meant for the commons being co-opted by proprietary software vendors and having changes walled off from the community who created the software in the first place.

pabs3 · on June 23, 2022

Source code is essential to FOSS, a public domain binary-only copy of Microsoft Windows definitely would not be FOSS. This is the second item of the open source definition.

https://opensource.org/osd

account42 · on June 24, 2022

Sure, that is a useful condition and is a no brainer to add if you need leverage copyright anyway.

But would it be enough to spur the open source movement on its own if you could legally decompile all binaries and redistribute that? Probably not.

Its not like source vs. binary is a clear distinction - between code obfuscation, generated code, transpilation, etc. there is a lot of wiggle room what should or should not be OK.

pabs3 · on June 24, 2022

The GPL makes it a pretty clear distinction, "preferred form for modification" is pretty clear, but decided on a case-by-case basis. Obfuscated code is not source, generated code is not source, transpilation is often not source but could be depending on how you use it afterwards, bitmap images are often not source but they can be, executables are usually not source but could be, videos are not source but could be. Some links discussing what source is here:

https://www.inventati.org/frx/essays/softfrdm/whatissource.h... https://b.mtjm.eu/source-code-data-fonts-free-distros.html https://wiki.freedesktop.org/www/Games/Upstream/#source https://compliance.guide/pristine https://opengameart.org/forumtopic/source-required-for-art-l... https://wiki.debian.org/rly-free-software

pabs3 · on June 23, 2022

People on the FLOSS side are for software freedom, copyleft is just one of the tools we can use within the current regulatory framework of copyright. If copyright ever went away, we would have to use different tools but would have different opportunities too.

toyg · on June 23, 2022

Those people these days are a vanishing minority. This is not the early-00s anymore.

The reality is that, nowadays, the overwhelming majority of developers touches FOSS code every day and just assumes they're entitled to use it as they see fit. The folks that came up with "copyleft" or care about licenses, are very much not in the driving seat. Blame FAANGs and their hatred for GPL.

DarkWiiPlayer · on June 23, 2022

I think the problem goes a bit deeper than that. From an IP perspective, I think it's reasonable to consider that training an AI on some form of work is using said work to build a new one, just like it would be if it was manually copied in or reproduced.

The problem is that, iirc, GPL didn't consider this at all and still uses language focused on copying code, so something like copilot might slip through the cracks of those definitions.

Then again, the license uses this language when it allows usage of the code in the first place, so one could say that either a) this usage is covered by the license, in which case all conditions apply, or b) it is not covered by the license, in which case... github wouldn't be allowed to use the code at all.

To give an analogy: I think feeding code into an AI is essentially analogous to compiling the code. A machine turns it into something more usable and the original human-written content isn't part of the result anymore, but the intellectual property gets dragged through the process nonetheless. Why would it be any different just because the mechanism of transforming the code into executable software gets a bit more complicated through the usage of AI?

dwild · on June 23, 2022

> Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS?

It literally can't do it in an "in a non-infringing way" as it wasn't made to do it "in a non-infringing way".

People were able to get copy-pasted code verbatim. It means it does not know whether what it does infringe on the GPL or not.

Let say you find a human that never knew anything about copyright and you show him a bunch of Disney movies and you ask him to make you a movie and he literally copy one of their movie. Does it make it non-infringing? (Funny thing is, even people aware of copyrights does infringe it... so yeah hard to say even a machine could make some non-infringing content).

The solution would be to at least make him aware of copyrights and works with that, but first is it even possible, and seconds, is it even enough...

Sadly nothing will ever be done, at least not until it we feed it Disney movies and it start to affect their bottom lines.

TrustInCopilot · on June 23, 2022

> On a tangential note, I always find the discussions surrounding FOSS licenses and copyright rather amusing in a sad way. There's a certain kind of entitlement a lot of people feel towards FOSS that they certainly do not express towards proprietary software and I imagine this a great source of the resentment and burn-out FOSS maintainers feel.

Definitely. Many of my acquaintances complaining about Github Copilot without trying it themselves regularly pirate movies, shows and music. They also always cheer if there is some court ruling against Facebook or Google, no matter what the actual case is even about.

> The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"

It seems to me that the regurgitation only happens if you post the first half of the code, expecting the second half. I imagine that the software sees how several hundred repositories (which are all forks) have a very similar pattern and tells you the best fitting approximation of how they continue, which is again very similar.

In the future I can definitely see Github updating their license and some kind of exodus by FOSSers towards GitLab. But I believe that many open source projects will just put up with it, similar to how Youtubers and Twitch streamers want to stay on the premier platform.

paulgb · on June 23, 2022

An interesting (though not conclusive) test of whether Microsoft is confident that Copilot is not copyright-infringing is whether they’d be willing to release a mode trained only on proprietary Microsoft code.

remram · on June 23, 2022

I was wondering about something similar. Can we get Copilot to come up with some non-free algorithm under copyright from another big company (with lawyers), e.g. Oracle, Microsoft, Google, etc? This is a little difficult because it would need to be non-free but public, be specific enough that we can recognize it, and I think Copilot has protections against outputting code verbatim (but it could be made to output code with variable names changed or similar).

That is probably the way to kickstart a legal discussion about Copilot.

jarbus · on June 23, 2022

I really like this idea

chrismorgan · on June 23, 2022

People keep on talking about the GPL in these cases, but there’s absolutely nothing whatsoever special about the GPL: any code that is not public domain (or under a public-domain-equivalent license) is equally affected. Any mention of the GPL is a red herring.

Copilot is completely depending on the legal theory of being effectively exempt from copyright, under fair use doctrine; if that legal theory falls apart, the entire space (and a lot of other machine learning stuff) is utterly doomed.

Will it, won’t it, should it, shouldn’t it? Dunno.

(And when people say that it should just say what license the code it generates is under and what attribution or similar is required: Copilot can’t tell whether it’s reproducing copyrightable chunks of code, or indeed where what it produces came from, by the very nature of machine learning techniques. The whole verbatim reproduction issue demonstrates this—they’re trying to avoid such reproductions, which a cynic might say is because it weakens their fair use claim, but it’s not easy to do.)

janetacarr · on June 23, 2022

The GPL license prohibits building a competitive solution from the licensed software, and 'derivative works' also require the GPL license.

So I think it is relevant here because there's a gray area around whether or not training a model is like linking to a GPL licensed software(not derivative with caveats) or deriving from one.

By the way, Free and Open-source software licenses are not public domain (or 'public-domain-equivalent'), The copyright holder of the software licenses it to whomever, but the holder still retains their copyright.

chrismorgan · on June 23, 2022

You’re missing the point: Copilot depends on being exempt from copyright restrictions, so that the license, any license, is irrelevant. Simplified, copyright law says “the creator/rights-holder owns the thing and you can’t do anything with it unless they let you (grant you license), or one of these general conditions holds”, and Copilot is not using your code under rights-holder permission, but under the general condition of “fair use”.

If the fair use doctrine fails and the license is relevant, there’s still nothing special about GPL, because almost all licenses would be being violated in some way (most commonly starting with attribution requirements). In this situation, Copilot will certainly be discontinued immediately.

janetacarr · on June 23, 2022

I see what you mean. It'll probably come down to how the AI is trained then because Copilot reproducing entire code blocks verbatim from the source material would not meet the criteria for Fair Use.

jakear · on June 23, 2022

> ‘derivative works’ require the GPL license

This keeps coming up, but if you look at the text of the GPL the word ‘derivative’ literally never appears. GPL in fact explicitly exempts code that is accessed over a web service from needing to be shared, as is the case with copilot.

GoblinSlayer · on June 23, 2022

It has the same concept:

To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work.

A "covered work" means either the unmodified Program or a work based on the Program.

jakear · on June 23, 2022

Copyright law hold that’s derivative works don’t require copyright permission. So again, copilot is in the clear. Copilot users may not be, but that’s up to them to determine at time of commit.

blihp · on June 23, 2022

Derivative works don't need to be referenced in the GPL as it's a concept from copyright law. See https://en.wikipedia.org/wiki/Derivative_work

jakear · on June 23, 2022

That just further supports the claim that copilot is in the clear - it is clearly a separate work with many underlying works.

Of course the code copilot generates may violate GPL, but that’s up to the tool’s wielder to determine. Just as it is when searching for code on the internet, consulting books, recalling past knowledge, etc.

I don’t even use copilot (I had early access and discovered programming languages are better for unambiguous encoding logic than English, go figure). I’m just sick of all these supposed craftsmen blaming their tools rather than holding themselves accountable for what they commit.

blihp · on June 23, 2022

Copyright law doesn't work that way. I'd bet this is going to be litigated to a final decision before anyone can say with certainty if/when a NN is violating copyright (i.e. via the training data) or not.

GoblinSlayer · on June 23, 2022

Copilot is not a competing solution, it's a knowledge base about text, like encyclopedia. As for snippets it produces, those might be copyrightable if they pass the copyrightability threshold. If it provides you kilobytes of text at once, that would be bad. A middle ground would be Copilot tracking how much code under incompatible licenses it pasted and stop at, say, 200 LOC.

belorn · on June 23, 2022

GitHub want to treat GPL as special in this case. They choose to not use proprietary code for training copilot, for the obvious reasons of getting sued by companies that uses github, and instead put a bet that using GPL and other FLOSS licenses won't cost them more than what they earn by developing copilot.

Copilot is thus completely depending on this economical bet.

It for those reasons why we could not write a "Cosinger" or Comusician" that is trained on music found on youtube. It would be sued into oblivion the first time any 2-3 notes could be linked to a specific copyrighted song. If copilot survive long term we might see a similar project trained on creative common music, including CC-NC no-derivs, but music labels might own a few of those and their guns would be quite large.

toteno · on June 23, 2022

> Copilot is completely depending on the legal theory of being effectively exempt from copyright, under fair use doctrine; if that legal theory falls apart, the entire space (and a lot of other machine learning stuff) is utterly doomed.

That's really depends on the country.

For example Japan has a law[0] that's allows usage of any copyrighted materials for machine learning and other data analysis. You can also do it for commercial purposes. There are some limitations (you can't share the dataset itself, but you can share the model), but overall it sounds good.

[0] https://storialaw.jp/en/service/bigdata/bigdata-12

gls2ro · on June 23, 2022

IANAL but as far as I see it the case of copilot could be described as an ML that will output sometimes parts of the training dataset itself.

dcdc123 · on June 23, 2022

Did you know that the first of the two configuration options for Copilot is an option to allow or disallow it to suggest samples that match public code?

https://imgur.com/D2DDuY8

chrismorgan · on June 23, 2022

I did not. That is a curious option to place, given how it would seem to weaken the fair use argument (since they’re providing a way of consciously allowing probable copyright infringement, rather than just treating the issue as inherent in the nature of learning, or a temporary bug that they’re steadily fixing). But still, to defend my earlier parenthetical claim that Copilot can’t tell whether it’s reproducing copyrightable chunks of code: this option is consistent with what I intended to convey in the paragraph as a whole, since they are developing extra bits around the edges to mitigate the issue, but it’s not possible for them to do it completely—it’s more like a game of whack-a-mole, fixing this class of undesirable reproduction here, that instance there, without causing too much damage to the perceived-legitimate output.

FireBeyond · on June 23, 2022

> since they’re providing a way of consciously allowing probable copyright infringement

I think the claim that it's "probable copyright infringement" is nowhere near proven.

GitHub likely gave that option to satisfy user's lawyers who might have a higher threshold for "clean room" implementations or "no open source". Not as any kind of implication of copyright infringement.

Fair use works as an argument to the usability of Copilot.

Derivative works, per US Copyright law, are not infringement, either.

chrismorgan · on June 23, 2022

I’m talking in the context of verbatim reproduction, which “suggestions matching public code” sounds an awful lot like. At that point, I think “probable copyright infringement” is a fair description.

FireBeyond · on June 23, 2022

Your work as a whole may be innovative.

is 'if err != nil {' your original work? Or is it 'commonly accepted knowledge' as a Go programmer?

wzdd · on June 23, 2022

It looks like that filter does an exact comparison ignoring whitespace. To be more effective it would need to ignore things like variable renames and trivial transformations (for(;;) becoming while(true) or whatever).

In other words we're getting into cheat-detection software territory, which sounds difficult to get right in general.

anonymoushn · on June 23, 2022

It seems like the configuration option offered should be "allow/disallow other users to copy my code without attribution" rather than "allow/disallow me to copy other users' code without attribution"

chii · on June 23, 2022

i'm not buying the argument that copilot is infringing copyright. If someone learnt how to program via reading open source projects, you don't get to claim that their future work is derivative.

markdeloura · on June 23, 2022

CoPilot didn't "learn how to program", it is reproducing blocks of code from other projects, some of which explicitly state that using their code requires attribution or other forms of acknowledgement. It is facilitating infringement of their licenses.

omgwtfbyobbq · on June 23, 2022

The line-by-line nature of CoPilot may make that difficult to establish.

w4ffl35 · on June 23, 2022

You can open a sidebar (in both Rider and VSCode) which displays full blocks of code

omgwtfbyobbq · on June 24, 2022

That's a good point, but until it drops in whole blocks, I think the liability might lie more with the user. Kinda like all of Tesla's non-alpha driver assistance features where to use them the user has to opt-in and agree they will maintain full control of the car even when using these features.

FireBeyond · on June 23, 2022

> some of which explicitly state that using their code requires attribution or other forms of acknowledgement

US Copyright law states that fair use and derivative work are not infringing - and said law supersedes licensing.

b3morales · on June 23, 2022

> US Copyright law states that […] derivative work are not infringing

It says no such thing:

https://www.copyright.gov/title17/92chap1.html#106

w4ffl35 · on June 23, 2022

This is only partially correct.

Since I have two free complimentary months I decided to sign up even though I'm not super thrilled with it (see previous comments). I was given two options:

1. allow code from public repositories 2. allow copilot to learn from my code

I disabled both of these options. Presumably I am now using an AI model which learns and suggests based on the context of my project.

Engineering-MD · on June 23, 2022

It would be interesting to start verbatim copying some open source GitHub projects with these setting disabled and see if it magically knows what comes next (ie it does have prior knowledge of published code even with this turned off)

w4ffl35 · on June 23, 2022

I'm not sure why my comment is getting downvoted.

I just gave it a test run. I have a function with this code:

  if (!card.IsFaceUp && !card.IsBlocked)
  {
    FlipTableauCard(card);
    card.SetIsBlocked(false);
    break;
  }

I then added this comment afterwards:

  // if the card is face up, flip it

And this is what copilot produced:

  if (card.IsFaceUp)
  {
    FlipTableauCard(card);
    card.SetIsBlocked(false);
  }

I'm pretty positive that is code generated based on my comment and the surrounding code.

Volundr · on June 23, 2022

> Presumably I am now using an AI model which learns and suggests based on the context of my project.

The "allow code from public repositories" doesn't do what you think it does. All it does is add an extra filtering step to avoid producing code found in it's training set verbatim. The model you are using was still trained on those repositories, it's not limited to your project.

w4ffl35 · on June 23, 2022

Thank you, I am aware it was trained on those repositories. I am not OK with this business model.

But my comment still stands. You can turn off the verbatim copying feature that people keep talking about and the "AI model" will generate code based on your own codebase.

When I'm using it with Unity or a JS project that has NPM modules, does it use those as context to fill in some code as well? No clue.

Was it trained on open source code and is that ethically and legally shady? Yes.

Is it copying verbatim at this point? No.

Does it help me be a better programmer and will I pay for it? No and only if I forget to cancel my trial subscription.

indymike · on June 23, 2022

> i'm not buying the argument that copilot is infringing copyright

I wrote some code. Released it under the GPL, and my only expectation is that if you use my code in your product you make source available to users (and GPL does require you tell the user how to get that source code). That on small requirement, the one thing that I'm asking you to do if you want to use my code, is not being respected by Copilot. It recommends my code, and obfuscates it, and does not tell the user where it was synthesized from, nor provides a way to get to the original source. From a certain point of view, Codepilot could be seen as a willful infringement machine. It will be interesting to see how this gets sorted out.

nucleardog · on June 23, 2022

Except, y'know, when it regurgitates copyrighted code verbatim[0] which is not even derivative but just straight up copyright infringement.

[0] https://twitter.com/mitsuhiko/status/1410886329924194309

ThrowawayR2 · on June 23, 2022

"I believe that all generally useful information should be free. By "free" I am not referring to price, but rather to the freedom to copy the information and to adapt it to one's own uses ... When information is generally useful, redistributing it makes humanity wealthier no matter who is distributing and no matter who is receiving" --Richard M. Stallman

Ironic that those who generally purport to champion FOSS fail to understand that Free Software was all about defeating copyright. The GPL was meant to turn copyright against itself.

teddyh · on June 24, 2022

The GPL is a tool for defeating copyright, or at least mitigate the worst of copyright’s effects on software development. If the GPL is weakened, it will be less helpful.

dzhiurgis · on June 23, 2022

As a society IMO we should be fine with this

LeSaucy · on June 23, 2022

Even down to whitespace?

jakelazaroff · on June 23, 2022

Okay, but Copilot isn't a human. It's a computer program. If I wrote an algorithm that spat out copyrighted code with the variables renamed, it would be absurd to say that constitutes an original work. Copilot is much closer to that than to a human programmer.

StewardMcOy · on June 23, 2022

IANAL, but my understanding is that this is not clear-cut. Clean room implementations aren't strictly required by law, but the legal standard is how similar the new work is to the copyrighted work. If an employee reads open source code and then writes substantially similar code, as copilot sometimes does, that could be found to be infringement by a court.

For this reason, I've worked at places that forbid employees even reading open-source code. If we were having difficulty with an open-soruce component to the point where we needed to look at the code, we'd hire a contractor, explain the problem, and then they'd explain a solution, and all the communications would go through a company lawyer.

moralestapia · on June 23, 2022

That's a strawman.

If you learn to write novels by reading other authors, is that a crime? No.

If you reproduce their work, sometimes word by word, yes.

adra · on June 23, 2022

Take your example, but make it more to the point. I hire an anonymous ghost writer to produce for me a novel based on some story premise that I made up. This ghost writer decided to use a bunch of copyright protected sections in their draft because reasons. I think the ghost writer isn't committing copyright violations, but when I publish the book, I almost certainly am.

I'm less worried about MS getting sued for this and approaching 100% expecting that users are opening themselves up to legal exposure. I can't see any legal department saying go ahead with using copilot code, but by all means ask.

chii · on June 24, 2022

> a bunch of copyright protected sections in their draft

it depends on what this "bunch" means. It's not clear cut at what granular level does the copyrighted parts become so small, and sources so many, that the new works is considered transformative.

dkersten · on June 23, 2022

It has been shown to have reproduced code verbatim, including comments. If that code falls under a license that has requirements or restrictions, you are infringing. The problem is, how do you know when this is occurring or when copilot spliced many things together to create something sufficiently new?

BeefWellington · on June 23, 2022

> If that code falls under a license that has requirements or restrictions, you are infringing.

I think the interesting legal question here will be, are *you* (the user of the service) infringing, or is *copilot*?

I suspect Copilot's legal team has already worked license terms such that they're passing the buck onto you.

zvr · on June 24, 2022

You are correct.

Copilot is a tool, much like the "copy" command.

If you choose to use what it's suggesting, then the fault is completely yours.

chii · on June 24, 2022

> I suspect Copilot's legal team has already worked license terms such that they're passing the buck onto you.

it would seem reasonable that copilot should not be liable for anything that a user instructs it to do.

dkersten · on June 23, 2022

That’s a good question.

jazzyjackson · on June 23, 2022

the part that I haven't seen decided yet is, how much do I have to copy before it's infringement?

maybe my co-pilot reproduces code verbatim from your GPL'd project because you and a dozen other developers all copied the same solution from stack overflow.

dkersten · on June 23, 2022

Remember that the SCO vs IBM court case was over a very small number of code lines and if memory serves they only lost because it was deemed trivial. Triviality might be correlated with a low number of code lines, but its certainly not a given.

NewJazz · on June 23, 2022

If I spent hours looking at the Linux kernel source, then wrote a kernel that had a lot of the same ideas and idioms, that would indeed be considered infringement.

Some open source developers are not allowed by their employers to read source code with a different license for fear of infringement.

legalcorrection · on June 23, 2022

Independent recreation of the allegedly infringing work is an absolute defense.

If you were to write all those ideas and idioms down and pass them to someone who had not seen the Linux source code, who then used it to reimplement similar functionality, neither of you would probably be guilty of copyright infringement. (there are still patents of course). https://en.m.wikipedia.org/wiki/Clean_room_design

Copyright doesn’t protect programming idioms and concepts. It protects against verbatim copying, more or less.

It all comes down to how you characterize what Copilot does. We will just have to wait for new caselaw or even legislation that accounts for autonomous systems in defining legal wrongdoing.

adra · on June 23, 2022

IANAL but I don't see a double-rot13 "machine process" as any defense against demonstrably identical code. Just because it passes through a process doesn't make it clean. There were numerous examples of whole functions and code blocks repeated verbatim from its source material. I'm not sure the state of the application now, but the effort to prove an output snippet isn't copyrighted with restrictions by the source feels like an O(N) problem that nobody would want to ever do.

At best, if copilot told you explicitly (by doing the very hard work of identifying the likely sources of the code output ) you could make some (more) informed decisions as to if it's worth the risk to include it.

BeefWellington · on June 23, 2022

> If you were to write all those ideas and idioms down and pass them to someone who had not seen the Linux source code, who then used it to reimplement similar functionality, neither of you would probably be guilty of copyright infringement. (there are still patents of course). https://en.m.wikipedia.org/wiki/Clean_room_design

This response doesn't relate to the example provided by CameronNemo, as it's a different scenario.

At any rate, there is no clean room because copilot has "seen" the literal source. It is not comparable to clean room implementation in any fashion.

jazzyjackson · on June 23, 2022

a training program reads source code and produces a model

the software running the model does not have access to source code

as the parent said, it depends how you characterize it, which is why this will be decided by whoever can afford the best lawyers.

BeefWellington · on June 23, 2022

Well, the problem is with that characterization is that it's shown to be false already.

Copilot has reproduced entire functions from existing codebases, which invalidates the idea it doesn't have access to source code.

jazzyjackson · on June 23, 2022

I would have to better understand language transform models to back up my argument, but my impression is that copilot is not sitting ontop of a SQL database ripping the most likely line of code out of a row of a table, rather, it is a lossy compression that happens to be able to reconstruct some data better than others

jazzyjackson · on June 23, 2022

thanks for bringing up clean room design, I think it's a good analogy. The model, tho it may reproduce verbatim source code, does not contain the original source as such, right?

But if that were the case, one could get away with copying music by merely compressing it - I am not copying the data, I have a totally different set of data that happens to get decoded into a similar performance.

That's roughly analogous to a machine learning model, isn't it? compressing an enormous dataset into a "model" that is capable of being decoded in myriad ways depending on context.

tremon · on June 23, 2022

I'm not buying the argument that copilot is "learning to program". It's doing nothing more than rote memorization and recall.

GoblinSlayer · on June 23, 2022

Some cultures unironically use this approach to educate humans.

msbarnett · on June 23, 2022

"Some cultures use rote memorization to educate humans" != "anything educated via rote memorization necessarily has developed a human-equivalent understanding of the material"

blagie · on June 23, 2022

I do consider copilot to be infringing on my AGPL code.

That's not a fundamental statement about all machine learning systems. GPT-2 did a lot more direct regurgitation than GPT-3. GPT-3 tends to be much more transformative, but does still sometimes spit out code / text verbatim.

Copilot and codex spit out close enough to my own code that it's clearly creating a derivative work, at least by my read.

This is untrodden legal ground, but I think that a lot of this comes down to issues of reasonableness. The reason I used the AGPL license was to create a commons. If copilot played within some reasonably friendly way around that commons, I might not feel bad about it.

However:

1) Copilot wants me to pay to use something derived from my own code, where I stuck a license there designed precisely NOT to be in that position.

2) Copilot provides a competitive advantage to proprietary projects who are more likely to be able to afford it, over open-source / community ones. The reason I used an AGPL license was because I thought we needed this type of code to be open and transparent. I work in a domain where transparency is essential (I don't disclose domain, but you can think of transparency in government, education, voting, medical, police, etc.)

3) I have no way to have a conversation with anyone at github / Microsoft. They took my stuff, and they won't talk to me about how they use it. It's automated systems all the way down.

4) The whole Open AI nonprofit -> for-profit transformation is just sleazeball. Given all the talk about ethical use of AI, something like this really leaves a sour taste in my mouth. I don't mind DeepMind, FAIR, etc., since they're honest about their goals. Open AI feels like a Silicon Valley get-rich-quick scheme with a lot of nice marketing copy and legally-questionable tactics.

Jury, judges, and developers are swayed by common sense. People like me can be swayed to testify one way or another based on whether we feel cheated. What Microsoft / github / Open AI did here wasn't very reasonable, friendly, or sensible.

TL;DR: I support the concept of co-pilot in essence. The specifics here feel illegal and sleazy.

jazzyjackson · on June 23, 2022

> They took my stuff

I was about to admonish you for phrasing it this way when we all uploaded code willingly to github, giving up certain rights according to the ToS, but then I remembered microsoft straight up bought all of github, so "took my stuff" is pretty accurate. I would be interested to see a diff of the ToS since the purchase.

blagie · on June 23, 2022

It's more complex than that.

A lot of code on github (albeit not mine) is uploaded without the original party's agreement. Richard Stallman doesn't use github, but a lot of his GPL-licensed code has been incorporated into projects hosted there. If the terms-of-service allowed github to violate GPL licenses, I think most projects would need to migrate to gitlab. It'd be neigh-impossible for project authors to know that no GPL code in their project came from someone who did not have a side-license to github.

Even if that argument fell apart somehow, their terms-of-service state (https://docs.github.com/en/site-policy/github-terms/github-t...):

    This license does not grant GitHub the right to sell Your Content. It also 
    does not grant GitHub the right to otherwise distribute or use Your Content 
    outside of our provision of the Service, except that as part of the right to 
    archive Your Content, GitHub may permit our partners to store and archive Your 
    Content in public repositories in connection with the GitHub Arctic Code Vault 
    and GitHub Archive Program.

github is now selling My Content. To add insult to injury, they're trying to sell it back to me!

jazzyjackson · on June 23, 2022

If I pay a monthly fee for a search engine, and the search engine displays snippets of copyrighted work published online, is the search engine re-selling that content?

tremon · on June 24, 2022

2Gkashmiri · on June 23, 2022

why isnt copilot being trained on microsoft source code?

Engineering-MD · on June 23, 2022

This would make a great argument in court if this ever is legally challenged. The only real argument would be because it could lead to lost income for Microsoft. In the UK, derived works which can be shown to harm the original work’s economic viability are much more likely to be seen as an infringement.

moralestapia · on June 23, 2022

Because the 'snippets' would then be several pages long.

/joke

BeefWellington · on June 23, 2022

How do we know it hasn't?

Has anyone tried taking source from leaked copies of old MS code and tried to get copilot to reproduce it?

thdxr · on June 23, 2022

Whenever it comes to things like legal issues or regulations, people take such pride in being someone who understands their intricacies that the larger picture is missed. We should think about what outcome we actually want.

While technically any snippet of code can claim a copyright, it's sort of ridiculous to suggest the snippets that Copilot generates has any protectable value.

Practically Copilot is saving you the labor of writing fairly trivial but tedious code. I work on open source full time, there's no random snippet of code that I've written I'd feel upset about if someone copied.

Licenses on software are mostly performative.

NoraCodes · on June 23, 2022

> Licenses on software are mostly performative.

I seriously doubt you actually believe this, at least in an equal way. The code to Windows XP is available on the 'net, but I can't just go and compile that and start giving out copies without serious legal repercussions.

Copilot is "copyright for me but not for thee" and it's bullshit.

lm28469 · on June 23, 2022

Copilot isn't automagically going to give you the windows XP source code tho, or anything even remotely close to a full project, it seems that even a full working/compiling function is asking too much.

It generates what a intern code monkey would generate after reading a few stack overflow posts and a few github repos.

NoraCodes · on June 23, 2022

Yep. But if it regurgitates more than a few words, that's still plagarism and probably a violation of copyright - or at least it would be if we did it with MS's code.

Again, its copyright for them, but no protections for us.

alex_sf · on June 23, 2022

> Copilot isn't automagically going to give you the windows XP source code tho

I mean, it might. That's the whole concern. It's scraping random bits of code from all over the place.

greyface- · on June 23, 2022

> Windows XP [...] I can't just go and [...] start giving out copies without serious legal repercussions

1 month ago, HN front page https://news.ycombinator.com/item?id=31458635

swalls · on June 23, 2022

From the about page:

"Q: Why is nothing from the Windows XP source code leak added?

A: Even though Microsoft has only taken down a few Windows modifications, they will most definitely take Windows XP Delta Edition down if there is a reference to the source code inside it. The Windows XP source code is illegal to download, fork, and redistribute, so nothing from it will ever be added."

NoraCodes · on June 23, 2022

Nice, TIL. Thanks!

8organicbits · on June 23, 2022

I think "performative" and "random snippet" are key here.

But let's think of an example that may push the line. What would happen if someone wrote a closed source "Linux" kernel using the Linux kernel interface as a stub, and filled in the code using copilot. You'd expect some of the generated code to come from Linux, since that's the best code for the stub. Linux's use of GPL is not performative and this could create multiple instances of copied code.

But for everyone else who isn't building a closed source replacement of GPL software, I have a hard time thinking you'd be impacted.

natly · on June 23, 2022

The 'inverse square root' carmack trick is hardly trivial.

chii · on June 23, 2022

so luckily no one owns the idea that you can use both newtons method, and floating point bit-hacking, to produce a good estimate of a square root.

SahAssar · on June 23, 2022

So if a similarly non-trivial piece of code was in a non-public-domain codebase and copilot copied that wholesale, then you'd have a problem with it?

thomastjeffery · on June 23, 2022

So your position is that you don't respect copyright in the first place, but your argument is that Copilot is effectively a case for free use.

gavinhoward · on June 23, 2022

I wrote a whitepaper refuting GitHub's arguments that Copilot is fair use: https://gavinhoward.com/2021/10/my-whitepaper-about-github-c... .

UncleMeat · on June 23, 2022

I feel like the term "whitepaper" has lost all meaning. Does it just mean "thing typeset with latex" now?

gavinhoward · on June 23, 2022

It's just the term that the FSF used when I wrote it for them. For me, it means "academic paper."

WhompingWindows · on June 23, 2022

I thought the distinction was: "white papers" are not necessarily peer-reviewed by an academic journal's editors; white papers tend to be internally-made documents within an organization, edited by that organization's members, for the purpose of knowledge/information.

Academic papers as a whole can include both "white papers" and "peer-reviewed journal articles" aka the "papers" that many think of. I would surely put a "white paper" onto my resume if I thought it was a great piece of work.

gavinhoward · on June 23, 2022

TIL, thank you. I think my paper is a whitepaper in that sense because it's not peer-reviewed, but it was meant as an academic-like paper.

UncleMeat · on June 23, 2022

"Whitepaper", for a long time, meant sort of the opposite of "academic paper." They weren't published in refereed conferences and journals and were usually published by corporations or governments but they resembled academic work in their contributions.

Something like the original BTC paper is a good example. It wasn't published in a conference or journal, but the level of rigor and the scale of the contribution is similar to what would be published in a conference or journal.

I think the crypto community was the end of it. Now "we are launching $CoolCoin" gets called "whitepaper."

gavinhoward · on June 23, 2022

That makes sense. Thank you for teaching me that.

I would still call my paper a whitepaper, though. While it's not rigorous, it's more about law where rigor doesn't apply in the same way. And it is not peer-reviewed, although it was reviewed and rejected by the FSF.

Datenstrom · on June 23, 2022

GitHub should not be sued for training on the data, but anyone using it should be liable for any copyright infringements it generates. That would effectively make it useless for business use cases, but it should be until the models understand copyright and plagiarism, which they do not yet.

mewse · on June 23, 2022

If Microsoft asserts and represents that their tool doesn’t generate copyright-infringing code, then surely Microsoft is the party which should be liable, rather than the poor unlucky programmer who was lied to by the billion-dollar corporation’s marketing agency?

chii · on June 23, 2022

> surely Microsoft is the party which should be liable

unless microsoft is doing work-for-hire for you via copilot, i highly doubt they are liable.

You, as the person who is claiming to have produced the work (even though you were using a smart tool to help), must be the person who also is liable. Otherwise, could you not claim that the auto-correct on your word-processor is liable for copyright infringement?

monkeybutton · on June 23, 2022

An interesting exercise would be recreating an entire program/tool that is GPL using copilot and releasing under a less restrictive license. Could one could argue that the effort put into cobbling together a knock-off is enough to constitute an original work?

Running a copyrighted movie through a neural network compression algorithm and uploading it on bittorrent isn't going to stop you from being sued. Even if the output is produced by an AI.

redox99 · on June 23, 2022

That's probably the right distinction.

If copilot allows you to type

// source code of linux kernel

And you get the whole code, then I would consider it unoriginal

The same way your movie example, if you told it

Avengers Endgame

And it gave you the whole movie, it would also be. But what if you type (like with DallE) Spiderman fighting Thanos and you get something different, but that resembles some Endgame scene. Would that infringe copyright, be fair use, or what?

Adrox · on June 23, 2022

Try to publish anything with a character looking like Spiderman, Thanos, IronMan, Captain America and see what Marvel does to you…

redox99 · on June 23, 2022

There are many videos like that on youtube and they seem to be doing fine

For example

https://www.youtube.com/watch?v=u8LvcJoAnms

UncleMeat · on June 23, 2022

The law isn't stupid. This is one of the advantages of having humans in the loop. People can and will say "I know what you are doing, knock it off."

im3w1l · on June 23, 2022

A lot of people asking "is this legal?", but to me the more interesting question is should this be legal?". And I would like to debate the advantages and disadvantages of this.

If we go back to the very basics, the purpose of copyright is to reward and incentivize content creators.

The purpose of fair use is to allow certain uses of a work that are beneficial to society, and don't substantially interfere with rewarding content creators. Consider for instance a parody. The parody does not substitute for the original work, so it will not interfere with the reward.

Looking at copilot through this framing, we can ask whether it will interfere with rewarding creators. If Alice writes a program, will copilot trained on that product allow Bob to create a competing product more easy, essentially freeload on Alice's creativity? Yes, to some extent it will.

On the other hand there are also advantages of allowing copilot. It promises to make software creation easier, for Bob, but also for Alice.

Now, as this is trained on code that is posted publicly but not private code it will advantage private code over publicly posted code. This has the potential to harm the free and open source software movement. This movement is a positive in the world, so harming it should be considered a bad thing.

But really, the big question is how big these effects are, and no one seems to have a good answer to that.

toss1 · on June 23, 2022

>>a form of laundering open source code into commercial works

If it is nothing but a glorified autocomplete that inserts SHORT snippets of code, as in a quote from a book in a book review, or the above quote, that's fine. But when it is inserting whole pages, that's beyond Fair Use, and basically a nicely scaled and laundered version of plagiarism.

Another key is now much it is modifying the code to suit the situation. Is it generating entirely new synthesized output like DALL-E 2 or GPT-3 such that the output is closer to generative creativity, or is it merely pasting in code blocks found in similar situations?

It comes down to the question of how transformative is the output, which is a key concept in copyright law.

>>“What is the difference between this and someone doing it manually? "

Another key question here. If it is merely cut/paste beyond a line or two, it's plagiarism, but if it is synthesizing new works, it's good. Same as manually: am I using your GPL work for inspiration to generate new works, or am I copy-pasting pages of code?

Does anyone have any extensive experience with Copilot to be able to highlight these differences?

EDIT: fmt, clarity

cdrini · on June 23, 2022

In my experience it's more generative. It responds to variable names in my code, formatting, patterns in how I'm writing (am I using ES6 features? async/await or raw promises? Etc). Granted it could still be copying, but when it's generating usually less than 5 lines at a time... I wonder if you could argue that all possible 5 line code snippets have already been written--wrt to like abstract logic/variable relationships/patterns.

Also I think it's important to note that Copilot isn't an independent AI; it's a human in the loop system. In my experience I always make adjustments to the code it generates to better fit my needs. So on top of the transformation that copilot does, I'm also stacking transformation on top of that. So the final, potentially copyrightable product is very far removed from the training data.

Copilot's training algorithm, OpenAI Codex, "is a descendant of GPT-3" under the hood. I think code is unique in that unlike text and images, there's much less variance in code. It's not as expressive as English, or visual art. So I think there might be a higher chance it'll generate something similar to its training data--but only because all code is inherently more similar.

toss1 · on June 23, 2022

Thanks - very helpful!!

Good points about you adding transformations on top of the Copilot output and less variation in code due to it's structured nature.

Interesting also that it responds to variable names and patterns in your writing/code. Does it also respond to comments?

Your answer points me more towards the not-infringing argument. Perhaps the best solution would be to have a companion plagiarism-checker tool that examined your code vs it's training set of GPL/MIT licensed code when you are nearly finished to flag significant copying. Shouldn't be too hard and would avoid the whole problem (and also maybe sometimes point you to a library you should be using instead of rolling your own).

cdrini · on June 25, 2022

Thanks for listening! The comment thread for anything copilot related is generally super polarised, and I really appreciated your calm comment and your calm response :)

If I understand your question correctly, it does respond to comments. One of the best ways to interact with copilot is to write a comment (eg "Read input until y/n is entered") and have it generate the resulting code. If you're asking if it matches commenting style, I think it does, but I haven't pushed it in that direction too much.

I think that's an interesting proposal, although it does place the onus on the developer. I would be curious to know how often developers would fail that check even without Copilot!

Overall I think there are definitely open questions, but I'm personally really excited by systems like GPT, Copilot, or Dallee. The future is not clear, but I think these tools to some extent make the internet make sense. There's way too much data online to make sense of--be it code, text, or images. Unlike a search engine which just links to hopefully relevant material, these tools "learn" from all that data and respond with an answer of sorts--not a list of references. I think one huge improvement would be making these systems more explainable. So getting a certain response, you can also see the thousands of references that were used to generate that response. That would help a lot in providing transparency in whether there is plagiarism happening, and also just be an immensely useful tool for humanity and the internet. It feels like the next logical step for the internet. I would even say it feels like the internet was built for this!

toss1 · on June 25, 2022

Thx for the kind words and helping me get a better understanding! I think you are really onto something about GPT/Copilot/DALL-E this being almost what the Internet was built for, or at least one of the fundamental growth stages - first communicate, then accumulate data/info online, then make it searchable, then start automatically "understanding" synthesizing it... (with the quotes around "understanding" doing a lot of work).

And a good point you make about not relying on the programmers to run the plagiarism checks - the tool should do it itself.

It sounds like this is indeed closer to an copyright-OK generating of new code, rather than mere laundering, and if it isn't yet there, it seems like the copy/paste paradigm would be a juvenile phase, and it should improve and get more "creative" and less copy/paste-ish with further development.

rockbruno · on June 23, 2022

EDIT: It took me a while, but I found proof of Copilot suggesting a full copyrighted algorithm. I then take back these arguments as I was under the assumption the tool couldn't do this: https://twitter.com/mitsuhiko/status/1410886329924194309

Old comment for documentation's sake:

Are licenses even enforceable by law? The idea of writing some mundane basic code and then wanting to sue someone for "stealing it" just sounds ludicrous to me. True copyright has barriers to make sure you actually invented what you're trying to patent.

That's not to say that there isn't a problem here -- there's definitely an ethical component to how this product works, but this whole code licensing thing never clicked for me. Does it hold any actual power?

thomastjeffery · on June 23, 2022

Licenses are indeed enforceable by law. Most popular ones (GPL, MIT, Apache) have been enforced in court.

Also Copyright isn't patent.

jodrellblank · on June 23, 2022

> "Moreover, open source developers are already suffering burnouts because of gigantic multi-billion dollar corporations taking their free code and re-bundling it as a SaaS, hence, introducing this new feature takes even more from them than there was before."

The popular open source licenses explicitly give permission for people to resell your work, it's not even buried in the small print or anything. e.g.

GPL: "Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish)".

MIT License: "including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software".

Apache License: "each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work".

This should no more cause burnout than someone buying your old car and using it for an endurance race makes you tired. Where there's burnout involved it's more likely the demands for support and fixes that head back upstream without any associated money. More concerning is Copilot trained on code which isn't GPL licensed or similar. Sharing code doesn't automatically grant anyone any license to use it for anything at all.

Longlius · on June 23, 2022

Except in the case of the GPL license, your rights to reuse and redistribute the source code are contingent upon your derived work also being licensed under the GPL. This is not a hypothetical untested legalese requirement but a real legal requirement that's held up in the courts of several countries.

So yes, using a substantial portion of GPL code in your proprietary software product is copyright infringement. Or even using a substantial portion of GPL code and not licensing the code under an appropriate license - https://www.gnu.org/licenses/gpl-faq.html#WhatDoesCompatMean

blueflow · on June 23, 2022

From MIT License;

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

I'm not confident Copilot will comply with this part of my code's license.

simonw · on June 23, 2022

It depends on the interpretation of "substantial portions". Classic challenge with legal documents.

Here's a discussion about that usage in the MIT license: https://opensource.stackexchange.com/a/2188

jodrellblank · on June 23, 2022

Me either, but that has little to do with burnout from other people making money from GPL licensed code.

(That a tool exists doesn't free one from responsibility for using it; CoPilot not citing original code and its license terms seems like it rules it out for anything beyond experimentation; that part is very arguable)

c01n · on June 23, 2022

> GPL: "Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish)".

Yes but if I use copilot in a private codebase how sure am I that it has not copied GPL code.

8organicbits · on June 23, 2022

I think the risk is higher, but it's always been possible one of the developers would copy GPL code. There are tools that check your code to see if any GPL code exists. I haven't used them, but I suppose you could audit with those.

jasonlfunk · on June 23, 2022

Imagine a situation where a chat bot AI is trained on a bunch of copywritted novels. And then you ask it to start reciting portions of it. Should that be illegal? It doesn’t seem like it to me.

If I were to turn around and resell it, then you could sue me; but that wouldn’t be a chat bot’s fault.

Just like it’s not the clipboard’s responsibility to ensure I’m not violating licenses, I don’t think it’s copilot’s responsibility either. Use it at your own risk.

jrochkind1 · on June 23, 2022

Hm, you're suggesting that it ought to definitely be illegal if you sell the output of the chatbot trained on copyrighted material, but not if you give it away for free?

That is definitely not how US copyright law works (although you are welcome to argue that it should), that there is such a bright line around selling. It's possible for something to be a copyright violation even if you give it away for free (see torrent sites!), and it's possible for to qualify as fair use and not be a violation even if you sell it.

Under USA copyright law (others are similar but not quite the same), the first step would be deciding if the output of this chatbot counts as a "copy" or "derivative work" at all. If it does not, then there is no copyright violation whether you sell it or not. If it does, then it is a copyright violation (whether you sell it or not) unless it's use can count as "fair use". Whether the use is "commercial" is just part of one of four factors that are balanced to determine if it's fair use. For instance, if you are only using a tiny portion of the copyrighted work and it doesn't have much effect on the profits of the original copyright holder and the use is considered highly "transformative" too (sound like copilot?) -- it could well be fair use even if you are selling your output.

(Also... Github Copilot is literally selling it, right? You have to pay them to use Copilot! It would accordingly be considered a commercial use. If I make a copy of a hollywood movie and sell it to you, I'm probably violating copyright (unless I can convince a court it's fair use), regardless of what you do with it, it doesn't matter if you re-sell it or not. If you re-sell it or make another copy, you may be additionally violating copyright another time yourself.)

I do think there's a reasonable argument that copilot is fair use. It wouldn't mainly hinge on whether the output is sold or not. This is presumably the argument MS/Github would make if brought into court. Since Oracle v. Google, I've stopped trying to predict what courts will do on software-related copyright cases, the law seems to be pretty chaotic and the actions of courts unpredictable. (also i'm not a lawyer this is not legal advice).

In general, I am in favor of an expansive bounds of fair use, and think it serves "the people" to have such.

KyeRussell · on June 23, 2022

Corpus licensing is an existing issue.

WesolyKubeczek · on June 23, 2022

The law is very big on intent. It's one thing if yeah, a corpus is full of copyrighted material, but is used for scientific, potentially humanity-advancement purposes, and completely different thing if it's being used straight for monetization while bypassing copyright holders.

A FOSS license doesn't mean it's all fucking freebies all the way down as any big tech company is quick to remind you (by attaching trademarks all over the place, for example), but if it's a big tech company taking your stuff and running with it, it's suddenly all fair game.

All in all, it's how companies have been behaving since about forever. Fuck little people all the way.

jrochkind1 · on June 23, 2022

Again though, using something for scientific/scholarly purposes alone is not a 100% guarantee defense against copyright infringement in the USA. It is one aspect of a four-part fair-use defense. You can definitely still be violating copyright even with a recognized scientfic/scholarly non-profit purpose. (and be engaging in fair use without it).

I have not heard of what people have already been discussing about copyright legal issues around corpuses of Other People's Content, I am curious to read more if anyone has a link.

While I understand that in this case it seems like (or is!) a Big Corporation taking advantage of the Little People -- I would urge extreme caution in advocating for reducing and limiting "fair use" rights because in this case it will hurt the Big Corporation. It used to be clear to everyone that fair use helped the Little People against the Big Corporation copyright holders and should be encouraged. Lately, people (especially software devs, who of course write IP) have been excited about strenghtening protections for copyright in order to somehow limit the Big Corporations (see all the stuff around Amazon and open source), but this is a dangerous game. Fair use is one the only tools we have protecting us from a dystopia where you can't open your mouth or type on your keyboard without paying someone a license fee, which is what OTHER Big Corporations would love to see.

moralestapia · on June 23, 2022

I don't know how blinkist gets away with it, honestly.

jrochkind1 · on June 23, 2022

I'm not familiar with this, but would love to learn more. Links explaining what you mean by this and how it works welcome. Googling wasn't helping me.

iasay · on June 23, 2022

It’s a good point. I was trained on a lot of copyright novels.

However I don’t quote verbatim which is what this tool has been doing.

chii · on June 23, 2022

how big does a snippet have to be to become verbatim?

Plenty of trigrams exist in many novels that are exactly the same, and i bet that there's plenty of n-grams in programming which would be construed as copies of each other.

iasay · on June 23, 2022

I think you’d need a lawyer to answer that question. Which in itself is a problem.

The nature of using this tool could hang you in theory more than if you didn’t use it.

NoraCodes · on June 23, 2022

If I "trained" an "AI model" consisting of an executable that regurgitates its input 99% of the time on a Hollywood blockbuster and distributed it, that'd be copyright infringement. Probably still would be at 50% and 10%. So what's the threshold for you here?

anonymousab · on June 23, 2022

> And then you ask it to start reciting portions of it. Should that be illegal

In several countries, humming a song - reciting a melody - is cut and dry de jure copyright infringement. It's just not rigorously punished.