Hacker News new | past | comments | ask | show | jobs | submit login

No, see Authors Guild v. Google. Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books. The Google Books site is not a derivative work of the millions of authors they copied from, and if they did copy any coincidentally GPL, AGPL, or creative commons copyleft work, the fair use exception applies before we reach the question of whether Google is obligated to provide anything beyond what it is doing.

By comparison, Copilot is even more obviously fair use.

I've had this conversation quite a few times lately, and the non-obvious thing for many developers is that fair use is an exception to copyright itself.

A license is a grant of permission (with some terms) to use a copyrighted work.

This snippet from the Linux kernel doesn't make my comment here or the website Hacker News a GPL derivative work:

    ret = vmbus_sendpacket(dev->channel, init_pkt,
        sizeof(struct nvsp_message),
        (unsigned long)init_pkt, VM_PKT_DATA_INBAND,
        VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
This snippet from an AGPL licensed project, Bitwarden, does not compel dang or pg to release the Hacker News source code:

    await _sendRepository.ReplaceAsync(send);
    await _pushService.PushSyncSendUpdateAsync(send);
    return (await _sendFileStorageService.GetSendFileDownloadUrlAsync(send, fileId), false, false);
Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

The Free Software Foundation agrees (https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse)

> Yes, you do. “Fair use” is use that is allowed without any special permission. Since you don't need the developers' permission for such use, you can do it regardless of what the developers said about it—in the license or elsewhere, whether that license be the GNU GPL or any other free software license.

> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

(And even this verbatim copying from FSF.org for the purpose of education is... Fair use!)




You're strongly and incorrectly implying that "Fair Use" is a clear (and relatively immutable) concept within copyright law, which couldn't be further from the truth. Even if this or that particular case sets out what appears to be solid grounds, one shouldn't take that as gospel by any means.

This mostly has to do with the nature of the wishy-washy nature of the 4 part Fair Use test, which, unlike decent legal tests, doesn't actually have discrete answers. The judge looks at the 4 questions, talks about them while waving her hands, and makes a decision.

Comparing to, e.g., Patent, where you actually do have yes-or-no questions. Clean Booleans. Is it Novel? Is it Non-Obvious? Is it Useful? If any of the above is "No", then no patent for you.

As for the execution of Fair Use, while I haven't gone too deep into Software, I can assure that for music, the thing is just a silly holy-hell mess; confirmed most recently by the "Blurred Lines" case, where NO DIRECT COPYING (e.g. sampling or melody taking) was alleged, merely that the song sounded really similar to "Got to give it up" and that was enough.

So then, I'd say everything either is, or should be, up in the air, when it comes to Fair Use and software.


Most law is wishy washy. There are very few cut and dry answers in the law (If there were, we wouldn't need lawyers and a court system based on deciphering the law).

All that said, the one thing I'd add about fair use is that it isn't permission to use anything you like, but rather a defense in a legal proceeding about copyright. It's pretty much all about being able to reference copyrighted material with the law later coming in and making final decisions on whether or not that reference went too far. (IE, copying all of a disney movie and saying "What's up with this!" vs copying 1 scene and saying "This is totally messed up and here's why".)

That was a big part of the google oracle lawsuit.


> Is it Novel? Is it Non-Obvious?

Those questions for patents are barely more clear-cut than copyright fair use tests, there is lots of room for disagreement.

It's definitely true that a fair use defense against copyright infringement varies a lot by the field of work and norms can develop which are relevant to court cases. The music field is a mess, the "Blurred Lines" judgement was total bullshit. But the software field is not without its own copyright history and norms so there's no reason to expect everything to go to hell.


But there's no reason not to either - I suppose my point is, don't take too much as gospel and think about everybody's best "end-goals" and push or pull with or against the law as needed.


There’s also an aspect of this that varies by size, budget, political clout, etc etc, of the individual or organisation.

The big guns like Microsoft, Google, Oracle, do this sort of thing as a matter of course in their business activities, they have the lawyers, the money, and the ear of members of parliaments, senators etc.

Whereas an individual or small business probably wants to conduct themselves within a more narrow set of adherences.


Unanswered question, as far as I know: is a trained model a derivative work? If the model accidentally retains a copy of the work, is that an unauthorized copy?


In my opinion, the model would not be an unauthorized copy given that it's primary purpose was for some other task and the inclusion of the work was merely incidental.

The unauthorized copy arises when someone gets the work out of the model.

Of course if you make a model explicitly for the purpose of evading copyright then the courts will see through that ploy.


I think it would be pretty easy to stake opinions on those "boolean questions."

Is (was?) a swipe gesture novel? Is it non-obvious?


I think what the parent is stating is that even though the patent questions can have debate, once you settle the question "Is it Novel" as yes or no you can determine if the item is patentable... wheras for fair-use, the questions themselves aren't yes/no questions, and further, they are just used as balancing factors, so even if everyone agrees on "the effect of the use upon the potential market for or value of the copyrighted work" it's only weighed as a factor for how fair the use is, and broadly left up to the hand-waving of the particular judge.


Oh, absolutely. Kind of furthers my point. Patent is a silly mess in a lot of ways, but at least there's something like Booleans in it. "Fair use" doesn't even have THAT.


Yes to all this.

I think the factor most at risk in a fair use test with Copilot is whether it ever suggests verbatim, code that could be considered the "heart" of the original work. The John Carmack example that's popped up here at least gets closer to this question, it was a relatively small amount but it was doing something very clever and important.

One can imagine a project that has thousands of lines of code to create a GUI, handle error conditions, etc. that's built around a relatively small function; if Copilot spat out that function in my code, it might not be fair use because it's the "heart" of the original work. Additionally, its inclusion in another project could affect the potential market for the original, another fair use test.

But Copilot suggesting a "heart" is unlikely, something that would have to be ruled on in a case-by-case basis and not a reason to shut it down entirely. Companies that are risk-averse could forbid developers from using Copilot.


This is an excellent comment because it captures some important nuance missing from other analysis on HN.

I agree with you that the relative importance of the copied code to the end product would be (or should be) the crux of the issue for the courts in determining infringement.

This overall interpretation most closely adheres to the spirit and intent of Fair Use as I understand it.


For any discussion on copyright and fair use, we should distinguish between the implications to Copilot the software itself and the implications to users of Copilot.

For Copilot itself, I do see the case for fair use, though it gets fuzzy should Microsoft ever start commercializing the feature. Nevertheless it remains to be seen whether ML training fits the same public policy benefits public libraries and free debate leverages to enable the fair use defense.

For Copilot users, I don't see an easy defense. In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book. In the case of Google books, they explicitly call out the limits on how the material they publish can be used. I'm contrast, Copilot seems to be designed to encourage such copying, making it more worry some in comparison.


>In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book.

A book completely written by pasting passages of other books would actually be a pretty interesting transformative work.


Yeah, but a book like this would be an artistic work.

While software is in this limbo between copyrights and patents...


The world is global. That's a US court ruling from one court of appeals. Most countries have narrower fair use rights than the US. Even if Copilot would fall within that legal precedent (far from guaranteed), a legal challenge in any jurisdiction worldwide outside the US states covered by that particular court of appeals, or which reaches the US Supreme Court, or which goes through the Federal Circuit Court of Appeals due to the initial complaint including a patent claim, would not be bound by that result and (especially in a different country) could very plausibly find otherwise.

What's more, if any of the code implements a patent, fair use does not cover patent law, and relying on fair use rather than a copyright license does not benefit from any patent use grant that may be included in the copyright license. If a codebase infringes a patent due to Copilot automatically adding the code, I can easily imagine GitHub being attributed shared contributory liability for the infringement by a court.

Not a lawyer, just a former law student and law feel layman who has paid attention to these subjects.


> law feel layman

What a weird autocorrect typo. This should have read "law geek layman." (And it initially autocorrected again as I was typing this paragraph.)


> No, see Authors Guild v. Google.

That case required that the output be transformative, in that "words in books are being used in a way they have not been used before".

Copilot only fits the transformative aspect if it is not directly reciting code, that already exists in the form that it is redistributing. So long as it does so, it fails to meet the criteria.


I think you might be considering two different acts here:

1. The act of training Copilot on public code

2. The resulting use of Copilot to generate presumably new code

#1 is arguably close to the Authors Guild v. Google case. You are literally transforming the input code into an entirely new thing: a series of statistical parameters determining what functioning code "looks like". You can use this information to generate a whole bunch of novel and useful code sequences, not just by feeding it parts of it's training data and acting shocked that it remembered what it saw. That smells like fair use to me.

#2 is where things get more dicey - just because it's legal to train an ML system on copyrighted data wouldn't mean that it's resulting output is non-infringing. The network itself is fair use, but the code it generates would be used in an ordinary commercial context, so you wouldn't be able to make a fair use argument here. This is the difference between scanning a bunch of books into a search engine, versus copying a paragraph out of the search engine and into your own work.

(More generally: Fair use is non-transitive. Each reuse triggers a new fair use analysis of every prior work in the chain, because each fair reuse creates a new copyright around what you added, but the original copyright also still remains.)


Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

Absent this, I don't think there's a case. The courts have given extraordinarily wide latitude to fair use and ML algorithms are routinely trained on copyrighted works, photos, etc. without a license.

I understand that this feels more personal because it involves our field, but artists and authors have expressed the same sentiment when neural nets began making pictures and sentences.

The question here is no different than "Is GPT-3 an unlicensed, unlawfully created derivative work of millions, if not billions of people?"

No, I'm quite confident it is not.


> Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

It doesn't need to be substantial. In Google v. Oracle a 9-line function was found to be infringing.


If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf


> The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

Yes, because it was _transformative_, in a clear way. Because an API is only an interface. Which makes that part of that decision largely irrelevant to the topic at hand.

> Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different compu-ting environment without discarding a portion of a familiar program-ming language. Google’s purpose was to create a different task-related system for a different computing environment (smartphones) and tocreate a platform—the Android platform—that would help achieve and popularize that objective.

> If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

It was already decided earlier, and Google did not contest it, choosing instead to negotiate a zero payment settlement with Oracle over the rangeCheck function. There was no need for the Supreme Court to hear it.


A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.


> A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

That's not the case. It wasn't an out-of-court-settlement, but an agreement about the damages being sought, the court had already found it to be infringing, and that was part of the ruling.

But none of that changes that 9-lines is substantial enough to be infringing. It isn't necessary to be a large body of work.

> If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

No... It means the rangeCheck function was infringing. The implication you seem to have inferred here wouldn't be inferred by any kind of plagiarism case.


I think we agree then, and appreciate the correction on the lower court settlement.

If Copilot is infringing, I suspect it's correctable (by GitHub) by adding a bloom filter or something like it to filter out verbatim snippets of GPL or other copyleft code. (And this actually sounds like something corporate users would want even if it was entirely fair use because of their intense aversion to the GPL, anyhow.)


It may be correctable... It doesn't change that Copilot is probably infringing today, which may mean that damages against GitHub may be sought.


The point of Copilot -- its entire value as a product -- is to produce code that matches the intent and semantics of code that was in the input. In other words, very deliberately not transformative in purpose.


Why did you choose the standard of "substantial" = "100s of lines"? Especially since we've already seen examples of verbatim output in the dozens of lines range, that choice of standard is rather conveniently just outside what exists so far. If we find a case with 200 lines of verbatim output will you say the only reasonable standard is 1000s of lines?

I don't think your argument is as strong as you're making it out to be.


Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use. I would be surprised if many of haven't inadvertently "copied" some GPL code in this way!

This goes to the "substantial" test for fair use. Clips from a film can contain core plot points, quotes from a book can contain vital passages to understanding a character, screen captures and scrapes of a website can contain huge amounts of textual detail, but depending on the four factors for fair use, still be fair use. (There have been exceptions though.)

The reaction on Hacker News to a machine producing code trained on their works is no different than the reactions artists and writers have had to other ML models. I suspect many of us are biased because it strikes at what we do and we think that our copyrights (because we have so many neat licenses) are special. They are not.

I think it would need to get to that level of "Copilot will emit a kernel module" before it's not obviously fair use.

After all, Google Books will happily convey to me whole pages from copyrighted works, page after page after page.

https://www.google.com/books/edition/Capital_in_the_Twenty_F...


> Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use.

it's anything but obvious. https://www.copyright.gov/fair-use/

> there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

9 lines of very run-of-the-mill code in Oracle / Google weren't considered fair use.


A big difference is that software is both is and isn't an artistic work.


It's not possible to get copilot to output a transformed version of the input?


Transformed output _may_ fall under fair use.

However - Copilot directly recites code. That is _very unlikely_ to fall under fair use.

Redistributing the exact same code, in the same form, for the same purpose, probably means that Copilot, and thus the people responsible for it, are infringing.


> However - Copilot directly recites code.

You make that statement as an absolute, but in the interests of clarity, all evidence so far shows that it directly recites code very rarely indeed. Even the Quake example had to be prompted by the specific variable names used in the original code.

In practice, the output code is heavily influenced by your own context — the comments you include, the variable names you use, even the name of the file you are editing — and with use it’s obvious that the code is almost certainly not a direct recitation of any existing code.


> all evidence so far shows that it directly recites code very rarely indeed.

_Once_ is enough for it to be infringing. The law is not very forgiving when you try and handwave it away.


You sound quite sure that the outlying instances of direct copying wouldn't be covered by the Fair Use copyright exemption. Any particular reason for that?

I tend to think it would be covered (provided it there were relatively small snippets and not entire functions).


I'm not the person you're replying to, but one strong reason is that the global reach and standardization of copyright law is far broader than the global reach and standardization of the fair use exception. A single non-US country in which GitHub Copilot is used in a way that would be infringing without the US fair use exception, and outside the scope of any such exception in that law, would be enough to cause GitHub/MS a legal hassle. There could well be more than one such country.


Oh, absolutely.

I'm not American, but like others around here — I was just restricting the discussion to American law for simplicity's sake.


Fair, but GitHub/MS (same company now) can't afford to ignore other countries' law in their internal evaluations of whether globally* available products like Copilot are legal.

* Minus a few countries/regions targeted by US sanctions, I assume, though they've gradually broadened their services in sanctioned countries with the necessary licenses from OFAC.


Precedent. Google v. Oracle found 9 lines, of an "obvious" implementation to be infringing.


Right, but would 3-4 lines in the middle of a 50 line function also be infringing? What about 2 lines?

I don't know the answer. I was only surprised that the commenter seemed dead sure that any and all copying (no matter how small) would be infringing.

That just doesn't correlate with my understanding of how Fair Use works: The "amount" of the infringement is one (of several) factors in determining if something falls under Fair Use:

>The third factor assesses the amount and substantiality of the copyrighted work that has been used. In general, the less that is used in relation to the whole, the more likely the use will be considered fair.

From https://en.wikipedia.org/wiki/Fair_use


So if a foreign company pilfers the source code to Windows, can they add it to a training set and then 'prompt' the machine learning algorithm to spit out a new 'copyright free' Windows, just by transforming the variable names?


I think that's my question regarding this whole thing:

If it's so fair use, why not train it on all Microsoft code, regardless of license (in addition to GitHub.com) ? Would Microsoft employees be fine with Copilot re-creating "from memory" portions of Windows to use in WINE ?


Well no, because only GitHub has access to the training set. But more importantly this misunderstands how Copilot even works -- even if Windows was in the training set, you couldn't get Copilot to reproduce it. It only generates a few lines of code at a time, and even then it's almost certainly entirely novel code.

Now, if you knew the code you wanted Copilot to generate you could certainly type it character by character and you might save yourself a few keystrokes with the TAB key, but it's going to be much MUCH easier to simply copy the whole codebase as files, and now you're right back where you started.


GPT-3 is still Microsoft licensed, but a similar model can be put together with the freely available GPT-2 and source code -- especially if your intent is copyright transfer.

As Francois Chollet points out in this talk, ultimately deep neural network models are locally sensitive hash tables, so the examples of people pulling out source code is an inherent shortcoming of deep learning models in general. Give the right 'key' and you can 'recall' the value you are looking for.

https://www.youtube.com/watch?v=J0p_thJJnoo


> "However - Copilot directly recites code."

Sounds like that wouldn't be difficult to fix? Transform the code to an intermediate representation (https://en.wikipedia.org/wiki/Intermediate_representation) as a pre-processing stage, which ditches any non-essential structure of the code and eliminates comments, variable names, etc., before running the learning algorithms on it. Et voila, much like a human learning something and reimplementing it, only essential code is generated without any possibility of accidentally regurgitating verbatim snippets of the source data.


At that point, can we all just agree IP is the stupidest concept to ever be layered on top of math (which programming is) and move on with non-copyrightable code?


Only if you agree that copyleft licenses are also stupid; without copyright, there's no way to prevent companies from making closed-source forks of code you wrote and intended to stay open.


The whole point of copyleft was as a stepping stone to get to RMS's four freedoms (https://www.gnu.org/philosophy/free-sw.en.html) which effectively eliminates copyright for software.


Freedom 1: “Access to the source code is a precondition”

With no copyright/copyleft, how do you enforce the rule that derived works must provide access to the source code? I’ve never heard that copyleft was a stepping stone—rather, it’s the stick that fully realizes the four freedoms.


Correct. Copyleft is idiocy as well. You don't really need a pay for a proprietary fork of a tool when no one can keep you out of the free one, and the proprietary stuff diffuses into the free option.


Yes, sure. Without copyright there's no need for copyleft left, right?


No...? Not unless that closed-source project's source code is leaked?


You don't care about attribution and other moral rights ?

(I guess these are going to depend a LOT on the jurisdiction that you're in ?)


I care, but in the long run, I care more about our descendants not having tools locked out of their hands. Facilitated information asymmetry is the root of far too many evils.

Where is your ego when you're dead and gone? Where could we be if the majority of human advancement we're not tightly clutched as trade secrets?

As someone who has done paid software engineering (yes, you can feel free to call me a hack or sell out if you wish), I've come to find that the salary I've pulled over the years has not gone to me... But keeping a roof over those I love, helping other people's projects grow, giving people a shot, etc.

My time on the other hand, gets dumped into implementing the same handful of processes doing the same damn thing, but different this time, because you can't just bloody make "Here ya go, here's your Enterprise-in-a-box".

I'd like people more people able to solve novel problems than necessarily need to retread the same path over and over. Some degree of that will always have to be done to keep the skills fresh in the population, but we could do way better at marshaling that split, and I'm convinced part of what necessitates it is creating artificial barriers through things like enforced implementation monopolization. Yes. It ensures a minimum level of novelty and variance across populations, but it also does terribly at not consuming the finite amount of human capacity for truly novel thought to innovate.

It may make societies that function based on greed and economic/fiscal measures work, but I'm not convinced other incentive structures won't keep the rolling stone of innovation from accruing moss.


I don't understand what you're talking about, I'm talking about the non-commercial parts of the monopoly rights that are copyrights and patents, the non-commercial parts arguably aren't going to restrict the users much, and their commercial parts are temporary by design.

(Copyright has went IMHO overboard with its duration, we should scale to back to the original 14 years renewable once, just like patents, but copyright doesn't apply to processes anyway, and so arguably it shouldn't apply to software that can't claim to have any artistic merit.)


> By comparison, Copilot is even more obviously fair use.

Not sure I see it that way.

If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

Copying and storing a book isn't recreating another book from it. Copilot is creating new stuff from the contents of the "books" in this case.

Edit: I misunderstood fair use as it turns out...


Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.


> Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.

Not sure if you meant to reply to me but I agree with you: you can't compare what Google did to what Copilot does.


Copilot just suggests code.


And someone accepts it. Even if suggesting derivatives of licensed code is not a license infringement, then Copilot sure is a vector for mass license infringement by the people clicking "Accept suggestion". And those people are unable to know (without doing extensive investigation that completely nullifies the point of the tool) whether that suggestion is potentially a verbatim copy of some existing work in an incompatible license.


If I suggest whole lines of dialogue to you, the screenwriter, did I write those lines or you? If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

Suggesting code is generating code


> did I write those lines or you

Neither. Someone else did, and published it. Copilot copied the dialog and suggested it.

> If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

It depends. Talking generalities isn't productive or interesting. Can you give an example and we can discuss specifics?

> Suggesting code is generating code

This isn't even superficially true


There are situations where the question is are the mishmashes from Copilot 'fair use'.

But the other, more direct question is ... what about the instances where Copilot doesn't come up with a learned mishmash result? What happens when Copilot just gives you a straight up answer from it's learning data, verbatim?

Then you, as a dev, end up with a bunch of code that is effectively copied, via a 'copying tool', which is GPL'd?

It's that specific case that to me sticks out as the 'most concerning part'.

Please correct me if I'm wrong.


For your specific case, “take your hard work that you clearly marked with a GPL license and then make money from it”, you don’t even need to rely on fair use. As long as you comply with the terms of the GPL, making money with the code is perfectly acceptable, and the FSF even endorses the practice. [1] Red Hat is but one billion-dollar example.

[1] https://www.gnu.org/licenses/gpl-faq.en.html#DoesTheGPLAllow...


But the person making money from the GPL code has to follow the terms of the license. Attribution, sharing modifications, etc.


Correct. That's why I said "As long as you comply with the terms of the GPL".


I've edited my comment with examples and a clarification.

Fair use is an exception to copyright and, by definition, copyright licenses.


I understand the concept of fair use (I think) but I can't see how it applies to Copilot.

Google didn't create new books from the contents of existing ones (whether you agree that they should have been allowed to store the books or not) but Copilot is creating new code/apps from existing ones.

Edit: I guess my understanding of fair use was wrong. I stand corrected.


If Google Books were creating new books, that would only help their argument. Transformativeness is one of the four parts of the fair use test.

Copilot producing new, novel works (which may contain short verbatim snippets of GPL works) is a strong argument for transformativeness.


It would help the transformativeness, but it would substantially change the effect upon the market. By creating competing products with the copyrighted material, there is a higher degree of transformative, but you also end up disrupting the marketplace.

I don't know how a court would decide this, but I do think the facts in future GPT-3 cases are sufficiently different from Author's Guild that I could see it going any way. Plus, I think the prevalence of GPT-3 and the ramifications of the ruling one way or another could lead some future case to be heard by the Supreme Court. A similar case could come up in California, or another state where the 2nd Circuit Artist Guild case isn't precedent.


> short verbatim snippets of GPL works

Define short


I don't think that's an accurate description...

Fair use is a defense for cases of copyright infringement, which means you're starting of from a case of copyright infringement, which sort-of muckys up the whole "innocent until proven guilty" thing. And considering it's a weighted test, it's hardly very cut-and-dry at that.


If you view GPL code with your browser would that mean that your browser now has to be GPL as well? In the sense that copilot is not much different than a browser for Stack Overflow with some automation, why would it need to be GPLed? Your own code on the other hand…


For sake of discussion, it would be clearer to split copilot code (not derived from GPL'd works) and the actual weights of the neural network at the heart of copilot (derived from GPL'd works via algorithmic means).

For your browser analogy, that would mean that the "browser" is the copilot code, while the weights would be some data derived from GPL'd works, perhaps a screenshot of the browser showing the code.

I'd think that the weights/screenshot in this analogy would have to abide by the GPL license. In a vacuum, I would not think that the copilot code had to be licensed under GPL, but it might be different in this case since the copilot code is necessary to make use of the weights.

But then again, the weights are sitting on some server, so GPL might not apply anyway. Not sure about AGPL and other licenses though. There is likely some illegal incompatibility between licenses in there.


As I understand it the things copilot tries to do is automate the loop of “Google your problem, find a Stack Overflow answer, paste in the code from there into my editor”. In that sense, the burden of whether the license of the code being copy pasted is on the person who answered the SO question and on me. If this literally was what copilot did, nobody would bat an eye that some code it produced was GPL or any other license because it wouldn’t be copilot’s problem.

No let’s substitute a different database of for the code that isn’t SO. It doesn’t really matter if that database is a literal RDBMS, a giant git repo or is encoded as a neural net. All copilot is going to do is perform a search in that database, find a result and paste it in. The burden of licensing is still on me to not use GPL code and possibly on the person hosting the database.

The gotcha here is that copilot’s database is a neural network. If you take GPL code and feed it as training data to a neural network to create essentially a lookup table along with non-GPL code did you just create a derived work? It is unclear to me whether you did or not. In particular, can they neural network itself be considered “source code”?


> If you view GPL code with your browser would that mean that your browser now has to be GPL as well?

Some good responses in sibling comments already, but I don't see the narrow answer here, which is: No, because no distribution of the browser took place.

If you created a weird version of the browser in which a specific URL is hardcoded to show the GPL'd code instead of the result of an HTTP request, and you then distributed that browser to others, then I believe that yes, you'd have to do so under the GPL. (You might get away with it under fair use if the amount of GPL'd code is small, etc.)


If you use your browser to copy some GPL code into your project your project must now be GPL as well.

So following your own argument, even if Copilot is allowed, using it still risks you falling under GPL


My point exactly. Copilot is innocent in that case just like the browser.


Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.


That probably depends on how large and how significant the bits you remember are. Otherwise one could take a person with photographic memory and circumvent all GPL licenses easily, by making that person type what they remember.


> Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.

I do not find that to be obvious at all.


You do not find it obvious that a human being would not become a GPL'd work?


To build a browser you don't need a verbatim GPL code, so it's not a derivative work in the same sense copilot is.

Stackoverflow on the other hand is much trickier question...


SO clearly doesn’t need GPL code to be useful. The wider SE network is evidence of that.


> If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

If I'm Google, and I scan your code and return a link to it when people ask to find code like that (but show an ad next to that link for someone else's code that might solve their problem too), that's fair use and legal. My search engine has probably stored your code in a partial format, and that's fine.


It's fine because a search engine is a generic tool the main purpose of which is not to replicate the code verbatim to be used as code.


>If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

You can actually take snippets from commercial movies and post them onto YouTube if your YouTube video is transformative enough for your usage to be considered fair use. Well, theoretically at least - in reality YouTube might automatically copyright strike it.

>Copying and storing a book isn't recreating another book from it.

That doesn't mean that GitHub has to redistribute Copilot under GPL. However, the end user could potentially have to if they use Copilot to generate new code that happens to copy GPL code verbatim.


> You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

> That doesn't mean that GitHub has to redistribute Copilot under GPL

I wasn't saying that was the case: some of the code that Copilot used may not allow redistribution under GPL.

But let's say that all of the code it scanned was GPL for the sake of argument. Why would they not have to distribute their Copilot source yet, if I use it to generate some code, I'd have to distribute mine?

My spidey-sense it tingling at that one!


> Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

Again, fair use is an exception to copyright protection. If something is fair use, the license does not apply. The fact that Copilot does not release its source code is related only to a specific term of a specific license, which does not apply if Copilot is indeed fair use.


Making money is irrelevant to fair use



Irrelevant to GPL maybe.


> By comparison, Copilot is even more obviously fair use.

You are correct about (US specific) the fair use exception, but it is in no way as clear as you suggest that what copilot is doing entirely falls under fair use. Fair use is always constrained.

I suspect some variant of this sort of thing will have to be tested in court before the arguments are really clear.


> ...the non-obvious thing for many developers is that fair use is an exception to copyright itself.

More precisely, fair use is an affirmative defense to an claim of copyright infringement. A fair use defense basically says, "Yes, I am copying your copyrighted material and I don't have a license (or am exceeding a licensed use), but my usage is allowed under the fair use doctrine (codified in 17 USC 107 in US law)."


Thanks for this, but can you answer the question:

Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Even if Copilot is 'fair use' ... does that mean the results are 'fair use' on the part of AutoPilot users?

And a bigger question: is your interpretation of those statues and case law enough to make the answer unambiguous?

I don't have legal background, but I do have an operating background with lawyers and tech ... and my 'gut' says that anyone using Copilot is opening themselves up to lawsuits.

If the code you put in your software comes, via Copilot, but that code is verbatim from some kind of GPL's (or worse, proprietary) ... there's a good chance you could get sued if someone gets the inclination.

Maybe it's because of my personal experience, but I can just see corporate lawyers banning Copilot straight up as the risks are simply now worth the upside. That's now what we like to hear in the classically liberal sense i.e. 'share and innovate' ... but gosh it doesn't feel like a happy legal situation to me.

Looking forward to people with more insight sharing on this important topic.


> Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Only a lawyer (and truly, only a court) could answer that question.

If you copy 100 lines of code that amounts to no more than a trivial implementation in a popular language of how to invert a binary tree, it's likely fair use.

If you copy 10 lines of code that are highly novel, have never been written before, and solve a problem no one outside the authors have solved... It may not be fair use to copy that.

Other people who have replied have mentioned "the heart" of a work. The US Supreme Court has held that even de minimis - "minimal", to be brief - copying can sometimes be infringement if you copied the "heart" of a work.


If this issue is eventually litigated, we will see. The law in the Second Circuit (where the final judgment was rendered before the case was eventually settled) may well be different than the law in a different circuit. If there is a split in the circuit courts, then the Supreme Court may have to weigh in on this issue.

When fair use is an issue, the courts look at the facts in context each time. These are obviously different facts than scanning books for populating a search index and rendering previews; and each side is going to argue that the facts are similar or that they are dissimilar. How the court sees it is going to be the key question.


This could either be:

1. a fascinating Supreme Court opinion.

2. a frustrating ruling because SCOTUS doesn't understand software and code.

3. the type of anti-anticlimactically(?) narrow ruling typical of the Roberts court.

While our Congresspersons can't seem to wrap their minds around technology/social media, I think SCOTUS would understand this one enough to avoid (2).


Fair use cases tend to produce narrowly-written law because the outcomes hinge on how the court judges the facts against the list of factors codified in the Copyright Act (17 U.S.C. section 107). The courts don't really have breathing room to use a different test. I don't recall any cases in which the courts have set binding guidelines for interpretation of these factors.


The Google vs Oracle case showed that SCOTUS can handle technical topics


Next up, Copilot for college papers! Who needs to pay a professional paper-writer (ahem, I mean write the paper) when you can have an AI write your paper for you! It's fair use, so you're entitled to claim ownership to it, right?


I think you are confusing legal protections for intellectual property with plagiarism. (At least that's what I think you're doing if I read your comment as sarcasm and guess what you're trying to say non-sarcastically?) But they are entirely different things.

You can be violating copyright without plagiarizing, so long as you cite your source, but if you copy a copyright-protected work in an illegal way when doing so.

And you can be plagiarizing without violating copyright, if you have the permission of the copyright holder to use their content, or if the content is in the public domain and not protected by copyright, or if it's legal under fair use -- but you pass it off as your own work.

Two entirely separate things. You can get expelled from school for plaguriism without violating anyone's copyright, or prosecuted for copyright without committing any academic dishonesty.

You can indeed have the legal right to make use of content, under fair use or anything else, but it can still be plagiarism. That you have a fair use right does not mean "Oh so that means you are allowed to turn it in to your professor and get an A and the law says you must be allowed to do this and nobody can say otherwise!" -- no.


Yeah, I was being sarcastic. But you make a good point about the legality of plagiarism.


Copilot is not doing what your example does.

If Github had a service that automatically mirrored public repositories on Gitlab, that would be equivalent to the example you gave.

But Github is taking content under specific licenses to build something new for commercial use.

I'm not sure if what Github does falls under Fair Use, but I don't know that it matters. I can read fifty books and then write my own, which would certainly rely—consciously or not—on what I had read. Is that a copyright violation? It doesn't seem like it is but maybe it is and until now has been impossible to prosecute?


GitHub isn’t building anything.

The end user is.

By this logic any and all neural nets that draw pictures are copyright infringing as well.


If they create exact copies of copyrighted pictures, then yes, they do.


> Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

...and if you're outside the USA?


Read the Authors Guild v Google dismissal. The court considered it fair use because Google's project was built explicitly to let users find and purchase books, giving revenue to the copyright holders. Copilot does not do that.


> ... giving revenue to the copyright holders.

That's a reference to factor four of the fair use test, "the effect of the use upon the potential market for or value of the copyrighted work." (17 USC 107).

None of the factors are dispositive, however. For example, a scathing book review that quotes a passage to show how bad the writing is might eviscerate sales of the book, but such a use is usually protected. For a counter-example, see Harper & Row v. Nation Enterprises 471 U.S. 539 (1985).


> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

Exactly the point I came to make.

The Authors’ Guild is a US entity, and so is Google, so only US law applies. And thus, we have the Fair Use exception.

But developers sharing code on GitHub come from and live all over the world.

Now, Github’s ToS do include the usual provision stating that US & California law applies, et cætera, et cætera [1], but… and even they acknowledge it may be the case, such provisions usually aren’t considered legal outside of the US.

So… developers from outside the US, in countries with less lenient exceptions to copyright, definitely could sue them.

Identifying these countries and finding those developers, however, is a different matter altogether.

[1]: https://docs.github.com/en/github/site-policy/github-terms-o...


This was a good point. Really enjoying this discussion. Interesting stuff.

I'm really out of my depth in giving my own opinion here, but I'm not sure that either the "distribution != derivative" characterization, or that "parsing GPL => derivative of GPL" really locks this thing down. The bit that I can't follow with the "distribution != derivative" argument is that the copilot is actually performing distribution rather than "design". I would have said that copilot's core function is generating implementations, which to me does not seem like distribution. This isn't a "search" product, and it's not trying to be one. It is attempting to do design work, and I could see a case where that distinction matters.


I buy the argument about copilot itself and this comment. But when someone goes to release software that uses the output of Copilot, I fail to see how they wouldn’t be a GPL derivative work if enough source was used. Copilot is essentially really fancy copy/paste in that context.


I think this is the correct answer. IANAL but the copilot code vs the copilot training data are different things and licensing for one shouldn’t affect the other, right? And the fact that training data happens to also be code is incidental.


One view would be that copilot the app distributes GPL'd code, in a weird encoding. Training the model is a compilation step to that encoding


I assume the code is a derivative work of training data because given different data code would be also different (neuron weights)


If I read a GPL implementation of a linked list and then write my own linked list implementation, was my neural network in my brain a derivative work of the GPL code?


Sure it is, you brain is not software though


So as long as I read GPL code, then rewrite it from memory and feed it to copilot to train it I can unGPL anything?


If fair use memorising whole source code byte-by-byte, storing it as ie. some non-100%-lossless compression for subsequent retrieval or arbitrary size snippets?


If copilot was trained using the entirety of the linux kernel, wouldn't the neural network itself need to be GPLed, if not its output.


> Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books.

For commercial use and derivative works?

Authors won't incorporate snippets of books into new works unless they're reviews. Copilot is different.


Google Books is a commercial site which incorporated the snippets of millions of copyrighted works. And of course, sitting in thousands of Google servers/databases are full copies of each of those books, photos of each page, the OCRed text of each page, and indexes to search them. Even that egregious copying without a license or permission was considered fair use.

If anything, the ways in which Copilot is different aid Microsoft/GitHub's argument for fair use. Because Copilot creates novel new works, that gives them a strong argument their system is more transformative than Google Books, which just presents verbatim copies of books.


The Google books example really misses the point, one of the reasons why the judges considered it fair use was because it was pointing back to the original sources (and thus potentially increasing publishers earnings).

Copilot does none of that. If all the ML companies are so sure this is fair use I encourage them to train an AI on Disney movies to generate short cartoon snippets based on some description. There sure would be a court case.


The main issue here is less doing it, but getting sufficiently nice results. I've done work in generative AI before and right now the state of the art is passable on single images with some but not enough control and is still weak on videos without heavy structure requirements. I expect in 5-10 years we will have good enough models (or hardware) to do short video generation and the question will get tested then. I also think a meaningful good video requires audio and have fun making well aligned text (for dialogue) audio of that text, and video frames. Aligning all that generation together is still challenging today.


> Authors won't incorporate snippets of books into new works

Of course they do, previous works are quoted all the time.


But that's another thing - co-pilot doesn't quote it encourages something more akin to plagarism, doesn't it?


Plagiarism, pretending you made a work entirely yourself when you didn't, is rarely a matter for a court to decide and the standards for what constitutes plagiarism can vary a lot. When I turn in projects for a course, a cite sources in the comments a lot, even if what I turn in is substantially modified. An employer generally doesn't care if you copied and pasted code from StackOverflow or wherever, so long as you don't expose them to a suit and you don't lie if asked "Did you write this 100% yourself?"

Citing your source is not a get out jail free card for copyright infringement, it doesn't really matter.


> Citing your source is not a get out jail free card

No, but it's a requirement of the license stackoverflow.com uses, which is unfortunate, for code (as opposed to text, where a quote can be easily attributed).


...with attribution.


And without. Attribution isn't a "copyright escape clause", copying a work without permission is still infringement - unless it's fair use.

Plagiarism is not the same as infringement.


Can you still apply Fair Use if they make Copilot a payed service?


Does intent not matter? Pasting code for explanatory reasons and citing the source seems different than silently incorporating it directly into a commercial work product.


> Fair use is an exception to copyright itself.

And copyright itself is an exception to the normal state of things : the public domain, copyright being only a temporary monopoly.


Assuming that Copilot's use of GPL'd code to provide snippets to a developer is fair use, what rights does the developer have to using that snippet?


Can you copy 10 lines of code from a open source project in your software? Yes you an, it's considered fair use. Nobody will ever sue for that. If it was, websites like StackOverflouw where developers post code probably taken by project with some restrictive license and other developer copy it in their projects would not exist.

Copilot will not write an entire software module, it will provide you with snippets. I see using GPL code for training fair use. If a developer reads the source code of a project to take inspiration and possibly copy some small parts does it violate the license?


When the recent Github v. youtube-dl fiasco happened, I remember reading similarly strongly-worded but dismissive comments regarding fair use, stating how it is quite obvious that youtube-dl's test code could never be fair use and how fair use itself is a vague, shaky, underspecified provision of the copyright law which cannot ever be relied on.

To me, seeing youtube-dl's case as fair use is so much easier than using hundreds of thousands source code files without permission in order to build a proprietary product.


How would you feel about a paid-for search engine using hundreds of millios of web pages without permission in order to build a proprietary product?


There is a crucial difference though, the search engine links back to the content. If Google would just display the content on their verbatim, it would definetly not be considered fair use. Even like this several countries have restricted what Google can do when displaying e.g. News.


Somehow building a list of pointers to original content does simply not have the same ring to me as a product that rehashes all of the content. A rehashing of content sounds to me much more like, for example, publishing a sequel to my favourite book. After all, a sequel is just a rehashing of the same characters in new adventures. If we can't do that, why should Copilot be fine?

My point was however that I'm just utterly failing to see how the youtube-dl test thing could be more of a copyright problem than this entire thing based on millions of others' works that is Copilot.


You mean like a search engine?


This is a thoughtful and insightful reply. Thank you.


Books (mostly) are not distributed under the GPL.


True. But pretty good privacy might be worth considering in this context - it was at one point published as a book after all...

https://philzimmermann.com/EN/essays/BookPreface.html


The GPL only gives you additional permissions relative to what you would have by default. The books included in that suit were more strongly restricted, since there was no license at all.


There are certainly some interesting additional conditions the GPL creates by taking the license away if you violate certain clauses. Regardless, the interesting part of this is that this looks different from the user's point of view and Microsoft's. Sure, 5 lines out of 10,000 is probably fair use. For Microsoft, their system is using the whole code base and copying it a few lines at a time to different people, eventually adding up to potentially lots more than fair use.

The question on this one will be about the difference between Microsoft/Github's product and a programmer using copilot's code:

"If I feed the entire code base to a machine, and it copies small snippets to different people, do we add the copies up, or just look at the final product?"


Does the GPL forbid fair use? Why don't book publishers use a license that forbids fair use?


Because fair use is an exception to copyright itself. A copyright license can't take away your legal right to fair use.


> Why don't book publishers use a license that forbids fair use?

They couldn't do it with a license, which only imposes conditions for the license to be valid. Fair use applies even if the copier has no license at all.

Potentially they could do it with a contract. A license is not a contract and imposes no covenants on the parties.


While I agree you are correct about (in the US anyway) fair use being an exemption from copyright, thus superceeds licensing

I disagree that Copilot is "more obviously fair use.", some parts might be, but we have seen clear examples (i.e verbatim code reproduction) that would not be.

I dont believe the question of "is this fair use" is as clear as you believe it to be


Just for reference, the hackernews source is public.


Not the current version? AFAIK there's some security-by-obscurity in the measures against spam, voter rings etc ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: