Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: GitHub Copilot can now block suggestions matching public code
70 points by mmastrac on June 25, 2022 | hide | past | favorite | 87 comments
As of a few minutes ago, it prompted me to update my settings:

"GitHub Copilot can allow or block suggestions matching public code. See GitHub Copilot FAQ to learn more."

The FAQ appears to be missing.




There was many personally identifying information leaks when it initially launched. Which suggests that they snooped through OSS code and violated any licenses. Here are 2 answers to the main two things copilot sympathizers ask.

1. How do you know Co-pilot violated OSS licenses?

> The aforementioned PII and secrets emitted. And IIRC a support personnel said they even copied GPL licensed code. So there is nothing stopping them from copying MIT licensed ones. Which is still a violation without attribution. So no reason for the benefit of the doubt.

2. Humans copy code all the time. So why can't an AI? Or something along the lines of it...

The answer is 2 parted.

1. A human can only copy so much. Most of them are not in bad faith. And those who are in bad faith are not a humongous portion. An AI just scrapes the code in mass scale which is impossible for a human even with automation. It's also easier for it to learn from this training data compared to someone scrapping the data and maintaining a DB. This is such a common sense thing. None of the laws, policies or systems are made for AI in mind. So it is just a careless thing to say.

2. If are not adhering to OSS licenses then YOU ARE in violation. So if you are not called out or punished, DOESN'T MEAN you are right. Just means you are getting away with it.

This is just a PR exercise for MS. They already have the data and the product. A nobody has the balls to question them. The normalization of these bad behaviors with their good PR being "the guardians of the OSS" is literally taunting the entire OSS community.

This is where I hoped FSF, EFF, SFConservancy or any other group would step up. Nothing so far. Just yelling into the void hoping MS will suddenly do the right thing.


> There was many personally identifying information leaks when it initially launched

> The aforementioned PII and secrets emitted

It's not trained on private GitHub repos. Any secrets it generated would have been already public and compromised.

> Which suggests that they snooped through OSS code and violated any licenses

That it was trained on FOSS code is already known. Whether doing so violates the licenses is probably up for courts to decide.

> An AI just scrapes the code in mass scale which is impossible for a human even with automation

Human programmers can do this with search engines.


> Any secrets it generated would have been already public and compromised.

Two things..

1. This is not some vigilante hacking groups you are talking about. This is MS - a humongous corp. They know doing this is wrong. So it shouldn't matter. It is still illegal and making a proprietary software product is exactly the reason why these things exist. Don't you think?

2. Already public unintentionally. AFAIK, any code that I produce is automatically copyrighted to me. This means if I write something in public and not provide a license, IT IS LEGALLY under the copyright protection provided to me by my country. At least that is the case in US and India which are home to a huge portion of OSS. Do correct me if I am wrong. Putting it like that in public would be just plain stupid for sure. But legally it is still mine. Reproducing it and remixing my work would be illegal. It is just my PITA now to prove that in court is all.

> Whether doing so violates the licenses is probably up for courts to decide.

I am yet to see any attribution to ANY OSS code that it is trained on. The PII and secrets is enough to find out the license of a repo which would make it easier to prove whether they violated it or not. Don't tell me all OSS that copilot has trained on is only public domain stuff. Even ISC license needs attribution.

> Human programmers can do this with search engines.

OK. CAN. And that is wrong too. So when they do, we should incriminate them as well if evidence is out there or if a matter such as this comes to light. How does that change anything I mentioned? Very curious.


> So it shouldn't matter.

It definitely does matter that any secrets generated were already public:

* Emitting secrets from private repos would be a huge confidentiality issue (though really you shouldn't commit code secrets to git at all), as it'd be taking something that's private + exploitable and making it public

* Emitting secrets that are already public doesn't cause the confidentiality issue. Once a secret is out, it's out, and should be changed immediately. By the time it's in Copilot's training set, it'll have already been on search engines/archive sites/black-hat forums/etc.

Tangentially, GitHub do also do some scanning to alert of accidentally committed secrets in repos: https://docs.github.com/en/code-security/secret-scanning/abo...

> 2. Already public unintentionally.

Right, but therefore already compromised and no longer confidential. Copilot isn't leaking any secrets, someone else did by making them public.

> AFAIK, any code that I produce is automatically copyrighted to me. This means if I write something in public and not provide a license, IT IS LEGALLY under the copyright protection provided to me by my country. At least that is the case in US and India which are home to a huge portion of OSS.

Essentially correct, to my understanding. If you're making it public, you'll generally also give some hosting/publishing/distribution rights to the services involved - as specified by their T&C.

> Reproducing it and remixing my work would be illegal

The US has the concept of fair use which provides exceptions for "transformative” purposes. For example: copying and downscaling your image to use as a thumbnail, caching the webpage your work is on, or creating a parody of your work.

Consider Google Books for example, where Google scanned millions of copyrighted books and made them searchable (showing snippets). This was ruled fair use due to being transformative.

Question would be whether code generated by Copilot that falls under this. Ultimately it's up to the courts to decide, but I'd lean in favor of "yes".

> The PII and secrets is enough to find out the license of a repo which would make it easier to prove whether they violated it or not. Don't tell me all OSS that copilot has trained on is only public domain stuff. Even ISC license needs attribution.

Fair use is about unlicensed usage, so if it's fair use then it doesn't need to abide by the terms of the licenses. Even if it's ruled not to be fair use, I think they could still train it on GitHub-hosted code due to the mentioned rights you give them by agreeing to GitHub's T&C.

> How does that change anything I mentioned? Very curious.

Changes your claim of impossibility, so now it's just about whether there's a violation.


who cares if it copies licensed code?

if it increases developer productivity by 10-20%, its a huge net win for global prosperity.


Who cares? Anyone who believes that plagiarism is wrong.

Not to mention anyone that believes in respecting other people’s licensing terms when they’ve provided their code to the world for free.


This view will stand only until you are effected directly. Which is a very careless way to look at it.


CoPilot using code to train a model doesn’t violate any software licenses.


Let's train AI on leaked source code of Microsoft, Nvidia and other commercial companies and find out how soon such project going to be sued and shutdown via DMCA.

Might also include source-available code like Unreal Engine, etc.


Not having the resources to defend yourself against large and litigious companies does not mean parent’s point is wrong.

Training CoPilot on code doesn’t violate any software licenses.


>Training CoPilot on code doesn’t violate any software licenses.

Even if it regurgitates large unchanged passages of code, with the advertised selling point of using anything it produces in an unrestricted fashion?


I think they are saying the training part itself doesn’t.

It seems to be kind of a fine point, otherwise you could implement a license laundering AI by training it on a single codebase and asking it questions about it. If you end up with a byte by byte replica sans a LICENSE file presumably you wouldn’t get away with actually _using_ the model to clone licensed software even if training it was technically ok.


What if I train a model on one snippet of code, and it always produces that snippet of code for all inputs? If I did not have a license to use that code, would laundering it through a model absolve me of violating the license?


And if yes, what happens if I replace the word "model" with "employee"? Suddenly the context changes?


Nothing changes, it is still a violation


This is something I don't understand, not just about Copilot but for many NN generators: the outputs sometimes seem like an obvious ripoff of something, that no way it qualify as a novel work, although I can't prove it. Yet the outputs are treated as if all copyright issues are discussed and cleared. Just how come?


> Yet the outputs are treated as if all copyright issues are discussed and cleared. Just how come?

Anyone developing and trying to sell ML models will try hard to act like this is a settled question. This is obviously a huge uncertainty factor for using various commercial ML models.


The copyright issues are downloaded to the user/developer.

If co-pilot suggests anything under copyright and you publish it you get sued.


That doesn’t have anything to do with my comment.

The model is 100% compliant with open source licenses.


Are you drawing a distinction between creating the model, which depends on viewing open source code and is presumably fine, and using it selling the model which may output licensed code and get you in trouble?


> CoPilot using code to train a model doesn’t violate any software licenses.

The argument GitHub advanced for this was absolutely ridiculous: that code stops being code when CoPilot analyzes it, qualifies only as text while CoPilot analyzes or suggests it, but magically turns back into code if/when the user incorporates the suggested text into their code base.

The real argument GitHub made was much more credible: the CEO came on here and said something like "the legality of code reuse is an interesting new area of law and we welcome the debate," which in lawyer terms means "we have smart, well-informed lawyers, and we can easily get these cases in front of dipshit judges who don't understand technology."

It's an interesting product, and I hear good things about its utility in practice. But the legal argument is utter nonsense. CoPilot definitely violates many software licenses, and knowingly so.


Reading and remembering code is allowed under all the OSS licenses. It's the reproduction of the code that's restricted. The blurry question is always: how much does an expression have to change between it being classified as an exact reproduction, a derivative work, and a novel work?

CoPilot would definitely fail the clean room test, though


the model includes chunks of that code, they copied it into their model


If the model is learning the nature of code and synthesizing new code then it may not violate some licenses. When people do this its usually fine.

However, if the model is simply compressing code it sees into the model weights, just memorizing snippets, then outputting those snippets, that’s much more likely to violate licenses. Like when people copy+paste from some licensed code without attribution (even if realistically nothing is enforced most of the time).

The truth is likely somewhere in the middle. Lets say 20% of code is stored directly in the weights but the rest is synthesized. That’s a problem for the whole product.

We already know models do a little bit of both depending on the data coverage. Common structures like if/then, loops, etc. are probably “understood” because the model saw them in lots of contexts. However, specific functions, especially those that are seen only a few times in much the same contexts are more likely to be copied. There’s a spectrum here from shortcut learning to understanding.

OSS doesn’t really have the access or resources to evaluate this. Github isn’t really incentivized to share any analysis they’ve done here.

What’s interesting to me is that their solution to this problem is to put the issue on their users/customers. By default, crawl everything public, ignorant of licenses, and if the customer has license concerns its on them to disable public code matching.


The problem only occurs when you are using it to fill mostly empty files, particularly in empty projects.

In an actual codebase the frequency you are getting 'snippets' is less than 1%, not 20%.

This was discussed in the FAQ for the beta.

The straw man around this topic is wild.

Anyone who has actually used it should know that in an existing project, it very heavily matches the output flavoring to the existing codebase. When there isn't anything to match, it only has the training data to pull from.

Demoing the edge case where it is set up to fail and then pointing and saying "look at how it failed" is a bit silly. And continuing to harp on it in the comments on a piece about how now there's literally a setting to prevent even those edge cases is perhaps even moreso.


It's in the FAQ. Scroll to the bottom of this page:

https://github.com/features/copilot/

"What can I do to reduce GitHub Copilot's suggestion of code that matches public code?

"We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment."


If you ask it for a function that reverses a string in JS, I'm sure the solution will match public code. Doesn't this filter very heavily limit its usefulness?


That's a good point. This filter is probably both insufficient and too limiting.


And is there a way to tell GitHub _not_ to analyse repos that I own?


Yes, there is an account setting checkbox labeled "Allow GitHub to use my code snippets for product improvements" - https://github.com/settings/copilot

I'm a bit confused how this will work in practice, because I don't see how they can retrain their entire model every time someone toggles that setting.


You can't access that page or toggle that checkbox unless you pay for Copilot first. This might be the most literal case I've ever seen of "if you're not paying, you're the product"


Perhaps I'm in some A/B testing cohort, because I certainly haven't paid for anything.


Were you part of the Copilot demo/preview? It appears that doing so leaves your github account in some sort of mostly enrolled state for Copilot that lets you tweak these settings compared to an account that wasn't part of the preview that just takes you to a "start your free trial" page.


That could be it. I've never actually used it, but maybe I applied for the preview at some point.


If your code is e.g. MIT or GPL licensed then someone else could still upload a fork to another GitHub repo.

We might really need a new clause in the license.


The MIT license states:

> The above copyright notice and this permission notice > shall be included in all copies or substantial portions > of the Software.

If you are using Copilot and it produces a MIT-licensed snippet of code, you are in plain violation of the license. It is wilful copyright violation.


A snippet by definition is not a substantial portion.


'Substantial portion' is not interpreted by you, it is interpreted by courts and juries. How it is interpreted depends on jurisdiction and more[0], the commonly accepted 'safe bet' is any single line of the software.

[0] https://opensource.stackexchange.com/a/2188


>Substantial portion' is not interpreted by you, it is interpreted by courts and juries. How it is interpreted depends on jurisdiction and more[0]

Yes and like standard your cite says none of the interpretations would cover a snippet.

>the commonly accepted 'safe bet' is any single line of the software.

Github/MS aren't going for the 'safe bet'. They're going for the maximum that they can get away with.


Which would seem to be an excellent hook for a wilful copyright violation lawsuit. This is not an accident.


Is there a way to prevent GitHub from doing this with forks of repos of my code?


It will be interesting to see the next version of robots.txt that begins to appear in public repos to prevent this type of data scraping.


If somebody starts going around neighborhood shooting paint balls, requiring everybody to wear "please don't shoot paint balls at me" t-shirts seems like a suboptimal solution.

If somebody left a code with some license 10 years ago, then that's how he published it and it should be respected. If there are some additional files, changes or checkboxes required on the author side (who may be dead btw), then it is a theater not a solution.

I could totally go with let's abandon licenses (but would prefer starting with patents), what's online is ours. But if we want to keep them then the license itself is enough of a statement. It's not like we have a hard time writing software which interprets them... it's more like "but we would have to do a big refactoring".


the thing that gets me most about all of this is that so many developers here believe, truly, that they are the only people on earth who can write certain things in code. and that because of that, their code is extremely valuable, and should never be found in the hands of another person.

Those same people put their code in public repos on GitHub, with permissive license(s), and don't opt-out of Github's use of their code for things like copilot.

Those same people are now mad at Microsoft because of copilot, and not mad at Amazon for their competing service which does the same things.


This is a really weird comment. I feel like you are trying to make a strawman, but it just looks like a pile of hay.

so many developers here believe [...] their code is extremely valuable, and should never be found in the hands of another person

Usually they are fine with other people reading or using the code, so long as proper attribution is given or derivative works have similar terms. Which is not happening in this case.

The Amazon solution is newer and less widely used or advertised. Presumably all of the arguments about Copilot refer to the Amazon version.


> don't opt-out of Github's use of their code for things like copilot.

How does one opt-out, aside from not making the source available at all?


GitHub allows you to opt your repos out if (and only if) you pay for Copilot.


Not true. I just opted out and I do not pay for it.


Care to share how? I would like to opt-out but have no idea how.


I went here was able to do so: https://github.com/settings/copilot

I'm also late to this article, so maybe this was a bug that was fixed.


All I see there is a button that says “Enable GitHub Copilot”.


Do you have any public repos? Also, I did apply for the beta but I'm not sure if I was ever accepted, so maybe that is why I have the option. I've definitely never paid for anything though. https://i.imgur.com/KBRUSFa.png


I do have public repos. Perhaps that setting there is about sending snippets from the editor? Unsure. But I definitely don’t see that.


>should never be found in the hands of another person

I am absolutely fine with other people using my code.

I am not fine with my code being used by AIs whose only ultimate goal is to make me unemployable while simultaneously generate a bunch of money for some corporations and introduce a bunch of labor and inequality issues in an already shit society.


So what other code it was trained on: private code?


Well, AI doesn't just repeat the things you feed it (particularly text transformers like the one here). Most of the time, the AI will have a "context", or a field of relevant text that is tokenized and preserved in-memory when generating more text. I'd presume that Copilot uses your current file as context, so it can infer simple things like variable names and libraries that you're using. From there, the model can do a pretty good job walking it's way down the decision tree that leads you to a working program. So, in a lot of ways, the majority of it's responses are actually original, just "inspired" by surrounding code.

Now, as we found, telling the bot to solve for `Q_rsqrt` is something of a creative dead-end: the AI has only come across that a few times, and most likely they were implemented the same. So, with all that being said, I think this is a fine solution to the problem (as long as it works well).


In a sense, it's like asking a classical pianist to "complete this tune", and then start playing Für Elise. They'll almost certainly play the rest of the original tune - as they should!


I love that. I also love that the opposite can be true: they’ll riff on a classical piece and create something new that still fits the key and time and feel of the initial piece.


This was announced when they announced GA a few days ago, no?


Yes, this option was there since it came out of beta


Copilot: it is easier to ask for forgiveness than to ask for permission.


So they got complaints that copilot generated public code and then as aftermath they hashed all public snippets to bandaid their GPL offense?


This is likely due to amazons version of copilot offering this feature.


Has anyone produced any variation of the most popular licenses in which the code is not available for training a machine learning model? That would seem to be the most reasonable starting point for addressing concerns here.


GitHub claims their use of code for Copilot is fair use (i.e. license wouldn't matter).

Nobody has challenged this in court yet to my knowledge.


I'm pretty sure they are right, but that doesn't mean that using code generated by copilot is though. Like fair use allows Google to index images and expose them in Google Image, but using an image found there elsewhere is almost always copyright infringement (unless you're lucky and the image was in the public domain).

Not a lawyer though.


indexing has robots.txt to define fine grained rules for what you want to be indexable, github repos only have public or private.

So I think there needs to be some way of determining it. If they think it's fair use that also means they wouldn't think it was wrong to train it on public repos in gitlab, etc. Do they already train on repos outside Github?


I don't think you can opt out of fair use in jurisdictions where the concept exist (that would ruin the concept pretty quickly). And robot.txt is merely netiquette, it isn't enforceable in court AFAIK.

That being said, by using only github-hosted code, they sidestepped the question of whether this is actually fair use, because it's part of there ToS. And there is no way to opt out someone's ToS either, unless you delete your account altogether.


Ok well if you have a permissive license, then I guess the fair use won't even need to adjudicated.


Intellectual Property Laundering


"'Information wants to be free!' ... but not ours though."

"'There's no such thing as intellectual property' ... except mine. Don't you dare touch my IP. I"ll sue!"

"'Downloading a copy doesn't deprive the copyright owner of anything' ... hey, stop stealing my code. Thief! Pirate!"

The irony of the anti-IP and pro-FOSS crowd failing to live up to their own professed philosophy is profound.


A large portion of the "FOSS" cherishes IP law and using it to enforce source availability of derivative works.


GPL


it's still GPL infringing


I've always used MIT or Apache for my public repos, but I don't feel comfortable if they build a product based on subscriptions using my code. What license should I use, to avoid copilot being trained on my code? Ignoring the fact they are not currently respecting the licenses...


Isn’t the point of MIT that the code can be reused in whatever way and for whatever purpose they please?

(As long as the copyright notice is retained of course, which in this case isn’t.)

I’ve begun leaning heavily toward GPL for this reason.


There is no copyright license that allows the public to read code but not learn from it.


AGPL is the most aggressive copyleft license available, but even that can't stop a well funded tech giant from taking whatever they want. You need to use AGPL, and retain a well-researched legal team.


It's never that simple :)


If you copy a piece of GPL code but change the whitespace or some arbitrary number that produces the same output, does that still still break the license (if you don't make the source public)?


Yes

> if you don't make the source public

I'm not sure what you mean by that, the GPL only has constraints for redistribution. If you don't distribute anything (publicly or not), there is no way to break the GPL.


Assuming the code you modify is protectible in the first place, then yes, this sort of trivial modification or derivative would be covered by GPL and have to abide by its rules.

But just because you slap a license on something does mean you actually can enforce it.


No because you only need to make the source available on request from the original developer(s).


Yes. If you copy and modified then code, it doesn't change the fact that you copied it.


I immediatly checked the box. Peace of mind for what I put in my code, and honestly codepilot would still be useful even with having only access to the local code base.


I selected "Allow" on suggestions matching public code. I feel like I'm on the FBI watchlist now.


Time to generate thousands of public repos filled with copilot-generated code!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: