Hacker News new | past | comments | ask | show | jobs | submit login

It shouldn't do that, and we are taking steps to avoid reciting training data in the output: https://copilot.github.com/#faq-does-github-copilot-recite-c... https://docs.github.com/en/early-access/github/copilot/resea...

In terms of the permissibility of training on public code, the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use. We are certain this will be an area of discussion in the US and around the world and we're eager to participate.




> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.

To be honest, I doubt that. Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model, which will be used in a closed source software generating code for closed source projects.


The whole point of fair use is that it allows people to copy things even when the copyright holder doesn't want them to.

For example, if I am writing a criticism of an article, I can quote portions of that article in my criticism, or modify images from the article in order to add my own commentary. Fair use protects against authors who try to exert so much control over their works that it harms the public good.


This isn't the same situation at all. The copying of code doesn't seem to be for a limited or transformative purpose. Fair use might cover parody or commentary & criticism but not limitless replication.


They are not replicating the code at all. They are training a neural network. The neural network then learns from the code and synthesises new code.

It's no different from a human programmer reading code, learning from it, and using that experience to write new code. Somewhere in your head there is code that someone else wrote. And it's not infringing anybody's copyright for those memories to exist in your head.


We can't yet equivocate ML systems with human beings. Maybe one day. But at the moment, it's probably better to compare this to a compiler being fed licensed code. The compilation output is still subject to the license. Regardless of how fancy the compiler is.

Also, a human being that reproduces licensed code from memory - because they read that code - would constitute a license violation. The line between derivative work, and authentic new original creation is not a well defined one. This is why we still have human arbiters of these decisions and not formal differential definitions of it. This happens in music for example all the time.


If avoiding copyright violations was simply "I remembered it", then I don't think things like clean-room reverse engineering would be ever legally necessary [1]

[1] https://en.wikipedia.org/wiki/Clean_room_design


It is replication, maybe not of a single piece of code - but creating a synthesis is still copying. For example, constructing a single piece of code of three pieces of code from your co-workers is still replication of code.

Your argument would have some merit if something were created instead of assembled, but there is no new algorithm that is being created. That is not what is happening here.

On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.


> Your argument would have some merit if something were created instead of assembled, but there is no new algorithm that is being created. That is not what is happening here.

If you're going to set such a high standard for ML tools like this, I think you need to justify why it shouldn't apply to humans too.

When a human programmer who has read copyrighted code at some point in their life writes new code that is not a "new algorithm", are they in violation of the copyrights of every piece of code they've ever read that was remotely similar in any respect to the new work?

I mean, I hope not!

> On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.

I'm not a lawyer, but this actually sounds very close to the "transformative" criterion under fair use. Elements of existing code in the training set are being synthesized into new code for a new application.

I assume there's no off-the-shelf precedent for this, but given the similarity with how human programmers learn and apply knowledge, it doesn't seem crazy to think this might be ruled as legitimate fair use. I'd guess it would come down to how willing the ML system is to suggest snippets that are both verbatim and highly non-generic.


From https://docs.github.com/en/github/copilot/research-recitatio...: "Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License."

On the same page is an image showing copilot in real-time adding the text of the famous python poem, The Zen of Python. See https://docs.github.com/assets/images/help/copilot/resources... for a link directly to copilot doing this.

You are making arguments about what you read instead of objectively observing how copilot operates. Just because GH wrote that copilot synthesizes new code doesn't mean that it writes new code in the way that a human writes code. That is not what is happening here. It is replicating code. Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.


> You are making arguments about what you read instead of objectively observing how copilot operates.

Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):

> This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.

> But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.

> The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

> This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

The arguments you've made here would seem to apply equally well to a version of Copilot hardened against "recitation", hence my reply.

> Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.

It would be convenient for your argument(s) if it were decided legal fact that ML-synthesized code is derivative work, but it seems far from obvious to me (in fact, I would disagree) and you haven't articulated a real argument to that effect yourself. It has also definitely not been decided by any legal entity capable of establishing precedent.

And, again, if this is what you believe then I'm not sure how the work of human programmers is supposed to be any different in the eyes of copyright law.


> Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):

No. We both aren't. I am discussing how copilot operates from the perspective of a user concerned about legal ramifications. I backed that concern up with specific factual quotes and animated images from github, where github unequivocally demonstrated how copilot copies code. You are speculating how copyright law should handle ML code synthesis.


> No. We both aren't

You say I'm not ... but then you say, explicitly in so many words, that I am:

> You are speculating how copyright law should handle ML code synthesis.

I don't get it. Am I, or aren't I? Which is it? I mean, not that you get to tell me what I am talking about, but it seems like something we should get cleared up.

edit: Maybe you mean I am, and you aren't?

Beyond that, I skimmed the Github link, and my takeaway was that this is a small problem (statistically, in terms of occurrence rate) that they have concrete approaches to fixing before full launch. I never disputed that "recitation" is currently an issue, but honestly that link seems to back up my position more than it does yours (to the extent that yours is coherent, which (as above) I would dispute).


> They are not replicating the code at all.

Now that five days have passed, there have been a number of examples of copilot doing just that, replicating code. Quake source code that even included comments, the famous python poem, etc. There are many examples of code that has been replicated - not synthesized but duplicated byte for byte from the originals.


surely that depends on the size of the training set?

I could feed the Linux kernel one function at a time into a ML model, then coerce its output to be exactly the same as the input

this is obviously copyright infringement

whereas in the github case where they've trained it on millions of projects maybe it isn't?

does the training set size become relevant legally?


Fair Use is specific to the US, though. The picture could end up being much more complicated when code written outside the US is being analyzed.


The messier issue is probably using the model to write code outside the US. Americans can probably analyze code from anywhere in the world and refer to Fair Use if a lawyer comes knocking, but I can't refer to Fair Use if a lawyer knocks on my door after using Copilot.


It is not US specific, we have it in EU. And e.g. in Poland I could reverse engineer a program to make it work on my hardware/software if it doesn't. This is covered by fair use here.


Is it any different than training a human? What if a person learned programming by hacking on GPL public code and then went to build proprietary software?


It is different in the same way that a person looking at me from their window when I pass by is different from a thousand cameras observing me when I move around city. Scale matters.


> a thousand cameras observing me when I move around city. Scale matters. reply

While I certainly appreciate the difference, is camera observation illegal anywhere where it isn't explicitly outlawed? Meaning, have courts ever decided that the difference of scale matters?


No idea. I was not trying to make a legal argument. This was to try to convey why someone might feel ok about humans learning from their work but not necessarily about training a model.


This is a lovely analogy, akin to "sharing mix tapes" vs "sharing MP3s on Napster". I fear the coming world with extensive public camera surveilance and facial recognition! (For any other "tin foil hatters" out there, cue the trailer for Minority Report.)


>I fear the coming world with extensive public camera surveilance and facial recognition!

I fear the coming world of training machine learning models with my face just because it was published by someone somewhere (legally or not).


You can rest assured that this is already the case if your picture was ever posted online. There are dozens of such products that law enforcement buys subscriptions to.


A human being who has learned from reading GPL'd code can make the informed, intelligent decision to not copy that code.

My understanding of the open problem here is whether the ML model is intelligently recommending entire fragments that are explicitly licensed under the GPL. That would be a licensing violation, if a human did it.


Actually, I believe it's tricky to say if even human can actually do that safely. There's the whole concept of "cleanroom rewrite" - meaning, if you want to rewrite some GPL or closed-source project into a different license, you should make sure you never ever seen even a glimpse of the original code. If you look on GPL or closed-source code (or, actually, code governed by any other license), it's hard to prove you didn't accidentally/subconsciously remember parts of this code, and copy them into your "rewrite" project even if "you made a decision to not copy". The border between "inspired by" and "blatant copyright infringement" is blurry and messy. If that was already so tricky and troublesome legal-wise before, my first instinct is that with the Copilot it could be even more legally murky territory. IANAL, yet I'd feel better if they made some [legally binding] promises that their model is based only on code carefully verified to have one of an explicit (and published) whitelist of permissive licenses. (Even this could be tricky, with MIT etc. actually requiring some mention in your advertising materials [which is often forgotten], but now that's a completely different level of trouble than not knowing if I'm infringing GPL or some closed-source code, or other weird license.)


> A human being who has learned from reading GPL'd code can make the informed, intelligent decision to not copy that code.

A model can do this as well. Getting the length of a substring match isn’t rocket science.


But wouldn't a machine learning AGPL code it be hosting AGPL code in its memory?


Pretty sure merely hosting code hoesn't trigger AGPL; if it did, github would have to be open-sourced.


Would you hire a person who only knew how to program by taking small snippets of code from GPL and rearranging them? That's like hiring monkey's to type Shakespeare.

The clear difference is that a human's training regimen is to understand how and why code interacts. That is different from an engine that replicates other people's source code.


What if a person heard a song by hearing it on the radio and went on to record their own version?


There is already a legal structure in place for cover song licensing.

https://en.wikipedia.org/wiki/Cover_version#United_States_co...


Exactly so it needs licensing of some sort - this is closer to cover tunes than it is to someone getting a CS degree and being asked to credit Knuth for all their future work.


How do you distribute a human?


A contractor seems equivalent to SaaS to me


Perhaps we need GPL v4. I don't think there is any clause in current V2/V3 that prohibits learning from the code, only using the code in other places and running a service with code.


Would you be okay with a human reading your GPL code and learning how to write closed source software for closed source projects?


> To be honest, I doubt that.

Okay, but that's...not much of a counterargument (to be fair, the original claim was unsupported, though.)

> Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model

That's really not a counterargument. “Fair use” is an exception to exclusive rights under copyright, and renders the copyright holder’s preferences moot to the extent it applies. The copyright holder not being likely to want it based on the circumstances is an argument against it being implicitly licensed use, but not against it being fair use.


> a closed source model

It seems like some of the chatter around this is implying that the resultant code might still have some GPL still on it. But it seems to me that it's the trained model that Microsoft should have to make available on request.


That's the point of fair use. To do something with a material the original author does not want.


Can you explain why you think this is covered by fair use? It seems to me to be

1a) commercial

1b) non-transformative: in order to be useful, the produced code must have the same semantics as some code in the training set, so this does not add "a different character or purpose". Note that this is very different from a "clean room" implementation, where a high-level design is reproduced, because the AI is looking directly at the original code!

2) possibly creative?

3) probably not literally reproducing input code

4) competitive/displacing for the code that was used in the input set

So failing at least 3 out of 5 of the guidelines. https://www.copyright.gov/fair-use/index.html


1a) Fair use can be commercial. And copilot is not commercial so the point is moot.

1b) This is false. This is not literally taking snippets it has found and suggesting it to the user. That would be an intelligent search algorithm. This is writing novel code automatically based on what it has learned.

2) Definitely creative. It's creating novel code. At least it's creative if you consider a human programming to be a creative endeavor as well.

3) If it's reproducing input code it's just a search algorithm. This doesn't seem to be the case.

4) Most GPLed code doesn't cost any money. As such the market for it is non-existent. Besides copilot does not displace the original even if there were a market for it. As far as I know there is not anything even close to comparable in the world right now.

So from my reading it violates none of the guidelines.


This is what is so miserable about the GPL progression. We went from GPLv2 (preserving everyone's rights to use code) to GPLv3 (you have to give up your encryption keys) - I think we've lost the GPL as a place where we could solve / answer these types of questions which are good ones - GPL just tanked a lot of trust in it with the (A)GPLv3 stuff especially around prohibiting other developers from specific uses of the code (which is diametrically different from earlier versions which preserved rights).


Think what you will of GPLv3, but lies help no one. Of course it doesn't require you to give up your encryption keys.


Under GPLv2 I could make a device with GPLv2 software and maintain root of trust control of that device if I wanted (ie, do an anti-theft activation lock process, do a lease ownership option of $200/month vs $10K to buy etc).

Think what you will, but your lies about the GPLv3 can easily be tested. Can you point me to some GPLv3 software in the Apple tech stack?

We actually already know the answer.

Apple had to drop Samba (they were a MAJOR end user use of Samba) because of GPLv3

I think they also moved away from GCC for LLVM.

In fact - they've probably purged at least 15 packages I'm aware of and I'm aware of NO GPLv3 packages being included.

Not sure what their App Store story is - but I wouldn't be surprised if they were careful there too.

Oh - this is all lies and apple's lawyers are wrong? Come one - I'm aware of many other companies that absolutely will not ship GPLv3 software for this reason.

In fact, by 2011 even it was clear that GPLv3 is not really workable in a lot of contexts and alternatives like MIT became more popular.

https://trends.google.com/trends/explore?date=all&geo=US&q=%...

Apple geared up to fight DOJ over maintaining root control of devices (San Bernadino case).

Even Ubuntu has had to deal with this - SFLC made it clear that if some distributor messed things up ubuntu would have to release their keys, which is why they ended up with a MICROSOFT (!) solution.

"Ubuntu wishes to ensure that users can boot any operating system they like and run any software they want. Their concern is that the GPLv3 makes provisions by which the FSF could, in this case as the owner of GRUB2, deem that a machine that won't let them replace GRUB2 with something else is in violation of the GPLv3. At that point, they can demand that Ubuntu surrender its encryption keys used to provide secure bootloader verification--which then allows anyone to sign any bootloader they want, thus negating any security features you could leverage out of the bootloader (for example, intentionally instructing it to boot only signed code--keeping the chain trusted, rather than booting a foreign OS as is the option)." - commentator on this topic.

It's just interesting to me that rather than any substance the folks arguing for GPLv3 reach for name calling type responses.


That's why Apple's SMB implementation stinks! Finally, there's a reason for it, I thought they had just started before Samba was mature or something.


Yeah, it was a bit of a big bummer!

Apple used to also interoperate wonderfully if you were using Samba SERVER side too because - well, they were using Samba client side. Those days were fantastic frankly. You would run Samba server side (on Linux), then Mac client side - and still have your windows machines kind of on -network (for accounting etc) too.

But the Samba folks are (or were) VERY hard core GPLv3 folks - so writing was on the wall.

GPLv3 shifted things really from preserving developer freedom for OTHERs to do what they wanted with the code, to requiring YOU to do stuff in various ways which was a big shift. I'd assumed that (under GPLv2) there would be natural convergences, but GPLv3 really blew that apart and we've had a bit of a license fracturing relatively.

AGPLv3 has also been a bit weaponized to do a sort of fake open source where you can only really use the software if you pay for a commercial license.


The macOS CIFS client was from BSD, not from Samba.


BSD's have also taken a pretty strong stance against GPLv3 - again for violating their principles on freedom.

I can't dig it up right now but someone can probably find it.

But the BSD's used samba for a while as well.


As of Darwin 8.0.1 (so Tiger?) smbclient(1)'s man page was saying it was a component of Samba. I think some BSDs used Samba.


You can do what you describe with the GPLv3. You'll just have to allow others to swap out the root of trust if they so please.

Everything else you write is just anecdotes about how certain companies have chosen to do things.


Let me be crystal clear.

If I sell an open source radio with firmware limiting broadcast power / bands etc to regulated limits and ranges - under GPLv3 I can lock down this device to prevent the buyer from modifying it? I'm not talking about making the software available (happy to do that, GPLv2 requires that). I'm talking about the actual devices I build and sell (physical ones).

I can build a Roku or Tivo and lock it down? Have you even read the GPLv3? It has what is commonly called the ANTI-tivoisation clause PRECISELY to block developers from locking devices down for products they sell / ship.

If I rent a device and build in a monthly activation check - I can use my keys to lock device and prevent buyer from bypassing my monthly activation check or other restrictions?

The problem I have with GPLv3 folks is they basically endlessly lie about what you can do with GPLv3 - when there is plenty of VERY CLEAR evidence that everyone from Ubuntu to Apple to many others who've looked at this (yes, with attorney's) says that no - GPLv3 can blow up in your face on this.

So no, I don't believe you. These aren't "just anecdotes" These care companies incurring VERY significant costs to move away / avoid GPLv3 products. AGPLv3 is even more poisonous - I'm not aware of any major players using it (other than those doing the fake open source game).


No, you can't lock it down without letting its owner unlock it. That's indeed the point. But your original comment said you have to give up your encryption keys. That's the lie I was getting at.

Now we can debate whether or not it's a good thing that the user gets full control of his device if he wants it. I think it is. You?


These claims are absurd. AGPL and GPLv3 carry on the same mission of GPLv2 to protect authors and end users from proprietization, patent trolling and freeloading.

This is why SaaS/Cloud companies dislike them and fuel FUD campaigns.


> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.

If you train az ML model on GPL code, and then make it output some code, would that not make the result a derivative of the GPL licensed inputs?

But I guess this could be similar to musical composition. If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.


> If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.

In this particular case, the output resembles the inputs, or there is no reason to use Github Copilot.


> It shouldn't do that, and we are taking steps to avoid reciting training data in the output

This just gives me a flashback to copying homework in school, “make sure you change some of the words around so it’s not obvious”

I’m sure you’re right Re: jurisprudence, but it never sat right with me that AI engineers get to produce these big, impressive models but the people who created the training data will never be compensated, let alone asked. So I posted my face on Flickr, how should I know I’m consenting to benefit someone’s killer robot facial recognition?


Wait I thought y'all argued Google didn't copy Java for Android, now that big tech is copying your code you're crying wolf?


The whole point of that case begins with the admission "yes of course Google copied." They copied the API. The argument was that copying an API to enable interoperability was fair use. It went to the Supreme Court because no law explicitly said that was fair use and no previous case had settled the point definitively. And the reason Google could be confident they copied only the API is because they made sure the humans who did it understood both the difference and the importance of the difference between API and implementation. I don't think there is a credible argument that any AI existing today can make such a distinction.


>training ML models is fair use

How does that apply to countries where Fair Use is not a thing? As in, if you train a model on a fair use basis in the US and I start using the model somewhere else?


Fair use doesn’t exist in Germany.


I don’t think it’s fair to ask a US company to comment on legalities outside of the US.


It's fair to expect a international company pushing its products all over the world to be prepared to comment on non-US jurisdictions. (I have some sympathy for "we have a local market, and that's what we are solely targeting and preparing for" in companies where that is actually the case, but that's really not what we are dealing with in the case of Microsoft/GitHub)


One would expect GitHub (owned by Microsoft) to have engaged corporate counsel for an opinion (backed by statue and case law), and to be prepared to disable the functionality in jurisdictions where it’s incompatible with local IP law.


You just shared a URL that says "Please do not share this URL publicly".


Well, he's also GitHub's CEO so it's probably just fine.


> training ML models is fair use

In what context? You are planning on commercializing Copilot and in that case the calculus on whether or not using copyright protected material for your own benefit changes drastically.


It isn't. US copyright law says brief excerpts of copyright material may, under certain circumstances, be quoted verbatim

----> for purposes such as criticism, news reporting, teaching, and research <----, without the need for permission from or payment to the copyright holder.

Copilot is not criticizing, reporting, teaching, or researching anything. So claiming fair use is the result of total ignorance or disregard.


Would i be able to use something like this in the near future to produce a proprietary linux kernel?




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: