In terms of the permissibility of training on public code, the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use. We are certain this will be an area of discussion in the US and around the world and we're eager to participate.
To be honest, I doubt that. Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model, which will be used in a closed source software generating code for closed source projects.
For example, if I am writing a criticism of an article, I can quote portions of that article in my criticism, or modify images from the article in order to add my own commentary. Fair use protects against authors who try to exert so much control over their works that it harms the public good.
It's no different from a human programmer reading code, learning from it, and using that experience to write new code. Somewhere in your head there is code that someone else wrote. And it's not infringing anybody's copyright for those memories to exist in your head.
Also, a human being that reproduces licensed code from memory - because they read that code - would constitute a license violation. The line between derivative work, and authentic new original creation is not a well defined one. This is why we still have human arbiters of these decisions and not formal differential definitions of it. This happens in music for example all the time.
Your argument would have some merit if something were created instead of assembled, but there is no new algorithm that is being created. That is not what is happening here.
On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.
If you're going to set such a high standard for ML tools like this, I think you need to justify why it shouldn't apply to humans too.
When a human programmer who has read copyrighted code at some point in their life writes new code that is not a "new algorithm", are they in violation of the copyrights of every piece of code they've ever read that was remotely similar in any respect to the new work?
I mean, I hope not!
> On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.
I'm not a lawyer, but this actually sounds very close to the "transformative" criterion under fair use. Elements of existing code in the training set are being synthesized into new code for a new application.
I assume there's no off-the-shelf precedent for this, but given the similarity with how human programmers learn and apply knowledge, it doesn't seem crazy to think this might be ruled as legitimate fair use. I'd guess it would come down to how willing the ML system is to suggest snippets that are both verbatim and highly non-generic.
On the same page is an image showing copilot in real-time adding the text of the famous python poem, The Zen of Python. See https://docs.github.com/assets/images/help/copilot/resources... for a link directly to copilot doing this.
You are making arguments about what you read instead of objectively observing how copilot operates. Just because GH wrote that copilot synthesizes new code doesn't mean that it writes new code in the way that a human writes code. That is not what is happening here. It is replicating code. Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.
Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):
> This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.
> But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.
> The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.
> This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
The arguments you've made here would seem to apply equally well to a version of Copilot hardened against "recitation", hence my reply.
> Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.
It would be convenient for your argument(s) if it were decided legal fact that ML-synthesized code is derivative work, but it seems far from obvious to me (in fact, I would disagree) and you haven't articulated a real argument to that effect yourself. It has also definitely not been decided by any legal entity capable of establishing precedent.
And, again, if this is what you believe then I'm not sure how the work of human programmers is supposed to be any different in the eyes of copyright law.
No. We both aren't. I am discussing how copilot operates from the perspective of a user concerned about legal ramifications. I backed that concern up with specific factual quotes and animated images from github, where github unequivocally demonstrated how copilot copies code. You are speculating how copyright law should handle ML code synthesis.
You say I'm not ... but then you say, explicitly in so many words, that I am:
> You are speculating how copyright law should handle ML code synthesis.
I don't get it. Am I, or aren't I? Which is it? I mean, not that you get to tell me what I am talking about, but it seems like something we should get cleared up.
edit: Maybe you mean I am, and you aren't?
Beyond that, I skimmed the Github link, and my takeaway was that this is a small problem (statistically, in terms of occurrence rate) that they have concrete approaches to fixing before full launch. I never disputed that "recitation" is currently an issue, but honestly that link seems to back up my position more than it does yours (to the extent that yours is coherent, which (as above) I would dispute).
Now that five days have passed, there have been a number of examples of copilot doing just that, replicating code. Quake source code that even included comments, the famous python poem, etc. There are many examples of code that has been replicated - not synthesized but duplicated byte for byte from the originals.
I could feed the Linux kernel one function at a time into a ML model, then coerce its output to be exactly the same as the input
this is obviously copyright infringement
whereas in the github case where they've trained it on millions of projects maybe it isn't?
does the training set size become relevant legally?
While I certainly appreciate the difference, is camera observation illegal anywhere where it isn't explicitly outlawed? Meaning, have courts ever decided that the difference of scale matters?
I fear the coming world of training machine learning models with my face just because it was published by someone somewhere (legally or not).
My understanding of the open problem here is whether the ML model is intelligently recommending entire fragments that are explicitly licensed under the GPL. That would be a licensing violation, if a human did it.
A model can do this as well. Getting the length of a substring match isn’t rocket science.
The clear difference is that a human's training regimen is to understand how and why code interacts. That is different from an engine that replicates other people's source code.
Okay, but that's...not much of a counterargument (to be fair, the original claim was unsupported, though.)
> Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model
That's really not a counterargument. “Fair use” is an exception to exclusive rights under copyright, and renders the copyright holder’s preferences moot to the extent it applies. The copyright holder not being likely to want it based on the circumstances is an argument against it being implicitly licensed use, but not against it being fair use.
It seems like some of the chatter around this is implying that the resultant code might still have some GPL still on it. But it seems to me that it's the trained model that Microsoft should have to make available on request.
1b) non-transformative: in order to be useful, the produced code must have the same semantics as some code in the training set, so this does not add "a different character or purpose". Note that this is very different from a "clean room" implementation, where a high-level design is reproduced, because the AI is looking directly at the original code!
2) possibly creative?
3) probably not literally reproducing input code
4) competitive/displacing for the code that was used in the input set
So failing at least 3 out of 5 of the guidelines. https://www.copyright.gov/fair-use/index.html
1b) This is false. This is not literally taking snippets it has found and suggesting it to the user. That would be an intelligent search algorithm. This is writing novel code automatically based on what it has learned.
2) Definitely creative. It's creating novel code. At least it's creative if you consider a human programming to be a creative endeavor as well.
3) If it's reproducing input code it's just a search algorithm. This doesn't seem to be the case.
4) Most GPLed code doesn't cost any money. As such the market for it is non-existent. Besides copilot does not displace the original even if there were a market for it. As far as I know there is not anything even close to comparable in the world right now.
So from my reading it violates none of the guidelines.
Think what you will, but your lies about the GPLv3 can easily be tested. Can you point me to some GPLv3 software in the Apple tech stack?
We actually already know the answer.
Apple had to drop Samba (they were a MAJOR end user use of Samba) because of GPLv3
I think they also moved away from GCC for LLVM.
In fact - they've probably purged at least 15 packages I'm aware of and I'm aware of NO GPLv3 packages being included.
Not sure what their App Store story is - but I wouldn't be surprised if they were careful there too.
Oh - this is all lies and apple's lawyers are wrong? Come one - I'm aware of many other companies that absolutely will not ship GPLv3 software for this reason.
In fact, by 2011 even it was clear that GPLv3 is not really workable in a lot of contexts and alternatives like MIT became more popular.
Apple geared up to fight DOJ over maintaining root control of devices (San Bernadino case).
Even Ubuntu has had to deal with this - SFLC made it clear that if some distributor messed things up ubuntu would have to release their keys, which is why they ended up with a MICROSOFT (!) solution.
"Ubuntu wishes to ensure that users can boot any operating system they like and run any software they want. Their concern is that the GPLv3 makes provisions by which the FSF could, in this case as the owner of GRUB2, deem that a machine that won't let them replace GRUB2 with something else is in violation of the GPLv3. At that point, they can demand that Ubuntu surrender its encryption keys used to provide secure bootloader verification--which then allows anyone to sign any bootloader they want, thus negating any security features you could leverage out of the bootloader (for example, intentionally instructing it to boot only signed code--keeping the chain trusted, rather than booting a foreign OS as is the option)." - commentator on this topic.
It's just interesting to me that rather than any substance the folks arguing for GPLv3 reach for name calling type responses.
Everything else you write is just anecdotes about how certain companies have chosen to do things.
If I sell an open source radio with firmware limiting broadcast power / bands etc to regulated limits and ranges - under GPLv3 I can lock down this device to prevent the buyer from modifying it? I'm not talking about making the software available (happy to do that, GPLv2 requires that). I'm talking about the actual devices I build and sell (physical ones).
I can build a Roku or Tivo and lock it down? Have you even read the GPLv3? It has what is commonly called the ANTI-tivoisation clause PRECISELY to block developers from locking devices down for products they sell / ship.
If I rent a device and build in a monthly activation check - I can use my keys to lock device and prevent buyer from bypassing my monthly activation check or other restrictions?
The problem I have with GPLv3 folks is they basically endlessly lie about what you can do with GPLv3 - when there is plenty of VERY CLEAR evidence that everyone from Ubuntu to Apple to many others who've looked at this (yes, with attorney's) says that no - GPLv3 can blow up in your face on this.
So no, I don't believe you. These aren't "just anecdotes" These care companies incurring VERY significant costs to move away / avoid GPLv3 products. AGPLv3 is even more poisonous - I'm not aware of any major players using it (other than those doing the fake open source game).
Now we can debate whether or not it's a good thing that the user gets full control of his device if he wants it. I think it is. You?
Apple used to also interoperate wonderfully if you were using Samba SERVER side too because - well, they were using Samba client side. Those days were fantastic frankly. You would run Samba server side (on Linux), then Mac client side - and still have your windows machines kind of on -network (for accounting etc) too.
But the Samba folks are (or were) VERY hard core GPLv3 folks - so writing was on the wall.
GPLv3 shifted things really from preserving developer freedom for OTHERs to do what they wanted with the code, to requiring YOU to do stuff in various ways which was a big shift. I'd assumed that (under GPLv2) there would be natural convergences, but GPLv3 really blew that apart and we've had a bit of a license fracturing relatively.
AGPLv3 has also been a bit weaponized to do a sort of fake open source where you can only really use the software if you pay for a commercial license.
I can't dig it up right now but someone can probably find it.
But the BSD's used samba for a while as well.
This is why SaaS/Cloud companies dislike them and fuel FUD campaigns.
If you train az ML model on GPL code, and then make it output some code, would that not make the result a derivative of the GPL licensed inputs?
But I guess this could be similar to musical composition. If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.
In this particular case, the output resembles the inputs, or there is no reason to use Github Copilot.
This just gives me a flashback to copying homework in school, “make sure you change some of the words around so it’s not obvious”
I’m sure you’re right Re: jurisprudence, but it never sat right with me that AI engineers get to produce these big, impressive models but the people who created the training data will never be compensated, let alone asked. So I posted my face on Flickr, how should I know I’m consenting to benefit someone’s killer robot facial recognition?
How does that apply to countries where Fair Use is not a thing? As in, if you train a model on a fair use basis in the US and I start using the model somewhere else?
In what context? You are planning on commercializing Copilot and in that case the calculus on whether or not using copyright protected material for your own benefit changes drastically.
----> for purposes such as criticism, news reporting, teaching, and research <----, without the need for permission from or payment to the copyright holder.
Copilot is not criticizing, reporting, teaching, or researching anything. So claiming fair use is the result of total ignorance or disregard.