Take Google as an example, running Google Photos for free for several years. And now that this has sucked in a trillion photos, the AI job is done, and they likely have the best image recognition AI in existence.
Which is of course still peanuts compared to training a super AI on the entire web.
My point here is that only companies the size of Google and Microsoft have the resources to do this type of planetary scale AI. They can afford the super expensive AI engineers, have the computing power and own the data or will forcefully get access to it. We will even freely give it to them.
Any "lesser" AI produced from smaller companies trying to compete are obsolete, and the better one accelerates away. There is no second-best in AI, only winners.
If we predict that ultimately AI will change virtually every aspect of society, these companies will become omnipresent, "everything companies". God companies.
As per usual, it will be packaged as an extra convenience for you. And you will embrace it and actively help realize this scenario.
If there were offline image recognition we could train on our own data privately, could the results of those trainings be merged to come up with better recognition on average than any one person could do themselves with their own photos?
In other words, would it be possible for us to share the results of training, and build better models, without sharing the photos themselves?
What I'm building into PhotoStructure is typically called "transfer learning."
PhotoStructure is entirely self-hosted, including model training and application: the public domain base models (trained on huge datasets) are fetched and cached locally.
By design, none of your data (or even metadata) leaves your server.
(I expect to ship this in an upcoming beta next month.)
Apple does all face recognition and image processing stuff on the edge. On your iPhone or Mac.
I wondered why my phone got frighteningly hot while charging sometimes. Then I saw the note after adding some faces manually for it to recognize, which was in the line of "Your phone will update faces when the phone is charging". My all photos are backed up to iCloud, btw.
Transfer learning best use cases are for fast prototypes or for ml tasks that do not need state of the art performance.
By keeping the training data itself private, distributed and outsourced, you might be able to get otherwise unachievable levels of performance.
I run windows. It can't ever be secure, anyone who wanted to hack me could.
Scrambling the data really makes things worse as any accident requiring recovery of my data is also probably going to lose the encryption key.
The only time I ever lost any significant chunk of data (a persons lifetime set of photos!) was because Windows encrypted data at rest, and thus it couldn't be recovered after a disk crash.
Unless there is some corporate or legal requirement to do so, I'll never encrypt a whole disk, or backup.
i'd hate encrypting too if I threw away all best-practices regarding it -- losing a key with the failed system is a "problem exists between chair and keyboard" type of issue.
Encryption protects your data from yourself, from your adversaries, from serendipitous grey-moral types, and from the prying eyes of over-zealous data-collection conglomerates.
You seem experienced in the field, so I won't presume what your best practices are -- but to be enthusiastic against encryption is a form of cheer-leading that I think I cannot ethically support; the longer I live and the more pervasive companies get to be with their data collection policies then the more powerful and required tools like encryption seem to become.
I wish backup tools like Duplicity would warn you about the risks of encrypting backups instead of warning the user if they disable encryption, because encryption has the possibility of rendering all those backups useless when the moment to use them finally comes.
I have a similar feeling that large swathes of my digital life would be rendered permanently inaccessible if 2FA was enabled and my device was rendered inoperable. (That's why I keep meticulous physical backups of emergency keys.) I think 2FA and the like should be considered a tradeoff with its own inherit risks and benefits, instead of a universally better option than randomly generated 80-character passwords alone.
What we currently call AI is very from AGI, and it's not clear that sitting on piles of proprietary data gives an edge towards AGI. If the goal is human-level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system. :)
Current DL systems need huge amount of data, because they are very primitive: they work with immediate associations, so they require seeing data very similar to all possible inputs to generalize well.
As we develop more sophisticated systems, I expect that the leverage from data will tip over to engineering finesse, and nothing is better at fostering great engineering than the permissionless tinkering environment of open source.
Pretending that the scientifically managed public school system, that attempts to manufacture uniform educated humans on a conveyer belt, is responsible for human education is fairly ridiculous.
Children have a remarkable capacity to learn, and do so automatically through free play and exploration until public education wrings that curiosity out of them and turns education into a job.
Humans get educated despite the public education system, not because of it.
Say what now? There may be places on Earth that practice scientific management, there are definitely some that pretend to, but IME public school systems are neither.
You can read for yourself:
Seems unlikely human education costs less than AI education in total.
Every now and then you get someone to think about an old problem on a clean sheet of paper and you might get a better result with less training data / investment.
Google is actually pretty crappy at reverse image searches.
(Until, of course, they force you out and give company to some crony oligarch. But that idea is also not unknown to Yandex, I believe?)
In fairness it's not quite as good, but, it's good enough for the searches I've wanted to do so far and gets better all the time. And they're adding searching text in photos this release. I'm happy to wait a little for this better implementation.
They use photos from the web for training, and then user photos are only used for the actual indexing.
Yes sophisticated AI tech concentrates power for those who already have power.
And the technology we all (presumably readers of HN) create can enhance the impact of the user. And this can result in unfair circumstances, in reality.
Law and force can prevent disproportionate use of power. Of course one must define the law, which may be done AFTER the offense has been committed. Further, if those who make the laws are corrupted by those with e.g. this AI tech power, then no effective law may be enacted and the hypothetical abuse will continue.
It is yandex who now collects massive amounts of data to improve their image search now, while google apparently doesn't.
Yandex is a giant, for sure, but google is, like, 10 times bigger and still doesn't provide the best service.
The problem with the huge models like GPT-3 is that they are too expensive even to run by regular people, not train.
Also, I think you are overdramatizing this. Governments used to be omnipresent (maybe still are), in a different way, more threatening to individuals and probably as threatening to societies as "everything companies" could be.
But I'm not too worried here because everyone gets access to larger datasets every year, and it gets cheaper to process every year, so whatever Microsoft or Google is capable of doing now, smaller companies will be capable of doing in a few years.
This suggests that seeing the future a bit ahead of the rest of the world, and then assembling a motivated all-star team is (perhaps in the short term at least) one way of out-competing the "super AI" of the giants.
Don't let the name fool you, OpenAI is anything but Open.
Where Microsoft does have an “unfair advantage” is in their marketing and sales firepower. Replicating their B2B and B2C sales channels is indeed very expensive. GitHub will be able to monetise Copilot by some upselling campaign. Then again, startups regularly manage to break into markets that are supposedly locked down by the likes of Microsoft.
I see a lot of people comparing human learning to machine learning in the comments, but there is a huge difference - we don't distribute copies of humans
By comparison, Copilot is even more obviously fair use.
I've had this conversation quite a few times lately, and the non-obvious thing for many developers is that fair use is an exception to copyright itself.
A license is a grant of permission (with some terms) to use a copyrighted work.
This snippet from the Linux kernel doesn't make my comment here or the website Hacker News a GPL derivative work:
ret = vmbus_sendpacket(dev->channel, init_pkt,
(unsigned long)init_pkt, VM_PKT_DATA_INBAND,
return (await _sendFileStorageService.GetSendFileDownloadUrlAsync(send, fileId), false, false);
The Free Software Foundation agrees (https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse)
> Yes, you do. “Fair use” is use that is allowed without any special permission. Since you don't need the developers' permission for such use, you can do it regardless of what the developers said about it—in the license or elsewhere, whether that license be the GNU GPL or any other free software license.
> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.
(And even this verbatim copying from FSF.org for the purpose of education is... Fair use!)
This mostly has to do with the nature of the wishy-washy nature of the 4 part Fair Use test, which, unlike decent legal tests, doesn't actually have discrete answers. The judge looks at the 4 questions, talks about them while waving her hands, and makes a decision.
Comparing to, e.g., Patent, where you actually do have yes-or-no questions. Clean Booleans. Is it Novel? Is it Non-Obvious? Is it Useful? If any of the above is "No", then no patent for you.
As for the execution of Fair Use, while I haven't gone too deep into Software, I can assure that for music, the thing is just a silly holy-hell mess; confirmed most recently by the "Blurred Lines" case, where NO DIRECT COPYING (e.g. sampling or melody taking) was alleged, merely that the song sounded really similar to "Got to give it up" and that was enough.
So then, I'd say everything either is, or should be, up in the air, when it comes to Fair Use and software.
All that said, the one thing I'd add about fair use is that it isn't permission to use anything you like, but rather a defense in a legal proceeding about copyright. It's pretty much all about being able to reference copyrighted material with the law later coming in and making final decisions on whether or not that reference went too far. (IE, copying all of a disney movie and saying "What's up with this!" vs copying 1 scene and saying "This is totally messed up and here's why".)
That was a big part of the google oracle lawsuit.
Those questions for patents are barely more clear-cut than copyright fair use tests, there is lots of room for disagreement.
It's definitely true that a fair use defense against copyright infringement varies a lot by the field of work and norms can develop which are relevant to court cases. The music field is a mess, the "Blurred Lines" judgement was total bullshit. But the software field is not without its own copyright history and norms so there's no reason to expect everything to go to hell.
The big guns like Microsoft, Google, Oracle, do this sort of thing as a matter of course in their business activities, they have the lawyers, the money, and the ear of members of parliaments, senators etc.
Whereas an individual or small business probably wants to conduct themselves within a more narrow set of adherences.
The unauthorized copy arises when someone gets the work out of the model.
Of course if you make a model explicitly for the purpose of evading copyright then the courts will see through that ploy.
Is (was?) a swipe gesture novel? Is it non-obvious?
I think the factor most at risk in a fair use test with Copilot is whether it ever suggests verbatim, code that could be considered the "heart" of the original work. The John Carmack example that's popped up here at least gets closer to this question, it was a relatively small amount but it was doing something very clever and important.
One can imagine a project that has thousands of lines of code to create a GUI, handle error conditions, etc. that's built around a relatively small function; if Copilot spat out that function in my code, it might not be fair use because it's the "heart" of the original work. Additionally, its inclusion in another project could affect the potential market for the original, another fair use test.
But Copilot suggesting a "heart" is unlikely, something that would have to be ruled on in a case-by-case basis and not a reason to shut it down entirely. Companies that are risk-averse could forbid developers from using Copilot.
I agree with you that the relative importance of the copied code to the end product would be (or should be) the crux of the issue for the courts in determining infringement.
This overall interpretation most closely adheres to the spirit and intent of Fair Use as I understand it.
For Copilot itself, I do see the case for fair use, though it gets fuzzy should Microsoft ever start commercializing the feature. Nevertheless it remains to be seen whether ML training fits the same public policy benefits public libraries and free debate leverages to enable the fair use defense.
For Copilot users, I don't see an easy defense. In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book. In the case of Google books, they explicitly call out the limits on how the material they publish can be used. I'm contrast, Copilot seems to be designed to encourage such copying, making it more worry some in comparison.
A book completely written by pasting passages of other books would actually be a pretty interesting transformative work.
While software is in this limbo between copyrights and patents...
What's more, if any of the code implements a patent, fair use does not cover patent law, and relying on fair use rather than a copyright license does not benefit from any patent use grant that may be included in the copyright license. If a codebase infringes a patent due to Copilot automatically adding the code, I can easily imagine GitHub being attributed shared contributory liability for the infringement by a court.
Not a lawyer, just a former law student and law feel layman who has paid attention to these subjects.
What a weird autocorrect typo. This should have read "law geek layman." (And it initially autocorrected again as I was typing this paragraph.)
That case required that the output be transformative, in that "words in books are being used in a way they have not been used before".
Copilot only fits the transformative aspect if it is not directly reciting code, that already exists in the form that it is redistributing. So long as it does so, it fails to meet the criteria.
1. The act of training Copilot on public code
2. The resulting use of Copilot to generate presumably new code
#1 is arguably close to the Authors Guild v. Google case. You are literally transforming the input code into an entirely new thing: a series of statistical parameters determining what functioning code "looks like". You can use this information to generate a whole bunch of novel and useful code sequences, not just by feeding it parts of it's training data and acting shocked that it remembered what it saw. That smells like fair use to me.
#2 is where things get more dicey - just because it's legal to train an ML system on copyrighted data wouldn't mean that it's resulting output is non-infringing. The network itself is fair use, but the code it generates would be used in an ordinary commercial context, so you wouldn't be able to make a fair use argument here. This is the difference between scanning a bunch of books into a search engine, versus copying a paragraph out of the search engine and into your own work.
(More generally: Fair use is non-transitive. Each reuse triggers a new fair use analysis of every prior work in the chain, because each fair reuse creates a new copyright around what you added, but the original copyright also still remains.)
Absent this, I don't think there's a case. The courts have given extraordinarily wide latitude to fair use and ML algorithms are routinely trained on copyrighted works, photos, etc. without a license.
I understand that this feels more personal because it involves our field, but artists and authors have expressed the same sentiment when neural nets began making pictures and sentences.
The question here is no different than "Is GPT-3 an unlicensed, unlawfully created derivative work of millions, if not billions of people?"
No, I'm quite confident it is not.
It doesn't need to be substantial. In Google v. Oracle a 9-line function was found to be infringing.
The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.
Yes, because it was _transformative_, in a clear way. Because an API is only an interface. Which makes that part of that decision largely irrelevant to the topic at hand.
> Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different compu-ting environment without discarding a portion of a familiar program-ming language. Google’s purpose was to create a different task-related system for a different computing environment (smartphones) and tocreate a platform—the Android platform—that would help achieve and popularize that objective.
> If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.
It was already decided earlier, and Google did not contest it, choosing instead to negotiate a zero payment settlement with Oracle over the rangeCheck function. There was no need for the Supreme Court to hear it.
If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.
That's not the case. It wasn't an out-of-court-settlement, but an agreement about the damages being sought, the court had already found it to be infringing, and that was part of the ruling.
But none of that changes that 9-lines is substantial enough to be infringing. It isn't necessary to be a large body of work.
> If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.
No... It means the rangeCheck function was infringing. The implication you seem to have inferred here wouldn't be inferred by any kind of plagiarism case.
If Copilot is infringing, I suspect it's correctable (by GitHub) by adding a bloom filter or something like it to filter out verbatim snippets of GPL or other copyleft code. (And this actually sounds like something corporate users would want even if it was entirely fair use because of their intense aversion to the GPL, anyhow.)
I don't think your argument is as strong as you're making it out to be.
This goes to the "substantial" test for fair use. Clips from a film can contain core plot points, quotes from a book can contain vital passages to understanding a character, screen captures and scrapes of a website can contain huge amounts of textual detail, but depending on the four factors for fair use, still be fair use. (There have been exceptions though.)
The reaction on Hacker News to a machine producing code trained on their works is no different than the reactions artists and writers have had to other ML models. I suspect many of us are biased because it strikes at what we do and we think that our copyrights (because we have so many neat licenses) are special. They are not.
I think it would need to get to that level of "Copilot will emit a kernel module" before it's not obviously fair use.
After all, Google Books will happily convey to me whole pages from copyrighted works, page after page after page.
it's anything but obvious. https://www.copyright.gov/fair-use/
> there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.
9 lines of very run-of-the-mill code in Oracle / Google weren't considered fair use.
However - Copilot directly recites code. That is _very unlikely_ to fall under fair use.
Redistributing the exact same code, in the same form, for the same purpose, probably means that Copilot, and thus the people responsible for it, are infringing.
You make that statement as an absolute, but in the interests of clarity, all evidence so far shows that it directly recites code very rarely indeed. Even the Quake example had to be prompted by the specific variable names used in the original code.
In practice, the output code is heavily influenced by your own context — the comments you include, the variable names you use, even the name of the file you are editing — and with use it’s obvious that the code is almost certainly not a direct recitation of any existing code.
_Once_ is enough for it to be infringing. The law is not very forgiving when you try and handwave it away.
I tend to think it would be covered (provided it there were relatively small snippets and not entire functions).
I'm not American, but like others around here — I was just restricting the discussion to American law for simplicity's sake.
* Minus a few countries/regions targeted by US sanctions, I assume, though they've gradually broadened their services in sanctioned countries with the necessary licenses from OFAC.
I don't know the answer. I was only surprised that the commenter seemed dead sure that any and all copying (no matter how small) would be infringing.
That just doesn't correlate with my understanding of how Fair Use works: The "amount" of the infringement is one (of several) factors in determining if something falls under Fair Use:
>The third factor assesses the amount and substantiality of the copyrighted work that has been used. In general, the less that is used in relation to the whole, the more likely the use will be considered fair.
If it's so fair use, why not train it on all Microsoft code, regardless of license (in addition to GitHub.com) ? Would Microsoft employees be fine with Copilot re-creating "from memory" portions of Windows to use in WINE ?
Now, if you knew the code you wanted Copilot to generate you could certainly type it character by character and you might save yourself a few keystrokes with the TAB key, but it's going to be much MUCH easier to simply copy the whole codebase as files, and now you're right back where you started.
As Francois Chollet points out in this talk, ultimately deep neural network models are locally sensitive hash tables, so the examples of people pulling out source code is an inherent shortcoming of deep learning models in general. Give the right 'key' and you can 'recall' the value you are looking for.
Sounds like that wouldn't be difficult to fix? Transform the code to an intermediate representation (https://en.wikipedia.org/wiki/Intermediate_representation) as a pre-processing stage, which ditches any non-essential structure of the code and eliminates comments, variable names, etc., before running the learning algorithms on it. Et voila, much like a human learning something and reimplementing it, only essential code is generated without any possibility of accidentally regurgitating verbatim snippets of the source data.
With no copyright/copyleft, how do you enforce the rule that derived works must provide access to the source code? I’ve never heard that copyleft was a stepping stone—rather, it’s the stick that fully realizes the four freedoms.
(I guess these are going to depend a LOT on the jurisdiction that you're in ?)
Where is your ego when you're dead and gone? Where could we be if the majority of human advancement we're not tightly clutched as trade secrets?
As someone who has done paid software engineering (yes, you can feel free to call me a hack or sell out if you wish), I've come to find that the salary I've pulled over the years has not gone to me... But keeping a roof over those I love, helping other people's projects grow, giving people a shot, etc.
My time on the other hand, gets dumped into implementing the same handful of processes doing the same damn thing, but different this time, because you can't just bloody make "Here ya go, here's your Enterprise-in-a-box".
I'd like people more people able to solve novel problems than necessarily need to retread the same path over and over. Some degree of that will always have to be done to keep the skills fresh in the population, but we could do way better at marshaling that split, and I'm convinced part of what necessitates it is creating artificial barriers through things like enforced implementation monopolization. Yes. It ensures a minimum level of novelty and variance across populations, but it also does terribly at not consuming the finite amount of human capacity for truly novel thought to innovate.
It may make societies that function based on greed and economic/fiscal measures work, but I'm not convinced other incentive structures won't keep the rolling stone of innovation from accruing moss.
(Copyright has went IMHO overboard with its duration, we should scale to back to the original 14 years renewable once, just like patents, but copyright doesn't apply to processes anyway, and so arguably it shouldn't apply to software that can't claim to have any artistic merit.)
You are correct about (US specific) the fair use exception, but it is in no way as clear as you suggest that what copilot is doing entirely falls under fair use. Fair use is always constrained.
I suspect some variant of this sort of thing will have to be tested in court before the arguments are really clear.
Not sure I see it that way.
If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?
Copying and storing a book isn't recreating another book from it. Copilot is creating new stuff from the contents of the "books" in this case.
Edit: I misunderstood fair use as it turns out...
Not sure if you meant to reply to me but I agree with you: you can't compare what Google did to what Copilot does.
Suggesting code is generating code
Neither. Someone else did, and published it. Copilot copied the dialog and suggested it.
> If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?
It depends. Talking generalities isn't productive or interesting. Can you give an example and we can discuss specifics?
> Suggesting code is generating code
This isn't even superficially true
But the other, more direct question is ... what about the instances where Copilot doesn't come up with a learned mishmash result? What happens when Copilot just gives you a straight up answer from it's learning data, verbatim?
Then you, as a dev, end up with a bunch of code that is effectively copied, via a 'copying tool', which is GPL'd?
It's that specific case that to me sticks out as the 'most concerning part'.
Please correct me if I'm wrong.
Fair use is an exception to copyright and, by definition, copyright licenses.
Google didn't create new books from the contents of existing ones (whether you agree that they should have been allowed to store the books or not) but Copilot is creating new code/apps from existing ones.
Edit: I guess my understanding of fair use was wrong. I stand corrected.
Copilot producing new, novel works (which may contain short verbatim snippets of GPL works) is a strong argument for transformativeness.
I don't know how a court would decide this, but I do think the facts in future GPT-3 cases are sufficiently different from Author's Guild that I could see it going any way. Plus, I think the prevalence of GPT-3 and the ramifications of the ruling one way or another could lead some future case to be heard by the Supreme Court. A similar case could come up in California, or another state where the 2nd Circuit Artist Guild case isn't precedent.
However, where does one draw the line between fair use and derivative works?
Creating something based on other stuff (Google creating AI books from the existing ones for example) would possibly be fair use I think but would it not also be derivative works?
Google Books is considered fair use because they got sued and successfully used fair use as a defense. Until someone sues over Copilot, everyone is an armchair lawyer.
It's a happy fact that figuring out people's arguments is often unnecessary for moderating the threads, especially in cases where people are breaking the site guidelines. Everyone needs to follow the site guidelines regardless of what the topic is, what their argument is, and how right they are or feel they are. Please stick to the rules when posting here.
Fair use is a defense for cases of copyright infringement, which means you're starting of from a case of copyright infringement, which sort-of muckys up the whole "innocent until proven guilty" thing. And considering it's a weighted test, it's hardly very cut-and-dry at that.
For your browser analogy, that would mean that the "browser" is the copilot code, while the weights would be some data derived from GPL'd works, perhaps a screenshot of the browser showing the code.
I'd think that the weights/screenshot in this analogy would have to abide by the GPL license. In a vacuum, I would not think that the copilot code had to be licensed under GPL, but it might be different in this case since the copilot code is necessary to make use of the weights.
But then again, the weights are sitting on some server, so GPL might not apply anyway. Not sure about AGPL and other licenses though. There is likely some illegal incompatibility between licenses in there.
No let’s substitute a different database of for the code that isn’t SO. It doesn’t really matter if that database is a literal RDBMS, a giant git repo or is encoded as a neural net. All copilot is going to do is perform a search in that database, find a result and paste it in. The burden of licensing is still on me to not use GPL code and possibly on the person hosting the database.
The gotcha here is that copilot’s database is a neural network. If you take GPL code and feed it as training data to a neural network to create essentially a lookup table along with non-GPL code did you just create a derived work? It is unclear to me whether you did or not. In particular, can they neural network itself be considered “source code”?
Some good responses in sibling comments already, but I don't see the narrow answer here, which is: No, because no distribution of the browser took place.
If you created a weird version of the browser in which a specific URL is hardcoded to show the GPL'd code instead of the result of an HTTP request, and you then distributed that browser to others, then I believe that yes, you'd have to do so under the GPL. (You might get away with it under fair use if the amount of GPL'd code is small, etc.)
So following your own argument, even if Copilot is allowed, using it still risks you falling under GPL
I do not find that to be obvious at all.
Stackoverflow on the other hand is much trickier question...
If I'm Google, and I scan your code and return a link to it when people ask to find code like that (but show an ad next to that link for someone else's code that might solve their problem too), that's fair use and legal. My search engine has probably stored your code in a partial format, and that's fine.
You can wipe your ass with the GPL license if your use of the product falls within Fair Use.
You can actually take snippets from commercial movies and post them onto YouTube if your YouTube video is transformative enough for your usage to be considered fair use. Well, theoretically at least - in reality YouTube might automatically copyright strike it.
>Copying and storing a book isn't recreating another book from it.
That doesn't mean that GitHub has to redistribute Copilot under GPL. However, the end user could potentially have to if they use Copilot to generate new code that happens to copy GPL code verbatim.
Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?
> That doesn't mean that GitHub has to redistribute Copilot under GPL
I wasn't saying that was the case: some of the code that Copilot used may not allow redistribution under GPL.
But let's say that all of the code it scanned was GPL for the sake of argument. Why would they not have to distribute their Copilot source yet, if I use it to generate some code, I'd have to distribute mine?
My spidey-sense it tingling at that one!
Again, fair use is an exception to copyright protection. If something is fair use, the license does not apply. The fact that Copilot does not release its source code is related only to a specific term of a specific license, which does not apply if Copilot is indeed fair use.
More precisely, fair use is an affirmative defense to an claim of copyright infringement. A fair use defense basically says, "Yes, I am copying your copyrighted material and I don't have a license (or am exceeding a licensed use), but my usage is allowed under the fair use doctrine (codified in 17 USC 107 in US law)."
Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?
Even if Copilot is 'fair use' ... does that mean the results are 'fair use' on the part of AutoPilot users?
And a bigger question: is your interpretation of those statues and case law enough to make the answer unambiguous?
I don't have legal background, but I do have an operating background with lawyers and tech ... and my 'gut' says that anyone using Copilot is opening themselves up to lawsuits.
If the code you put in your software comes, via Copilot, but that code is verbatim from some kind of GPL's (or worse, proprietary) ... there's a good chance you could get sued if someone gets the inclination.
Maybe it's because of my personal experience, but I can just see corporate lawyers banning Copilot straight up as the risks are simply now worth the upside. That's now what we like to hear in the classically liberal sense i.e. 'share and innovate' ... but gosh it doesn't feel like a happy legal situation to me.
Looking forward to people with more insight sharing on this important topic.
Only a lawyer (and truly, only a court) could answer that question.
If you copy 100 lines of code that amounts to no more than a trivial implementation in a popular language of how to invert a binary tree, it's likely fair use.
If you copy 10 lines of code that are highly novel, have never been written before, and solve a problem no one outside the authors have solved... It may not be fair use to copy that.
Other people who have replied have mentioned "the heart" of a work. The US Supreme Court has held that even de minimis - "minimal", to be brief - copying can sometimes be infringement if you copied the "heart" of a work.
When fair use is an issue, the courts look at the facts in context each time. These are obviously different facts than scanning books for populating a search index and rendering previews; and each side is going to argue that the facts are similar or that they are dissimilar. How the court sees it is going to be the key question.
1. a fascinating Supreme Court opinion.
2. a frustrating ruling because SCOTUS doesn't understand software and code.
3. the type of anti-anticlimactically(?) narrow ruling typical of the Roberts court.
While our Congresspersons can't seem to wrap their minds around technology/social media, I think SCOTUS would understand this one enough to avoid (2).
You can be violating copyright without plagiarizing, so long as you cite your source, but if you copy a copyright-protected work in an illegal way when doing so.
And you can be plagiarizing without violating copyright, if you have the permission of the copyright holder to use their content, or if the content is in the public domain and not protected by copyright, or if it's legal under fair use -- but you pass it off as your own work.
Two entirely separate things. You can get expelled from school for plaguriism without violating anyone's copyright, or prosecuted for copyright without committing any academic dishonesty.
You can indeed have the legal right to make use of content, under fair use or anything else, but it can still be plagiarism. That you have a fair use right does not mean "Oh so that means you are allowed to turn it in to your professor and get an A and the law says you must be allowed to do this and nobody can say otherwise!" -- no.
If Github had a service that automatically mirrored public repositories on Gitlab, that would be equivalent to the example you gave.
But Github is taking content under specific licenses to build something new for commercial use.
I'm not sure if what Github does falls under Fair Use, but I don't know that it matters. I can read fifty books and then write my own, which would certainly rely—consciously or not—on what I had read. Is that a copyright violation? It doesn't seem like it is but maybe it is and until now has been impossible to prosecute?
The end user is.
By this logic any and all neural nets that draw pictures are copyright infringing as well.
...and if you're outside the USA?
That's a reference to factor four of the fair use test, "the effect of the use upon the potential market for or value of the copyrighted work." (17 USC 107).
None of the factors are dispositive, however. For example, a scathing book review that quotes a passage to show how bad the writing is might eviscerate sales of the book, but such a use is usually protected. For a counter-example, see Harper & Row v. Nation Enterprises 471 U.S. 539 (1985).
Exactly the point I came to make.
The Authors’ Guild is a US entity, and so is Google, so only US law applies. And thus, we have the Fair Use exception.
But developers sharing code on GitHub come from and live all over the world.
Now, Github’s ToS do include the usual provision stating that US & California law applies, et cætera, et cætera , but… and even they acknowledge it may be the case, such provisions usually aren’t considered legal outside of the US.
So… developers from outside the US, in countries with less lenient exceptions to copyright, definitely could sue them.
Identifying these countries and finding those developers, however, is a different matter altogether.
I disagree that Copilot is "more obviously fair use.", some parts might be, but we have seen clear examples (i.e verbatim code reproduction) that would not be.
I dont believe the question of "is this fair use" is as clear as you believe it to be
I'm really out of my depth in giving my own opinion here, but I'm not sure that either the "distribution != derivative" characterization, or that "parsing GPL => derivative of GPL" really locks this thing down. The bit that I can't follow with the "distribution != derivative" argument is that the copilot is actually performing distribution rather than "design". I would have said that copilot's core function is generating implementations, which to me does not seem like distribution. This isn't a "search" product, and it's not trying to be one. It is attempting to do design work, and I could see a case where that distinction matters.
For commercial use and derivative works?
Authors won't incorporate snippets of books into new works unless they're reviews. Copilot is different.
If anything, the ways in which Copilot is different aid Microsoft/GitHub's argument for fair use. Because Copilot creates novel new works, that gives them a strong argument their system is more transformative than Google Books, which just presents verbatim copies of books.
Copilot does none of that. If all the ML companies are so sure this is fair use I encourage them to train an AI on Disney movies to generate short cartoon snippets based on some description. There sure would be a court case.
Of course they do, previous works are quoted all the time.
Citing your source is not a get out jail free card for copyright infringement, it doesn't really matter.
No, but it's a requirement of the license stackoverflow.com uses, which is unfortunate, for code (as opposed to text, where a quote can be easily attributed).
Plagiarism is not the same as infringement.
And copyright itself is an exception to the normal state of things : the public domain, copyright being only a temporary monopoly.
Copilot will not write an entire software module, it will provide you with snippets. I see using GPL code for training fair use. If a developer reads the source code of a project to take inspiration and possibly copy some small parts does it violate the license?
To me, seeing youtube-dl's case as fair use is so much easier than using hundreds of thousands source code files without permission in order to build a proprietary product.
My point was however that I'm just utterly failing to see how the youtube-dl test thing could be more of a copyright problem than this entire thing based on millions of others' works that is Copilot.
The question on this one will be about the difference between Microsoft/Github's product and a programmer using copilot's code:
"If I feed the entire code base to a machine, and it copies small snippets to different people, do we add the copies up, or just look at the final product?"
They couldn't do it with a license, which only imposes conditions for the license to be valid. Fair use applies even if the copier has no license at all.
Potentially they could do it with a contract. A license is not a contract and imposes no covenants on the parties.
Sure one could argue that Copilot learned in the way a human does. There is nothing that prevents one from learning from copyrighted work, but snippets delivered verbatim from such works are surely a copyright violation.
If relating this to how humans learn, books and other sources are used to inform understanding and human knowledge. One can purchase or borrow a book without actually owning the copyright to it. Indeed, a given passage may be later quoted verbatim, provided it is accompanied with a reference to its source.
Otherwise, a verbatim use without attribution in authored context is considered plagiarism.
So, sure one can use a multitude of material for the training. Yet, once it gets to the use of the acquired "knowledge" - proper attribution is due for any "authentic enough" pieces.
What is authentic enough in this case is not easy to define, however.
At some point Neural Nets like GameGAM might be good enough to duplicate (and optimize) a commercial game. Can you then release your version of the game? Do you just need to make a few tweaks? Are we going to get a double standard because commercial interests are opposed depending on the use case?
It would be pretty funny if Microsoft as a game publisher lobbies to prevent their IP being used w/ something like GameGAN, but then takes the opposing stand point for something like their CoPilot! Although I'm sure it'll be spun as "These things are completely different!".
Maybe we will some day, but for now this isn't the case, where the law is concerned :
Of course the interesting part is that the user not only has no idea what that license is but also where the code came from and if it is in fact copied verbatim. It's unlikely a court would agree that putting licensed code through a machine strips the licensing requirements of the code, of course, but that doesn't seem to be Microsoft's problem.
I think Microsoft's use of public code hosted on GitHub is covered by the terms of service but if this use includes granting a license more permissive than the license indicated on the code itself, this would probably put every GitHub user who ever committed less permissively licensed code to GitHub that they didn't control in violation of those licenses.
There's really only three ways this can go:
1) Machine learning does legally become a license-stripping black box, which would allow creating a machine generated commons by feeding arbitrary copyrighted works into sloppy AIs that mostly just replicate their input without changes.
2) Copyright law is extended to consider the output of machine learning as derived works from its inputs, massively extending the reach of copyright and creating massive headaches for everyone (e.g. depending on the exact ruling this would effectively make it impossible to reproduce a digital artwork as merely rendering it on a screen would create a derived work).
3) The original licenses are upheld and remain in effect, rendering the output of Copilot useless by creating a massive legal headache for anyone trying not to violate copyright.
I think outcome 2 is unlikely but 1 and 3 aren't mutually exclusive.
There's no point to copilot without training data, some but not all of the training data was (A)GPL. There's no point to github without hosting code, some but not all of the code it hosts is A(GPL).
The code in either cases is data or content, it has not actually been incorporated into the copilot or github product.
GitHub's TOS include granting them a separate license (i.e., not the GPL) to reproduce the GPL code in limited ways that are necessary for providing the hosting service. This means commonsense things like displaying the source text on a webpage, copying the data between servers, and so on.
A bit of a tangent and it’s fictional, but I really have to recommend the tale of MMAcevedo. https://qntm.org/mmacevedo
I know HN loves a good "well actually" and Microsoft is always suspect, but let's leave the idea of code laundering to the Oracle lawyers. Let hackers continue to play and solve interesting problems.
Copilot should be inspiring people to figure out how to do better than it, not making hackers get up in arms trying to slap it down.
If you're asking about the moral reaction here, I think it depends on how one views Copilot. Does Copilot create basically original code that just happens to include a few small snippets? Or does Copilot actually generate a large portion of lightly changed code when it's not spitting out verbatim copies of the code? I mean, if you tell Copilot, "make me a QT compatible, crossplatform windowing library" and it spits out a slightly modified version of the QT source code and if someone started distributing that with a very cheap commercial license, that would be a problem for the QT company, which licenses their code commercial or GPL (and as QT a library, the QT GPL forces user to also release their code GPL if they release it, so it's a big restriction). So in the worst case scenario, you can something ethically dubious as well as legally dubious.
Why can't we do both? I mean, I am quite interested in AI and it's progress and I also think it's important to note the way that AI "launders" a lot of things (launders bias, launder source code, etc). AI scanning of job applications has all sorts of unfortunate effects, etc. etc. But my critique of the applications doesn't make me uninterested in the theory, they're two different things.
Still, some of the moral outrage here has to do with it coming from Github, and thus Microsoft. Software startup Kite has largely gone under the radar so far, but they launched this back in 2016. Github's late to the game. But look at the difference (and similarities) in responses to their product launch posts here.
Maybe Github isn't violating the licenses of the programmers who host on them. Maybe Copilot doesn't just spit out code that belongs to other people. Those are matters of interpretation and debate.
But if Github was doing this with Copilot, virtually an open source programmer would have a reason to be upset. Open source programmers don't give their code out for free they license it. This is a legal position, not a feeling. "Intellectual property" may be a pox on the world but asking open source developers to abandon their licenses to ... closed source developers, is legitimately a violation.
And before the spitting out source code problem appeared, I recall quite a few positive responses to Copilot. Lots of people still seem excited. And yeah, people are looking at the downside given Microsoft's long abusive history but hey, MS did those thing.
I expect that we'll need new copyright law to protect creators from this kind of thing (specifically, to give creators an option to make their work public without allowing arbitrary ML to be trained on it). Otherwise the formula for ML based fair use is "$$$ + my things = your things" which is always a recipe for tension.
Also people rarely do it; I've caught maybe a couple instances of it in my career and I never really thought too much about them again. This tool helps make it a lot easier and more common. I have a feeling other people chiming in are also in the camp of "Oh, this is going to be a thing now, huh?"
I also can't help but to think that my negative opinion of it isn't solely based on this provenance issue. While it's cool it seems questionable about how practical it is. If the value was more clear I think I could stomach the risk a bit better.
First, you might choose to distribute your code under a copyleft license to advance the OSS ecosystem. Second, the older you get, the more experience you accumulate, paradoxically the harder it is for you to find a job or advance your career in this industry—so, to maintain at least some source of motivation for tech companies to hire you, you may choose to make some of the source available, but reserve all the rights to it.
You’re fine making the source of your tool or library open for anyone to pass through the lens of their own consciousness and learn from it, but not to use as is for own benefit.
Now with GitHub Copilot suddenly you see the results of your labour you’ve previously made (under the above assumptions) public being passed through some black box, magically stripped from your license’s protections, and used to provide ready-made solutions to everyone from kids cheating at college tests to well-paid senior engineers simply lacking your expertise.
I hope it’s easy to spot how engineer’s interests in the above example are not necessarily aligned with GitHub’s, how this may be perceived as an unfair move disadvantaging veteran rank-and-file software engineers while benefitting corporate elites and investors, and subsequently has the potential to disincentivize source code sharing and deal a blow to OSS ecosystem as a whole.
Personally, I think that in the age of AI programming any notions of code licensing should be abolished. There is no copyright for genes in nature or memes in culture; similarly, these shouldn't be copyright for code.
I still think we're a long way from that. Copilot will help write code quicker, but it's not doing anything you couldn't do with a Google search and copy/paste. Once developers move beyond the jr. level, writing code tends to become the least of their worries.
Writing the code is easy, understanding how that code will affect the rest of the system is hard.
It's just a smarter tab-completion.
https://analyticsindiamag.com/open-ai-gpt-3-code-generator-a... has a bunch of videos of this in action.
I feel like this comment misunderstands what a software developer is doing. Copilot isn't going to understand the underlying problem to be solved. It's not going to know about the specific domain and what makes sense and what doesn't.
We're not going to see developers replaced in our lifetime. For that you need actual intelligence - which is very different from the monkey see monkey do AI of today.
Having a semi-intelligent monkey that can fetch obvious things off the shelf, build very basic control structures, and do the boring little housekeeping tasks is bad for the craft of programming but very good for the good-enough-solution situation. I can see it having the same impact as cheap and widely available digital cameras; anyone can be a kinda decent photographer now, but if you want to be a professional you're probably going to have to work a lot harder to stand out, whether that's by development of craft, development of narrow technical expertise and fancy equipment, or development of excellent business skills.
Photography is a good analogy - with everyone having fancy cameras you could think that a photographer is now not necessary. But yes there are still photographers about - they see things that the average person doesn't. The camera doesn't tell them what type of photos to take, what composition the photo should have or what poses a model should have.
New genetic sequences are patentable, not copyrightable, but that because of the process involved in creating new genetic sequences more then the genes themselves.
Sure naturally occurring genes aren't patentable, but it's not like we have code growing on trees. So that's a terrible comparison.
If they made the trained model public (and also trained it on private code) the response would be completely different.
Since when are humans not a part of nature?
It only happens at boss level when tech giants litigate IP issues.
"Hackers" "playing" and ignoring copyright is fine, but Copilot isn't promoted as a toy, it's promoted as a tool for professional software development. And in that framing it is about as dangerous as an untrained intern with access to the production server.
I don't care if MS copies my hobby projects exactly, but I'm not sure my employer(defense contractor) would even be allowed to use a tool like this.
I think it looks cool though. I will probably try it out if it is ever available for free and works for the languages I use.
One of the (many) problems is that GitHub/Microsoft already benefit from runaway network effects so it’s difficult to “do better”. Where will you get all of that training code if not off GitHub?
The real answer to this is to yank your projects from GitHub now while you search for alternatives.
Whether or not this move is “legal”, it should serve as a wake up call that GH is not actually a service we should be empowering. This incident is just one example of why that’s a bad idea.
You would have a much stronger case if they had taken your code from elsewhere.
Want Linux to run on your thing? You must publish driver source then or you're violating copyright law. This was less a big deal before device vendors ratcheted the pathological behavior up to 11 with smartphones and that's why far more people seem to react far more strongly now.
but then again I migrated away from github as soon as MS bought it
still, it's a matter of principle