I am not a lawyer, but I am capable of summarizing the thoughts of lawyers, so my take is that in general, fair use allows AI to be trained on copyrighted material, and humans who use this AI are not responsible for minor copyright infringement that happens accidentally as a result. However, this has not been tested in court in detail, so the consensus could change, and if you were extremely risk-averse you might want to avoid Copilot.
A key quote from the second link:
Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Personally, I think law should allow Copilot. As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing. And nobody cares if my ten-line "how to invert a binary tree" snippet is the same as someone else's. Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
> As a human, I am allowed to read copyrighted code and learn from it.
Of course not. Reading some copyrighted code can have you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.
Likewise, you can't ever write GPL VST 2 audio plug-ins if you ever had access to the official Steinberg VST2 SDK. Etc etc...
Did people forget why black box reverse engineering of software ever came to be ?
> Of course not. Reading some copyrighted code can have you entirely excluded from some jobs
That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
Those projects could hire people familiar with competitor code and assign them to competing projects if they wanted. The contributors could, in theory, write new code without using proprietary knowledge from their other companies. In practice, that's actually really difficult to do and even more difficult to prove in court, so companies choose the safe option and avoid hiring anyone with that knowledge altogether.
Now the question is whether or not GitHub's AI can be argued to have proprietary knowledge contained within. If your goal is to avoid any possibility that any court could argue that GitHub copilot funneled proprietary code (accessible to GitHub copilot) into your project, then you'd want to forbid contributors from using CoPilot.
In this case though we have machine learning model that is trained with some code and is not merely learning abstract concepts to be applied generally in different domains, but instead can use that knowledge to produce code that looks pretty much the same as the learning material, given the context that fits the learning material.
If humans did that, it would be hard to argue they didn't outright copy the source.
When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?
If yes, why doesn't parsing the source into an AST and then rendering it back also insulate you from abiding a copyright?
>When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?
You've hit the nail on the head here. If this is okay, then neural nets are simply machines for laundering IP. We don't worry about people memorizing proprietary source code and "accidentally" using it because it's virtually impossible for a human to do that without realizing it. But it's trivial for a neural net to do it, so comparisons to humans applying their knowledge are flawed.
That's a really good observation. Perhaps it highlights an essential difference between two modes of thought - a fuzzy, intuitive, statistical mode based on previously seen examples, and a reasoned, analytical calculating mode which depends on a precise model of the system. Plausibly, the landscape of valid musical compositions is more continuous than the landscape of valid source code, and therefore more amenable to fuzzy, example-based generation; it's entirely possible to blend two songs and make a third song. Such an activity is nonsensical with source code, and so humans don't even try. We probably do apply that sort of learning to short snippets (idioms), but source code diverges too rapidly for it to be useful beyond that horizon.
This is not such a big problem in reality because the output of Copilot can be filtered to exclude snippets too similar to the training data, or any corpus of code you want to avoid. It's much easier to guarantee clean code than train the model in the first place.
> then you'd want to forbid contributors from using CoPilot
I mean, if you used CoPilot on one computer, stared at it intensely for 1 hour, closed that computer, and then typed out code in the other computer that you were contributing from, you technically didn't use it for the contribution, you just used CoPilot for your education only.
Intellectual property is itself a flawed concept in many ways. It's like asking someone to do physics research but forbidding them from using anything that Einstein wrote.
Intellectual property itself is silly. How can a thought be the property of someone ?
Secrecy is the solution if you don't want others to learn from you (like Coca-Cola does).
It's not a natural right, we supposedly do it too stimulate innovation by offering a reward and in order to get things into the public domain -- obviously Disney (and the politicians that kowtowed to them) ruined that for the World.
Patents should have reduced with product lifecycles, copyright should be a similar period; maybe 10-14 years.
It's not silly, it's an evolved and pragmatic solution to the question of how society can incentivize creative work. More or less every society has developed some notion of IP and there's little appetite in wider society to debate it - the idea of abolishing IP laws is deeply fringe and only really surfaces in forums like this one.
Does it have flaws and can it be improved upon? Sure. I think society underweights what improvements to the patent system in particular could do. But such ideas are so niche they are hardly even written down, let alone debated at large. Society has bigger issues on its mind.
Like any evolved system IP law encounters new challenges over time and will be expected to evolve again, which it will surely do. A simple fix for Copilot is surely to just exclude all non-Apache2/BSD/MIT licensed code. Although there might technically still be advertising clause related issues, in practice hardly anyone cares enough to go to court over that.
If you read the video with a view to reproducing it then you created a derivative work, ie copyright infringement.
If you just used it for inspiration, that's fine; if the way it was coded is a result of technical constraints, that's fine too; if the code is generic it's not distinctive enough to acquire copyright in the first place.
>That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
and they made those decisions based on the need to be able to argue in court that code was not copied.
>then you'd want to forbid contributors from using CoPilot
Right, the whole thing about arguing if copilot spits out a ten line function verbatim is not really what will be the problem, the problem is a human programmer still needs to run copilot and they will be the ones shown in the codebase as the author of the code (they could of course put a comment 'I got this bit from copilot' but might be cumbersome and anyway would hardly work as proof), although I suppose it would be not just proprietary code but code with an incompatible license.
> >That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
> and they made those decisions based on the need to be able to argue in court that code was not copied.
Yeah, but only to make it easier for them to argue it; the letter of the law doesn't require it. You could argue that "Sure, I read Windows source code once -- but that was years ago and I can't remember shit of it, so anything I wrote now is my own invention." That might be harder to get the court to accept as a fact, but it's not a prima facie legal impossibility.
>That's not a law. That's a cautionary decision made by those companies or projects to make it more difficult for competitors to argue that code was copied.
Okay, so it's not law, it's just a policy compelled by preceding legal judgements. Case law, perhaps.
In general, you're absolutely allowed to learn programming techniques from anywhere. You can contribute software almost anywhere even if you've read Windows source code. Re-using everything you've learned, in your own creative creation, is part of fair use.
Your example is the very specific scenario where you're attempting to replicate an entire program, API, etc., to identical specifications. That's obviously not fair use. You're not dealing with little bits and pieces, you're dealing with an entire finished product.
> Your example is the very specific scenario where you're attempting to replicate an entire program, API, etc., to identical specifications. That's obviously not fair use. You're not dealing with little bits and pieces, you're dealing with an entire finished product.
No - google's 9 lines of sorting algorithm (iirc) copied from Oracle's implementation were not considered fair use in the Google / Oracle debacle.
Likewise SCO claimed that 80 copied lines (in the entirety of the Linux source code) were a copyright violation, even if we never had a legal answer to this.
nope, those lines were specifically excluded from the prior judgment - and SC did not cast another judgment on them:
> With respect to Oracle’s claim for relief for copyright infringement, judgment is entered in favor of Google and against Oracle except as follows: the rangeCheck code in TimSort.java and ComparableTimSort.java, and the eight decompiled files (seven “Impl.java” files and one“ACL” file), as to which judgment for Oracle and against Google is entered in the amount of zero dollars (as per the parties’ stipulation).
The fair use was about Googled API reimplementation.
It becomes a whole different case with a 1:1 copy of code.
And don't forget fair use works in the US, not necessarily in the rest of the world.
But I'm happy about all the new GPL programs created by Copilot
That Supreme Court ruling doesn't appear to address the claims of actual copied code (the rangeCheck function), only the more nebulous API copyright claims.
This is true, but there's also a murkier middle option. I used to work for a company that made a lot of money from its software patents but I was in a division that worked heavily in open-source code. We were forbidden to contribute to the high-value patented code because it was impossible to know whether we were "tainted" by knowledge of GPL code.
Same here. I worked at a NAS storage (NFS) vendor and this was a common practice. Could not look at server implementation in Linux kernel and open source NFS client team could not look at proprietary server code.
No you are not, guaranteed (I think, not a lawyer).
At least from a copyright point of few.
TL;DR: Having right, and having a easy defense in a law suite are not the same.
BUT separating it makes defending any law-suite against them because of copyright and patent law much easier. It also prevents any employee from "copying GPL(or similar) code verbatim from memory"(1) (or even worse the clipboard) sure the employee "should" not do it but by separating them you can be more sure they don't, and in turn makes it easier to defent in curt especially wrt. "independent creation".
There is also patent law shenanigans.
(1): Which is what GitHub Copilot is sometimes doing IMHO.
This model doesn't learn and abstract: it just pattern matches and replicates; that's why it was shown exactly replicating regions of code--long enough to not be "de minimis" and recognizable enough to include the comments--that happen to be popular... which would be fine, as long as the license on said code were also being replicated. It just isn't reasonable to try to pretend Copilot--or GPT-3 in general--is some kind of general purpose AI worthy of being compared with the fair use rights of a human learning techniques: this is a machine learning model that likes to copy/paste not just tiny bits of code but entire functions out of other peoples' projects, and most of what makes it fancy is that it is good at adapting what it copies to the surrounding conditions.
Have you used Copilot? I have not, but I have trained a GPT2 model on open source projects (https://doesnotexist.codes/). It does not just pattern match and replicate. It can be cajoled into reproducing some memorized snippets, but this is not the norm; in my experience the vast majority of what it generates is novel. The exceptions are extremely popular snippets that are repeated many many times in the training data, like license boilerplate.
Perhaps Copilot behaves very differently from my own model, but I strongly suspect that the examples that have been going around twitter are outliers. Github's study agrees: https://docs.github.com/en/github/copilot/research-recitatio... (though of course this should be replicated independently).
So, to verify, your claim is that GPT-3, when trained on a corpus of human text, isn't merely managing to string together a bunch of high-probability sequences of symbol constructs--which is how every article I have ever read on how it functions describes the technology--but is instead managing to build a model of the human world and the mechanism of narration required to describe it, with which it uses to write new prose... a claim you must make in order to then argue that GPT-3 works like a human engineer learning a model of computers, libraries, and engineering principals from which it can then write code, instead of merely using pattern recognition as I stated? As someone who spent years studying graduate linguistics and cognitive science (though admittedly 15-20 years ago, so I certainly haven't studied this model: I have only read about it occasionally in passing) I frankly think you are just trying to conflate levels of understanding, in order to make GPT-3 sound more magical than it is :/.
What? I don't think I made any claim of the sort. I'm claiming that it does more than mere regurgitation and has done some amount of abstraction, not that it has human-level understanding. As an example, GPT-3 learned some arithmetic and can solve basic math problems not in its training set. This is beyond pattern matching and replication, IMO.
I'm not really sure why we should consider Copilot legally different from a fancy pen – if you use it to write infringing code then that's infringement by the user, not the pen. This leaves the practical question of how often it will do so, and my impression is that it's not often.
It's not really comparable to a pen. Because a pen by itself doesn't copy someone else's code/written words. It's more like copying code from Github or if you wrote a script that did that automatically. You have to be actively cautious that the material that you are copying is not violating any copyrights. The problem is Copilot has enough sophistication to for example change variable names and make it very hard to do content matching. What I can guarantee it won't be able to do is to be able to generate novel code from scratch that does a particular function (source: I have a PhD in ML). This brute-force way of modeling computer programs (using a language model) is just not sophisticated enough to be able to reason and generate high level concepts at least today.
The argument I was responding to--made by the user crazygringo--was that GPT-3 trained on a model of the Windows source code is fine to use nigh unto indiscriminately, as supposedly Copilot is abstracting knowledge like a human engineer. I argued that it doesn't do that: that GPT-3 is a pattern recognize that not only theoretically just likes to memorize and regurgitate things, it has been shown to in practice. You then responded to my argument claiming that GPT-3 in fact... what? Are you actually defending crazygringo's argument or not? Note carefully that crazygringo explicitly even stated that copying little bits and pieces of a project is supposedly fair use, continuing the--as far as I understand, incorrect--assertion by lacker (the person who started this thread) that if you copied someone's binary tree implementation that would be fair use, as the two of them seem to believe that you have to copy essentially an entire combined work (whatever that means to them) for something to be infringing. Honestly, it now just seems like you decided to skip into the middle of a complex argument in an attempt to made some pedantic point: either you agree that GPT-3 is a human that is allowed to, as crazygringo insists, read and learn from anything and the use that knowledge in any way they see fit, or you agree with me that GPT-3 is a fancy pattern recognizer and it can and will just generate copyright infringements if used to solve certain problems. Given your new statements about Copilot being a "fancy pen" that can in fact be used incorrectly--something crazygringo seems to claim isn't possible--you frankly sound like you agree with my arguments!!
I think a crucial distinction to be made here, and with most 'AI' technologies (and I suspect this isn't news to many people here) is that – yes – they are building abstractions. They are not simply regurgitating. But – no – those abstractions are not identical (and very often not remotely similar) to human abstractions.
That's the very reason why AI technologies can be useful in augmenting human intelligence; they see problems in a different light, can find alternate solutions, and generally just don't think like we do. There are many paths to a correct result and they needn't be isomorphic. Think of how a mathematical theorem may be proved in multiple ways, but the core logical implication of the proof within the larger context is still the same.
Statistical modelling doesn't imply that GPT-3 is merely regurgitating. There are regularities among different examples, i.e. abstractions, that can be learned to improve its ability to predict novel inputs. There is certainly a question of how much Copilot is just reproducing input it has seen, but simply noting that its a statistical model doesn't prove the case that all it can do is regurgitate.
One way to look at these models is to say that they take raw input, convert it into a feature space, manipulate it, then output back as raw text. A nice example of this is neural style transfer, where the learnt features can distinguish content from style, so that the content can be remixed with a different style in feature space. I could certainly imagine evaluating the quality of those features on a scale spanning from rote-copying all the way up to human understanding, depending on the quality of the model.
Imagine for a second a model of the human brain that consists of three parts. 1) a vector of trillion inputs, 2) a black box, and 3) a vector of trillion outputs. At this level of abstraction, the human brain "pattern matches and replicates" just the same, except it is better at it.
Human brains are at least minimally recurrent, and are trained on data sets that are much wider and more complex than what we are handing GPT-3. I have done all of these standard though experiments and even developed and trained my own neural networks back before there were libraries that have allowed people to "dabble" in machine learning: if you consider the implications of humans being able to execute turing complete thoughts it should be come obvious that the human brain isn't merely doing pattern-anything... it sometimes does, but you can't just conflate them and then call it a day.
The human brain isn't Turing-complete as that would require infinite memory. I'm not saying that GPT-3 is even close, but it is in the same category. I tried playing chess against it. According to chess.com, move 10 was its first mistake, move 16 was its first blunder, and past move 20 it tried to make illegal moves. Try playing chess without a chessboard and not making an illegal move. It is difficult. Clearly it does understand chess enough not to make illegal moves as long as its working memory allows it to remember the game state.
Hmm... but a finite state machine with an infinite tape is Turing complete too. If you're allowed to write symbols out and read them back in, you've invalidated the "proof" that humans aren't just doing pattern matching.
How so? The page you link offers three definitions[1], and all of them require an infinite tape.
You could argue that a stack is missing in my simplified model of the human brain, which would be correct. I used the simple model in allusion to the Chinese room thought experiment which doesn't require anything more than a dictionary.
Turing completeness applies to models of computation, not hardware. Otherwise, nothing would be Turing-complete because infinite memory doesn't exist in the real world. Just read the first sentence of what you linked to:
In computability theory, several closely related terms are used to describe the computational power of a computational system (such as an abstract machine or programming language)
Human thought isn't anything like GPT thought - humans can spend a variable amount of time thinking about what to learn from "training data" and can use explicit logic to reason about it. GPT is more like a form of lossy compression than that.
This is called prompt engineering. If you find a popular, frequently repeated code snippet and then fashion a prompt that is tailored to that snippet then yes the NN will recite it verbatim like a poem.
But that doesn't mean it's the only thing it does or even that it does it frequently. It's like calling a human a parrot because he completed a line from a famous poem when the previous speaker left it unfinished.
The same argument was brought up with GPT too and has been long debunked. The authors (and others) checked samples against the training corpus and it only rarely copies unless you prod it to.
I don't know if I agree with your argument about GPT-3, but I think our disagreement seems to be besides the point: if your human parrot did that, they would--not just in theory but in actual fact! see all the cases of this in the music industry--get sued for it, even if they claim they didn't mean to and it was merely a really entrenched memory.
The point is that many of the examples you see are intentional, through prompt engineering. The pilot asked the copilot to violate copyright, the copilot complied. Don't blame the copilot.
There also are cases where this happens unintentionally, but those are not the norm.
Transformers do learn and abstract. Not as well as humans, but for whatever definitive of innovation or creativity you wanna run with, these gpt models have it. It's not magic, it's math, but these programs are approximating the human function of media synthesis across narrowly limited domains.
These aren't your crazy uncle's Markov chain chatbots. They're sophisticated bayesian models trained to approximate the functions that produced the content used in training.
The model and attention mechanism produces Bayesian properties, but transformers as a whole contain non-Bayesian aspects, depending on how rigorous you want to be in defining Bayesian.
In my experience open source has now become so prevalent that I think some young developers could be completely caught out if the pendulum swings the other way.
Semi-related, the GNU/Linux copypasta is now more familiar to some than the GNU project in general - this is a shame to me as I view the copypasta to be mocking people who worked very hard to achieve what GNU has achieved asking for some credit.
Yeah... but they didn't say it was the law that got you excluded from working on some projects from reading copyright code. It's corporate policy that does that - it's not a law but they do it based on who owns the copyright. Not everything that impacts you is a law.
They said
> Reading some copyrighted code can have you entirely excluded from some jobs
And they're right. It's because of corporate policies. They never said it was because of a law - you imagined that out of nothing.
No that’s not true. I did not edit my posts after reading their reply, and the false accusation was that I changed my comment after it was replied to.
I didn’t challenge whether the question was in good faith, but I’ll just note that the relevant discussion of copyright got dropped in favor of an ad-hominem attack.
My question of which “it” was being referred to is a legitimate question that I believe clarified the intent of my comment, and I added it to make clear I was talking about what @lacker said, not what @jcelerier wrote.
> Edit - I’m adding another point as an edit to show another way to communicate. Would any of your points been lost had you done something similar?
This doesn’t answer my question of why an edit should not be made before I see any replies, nor of why any edit is “poor form” and according to whom. I made my edit immediately. I’m well aware of the practice of calling out edits with a note, I’ve done it many times. I don’t feel the need to call out every typo or clarification with an explicit note, especially when edited very soon after the original comment.
Thanks? Edits exist before you finish replying too, right? Maybe point that out to @chrisseaton, whose incorrect assumption was that I edited in response to what he wrote.
It's dependent on jurisdiction. Black box reverse engineering is only required in certain countries. If I remember correctly, most of Europe doesn't require it.
> > As a human, I am allowed to read copyrighted code and learn from it.
> Of course not. Reading some copyrighted code can make you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.
You can of course read the code. The consequences are thus increased limitations, like you say.
What you mention is not an absolute restriction from reading copyrighted material. You perhaps have to cease other activities as a result.
I've you've ever read a book or interacted with any product, you've learned from copyrighted material.
You've extrapolated "some organizations don't allow you to contribute if you've learned from the code of their direct competitor" to "You're not allowed to learn from copyrighted code", which is absurd.
> Reading some copyrighted code can have you entirely excluded from some jobs - you can't become a wine contributor if it can be shown you ever read Windows source code and most likely conversely.
If that's the case, it should be easy to kill a project like wine - just send every core contributor an email containing some Windows code.
Nobody could grant if that thing is really windows code or a fake. Not without the sender self-identifying as a well known top MS employee having access to it. In that case the sender would be doing something illegal and against MS interests.
The result would be WINE having an advantage to redo the snippet of code in a totally new and different way and MS being forced to show part of its private code, that would expose them also to patent trolls.
Would be a win-win situation for Wine and a lose-lose situation for MS.
It's very clearly visible on the Wine wiki that people who have ever seen Microsoft Windows source code cannot contribute to Wine due to copyright restrictions:
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
That's the first time I've heard copilot get described as copying little bits of code from the Internet. Copilot aggregates all github source code, removes licences from the code, and regurgitates the code without licenses.
Furthermore, both github and the programmers using copilot know this. Look at any one of these threads written by programmers about copilot. Using copilot is knowingly stealing the source code of others without attribution. Using copilot is literally humans stealing source code from others. Copilot was written for the purpose of taking other's code.
It's not "literally" stealing, because it doesn't deprive anyone of the use the source code. Those two points were somehow extremely obvious to everyone here as long as it was music and movies we were talking about.
And Github themselves have stated that only 0.1% of the Copilot output contains chunks taken verbatim from the learning set. Of those, the vast majority are likely to be boilerplate so generic it's silly to claim ownership, and maybe sometimes impossible to avoid.
It is actually true, in the UK at least the legal definition of theft includes the deprivation of the owner of the property in question.
The copyright lobby hedge the term as "copyright theft" (i.e. not actual theft) in order to shift the societal understanding. Whish appears to have worked.
This is not a value judgement on copyright infringement. Just that technically it doesn't meet the legal definition of theft.
cf. The rather amusing satire of the "you wouldn't steal a handbag" campaign in the UK, which ran "you wouldn't download a bear!"
Actually it's not theft in the US, it's intellectual property rights infringement. The way you're defining it memes are theft. There is also a thing called fair use when you don't use a significant portion of a copyrighted work, which is why memes and using small bits of code aren't infringement when you use them in different context.
Oh, then today I learned! I didn't realise they were different. Just looked it up in a "plain English dictionary of law" and the distinction seems subtle but important. Rather than "with the intention of depriving the owner", the US one says "with the intention of converting it to their use", which seems broad enough to cover exploiting a copy, rather than the original (or only, in the physical realm...)
Oh Idunno, it "depends on what the meaning of 'is' is"...
> Rather than "with the intention of depriving the owner", the US one says "with the intention of converting it to their use", which seems broad enough to cover exploiting a copy
...or rather, on the meaning of "converting". I've always theought of that as "changing", i.e. "it used to be one thing, and now it's something else". But copying IP only adds a use of it, it doesn't fundamentally change it in this sense: it is still available for the original proprietor's use. Is that really "converted"?
At least for the ordinary-English uuage of the word, I think it could be argued that it isn't. But then maybe this isn't just English; maybe the word "converting" also has some term-of-trade definition in that dictionary?
The US definition seems more robust, as otherwise, I could somehow steal something you built (e.g. a farm) and then generously allow you to continue using it, perhaps for a fee. You would therefore not be deprived of it but I would still be the new owner or user.
It seems unlikely this distinction would ever matter in a real court though.
The issue isn't an AI reading copyrighted code, the issue is an AI regurgitating the lines of copyrighted code verbatim. To be clear, humans aren't allowed to do this either.
And sure, nobody cares about your stupid binary tree, but do they care about GNU and the Linux kernel? Imagine someone trained an AI to specifically output Linux code, and used it to reproduce a working OS. Is that fair?
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
This is silly. Co pilot is not reading by itself, someone pushed buttons telling it to read and write. If I clone the entire github without the licenses I am telling a robot to do it, doesn't make it right.
> As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing.
You're taking the "learning" metaphor too literally. Machine learning models do not learn. They can and do encode their training material into their weights and biases, too. That's what Copilot was doing, regurgitating parts of its training data line for line.
To me, that is not much different from transforming a copyrighted piece of work with, say, compression, a lossy codec or cropping. There are plenty of people who can learn to play Metallica songs really well, but if they copied specifics aspects of their work it would be copyright infringement, as well.
A human being can literally learn. We can understand abstract principles from one copyrighted work and apply them to another without actually infringing its copyright. A ML model does not understand, it is a statistical model. It is inherently a derivative work, and it often encodes the copyrighted work into was trained on into the model itself.
That function appears in hundreds, if not thousands of GitHub repos. It's plausible that it's the most famous block of code ever. Are all those repos guilty of copyright infringement?
The only ways this argument could be less alarming to me is if people were bothered that it was writing the same Hello World as somebody else, or that it was naming variables "foo" and "bar".
Let's wait until we have a bulletproof, egregious, and inexcusable case of it lifting code until we panic.
I don't think it should be infringement. You can't copyright an algorithm, and Carmack's function is like six lines of code. If you just read Carmack's source code, then rewrote the same algorithm with different variable names, it would clearly not be infringement. Is it really so bad if you keep his precise variable names, comments, and indentation? How does it hurt Carmack to reproduce this tiny snippet of code exactly, rather than with a small rewrite?
Sorry but it is not a robot publishing the "lifted" code but a human. So the copyright will very much apply. That's an argument like saying CTRL+C/CTRL+V is OK because it is a "computer doing it".
Plus it is not "minor infringement" but code is being lifted verbatim - e.g. as has been demonstrated by the Quake square root code.
Perhaps the final judgment would say "AI cannot infringe on copyright provided that only other AIs consume the result of the first AIs work".
And suddenly there is a world of robots composing, writing and painting for other robots. With us humans left out.
There should be a /s at the end, but legal world sometimes produces such convolutions. See, for example, the interpretation of the Commerce Clause in Gonzales v. Reich.
As far as IP protections go, they're similar, but the incentives are so different that you get songwriters going to court over bits of melodies that might be worth millions. Outside of quantitative trading, it's hard to find an example of 10 lines of code that are worth millions and couldn't easily be replaced with another implementation.
This is a stupid argument that the Twitter author made. Saving music digitally is reading by robot, so recording music that wasn't digital into a digital format is fair use.
It should be obvious that if the robot is simply scraping web sites and reproducing their text verbatim (without permission and without giving credit) that would be an infringement.
There are a lot of shades of gray between that and the other extreme, which is where it is scraping millions of sites, learning from them, and producing something that isn't all that similar to any of them. Both ends of the spectrum, and everywhere in between, are things that humans can do, but as machines get more capable this is getting trickier and trickier to sort out.
In this case, it sounds like it might be closer to the first example, since significant parts of the code will be verbatim.
Ultimately, I am hoping that such things cause us to completely rethink copyright law. The blurriness of it all is becoming too much to make laws around. We just need better mechanisms to reward people for creating valuable IP that they allow people to freely use as they please.
Copilot is lifting entire functions from GPL code. Legal technicality aside , I know I'd be upset if I gpl'ed some code and someone stole large parts of it.
Why would you GPL the code in the first place if you didn't want other people using it? It's perfectly within the license for someone to do basically whatever they want with GPL code as long as they're not redistributing it. That includes using it for the internal operations of a Fortune 500 company, using it to run a dictatorship, or building a SaaS business on top of it. If you don't want people to "steal" your code, the GPL isn't the right license.
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
So wait, if I write my own AI, lets call it cp, and train it on gnu-gcc.tar.gz with the goal of creating a commercial-compiler.tar.gz then I can license the result any way I want? After all most of the work was done by the computer.
Just when I thought tweetstorms couldn't get any worse, here's one where every tweet is a quote-tweet of the author. I don't even understand how I'm supposed to read this.
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Surely there's a limit to this. If I use a machine to produce something that just happens to exactly match a copyrighted work, now it's not infringement because of the method I used to produce it? That seems nonsensical, but maybe there's precedent for this too? (I have no idea what I'm talking about.)
That quote is basically entirely nonsensical. 'copyright' hasn't decided anything (nor has any legislative body nor the courts). All that's happened is that OpenAI has put forward an argument that using large quantities of media scraped from the internet as training data is fair use. This argument for the most part does not rely on the human vs machine distinction (in fact it leans on the idea that the process is not so different from a human learning). The main place this comes up is the final test of damage to the original in terms of lost market share where it's argued that because it's a machine consuming the content there's no loss of audience to the creator (which is probably better phrased as the people training the neural net weren't going to pay for it anyway). A lot does ride on the idea that the neural net, if 'well designed', does not generally regurgitate its training data verbatim, which is in fairly hot dispute at the moment. OpenAI somewhat punts on this situation and basically says the output may infringe copyright in this case, but the copyright holder should sue whoever's generating and using the output from the net, not the person who trained and distributed the net.
Surely it could be argued that there is a loss of audience to the author. At the moment some people will read the author's code directly in order to find out how to solve a problem. In the future at least some of those people will just ask copilot to solve the problem for them.
It all comes down to this: this has not been tested in the court. The above opinion, or for that matter any opinion from any lawyer or not-a-lawyer, is just that, an opinion.
As a business it is your responsibility to determine if this code-copying is worth a risk to your business.
Based on my experience, I'm pretty sure all corporate lawyers will disallow such code copying, till it has been tested in the court. It's just a matter of who will be the guinea pig.
Fair use for training and "independent creation" are one think a AI "remembering and mostly verbatim copying code over" an another.
Many of the current Machine Learning application try to teach AI to understand the concepts behind their training data and use that to do whatever they are trained to do.
But most (all?) fail to properly reach the goal in any more complicated cases, at least the kinds of models which are used for things like Copilot (GPT-3?).
Instead what this models learn can be described as a combination of some abstract understanding and verbatim snippets of input data of varying size.
As such while they sometimes generate "new" things based on "understanding" they also sometimes just copy things they have seen before!! (Like in the Quake code example where it even copied over some of the not-so "proper" comments expressing programmers frustration).
It's like a human not understanding programming or english or Latin letters but has a photographic memory and tries to somehow create something which seems to make sense by recombining existing verbatim snippets, sometimes while tweaking them.
I.e. if the snippets are small enough and tweaked enough it's covered by fair use and similar, BUT the person doing it doesn't know about this, so if a large remembered snippet matches verbatim it will put it in effectively copying code of a size which likely doesn't fall under fair use.
Also this is a well known problem, at least it was when I covered topics including ML ~5 years ago. I.e. good examples included extracting whole sequences of paragraphs of a book out of such a network or (more brilliantly) extracting thinks like peoples contact data based on their names or credit card information (in case of systems trained on mails).
So that Copilot is basically guaranteed to sometimes copy non super smalls snippets of code and potential comments in a way not-really appropriate wrt. copyright should have been a well know fact for the ML specialist in charge of this project.
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
This is a ridiculous conclusion. The ultimate destination of the robot actions' product is its user, i.e. a human. It is a clear corollary of the transitive law. Therefore, all human-focused legal concepts, including infringement, are applicable in such cases.
The absurdness of the conclusion cited above can be easily illustrated, as follows. Suppose that a person owns or rents an advanced robot (say, like Boston Dynamics' Spot, but better). He/she then programs it to break into someone's house and steal something valuable. All goes by the plan, the robot delivers the stolen goods to the rendezvous point and, if rented, gets returned. Now, according to the conclusion's logic, since a robot has done the actual "work", "it's fair use". Nonsense, right?
Just to clarify: I like the concept of GitHub Copilot (even though I have not yet tried this particular product). It offers various benefits, from pedagogical to adopting software engineering best practices to improving engineering productivity. However, I think that IP and legal aspects of this approach and specific product should be carefully studied and resolved in a consistent manner (e.g., prevent the model or system to output exact source code snippets).
An AI isnt learning from it. Its effectively copying prior work when it solves a problem. There is no novel out of bounds data generation by modern ai approaches
> Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Reading by a robot doesn't count. But injecting a robot between copyright material and a product doesn't magically strip the copyright from whatever it produces.
> I am capable of summarizing the thoughts of lawyers
Then, I'm sorry, but you seem to have done a pretty bad job of summarizing that paper ?
My own take on summarizing it's conclusion would be :
"If the program can pass the Turing Test, then it should be legally liable, just like a human would."
(Yes, emphasis on the should here, but the way that you're presenting that quote might make readers think that that paper's conclusion is the opposite one !)
----
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
Some of the examples from that paper are A&M Records, Inc. v. Napster (no comment) and White Smith Music Publ’g Co. v. Apollo Co. (piano rolls), and in that latter case we pretty much (?) got the whole Copyright Act of 1909 the very next year, where these “mechanical reproductions” were subjected to a statutory compulsory license.
So at the very least there should be concern about how Copilot might be eventually considered by the courts to facilitate copyright infringement and at the very least have to provide the source of its "insights" ?
(Attribution being the bare minimum that most of the software licenses require.)
In germany, there is no fair use exception to copyright. Also, there is no IP most software principles: e.g. writing a specific loop, that even an (weak) AI could suggest, would probably be too simple to be protected.
What could be valid is a right to not mimic collections, but that would mean you cannot clone the Copilot, as input is mapped to a non-trivial collection of outputs.
Hit the nail on the head, at least as far as my concerns go.
Security is a constant issue with humans writing code. Do we really want an "AI" that understands neither code nor security spitting out snippets of code to pasted into network services?
If Copilot ever becomes truly popular it's going to be an absolute security nightmare both from the code it suggests (just bad code, GIGO as you say), but also because adversaries will be gaming it by posting bad code for it to pick up and "learn" from.
Well, maybe the interpretation will change if the right people are pissed off.
At this point, how hard would it be to produce a structurally similar "content-aware continuation/fill" for audio producers, film makers, etc, which suggests audio snippets or film snippets, trained from copyrighted source material?
If prompted by a black screen with some white dots, the video tool could suggest a sequence of frames beginning with text streaming into the distance "A long time ago in a galaxy far far away ..." and continue from there.
Normally we don't try to train models to regurgitate their inputs, but if we actually tried, I'm sure one could be made to reproduce the White Album or Thriller or whatever else.
> nobody cares if my ten-line "how to invert a binary tree" snippet is the same as someone else's.
Maybe nobody cares about that, but the problem is that Github's automated tool is not telling you what code it shows you is actually an exact copy of existing code, or how much of that existing code is being copied, or whether the existing code is licensed, or, if it is licensed, whether your copying is in accordance with the license or not. And without that information you can't possibly know whether what you are doing is legal or ethical. Sure, you could try to guess, but that sort of thing is not supposed to rely on guessing.
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
Of course people are hurt, namely the original creators who spent years of work and whose work is potentially laundered, depending on how good this IP grabbing AI will get.
If it gets really good, some smug and well connected loser (e.g. the type who posts pictures of himself with a microphone on GitHub) will click a button, steal other people's hard work and start a "new" project that supersedes the old one.
So what happens when someone makes a transformer network that can read fanfics and animate them live trained from the whole collection of MPAA movies? I mean its inevitable. Given the history of the MPAA, I don't think they're gonna lie down and just take it. I feel like we're in a slippery slope to provoke the "IP lords" into brutally draconian measures that will make the Disney copyright extensions look like a tax deferral.
You sure about that? You aren't allowed to read it outside of the license attached to it. Downloading pirated source code, reading it, and then typing it out from memory doesn't magically give you a right to use it in any way. I would argue the licenses attached to most copyrighted code are being violated the moment the code is scraped and replicated without permission.
There might be a slippery slope here: suppose there's a GPL version of product X in the training set. I'm building a proprietary competitor. Then let's say copilot makes it a little bit easier and cheaper for me to build my product.
Now suppose it's 10 years from now and it's trivial to build a proprietary competitor.
>Copyright has concluded that reading by robots doesn’t count. Infringement is for humans only; when computers do it, it’s fair use.
Glad to hear this. My warez group from here on out will only release binaries written collaboratively with a neural net trained on the best proprietary software available
> As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing.
This is a very false equivalency. AI and humans are different. First, AI is at best a slave, and likely a slave of a capital. Second - scale makes difference.
I have a question. How to define the AI? In an extreme case, could I say copy-paste is the AI if I implement it in hundreds of thousands of NN layers and it output piece of the origin work (e.g. 30s of music from 3mins with small modification).
> As a human, I am allowed to read copyrighted code and learn from it.
Plenty of examples show that it didn't learn that much and copy literal parts of code. At my school that would have been ground for plagiarism which weren't treated lightly.
> Infringement is for humans only; when computers do it, it’s fair use.
But ultimately the human is OK-ing the code and committing it, basically as his own work most of the time. I'm reasonably sure that this may matter to courts.
There's a lot of sibling commenters disagreeing with this take but I think they miss that ultimately this comes down to how legal experts interpret tech, rather than what tech experts think law should apply.
This is, imo, unfortunate, as often the legal interpretation is based on a gross misunderstanding of how the tech works, but this is the way.
I don't think copilot should be legal according to my own interpretation but in this (rare) case I feel the "IANAL" tag applies not because I lack (legal) knowledge, but rather because I have (tech) knowledge that is likely absent from actual decision making on legal outcomes (therefore leading to different legal outcomes than how I would see things working).
Autonomous programming will be explored. Potentially, Copilot is a proof of concept, an early step in that direction. If it is, the corrections made by Copilot users will be applied to the development of the future of unattended programming. Whether it is or not, it's close enough that any legal outcomes experienced by Copilot users will contribute to the definition of liability boundaries relevant to the future of autonomous programming. Copilot users are numerous enough that the incidence of risk is low of ending up under the foot of a copyright owner with the means and will to crush a user, but no one should take such a risk to use a novelty like Copilot in production code.
> As a human, I am allowed to read copyrighted code and learn from it. An AI should be allowed to do the same thing.
This is a non-sequitur. Why should it?
> And nobody cares if my ten-line "how to invert a binary tree" snippet is the same as someone else's.
Are you going to make up a rule for every length and type of code? What about twenty line? If ten lines are fine then surely twenty would be? How about pictures? If some code is then surely a picture or two wouldn't hurt? Let's just tweak the AI slightly so it regurgitates more code verbatim -- or do courts have to examine any change made to the AI and okay them?
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
The Windows source code can be found on the internet. As a human you're allowed to read that if you have it. Try making an AI that copies bits of that into your code and release that on the internet.
> Nobody is really being hurt when a new tool makes it easier to copy little bits of code from the internet.
Quite the opposite. We all get a tiny bit better with good information like this. This is what the internet should be for, evolving, learning from past mistakes, information availability.
If the discussion was “I clicked this button and got someone’s entire chat platform” that would be different. Words and sentences aren’t copy written, books are, so when exactly are a collection of words a book?
There is nuance, and the linked page has none. But that’s fine, that guy is free to pull his content off GitHub. This seems like a useful feature for other people who want to make things first and foremost.
> Words and sentences aren’t copy written, books are, so when exactly are a collection of words a book?
If that were true, then 20 people could each steal a single chapter from a book, and one of the people could combine those 20 chapters into a new copyright-free book. That's clearly false.
I've given thousands of hours to open source projects, I really think open source is a pillar of modern society. So you would think I am all for something like copilot, but no.
At first I thought this was a great feature, because easier access to code, but after some reflection, I am also very skeptical.
I am able to make my code open source, because I can make a living out of it, and I have a lot of open source code that I love to share for things like education or private stuff, but if you want to use it for something real, you need to hire me. If you can suck all the code without even I noticing it, that's not fair.
The other thing is code quality. I don't want to sound rude, but there are tons of bad code around. Not necessarily because the author is unskilled, but because the code might not need to be high quality (for example I wrote a script to sort my photos, it was very hastily written and specific to my usage, I used once and was done with it). Also, there are some bad/wrong pattern that are really popular.
I am surprised you are able to DMCA a twitch stream because someone whistle india jones theme but in this case it is considered fair use.
> have a lot of open source code that I love to share for things like education or private stuff, but if you want to use it for something real, you need to hire me. If you can suck all the code without even I noticing it, that's not fair
Co-pilot aside, that's already how it works today. If you make something open source, I can use your code to power my business, and I'm under no obligation to hire you. It's great when companies give back to open source, either by supporting the projects they depend on, or by open sourcing their own internal projects, but it's not obligatory.
If you don't want people to independently profit from your code, don't release it under a license that allows commercial use
It sounds like the person you're responding to already releases their code under a non-commercial license. The problem with Copilot is that it may allow commercial enterprises to avoid such a license by copying the code verbatim from their repositories, possibly without any party involved knowing that it's happened.
But this is only for snippets right? Which I think is the issue: it has never been tested in court. Basically if you put:
/* web user management */
And copilot comes up with complete user management lifted out of another repo with all pages, db structures and logic but the copyrights stripped then yes. But, as I understand it, that is not what it does. You will need to slowly tell it every tiny part of how user mamagement is to be implemented and for those snippets it copies code. But when you are done, there might be snippets from 100s of different repositories potentially. I think it is hard to show that breaks copyright as many people already come up with roughly the same stuff 1000s times/day all over the world.
> But when you are done, there might be snippets from 100s of different repositories potentially.
Potentially, but not necessarily. It's possible also that if there is only one close match for the logic required, it may produce verbatim something it's already seen. GPT is known to do this for sufficiently precise inputs.
> I have a lot of open source code that I love to share for things like education or private stuff, but if you want to use it for something real, you need to hire me.
implies that they have code which they are sharing under that proviso. Do you read it differently?
You are right about the technical distinction of open source from source-available. I think that the GGP (and myself) were both using it colloquially as a shorthand for source-available.
Is Copilot trained on source available code? If not, then whatever restrictions you may want to apply with your source available code isn’t relevant. The debate is about copyleft.
GPL is an open source license. Please read the ancestors to understand what’s being discussed here.
Edit: I was clearly making a distinction between open source (as in covered by an OSI-approved open source license) and only source-available, rather than treating source-available as a superset of open source.
"Open source doesn't just mean access to the source code. The distribution terms of open-source software must comply with the following criteria: ... The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research."
If there's no license, then it's not open source. This is a term with a standard meaning, and it doesn't just mean that the source is available for reading
Still, if you come across some published source code that does not appear to be licensed and does not specifically define itself as being "Open Source" as defined by the "Open Source Initiative", copyright law applies and you're not allowed to just take it and use it.
GitHub specifically uses the words "source code from publicly available sources" when talking about what they used to train their model on.
As far as I'm aware public code repos aren't by default "Open Source" as defined by the "Open Source Initiative".
Sorry, I was specifically responding to how my parent views open source.
I agree that Copilot was probably trained in part on public code that isn't open source -- GitHub's claim (not saying I agree) is that they don't need a license to train on code.
In this post's comment section alone many use the term "open source" but really mean to say "public-source", others use the two terms interchangeably even when they seem to be aware of the distinction, and then there are people who seem to think that by making your GitHub repo public it becomes OSD-spec "open source" and with that free to use.
It's just so confusing and easy to misinterpret each-others' true meaning.
Thanks for making me aware of the existence of that OSD OSS spec btw! Came across the (recovered) blog post where the term was first announced http://www.catb.org/~esr/open-source.html.
You have said this multiple times on this thread now, but in addition to how the OSI getting to unilaterally define the technical definition of "open source" being controversial even within the software engineering community, you really need to be looking at the definition of words "descriptively" and most people seem to put even "shared source" (look but don't touch) models as subsets of the class "open source".
Regardless: Copilot doesn't only pick up "open source" code... it picks up any code that has been published under any reason including the large amounts of code that is on GitHub without any license at all or which was literally stolen and leaked onto GitHub.
Meanwhile, even open source licenses have restrictions, whether they be "you can't use my work without agreeing to contribute back your work to the collective", various forms of automatic patent grants and associated retaliation clauses, or merely "you have to credit me", a simple limitation almost all open source software comes with which Copilot launders away.
> the OSI getting to unilaterally define the technical definition of "open source" being controversial even within the software engineering community
I don't think this is controversial? The OSI defined the term when they introduced it, in the late 90s. When Microsoft came out with "shared source" there was a huge amount of pushback from people saying "don't think that this is open source" (ex: https://www.linuxjournal.com/article/5496)
> Copilot doesn't only pick up "open source" code... it picks up any code that has been published under any reason
I agree. I'm not defending Copilot, and I think the legal questions here are interesting and tricky. My pushback here and throughout this page has been when people say non-commercial licenses are open source -- this thread started with kuon saying "I have a lot of open source code that I love to share for things like education or private stuff, but if you want to use it for something real, you need to hire me"
> it picks up any code that has been published under any reason including the large amounts of code that is on GitHub without any license at all or which was literally stolen and leaked onto GitHub.
You are stating a rather arbitrary assumption of yours as a fact, unless you have concrete sources or evidence.
This is the gist of it. I do not agree with OSI definition of open source but I won't argue about it here.
There are many OSI approved licenses with restrictions on use, like "your software must also be open source" or "contribute back your changes" or "give me attribution"...
> My definition of "open source" is that the source code is publicly available. That's it.
If I defined "open source" to mean that changes the source code must be released publicly, it's going to be pretty hard for me to talk to all the other people who already use "open source" to mean something else. There is already a standard term for the source code being publicly available: "source available": https://en.wikipedia.org/wiki/Source-available_software. You're also welcome to invent and attempt to popularize any alternative term you want, but using idiosyncratic definitions makes discussion less clear.
> Just because that organization got their hands on a premium domain name doesn't mean they get to decide what that term means.
The OSI didn't just claim "opensource.org" -- the folks behind it coined (https://opensource.com/article/18/2/coining-term-open-source...), introduced, and popularized the term "open source" over two decades ago. From the beginning they have used the same definition, which was derived from the Debian project's Free Software Guidelines.
They are not also not the only ones who use the term that way. Wikipedia has "Licenses which only permit non-commercial redistribution or modification of the source code for personal use only are generally not considered as open-source licenses" -- https://en.wikipedia.org/wiki/Open_source
"Something real" and "fair use" don't get along. I'm also not sure fair use trumps licensing, since one is a copyright issue, and the other is the terms of use. You don't get to copy a snippet of GPL code and get away by calling it fair use. At least, I hope it isn't the case.
> I'm also not sure fair use trumps licensing, since one is a copyright issue, and the other is the terms of use.
Licenses and copyright are strictly related - the 'terms of use' in the form of a license simply is a grant for someone to reuse something in full granted they uphold their end of the bargain which might mean (for the GPL) relicensing code that touches said licensed code; but, if someone doesn't agree to the license or they don't uphold their end (such as by releasing their own code under the same license), the license states that the use is now invalid and they're not allowed to copy it in the permissive form provided unless the use passes the fair use doctrine.
Now 'fair use' is actually a pretty high bar, and most especially for code - feel free to read 17 USC 107 [0] which lays out how this is determined. What's tough for code is that, in terms of for copying code or using a library, it usually doesn't qualify for the initial requirement "for purposes such as criticism, comment, news reporting, teaching, scholarship, or research, is not an infringement of copyright". Unless you're taking GPL code and writing up how bad it is, or how efficient is is, chances are the use doesn't fit into fair use.
So while I think I was indeed a bit obtuse in saying that code could be used in a fair use way (since that's not what happens in these contexts) it's technically possible.
> I am able to make my code open source, because I can make a living out of it, and I have a lot of open source code that I love to share for things like education or private stuff, but if you want to use it for something real, you need to hire me. If you can suck all the code without even I noticing it, that's not fair.
If you license your software such that I can do whatever I want with it, then I can do whatever I want with it. I don't see how you can then go on to claim it isn't fair if I'm using as you allow.
From the Copilot FAQ, notice there is no specific mention of the term "Open Source":
> It has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.
Meaning GitHub might have also been sourcing from public source / source-available projects that were not OSS licensed at all.
> I have a lot of open source code that I love to share for things like education or private stuff, but if you want to use it for something real, you need to hire me.
Then you should share your code with a license that reflects that.
I never hosted--with quite some prejudice, even--any of my projects on GitHub (for a number of reasons that are off topic right now)... it didn't matter, though: people take your code and upload it to GitHub themselves (which is their right); so you can't avoid Copilot by simply self-hosting your repositories.
Well... "the point of" open source for some of us is to participate in a collaborative commune where we are willing to contribute code to the collective specifically because other people are required to also contribute their code to said collective as part of a shared battle against opaque proprietary systems; so the concept that someone is going to then find it, scrape it, and "launder" what was supposed to be our competitive advantage into their closed source projects by training it into a glorified pattern recognizer and then regurgitating it into their text editor without even as much as attribution without realization that the work we did and contributed to the world was part of a tit-for-tat pact as they want to have their cake and eat it too kind of ruins it for me... like it is bad enough that I have to constantly guilt trip people into not violating my license already: systematizing it via Copilot is making me think that we might need even stricter mechanisms for distribution of code within the commune so as to build a kind of "decentralized trade secret". (That said, other than your wording about "the point of open source", I agree with your doom-y sentiment 100%.)
I’d argue that this new use case is very interesting to open source and how it relates to the various licenses, and not necessarily “the point of open source”.
I can imagine people being OK with their code being used as-is, and/or being modified, but not used completely out of context to train some corporate AI to inject code into commercial code based.
You can't give out something for free without limitations and then complain when someone uses it for something you didn't expect. Well you can complain but no one has to listen.
Githubs use seems very in the spirit of open data and code. Using open source to help others.
We didn't give it away for free without limitations: we carefully drafted licenses to attach to our code to establish legal boundaries on what we considered acceptable; my code, for example, is under GPL, which says "I am OK with you using the products of my labor under the requirement that you will then let me use the product of your labor"... hell: almost every bit of open source code has at minimum "I am OK with you using the products of my labor as long as you are at least willing to give me credit somewhere", a minimal limitation that Copilot doesn't honor (and prevents its users from honoring): Copilot is essentially a giant code launderer designed to strip away licenses from effort that was only contingently given to the community.
Licenses explicitly are the set of permissions and limitations for someone to use the code. Even most permissive open source licenses require that you maintain an attribution with the text of the license in derivative works. Copyleft licenses and in particular the GPL put very strict requirements on what types of licenses are acceptable in any derivative works.
In literally no open source license except "do-whatever-you-want-i-dont-give-a-damn-bye"-type licenses like WTFPL are you giving something away for free without limitations. That's not even close to what open source means, at all. Open source and public domain are not synonymous terms. And in those cases where there are literally any terms in a license, use of Copilot obviously violates the terms of the license, unless the user operating Copilot goes back, finds the code Copilot is stealing from, and makes the proper attribution / otherwise ensures that they are meeting the terms of the license of the stolen code.
Agreed. I am considering relicensing all of my permissively licensed code because of this. The fundamental assumptions I had when releasing that code under a permissive license have been violated.
Under GitHub's legal theory (fair use), nothing you put in that license file can stop them from doing so legally.
If copyright applies, they're already in violation by failing to attribute your MIT contributions and could theoretically be sued for infringement (as they did not abide by the terms of the license).
When you use a permissive license, it’s best you stop thinking of it as your code. You’ve set it free for everyone, and while you may retain copyright in some abstract sense, it really no longer belongs to you.
While I agree with you--in that the vast majority of people who publish code under permissive licenses actually have an implied set of "moral code" restrictions that they end up surprised people violate even though they specifically allowed such by their choice of license (implying they should have chosen a different, and likely more restrictive, license)--even permissive licenses tend to at least include "you can't take my code without crediting me for taking my code", and so I can appreciate someone being upset about that not happening.
Has Github confirmed anywhere that copilot is only trained with projects on Github?
The OpenAI people seem to grab any bit of data they can get their hands on, regardless of the source. I don't see why they'd limit themselves to Github for something like this.
The point I am making is that even if they did, removing your code from GitHub doesn't help you: your code is going to end up on GitHub anyway; this is a stronger position that requires fewer assumptions.
Your choice but be aware that self-hosting probably reduces collaboration with community. GitHub and GitLab makes it easy to contribute a pull request. Hundreds of private cgit web frontends make it hard to contribute and it is often impossible to search the code on the web, requiring a clone which takes a long time for big repos.
I'm glad that Copilot is bringing the grey areas of copyright into discussion. If I write a book and it is copyright, what's the smallest unit which is covered by that copyright? Each word is obviously not. Some sentences will be fairly generic and I will not be the first person to write them. But some sentences will be characteristic of the work or my own style. Clearly how we apply copyright to subdivisions of an original work is an open question.
This. You realize it doesn’t make any sense. All ideas are shared creations, by definition. If you’ve created something that has meaning for other people, the meaning comes from the ideas you are incorporating into your own tree.
There is no defending copyright. It is indefensible from first principles. It makes no logical sense.
> Above a certain level of creativity people do produce novel or exceptional things that are worthy of protection.
Name your very best example that will prove me wrong. It should be so simple. One example, that's all it takes. Take your time, make sure you've got a good one. I'll tell you that not once, not a single time in over 17 years, have I ever seen a single example of this argument hold up under scrutiny.
Oh wait, you already did:
> Because naked men are a shared concept Michelangelo's David is not protect-worthy?
Ah yes, Michelangelo's David. A work free of copyright built under commission! Thank you for again pointing out the futility of the defense of copyright.
Okay, Harry Potter and the Sorcerer’s Stone. I’m legitimately trying to understand. Do you think after that book was published it should have no copyright protection? That it should be totally legal for me to print and sell my own copies?
Yes. Our government should not be in the business of regulating the distribution of a sequence of words about imaginary wizards.
JK is a talented and hard working writer, and though I'm not a fan personally of those books I respect that they likely are great pieces of work, but I believe we are getting the scraps of what we could get in the Intellectually Oppressed world compared to an Intellectually free world. I'd rather have a world without cancer, a world with 100x more people able to provide medical care, a world with less pollution, than a world of artificial scarcity where a few who go along with a system of oppression get to be billionaires.
> Our government should not be in the business of regulating the distribution of a sequence of words about imaginary wizards.
So not imaginary wizards then.
What should be regulated? Is it nothing? Does your statement become "Our government should not be in the business of regulating the distribution of a sequence of words"?
> Does your statement become "Our government should not be in the business of regulating the distribution of a sequence of words"?
Yes. Your lungs is a tree that needs healthy air. Your brain is a tree that needs healthy ideas. When people are not free to clean the ideawaves, they fill with pollution, and that is where we find ourselves.
Sure why not? Do you think JK Rowling needs more money?
Maybe the state could grant protection for 10 years after publishing to give the author a chance to recoup their investment. I don't know why the protection extends to the author's grandchildren.
If there is no copyright, what incentive is there to ever create anything digital? Adobe would never invest in Photoshop if any random person was legally able to sell copies for $1 each. A production company will never publish a book or create a TV show if anyone can just undercut them by taking what they have produced and resell it. Doesn't seem like a con to me, it sounds pretty critical for any kind of functional digital market.
Most of people who contribute to open source software are not motivated by copyright. I think without copyright we would still have our software, just would have way bigger share of open source software in the world, as well as nastier copy-protection mechanisms. And more services that are online only.
> Most of people who contribute to open source software are not motivated by copyright.
I disagree. License choice is deliberate, and many open source licenses are chosen for the strict stipulations they put on users and developers, like mandatory attribution and terms of distribution or reproduction.
I release some software under the GPL and AGPL. I don't want anyone to use my software that doesn't intend to abide by the terms it was released under.
If I wanted to release software with less stipulations, then I would, and I have.
Both licenses have mandatory attribution clauses, which is one of the stipulations I mentioned, along with mandatory inclusion of copyright notice and license. BSD in particular has specifics about advertising and distributing other files, as well.
I recognise your comments from several different threads, and I'm wondering if you might not be working against your own ideals. The GPL license is intended to persuade other sources to share their contributions when building on top, which I assume is what you would like to see happen. If everything is GPL then everything is open source, everyone can use anything, including training AI methods, etc.
The problem posed with copilot is in fact the opposite. By taking it to its logical conclusion, this might make it possible to disregard this effort and use GPL code on your private project.
If we abolish copyright then there is no need for GPL. I am so forever grateful for GPL, as Stallman and the like weaponized copyright against itself, and then gave us crystal clear data that open source software is strictly superior in the long run.
But even if only 1% of ideas were copyrighted, that is still a tax on the use of all ideas. In a world without copyright, I can download any dataset at will and analyze and remix it to my heart's content, and share my findings. But in a world with copyright, if there was one "copyrighted" land mine in there I open myself up to financial ruin. So one must tread carefully when working with any external ideas.
The GPL isn't about sharing contributions back to the project, but ensuring the end-users have the source code and permission to use, modify and share it.
Did you mean to say that "GPL isn't just about sharing ..."? It most certainly is part of the intention behind GPL. Changes to GPL software must be disclosed and made open source.
If you as an end user want to modify the source code for your own use, then that is fine. If you want to distribute it, you must state the changes, and you must also do so for any code it is linked with. The original maintainers are then also free to incorporate said changes should they chose.
Nothing in the GPL requires making changes public or sharing them with the developers of the project. You only have to give your changes to people you give your modified binaries to. You never have to give your changes back to the upstream project. So AFAICT the GPL is about user freedom rather than sharing and any sharing that happens is a side effect of user freedom. There are organisations who make changes and share them with their customers but not with the wider community. There might be situations where sharing publicly could lead to detrimental effects, Debian's "Desert Island", "Dissident" and "Tentacles of Evil" tests are some examples of that.
> There is no defending copyright. It is indefensible from first principles. It makes no logical sense.
What does that even mean? The intent from the beginning of copyright was to allow people to live off of intellectual works by claiming legal rights over the work.
There are no “first principles” from which basically any societal agreements like these are derived.
Even something as simple as “murder is illegal” isn’t actually derived from any first principles because the government is allowed to murder people, citizens are during self defense, etc.
We know that there was a written intent that it was "To promote the progress of science and useful arts". However, who knows whether or not that was the true intent of all those who signed off on it. We see that lots of written intent, (Exhibit A: Purdue's "Partners Against Pain" Oxycontin promotion), may not match the mathematical reality on the ground. Also, we know that there was plenty of places in the Constitution that were good to amend (the three fifths clause, for instance).
This site (http://www.copyrighthistory.org/cam/index.php) has lots of fascinating old docs where you can come up with your own impressions about the early days of copyright. My general impression was that while it didn't ever actually promote the progress of science and useful arts, it absolutely did in the early days serve as a super smart free hack for the new federal government to build a central intelligence and library of all the latest and greatest inventions from throughout the land.
> What does that even mean?
It means that if you analyze it using logic and put all assumptions on the table (start high up on the tree), you deduce that this is a system of intellectual slavery, not of intellectual "property". You deduce that if there is such a thing as stealing ideas, then all ideas with any value are majority stolen and but a fraction novel.
I think an interesting analogy is if you rewrote a book in your own words but with each paragraphs meaning intact. So you rewrote Harry Potter but with slightly different sentence structures, but meaning was otherwise near identical. It’s that copyright infringement? I think it would certainly be plagiarism.
The other similar analogy is of translation: a translated work is still copied by ‘derived from’ copyright laws.
Is this just what copilot is doing in some ways but for smaller components?
'Some sentences' makes me think of the link tax introduced to prevent aggregating news sources based on only headlines, so even generic sentences fall under copyright in certain cases.
It is certainly fascinating to see people start running away from "information wants to be free" and other Free Software principles full tilt when, all of a sudden, it's their livelihoods that are on the line. Unless my recollection is off, the GPL was never the goal of the original Free Software movement; it was merely a tool to get to the end state where all code becomes available for use by anyone for any reason without cost or restriction.
I am reminded of a line from Terry Pratchett's Going Postal in relation to a hacker-like organization called the Smoking GNU, "...[A]ll property is theft, except mine...", which I thought was rather painfully apt in describing what FOSS evolved into after becoming popular.
Your recollection is off, majorly. I'd recommend looking up the origins of the FSF/GPL/Copyleft. The entire movement essentially got started because Stallman gave Symbolics his (public domain) Lisp interpreter, then Symbolics improved it but refused to share the improvements.
"No restrictions" has never been the goal and to claim that they're egoistic hypocrites who are just scared for their own livelihood because of this is just an absurd strawman.
> "I'd recommend looking up the origins of the FSF/GPL/Copyleft. "
Are you sure you're in a position to be saying things like that? The closed source Xerox printer driver incident is generally viewed as the origin of RMS's thinking on Free Software, not the Symbolics incident. And, as others have pointed out, you were mistaken even on the particulars of that.
As for Free Software not being about no restrictions, may I remind you of the four freedoms that are at the heart of the Free Software philosophy? Copilot runs afoul of none of them and I would go so far as to say that Copilot is an embodiment of 1 through 3.
"A program is free software if the program's users have the four essential freedoms:
- The freedom to run the program as you wish, for any purpose (freedom 0).
- The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
- The freedom to redistribute copies so you can help others (freedom 2).
- The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this."
Again, the GPL is a tool to achieve the Free Software philosophy, not the end goal.
But software built from source copied using Copilot is not guaranteed to preserve those freedoms to its users. Which is the whole point of the GPL - preserving these freedoms to subsequent users, as a chain.
I've admittedly based that on what a FSF advocate told me and googling it seemed to support it. If you have some source that indicates this is wrong/biased, feel free to link it.
Anyway, that still wouldn't change that the FSF and Copyleft are explicitly anti-proprietary, not intending to be 'no restrictions'.
>It is certainly fascinating to see people start running away from "information wants to be free" and other Free Software principles full tilt when, all of a sudden, it's their livelihoods that are suddenly on the line.
Indeed. Everybody is a leet haxors when they're 14, it's 1998, and we're vying for +o in #warez on DALnet. We believed information really did "want to be free".
Unfortunately some of those same kids grew up to create today's data barons and that old saying about getting someone to understand something when their salary depends on not understanding comes into play.
I can't speak to whether or not Richard Stallman was trying to make some 4-dimensional chess move to remove software restrictions by adding software restrictions when he wrote the GPL back in the 80s, but his original intentions are irrelevant in most cases since most people who license their code under the GPL do not consult with him or consider his opinions when they choose to do so.
I think what is upsetting to those who create free[0] software, particularly GPLed works, is that their code has been used to create something that is non-free, which they have no control over (and thus takes away all the 4 freedoms).
As such I really don't understand why so many people are saying things like you have said here. These people believing in free software, including the original intent of the GPL to be a tool against copyright, does not mean they should, or would, be all for co-pilot, a non-free product by a company that once spat in the face of free software (and likely still does behind closed doors), who is using their free software, including the GPLed works, to create a non-free software.
As such, even if it ends up being legal from a license standpoint, it really feels like copyleft is being taken advantage of and exploited because nothing is given back in return (even if copilot remained free of charge, that's hardly giving back in foss lingo). So I suppose it's more a moral and intent thing rather than a legal thing at this point, unless copyright law decides copilot is a derivative work (doubtful, personally).
> Unless my recollection is off, the GPL was never the goal of the original Free Software movement; it was merely a tool to get to the end state where all code becomes available for use by anyone for any reason without cost or restriction.
Yes, but the GPL was created for a world with copyright to step towards a world without it. However we are extremely far away from such a world, and I don't see how copilot helps step towards it at all. I just don't see the argument. It just results in code that will be used potentially wrongly in copyrighted works, proprietary or not. And all of this is enabled by, as I mentioned, a non-free software created by Microsoft, who have used their huge capital to gain access to a proprietary AI by throwing money at it. Nothing about it seems fair even if it ends up all being legal in copyright terms.
[0] I'm seeing a lot of people in these threads who don't know the definition of free software or open source. Free software referes to free as in freedom, not free as in 'free beer' (gratis). Of course, free software is still often gratis, but you can still monetize it in various ways.
Surprised I had to scroll so far to find this, given how copyleft and straight-up anti-IP so much of the open source community is.
I think a lot more people on this site (and in the FOSS community in general) would be on board with Copilot if it respected viral licenses, e.g. if it had a way of inferring that the code it was copying verbatim were GPL-3 and warned the user that including it in their project would require them to GPL-3 their project as well.
Not really. Back when free software was strong, it would have been a good thing for society since Microsoft was selling software in boxes on actual store shelves.
Now 'the edge' is already mostly open source. All the lock-in and value has moved into either infrastructure or in software you don't even get to touch since it runs in the Cloud and you just provide IO to it.
I think in this new era of endless security breaches at cloud firms and M1-style processing innovation we'll see a slow but steady migration away from the cloud.
I’m going to take the other side of that prediction:
Endless security breaches will encourage firms to do “less IT” themselves and accelerate the adoption of SaaS solutions (and PaaS, with no/low-code etc.)
Also, perhaps not a massive driver but still, not for nothing: M1-style processing innovation (ARM) will see more developers creating for ARM servers, because they can, which will almost exclusively be run by the hyper scale cloud providers.
I laugh at the thought that every company trying to host 20 vendor apps is more secure than having the company that wrote the apps host them. Self hosted apps get updated on a much much slower schedule and don't have access to the brains who built the software.
The arguments are already weak. The judicial precedent, however, is strong. Microsoft will continue to publish proprietary ML models and profit off them, at the expense of the corpus authors (us lowly laborers).
I used to admire GitHub for being a fully bootstrapped company and free to pursue a path in the world they believed in as a company.
Since the Microsoft acquisition it’s becoming painfully obvious how unhealthily centralized the dev world has become, and they seem to strive to become ever more entrenched in the name of maximizing shareholder value.
I only have a small amount of open source projects on GH but I intend to vote with my feet and abandon the platform by self-hosting Gitea. By itself it won’t be a big splash but I’m inspired by posts such as this and I hope to inspire someone else in turn. Of all people we devs should be able to find good ways to decentralize.
In this case that might not help you at all. If your project is popular enough, somebody will mirror it on GitHub, where they are free (or believe they are) to incorporate your code in Copilot. Voting with your feet might be helpful long-term but will not protect you from this particular "feature".
It's more than that though. You should vote with your feet in opposition to all Microsoft products involved in this loop, not just GitHub. No GitHub Sponsors. So no VS Code. No NPM. No Azure.
Thanks for the question. Well. I put my AGPLv3-licensed code on GitHub to help other developers. I didn’t do this to help GitHub / Microsoft build a closed-source tool to monopolize the (F)OSS market.
It’s interesting how incumbent companies such as Google and GitHub try to capitalize on their user data with machine learning in any way they can to maximize shareholder value.
GH and MS spend a lot of time talking about how important open source is to them. They didn’t exactly prove this by building Copilot to be so oblivious about licensing. Either they took a gamble and hope they’d get away with it or it didn’t occur to them that this would be a problem at all. Either way, I’ve lost faith in GH's ability to act in the best interest of its users and the larger open source community.
It’s a free market and I hope to see more competition in this space: Both from GitHub-alternatives that respect code-licenses and from self-hosting alternatives.
> This product injects source code derived from copyrighted sources into the software of their customers without informing them of the license of the original source code. This significantly eases unauthorized and unlicensed use of a copyright holder's work.
It appears that GitHub wishes to address this issue via UI changes to Copilot. A quote from a recent post on GitHub[0]:
> When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.
> This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
That post is also on the Hacker News front page right now[1], but has 10% of the upvotes as this post so it's less visible.
I'm hoping all the criticism will encourage GitHub to make a better product.
While I understand the sentiment wasn’t Copilot trained on code not only hosted on GitHub, but found all over the Internet? Which means hosting your code yourself would not prevent GitHub from using it to train Copilot. That raises an interesting question though - how do you opt out? Is there even a way to do it?
This kind of learning needs to be opt-in, not opt-out.
I would also be extremely surprised if most open source copyright holders didn't already expect their licensing terms to protect against this kind of code/authorship laundering. Speaking individually, I know that it certainly surprised me to hear that GitHub thinks that it's probably okay to regurgitate entire fragments of the training set without preserving the license.
Bad news for you. Japanese copyright law, article 47-7 explicitly allow using copyrightable works for data analysis by means of a computer(including recording a derivative work created by adaptation)
It should be considered as fair-use of USA except we don't use Common Law system so we explicitly state what exempt from the copyright protection.
robots.txt is a convention for those who want to be good 'web citizens' rather than legally binding. It does absolutely nothing to stop someone who ignores your wishes. For example, there are tons of bots that ignore robots.txt entirely or even go straight for the thing (i.e. 'hey, thanks for telling us where to look!') you're telling them to avoid in robots.txt. While copyright is a mechanism that can be used if you can make the case, and have the means, it will only work for entities that have something to lose and are within a jurisdiction where it matters.
Researchers in our lab created a huge dataset of facial expressions from images on the web, annotated it and published the URLs to the images and the annotations for research but made sure to search only for images with proper licenses. I don't think that you are allowed to just go download any old image and train on it. I understand the many many people do it, but it's not legal (as far as I know, please correct me if I'm wrong).
> I don't think that you are allowed to just go download any old image and train on it.
My understanding as a two-year student of ML is that you are allowed in the US to go download any old image, train on it, and then release the model as long as the outputs are "sufficiently transformative."
To be clear: "transformative" not meaning merely "altered" but really meaning "repurposed"; if the new work is something people could feasibly use instead of the old work (harming the author's original market), it isn't "transformative".
Yes. For example, arfa ran into this question when launching https://thisfursonadoesnotexist.com/. Lots of furry artists had exactly the same concerns with his work there, but that work is decisively transformative.
Copilot seems ... well, less transformative. I'm still not sure how to feel.
I guess it goes back to closed source / trade secrets territory. If you have something you really don't want stolen, it is safer to never expose it and never trust that the law will fairly protect you.
The irony is that copilot won't suggest its own source code, just everyone else's. It is open source without the benefits.
I would like to know this too. I understand that GitHub is a private company and you have to accept their T&C, but surely they aren't allowed to use source code found elsewhere on the internet to train their ML models without asking for permission first unless it's a B2B cooperation such as with Stackoverflow.
According to the discussion at this link, you do not need permission to use copyrighted data to train AI models. Copyright prevents you from copying data, it doesn't prevent you from learning from it.
To train your model, yeah, probably ok. But I don't think anybody will see people using the duplicated code that AI insert on your codebase the same way.
Oh man I can’t imagine the consequences for certain languages and frameworks if it uses SO answers though. Imagine if it trained in all the dumb and ancient answers like “how do I get the length of a string in javascript” and took the first accepted answer of “use jquery”
This raises an issue of trolling. What prevents developers to generate "inappropriate" code to feed it to this algorithm the same way they did with the Microsoft Chat bot for example? That will surely reflect on the quality of code generated by this AI system and therefore the stability and security of applications built.
I’m sure this will happen, and there will definitely be instances of the bot giving users bad code, but it would be incredibly difficult to make it solely give out bad code.
> how do you opt out? Is there even a way to do it?
Yeah, don't post your code in the public for everyone to read. If I am musician, and I play my song publicly, people will hum it if they like it. I can't do anything to stop that, except not play the song to anyone.
It’s a different thing when random people start humming or even singing your song in private, and an entirely different thing when a big corporation uses your work to train a system that will at some points generate parts of your works and will make them millions of dollars of your and everyone else’s code without attributing anyone in any way.
Plus, to be honest I’m not even sure whether I’m for or against that, I am just wondering if one was against it but still wants to do open source work do they have any recourse?
I think it's clear that Copilot pushes boundaries...technological and legal. It makes people uncomfortable and challenges a lot of assumptions that we have about the current world. But this is exactly what I expect from the next revolutionary change in computing.
it doesn't at all challenge any assumption we have about the current world, unless you've been living with the assumption that you can't automate the process of producing code snippets.
Copilot is no different than stackoverflow with the exception that your selection mechanism is done by an artifical neural net rather than a bunch of people with up and downvotes in front of their screen. The reason people are uncomfortable here isn't some vague PR speech notion about the future, it's that copilot appears to be quite literally ignoring software licenses. Can we not devolve into this Silicon Valley corporate speak of rebranding a company ignoring intellectual property as innovation?
It sounds like you have it all figured out, so I just have one question: if it's not innovative (it's "no different than stackoverflow with X"), then what is your explanation for why the Copilot announcement here received over 2.8k votes? It's so obviously just stackoverflow with a better selection mechanism, so why aren't people treating it as such?
as far as I have seen it that is exactly how people are using it, which is why the last thread about it was (rightfully) full of people pointing out what a bad idea it is of copying half-cooked code snippets into your own codebase.
Why does it get a lot of upvotes? Because it's an AI product that makes things appear on your screen and that's the threshold for hype in our current age. Of course the number of upvotes on HN regardless doesn't speak to the innovation of anything. As best as I can tell all pre-covid posts on HN concerning mrna vaccines have a total of one upvote. Useless tech right?
It sounds like you think most people upvoted it because it is a shiny toy and the bar is very low when it comes to impressing tech industry workers. And also you don't believe upvotes have a positive correlation with innovation on HN, because the mRNA vaccine posts don't have as many upvotes as you think they should. Does that about sum it up?
Copyright doesn't just benefit huge corporations. For instance, without it, independent artists who rely on copying for distribution (authors, musicians, etc.) would find it much more difficult to make money off their work, mostly (IMO) because large corporate entities with large investments made in publication and distribution systems could simply take content and sell it themselves with zero obligation to the original creator(s). This process could be highly automated at scale, giving creators essentially zero chance to compete in the market.
It's a bad idea.
The thing about copyright law that needs reform is its bias toward the benefit of large corporate entities. Platforms' implementations of DMCA compliance allow "rights holders" to spam perjurious takedown requests en masse, garnishing the earnings of creators and legitimate rights holders in what can only be called (in addition to perjury) outright fraud. Companies like Github scrape the web for content, most of it copyrighted, and use it to construct new products for their own profit. Rare recitation events aside, I think their use case is legitimate fair use in the eyes of the law (and if you look at my comment history you'll see me vehemently arguing to that effect), but should it be? We don't seem to be asking that question, which is really disappointing--we're either complaining loudly and without substance, or blithely accepting the might-makes-right ethic as the central pillar of our IP law.
The status quo for artists is pretty dismal. Across industries you have a few ultra-successful artists, a small group who can make a decent living and then a long tail of people who can't pay the rent.
Gaming things out, I don't think copyright is really helping any of those artists or society as a whole. If it didn't exist, you'd still have breakout artists who make money through endorsements, live shows, and selling original copies of their work.
I think it's kind of crazy to handwave away a whole class of creators who do make real money off selling their work (rather than being performers), even if it isn't big money or even enough to live on without supplemental income. And I categorically disagree that copyright isn't helping "any" of them. One counterexample is self-published authors on Amazon and similar services, many of whom do make a lot more money than a layperson might expect, and all of whom would obviously be cut off at the knees by squeaky-clean first-world aggregator-reader services the second the copyright protections of their works were revoked.
>Copyright doesn't just benefit huge corporations. For instance, without it, independent artists who rely on copying for distribution (authors, musicians, etc.) would find it much more difficult to make money off their work,
That doesn't look like it's the point to me.
""[the United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right , to their respective Writings and Discoveries." "
As I read that, copyright is there to 'promote progress', not to maximize gains.
No doubt there is a million linear feet of case law that got us where we are.
Honestly, I rather like this whole question of copilot. I solidly appreciate the brilliance of github as a honeypot.
> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right , to their respective Writings and Discoveries.
What better way to promote said Progress than by making sure said Authors and Inventors can make enough money off their work to keep doing it? As written, it's a roundabout way to get at the instrumentality of capital, but if that's not what they had in mind then I'm not sure what they were getting at. Without copyright, a creator's rights to their own work aren't diminished; it's just that everyone else's are expanded to the same level.
(I'd love to know if I'm way off base about this. I'm not a lawyer, and I'm sure it's been discussed to death.)
> Honestly, I rather like this whole question of copilot. I solidly appreciate the brilliance of github as a honeypot.
I think it's really cool, and I'd probably use it myself. As much as my favorite kinds of programming (e.g. writing experimental text editors) might not benefit from it, in my day job I sure would love to spend less time filling in boilerplate and looking up mundane API details.
I don't mean to single Github out in my mention of big corporations benefiting from copyright law. Scraping vast quantities of copyrighted data to build new products is a common business model at this point, and--like other new IP-related paradigms enabled by modern information technology--I think it deserves a fresh look, being mindful of just what it is we're trying to accomplish with copyright law. As you say, it's not always obvious, even in written law.
There are a lot of better ways. Having more information being public and free, and usable by tools like this sounds like an excellent way of promoting progress.
Well, this is the current status quo. Automated scraping of copyrighted material for (arguably) transformative applications like Copilot is generally allowed under fair use, while non-transformative copying is considered infringement.
If you want to strip creators of (default) exclusive rights to their work, that's a different conversation than the one around whether Copilot and similar applications fall under fair use. Both have been touched on in this subthread, but your comment seems to be conflating them in a way that doesn't follow directly from the discussion above it.
> Automated scraping of copyrighted material for (arguably) transformative applications like Copilot is generally allowed under fair use,
Ok great. So then new tools like this are good, even though they weaken copyright, and the concerns that people have about it (that it allows easier copying of code), are actually a benefit.
And the fact that it might hurt people's ability to profit from their code, is overruled by the benefit that this stuff provides.
> doesn't follow directly from the discussion above it.
You suggested protecting profits as if it is the only or best way of promoting progress.
When, in reality, stuff like these tools are actually a much better way of doing so. And it does so in a way that undermines copyright law, in a beneficial way.
> Ok great. So then new tools like this are good, even though they weaken copyright,
Does it weaken copyright? Like I mentioned, it seems like it's probably allowed under existing law.
> it might hurt people's ability to profit from their code
I don't really buy this. The outputs of Copilot seem transformative enough that they won't by themselves meaningfully compete with the applications built from the sources in the training set.
It seems to me that people are objecting to it more as "theft" on ethical grounds alone. I don't really have a strong opinion either way on that front, but if I did it would be based on principle and not some theoretical material harm, because I think the latter is marginal at best.
> You suggested protecting profits as if it is the only or best way of promoting progress.
In fields where creators make money by selling access to copies of their work, what is a better way of promoting progress? People need places to live, and things to eat, and other things, and all of that costs money. If working in these creative fields becomes even less lucrative than it already is, fewer people will be able to do it, and for less time, because they will have to spend more of their time making money in other ways.
In tech, many of us are privileged to have a fair amount of spare money and time. Don't forget that not everybody enjoys that privilege, and please try not to attach a negative-valued concept of "profit" to the necessities of survival.
> When, in reality, stuff like these tools are actually a much better way of doing so. And it does so in a way that undermines copyright law, in a beneficial way.
Again, this is a very tech-centric view. I can't imagine (for instance) the average novelist being particularly happy to have the exclusivity of their rights to their own work curtailed to enable the creation of some tool, using their work as an input, for generating prose. And such objections would be absolutely correct, if anybody was actually talking about doing that.
Fortunately, nobody is talking about doing so--not for code with Copilot, not for fiction prose with the new GPT-3 tools that are popping up, and not for any other medium I'm aware of. These applications are covered under existing fair use law, and their existence does not depend on weakening the exclusivity of creators' rights to their work.
If you were to tell me that such rights should be curtailed to enable tools like Copilot to exist, I would strongly disagree with you. But--again--such curtailment is not necessary. The only reason I'm talking about it here is that there are people who think copyright should be abolished or strongly weakened. Almost universally, I've found, they're people who don't make money off distributing authorized copies of their work. So, if you ask me, they really have no clue what they're talking about, and shouldn't be running their mouths before seriously listening to (at least) the independent creators who would be impacted by such a change.
> Does it weaken copyright? Like I mentioned, it seems like it's probably allowed under existing law.
Yes it does. It is legally allowed, but in the past it was much more difficult to code launder, or copy things, in the way that an AI would do it.
Something becoming easier to do, has an effect, even if it was legal in the past.
> what is a better way of promoting progress
More technologies like this, that allow better sharing of code and information. It reduces the barrier to entry to creating content, thus causing more of it to be made.
> If you were to tell me that such rights should be curtailed
You have it reversed. Rights do not need to be curtailed, to enable these tools. Instead I am advocated for the production of these tools to be done for the purpose of curtailing these rights. The causation is reversed.
The rights should be "curtailed" through the process of tools that allow people to easily get around the law, and to make the law unenforceable. Changing laws is much harder than making the law irrelevant.
We don't need to change any laws, if we just make it impossible for laws to be enforced.
It is kind of like how bittorrent undermined copyright laws. No laws needed to be changed, for piracy to become rampant and unpunishable. (And don't even try to challenge me on this point, that piracy is effectively unenforceable these days. If you do, I'll just go watch the lastest episode of some marvel show, for free, right now, lul)
> More technologies like this, that allow better sharing of code and information. It reduces the barrier to entry to creating content, thus causing more of it to be made.
Like I said, this seems like a highly tech-centric viewpoint. Keep in mind that source code is far from the only thing covered under copyright law. Personally, if I was stranded on a desert island, I'd rather have a single original novel written by a human than a hundred novels' worth of GPT-3 output.
Beyond that, your perspective is pretty interesting--I guess you support the existence of tools like this because you see it as an opportunity to erode existing copyright law. Personally, I may not support the full extent and implementation of copyright law in America, but I do support the fundamental principle that a creator should have exclusive rights to their work. So we disagree pretty strongly on that, and I doubt we'll find common ground.
I guess I would just urge you, if you value art at all, to consider how independent artists like writers and musicians would be affected by the elimination of copyright. I don't really give a shit about the IP rights of programmers (even though I'm one myself, with public FOSS contributions), but you seem willing to throw out the baby with the bathwater.
Who can? Sure, Disney shouldn’t be able to copyright public domain works or Mickey Mouse until the end of time. But they also shouldn’t be able to swoop in, use your songs/artwork/software in their latest movie, without permission or appropriate compensation.
Don’t be too eager! Weakened copyright doesn’t necessarily translate to an overall benefit, at least for software.
Weakening copyright also weakens copyleft - for example, it seems reasonable to me that the producer of an open-source work should be entitled to require reciprocal openness from people who build upon it. If I can legitimately launder some GPL source code (say, a Linux kernel driver) through an ML model without being obliged to release the resulting code, I think everyone loses.
Why is this noteworthy? Who is this person? Am I missing something?
I agree that there needs to be talk about licensing and copyright but with so "less/no content" there can be no meaningful discussion other than aimless banter.
This isn't interesting though. It doesn't even provide any value. It's a random guy that doesn't like GitHub, it could have just as well been a HN comment from yesterday.
It's just posted(not by the guy that made the page, mind you) to farm karma, exploit the news cycle and carve out some more space for discussion of this tired topic.
Right, but you either need a solid argument or some authority, and this guy has neither. He's effectively a nobody and he has just jumped to the conclusion that CoPilot is illegal.
If he had a good argument for that, fine. But without that he really needs to be someone whose opinion I care about.
This is somehow inverse logic. Does rape victim needs authority to voice raping in order to validate it?
What is there that is not solid, CoPilot is using community code that is under GPL licence therefore Microsoft should not be able to charge for CoPilot but give it for free, or not create another revenue stream.
No, a rape victim needs a solid argument, i.e. evidence.
> What is there that is not solid, CoPilot is using community code that is under GPL licence therefore Microsoft should not be able to charge for CoPilot but give it for free, or not create another revenue stream.
You're doing the same thing as OP by assuming that this is illegal. That had yet to be determined. It could easily be the case that this falls under some fair use law or isn't even covered by copyright. It isn't for humans!
How about just leaning back and reading the discussions which evolve out of this post? Some may have something to say about it which will either help you solidify your point of view or add a new perspective to it which you might have missed.
The topic is a current one [1], which makes it even more valuable.
Anyone publishing anything on the Internet should expect this type of use case. If it is removed from github and republished via another site, there is absolutely nothing preventing another service/company from doing the exact same thing (or 'worse'... i.e. imagine a learning system that can actually understand the code) when scraping the alternative location. It's not unusual for bots to be among the most frequent visitors to low traffic pages these days and they aren't all just populating search engines.
A bigger concern for many is that if you USE copilot, you’ll unintentionally copy code with licences that your company really, REALLY does not want to copy. For example, here’s copilot copying some very famous GPL code: https://twitter.com/mitsuhiko/status/1410886329924194309?s=2...
And basically every software company avoids GPL like the plague, due to its strong copyleft conditions.
Sure, but that's a different end of the issue than I was referring to. I was pointing out that just taking code off of github wouldn't avoid the use case. Any published code from any public source is likely to eventually be used this way by someone.
Yeah, I agree with your point that “if you publish content to the internet, expect it to be used in ways you don’t intend, or even permit.” Just pointing out that a lot of the concerns are not “GitHub is stealing my code for use in Copilot,” but “using GitHub Copilot in my proprietary software is a massive risk/liability.”
Anyone know how they're hosting their repositories? https://thelig.ht/code/ is actually kind of nice and minimalist; I was hoping to set up the same thing, mostly just for kicks.
Lots of people arguing this guy isn't anybody, but the name seemed sort of familiar to me and my quick googling and looking at his site makes me think he probably has done something that some people use? For example dbxfs seems to have quite a history.
on edit: just saw there was a description of who he is https://news.ycombinator.com/item?id=27724247 as noted I don't know but not sure if it's enough to imply a bad motive of him wanting to get some sort of attention for opposing copilot.
on second edit: huh, seems to be one of those occasions when I have mysteriously offended some people on HN without swearing, joking or being rude.
I think there's an argument to be made that neural nets are in some sense a form of compression. The model is a lossy compressed representation of the data. So training a model on copyrighted data is quite direct copyright infringement - you're compressing, then redistributing.
Has this ever been used as an argument in a legal case?
This huge revolt is interesting but I doubt it makes github very scared. They'll just come out with some new version of it which they'll show takes into account licenses (or uses a 10k or something dataset with hand checked licenses) and that'll be that and we'll forget about all this the week after.
And yet would they have bothered without the "huge revolt"? I doubt it; what you are actually describing is a scenario where the "huge revolt" succeeded.
Grats. I abandoned a while ago as well. If anyone is looking for a rec for self-hosting: Gitea is cake. Sits nicely behind Caddy as all my other services do. Alternatives such as Gitlab I found wanted to 'own' too much of my system.
The irony of it all is that their code will find its way into the next Common Crawl release anyway and that's used to train GPT-3, which in turn forms the basis of OpenAI Codex, which is the product that CoPilot builds on...
So hosting elsewhere might not safe your code from ending up deep in the bowels of some corporate black-box ML model that occasionally regurgitates your IP if accidentally given the wrong (right?) prompt.
If you make your code public, you basically accept that someone will copy it verbatim. Other companies still might have it in their closed source product somewhere, even if it's just accidental copypasta from SO.
The real dark bit there is that if someone else posts something publicly, but your own stuff is "obscured" in some way, they may be able to (falsely) claim prior art, as the original author.
Amongst other things, hosting something on github is a public ledger of authorship.
The license you have chosen requires attribution. You may not care[1] but the other party still most likely will be in violation if Copilot reproduces a significant chunk of your code.
[1] I also MIT license my public code on Github, and also wouldn’t care that much.
I wonder if the license is still binding in the other direction though. Moving forward, by publishing the code on the Internet you know you’re training an AI to copy it.
What if you published a subtle proof of concept that takes out nuclear plants, and then some knucklehead deployed it because Copilot suggested it?
I could see certain...agencies doing something like seeding the tech scene with insecure hashing algorithms, and them becoming a part of the Canon, due to consumption by uncritical ML training algorithms.
We get back to the old "data quality" conundrum. We need to have a way to rate the quality of our data, which then opens the door to corruption and gaming.
I think the reaction from software devs on how Copilot's uses their code for ML is interesting in that all the ML companies have been doing this with all other forms of produced content: texts, posts, messages, photo captions, etc. And most likely even less care went into adhering to laws or ethics. Yes code has licenses and thus more distinct legal ramifications but on the other side are people who don't really understand that every time they interact with software or produce some content, everything is gathered and harnessed to power all these companies.
Regardless on how I feel about this usage, I’d be more concerned with the very real possibility of introducing vulnerabilities this way. Say the copilot takes a snippet from a code base. That snippet had a vulnerability and was fixed by the team that understood the what and how. How does that vulnerability get fixed? Does copilot let the user know months later that snippet used actually is very bad and that the company that originally implemented fixed it and you should too?
I abandoned github when they put code that was not licensed (is: copyright retained) and reproduced it and saved it in their Arctic Vault without the authors consent (mine)
What's wrong with the Arctic Code Vault [1]? Is the only problem that they didn't seek your consent? How is it different to deploying a new availability zone and having your public repos accessible on another server? Your code is preserved verbatim, and it's not possible for GitHub to provide their service without the right to make verbatim copies of your code, which presumably you agreed to as part of their ToS.
> is basically reprinting it without my permission.
What if I were to tell you, that in order to publish any code, on the internet, that code has to be "reprinted" to many different computers and places?
In fact, whenever you yourself need to even access that code, that code is copied over to many different computers along to way, as is necessary to send it to you.
But LTO is fine? I was going to ask if it was because it's not intended as a backup, but that's not even true, this is intended as a backup on a long time scale.
I haven't read this interpretation of the Arctic Vault project - presumably most users of GitHub are okay with their code being reproduced/backed up across many production servers for fault tolerance. Making an 'extra special' long-term backup in the Arctic Vault doesn't seem like a meaningfully different action to me - i.e. using a cloud-based host is essentially opting in to this kind of 'license violation'.
If they had taken one of their existing DB/disk backups and called it a vault, would that have been an issue?
Github does not own the Arctic Vault, there is an independent company behind it [1]. Given its purpose as a long-term archival, it is likely that exemptions to the copyright for (library) archival can apply here. [EDIT: This is probably not true, see the reply for the reason.]
> Github does not own the Arctic Vault, there is an independent company behind it
Github are the ones doing all the archiving. So, in essence, they do own that. Piql are just the ones providing the storage: it's a commercial for-profit entity employed for backup by another commercial for-profit entity.
It is technically true, but the Arctic World Archive specifically "accepts deposits that are globally significant for the benefit of future generations, as well as information that is significant to your organisation or to you individually" [1]. So it doesn't accept any data (at least as far as I see) and the Github archive should also have met this criteria.
By the way, my initial statement that it may qualify for copyright exemptions turned out to be false for a different reason. They only apply when the library and/or archive in question is open to the public, and the Github Arctic Vault isn't. Thus I think it's actually a Github's generic usage grant in the ToS [2] that allows for the Vault. The Copilot is, of course, very different to anything described in the ToS.
...provides prime-rate marketing bullshit in its marketing materials
> Thus I think it's actually a Github's generic usage grant in the ToS
If you refer to Section D.4, then:
- Arctic Vault is not "for future generations", but for GitHub only, since that section doesn't permit GitHum to just make copies willy-nilly for anything other than "as necessary to provide the Service, including improving the Service over time" and "make backups"
- This specifically makes GitHub "the owner" of that data, and not "some third-party" as you originally suggested
If you insist the term "owner" for copyright grants, you have a faulty understanding of copyright. The terms of service, much like software license, only allows for the licensee to do some specific things (in this case, including backups) under certain circumstances agreed upon in advance. Copyright assignment, which is akin to the ownership transfer, is much harder.
> This specifically makes GitHub "the owner" of that data, and not "some third-party" as you originally suggested
This one is my fault though, I've used the "Arctic Vault" as an archival site, but as I later realized it is a Github's archive stored in the Arctic World Archive. So yeah, it's (only) Github that can retrieve the data.
This is a commercial for-profit company, GitHub, taking some code and storing it in cold storage of some other commercial for-profit company, with no one, except these two parties have access to this code. And it doesn't look like GitHub even has the right to do it because it stores it for some purpose other than whatever is stated in their ToS.
I wonder if the whole kerfuffle around Copilot will end up spilling some light on this, too.
How is the Arctic Vault different from any other offsite backup?
I suppose one issue is that you (presumably) can't request deletion from it (which may even be a GDPR violation).
Edit: I looked up the relevant GDPR stuff, apparently there's an exemption for when "erasing your data would prejudice scientific or historical research, or archiving that is in the public interest.", which it arguably includes the Arctic Vault.
There’s a lot of consternation over copyright issues, but I see an entirely different problem. When I hear this tool described and see it’s examples the first thing I think is that Github has just automated the dubious process of copy/pasting from StackOverflow.
As a senior developer, I am strongly biased against the SO+c/p programming approach that I’ve seen many Junior and mid level developers use. There’s certainly a time and place for it when you become really stuck but at least having to go out and find the code yourself requires thought which helps you grow.
My gut reaction to Copilot is that adding this automation into IDEs is going to have a net-negative effect on growing developers as it lowers the level of thought and effort necessary to write even trivial applications. This is a huge detriment to learning. You don’t even get the chance to try to solve the problems yourself because the AI is going to be proactively getting in the way of your learning.
All that being said, I think a tool like this could be of great use with boilerplate within a project — but only suggesting things from that project. For example, setting up a new api route, dependency injection, error propagation, etc. Help with all of these mechanical things would be awesome.
This is a hell of a Pandora’s box that’s being cracked open here.
Interesting times ahead. For example, if you believe these kinds of tools will become a huge competitive advantage, and that the inclusion of GPL code is a meaningful force multiplier, it kind of implies the fusion of AI code generation and the GPL will eat the world.
Only if people understand that the result is under GPL; if they don't, then this is a mechanism to slowly "launder" the work people put into GPL code to funnel into non-GPL codebases.
It depends on which "people" you're referring to. I suspect the degree to which the programmer knows this is of little relevance to the question of how the legal + risk management implications will play out.
I mean general people people, not only developers: people includes managers and lawyers and politicians and everyone who might cause you to have GPL Copilot separate from MIT Copilot... the same people who right now cause licenses to matter, despite many developers not understanding anything about copyright law and just thinking "I'll steal that other developer's work as it makes my life easier".
If anything, I think the real test of this tech is going to be audio, as it has the right overlap of "big copyright is going to get pissed", "there already exist tools that attempt to automatically detect even small bits of infringement", "people actually litigate even small bits of infringement", and "it feels feasible in the near future": you whistle a tune, and the result is a fully produced backing track that sometimes happens to exactly sound like the band backing Taylor Swift on a recognizable song and generates Taylor Swift's voice, almost verbatim, singing some of her lyrics to go along with it.
Why is human understanding going to prevent this? Doesn't it seem like this is precisely the de facto function of Copilot: a license laundering machine?
If humans understand this then presumably lawyers would start hunting for code replication caused by Copilot--using automated mechanisms similar to those used by professors at Universities to catch people cheating--and do the moral equivalent of ambulance chasing: offering to file all the paperwork on spec for a cut of an assured payout. But if people in general believe this to be fair use somehow, then GPL is essentially dead (I have been a big advocate for it over the years, and if people are doing this--and everyone thinks it is OK--then it loses the entire point as far as I am concerned).
1. Train one ML implementation to produce "specification text" in a way that they're agreed to be free from copyright claims. E.g. train to avoid any direct quoting, possibly via a different human, programming or custom specification language.
2. Train a separate ML implementation to produce code from the specifications.
3. Hook them together and you've got a pipeline for generating learned, but copyright-free, code.
Kind of reminds me a bit of some of the machine translation work with human languages.
Note: this is how the GNU project itself sometimes clones the functionality of copyright-free way, so I'm pretty sure it would be safe to use this on GPL-licensed code.
Maybe now is the time to release a GPLv4 extending-restricting-relating the four freedoms to non-humans.
I expect the best lawyers from Microsoft have had a look into this and maybe there a weaknesses in GPLv3 ready to exploit for corporate AIs. What is the response from the FSF?
The argument that 'machines can learn from the code to produce something novel' doesn't bode well given copilot may very well produce code that is straight up cut and paste.
This just seems like a massive lawsuit waiting to happen.
What happens when you discover that you're using '20 lines of code from some GPL'd thing'?
What will your lawyers say? Judges?
It seems to me that if you use Copilot there's a straight up real world chance you could end up with GPL'd code in your project. It doesn't matter 'how' it got there.
I don't understand therefore how any commercial entity could allow this to be used without absolute guarantees they won't end up with GPL'd code. Or worse.
This is typical Microsoft behavior: embrace, extend, and extinguish. They embraced open source with the intention of controlling (GitHub) and exploiting it. And the interesting thing is that many people fell for this already ancient strategy.
So when MS does it it's evil, but it's perfectly fine for everyone else to do it?
I also don't see how any of this follows - they could've just crawled GitLab or any other OSS repository. They didn't even need Github for this.
Heck, is OpenAI doing embrace, extend, and extinguish on the entire web now, because they use Common Crawl [0] to train GTP-3, which forms the basis of CoPilot?
Well, I did try to warn the Copilot fanatics [0]. They just downvoted me days ago and here we are. We have a GitHub Copilot backlash against the hype squad.
The GitHub CEO is no where to be found to answer the important questions on software licenses, copyright and the legal implications on scraping the source code of tons of projects with those licenses for Copilot.
The fact you can only use it in VSCode and with Microsoft having an exclusive deal with OpenAI screams an obvious 'embrace and extend',
As for 'Extinguish', they will need to be very creative on that.
It seems like the most fair way to go would be for Copilot to be completely open sourced and hosted on GitHub. That way they’d be subject to the same terms/conditions they are imposing on everyone else’s code/repos.
To address a lot of the negativity around copyright fair-use, Copilot should have probably adopted something like Stackoverflow's model where contributors get rewarded by points. In this case the repo that the code used by Copilot came from would get a new type of star rating and the more people used it Copilot would assign more stars. Fractional stars would be awarded depending on what fraction of each code snippet Copilot thinks came from a specific repo...
It could maybe at some point send rewards in form of donations etc. from Copilot users, similar to Sponsored repos.
Seems to me like they need to back out of this fast and at very least limit it such that it is only trained and then used on "license compatible" projects. eg: train it in isolation on MIT licensed projects and then have the user explicitly confirm what license the code they are working on is to enable it. Possibly they even need to auto-enable a mechanism to detect when code has been reused verbatim and enable some kind of attribution (or respect for other constraints) where that is required by the license.
Alternatively, they'll take it head-on, pay their lawyers to argue fair use, and blaze a new trail through the understanding of copyright application that allows this ML model (and others like it) to exist.
This is ultimately a Microsoft project, and they have Microsoft money and Microsoft lawyers to defend their position.
I see lots of comments blaming MS about this turn of events. I think this project has been going on pre-MS.
But what did we expect from GitHub....MS or not? This was an obvious survival mechanism for GitHub sans MS. All that coding data there? Let's turn machine learning or AI onto that and make something.
And if MS were treating GitHub as an "at arms length" corporate entity so as to NOT to upset the opensource/free software community (because MS) then the blame lies fair and square with the management of GitHub.
This is not the first person I've seen that ditches GitHub in favor of some other front-end... and it's not uncommon that they look like this which is often baffling to me.
Say what you want about GitHub's almost monopoly position, but the UX is really great and accessible even to non-technical people. Maybe you don't need that, maybe you don't want the issue-trackers, but it's worth thinking about who you're excluding with these kind front-ends.
I have some of my code BSD0 licensed (in practice public domain). One thing that I'm vary of regarding Copilot is: what would happen if my code would become a part of some proprietary code by a big multinational corporation and then they would DMCA me out? I'm a bit in the middle of a digital housekeeping and I think I will move my code somewhere else, because of it.
Your code will end up on GitHub anyway if other people find it useful, as the majority of developers don't even understand you can self host git repositories, so they only know how to do their own development by taking the code they find and putting it on GitHub first.
I would be more sympathetic to the idea of the co-pilot if apart from being susceptible to stripping licensing information from permissive and copyleft projects, it could also inject copyright-stripped sources of the same amount of closed source code.
As it is now, it works towards weakening the copyright of free software while doing nothing (or very little) to closed software.
I see lots of comments blaming MS about this turn of events.
But what did we expect from GitHub....MS or not? This was an obvious survival mechanism for GitHub sans MS. All that coding data there? Let's turn machine learning or AI onto that and make something.
And of MS were treating GitHub as an "at arms length" corporate entity with their own lawyers etc such as to NOT
What a cool court case this would make. Is copilot's model sufficiently abstracted from the code it has read? Judges and juries learning about how the GitHub team avoided overfitting? Are humans who have read open source code producing derivative works?
Won't be long until we see an infringement case. /me grabs popcorn
I wish I had the guts to leave only "tombstones" for my GitHub projects, pointing to other sites where they're actually stored.
Unfortunately, GitHub enjoys the effect of most people being on it (correct me if I'm wrong), and leaving it is costly, regardless of whether the alternative is a reasonable service or not.
In the US, copyright is automatically granted: "Copyright protection in the United States exists automatically
from the moment the original work of authorship is fixed"
Interestingly enough only announced copilot as extension this became a problematic while the model generated code since was launched.
I suppose it’s difficult to prepare billions of line codes or data points and everyone to be happy.
For anyone who thinks this is even remotely okay, answer me this. Imagine that Copilot is actually just a gigantic PR campaign, and when you send a completion request, they do this:
They send the request off to a sweatshop in Bangladesh where a bunch of mechanical turk workers scroll through licensed codebases, find an appropriate snippet of code, agree on the best one, strip all of the licenses and attributions out, and send it back to you. (turns out they're very quick at their job)
How is this any different than what Copilot does with the purely technical difference that you're pushing code through an artifical neural net instead of a bunch of human ones. Why is that supposed to matter
It doesn't sound like the person you're replying to understands that the code returned is largely synthesized with OpenAI's Codex. It is not simply a "snippet selection" mechanism, it has "learned" (to a limited degree) patterns in code, and can generate those patterns even if they don't exist verbatim from the training set.
how well does copilot respect licenses? I'm willing to bet it accidentally ingested some GPL code and will be spitting it out at some point. would that be allowed by GPL?
also, someone could deliberately obfuscate the license text to fool it but still be clear enough for humans. something like "License: if you use this source code to train a bot then you must obtain a commercial license, otherwise MIT license applies". bot searches for "MIT" and thinks it's safe.
So sue them (class action lawsuit on behalf of everybody who contributed restrictive licensed code). That's the only way this issue is going to be resolved any time soon.
It's strange how this thing was not an opt-in feature at GitHub. I also feel like this thing is a violation of my integrity and I will consider stop using GitHub as well
From their perspective, they weren't doing anything that abused their privileged position. Tabnine trained their model on open source code, much of which was probably hosted on GitHub. Why should GitHub have to ask permission if tabnine didn't?
Whether training an ML model on code is fair use is still an open question, but I don't think GitHub is a greater villain here than anyone else doing the same thing (at least until they start using private repos).
Of all the hills to die on, this seems like an odd choice. Why not work with others to iron out the legal and technical issues with this new technology?
Not everyone has the patience and ability to discuss their objections in a public forum while their rights are being violated (in their view).
Some people have a passion and a very strong belief in their ideals and I applaud them for following through with it, even if I don't necessarily share their opinion on the matter.
This is exactly why people have issue with Github's Copilot.
It's not the technology, but the fact that any code you pushed to GitHub in the past 13 years is now 'accessible' to anyone.
Private repo?
Paid account?
Deleted repo five years ago?
Deleted repo today?
Proprietary code?
Embarassing commits?
Accidental API keys or passwords in commits?
All 'available'.
It feels like the entirety of GitHub was just 'leaked', and converted into a marketable product.
Would you push your code to a service if you knew it could be read by anyone one to ten years from now? Even if you paid to keep it a secret?
I'm not sold on the product, but it's important to note that GitHub Copilot was only trained on public repos, which means nothing should be out in the open that wasn't already made public by the authors.[0]
> GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI. It has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.
Yes, it was. From their site: "It has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub."
Once you put something on the internet, you should assume it still exists out there somewhere even after deleting it. Even before copilot, all credentials that end up in a repo needed to be changed. I'm not sure what's supposed to be different now.
I know that some people have uploaded the Microsoft research kernel or even the leaked Windows source code to github at some point.
I wonder what Microsoft will do when snippets from that code start appearing in your code because of copilot. I'm guessing their lawyers wouldn't accept "the robot did it" as an excuse in that case.
I'm tempted to just throwing stuff like "AWS_KEY=" at the algorithm and see how many working credentials I can steal from private repos.
> I'm tempted to just throwing stuff like "AWS_KEY=" at the algorithm and see
Anybody tried? What does actually happen if you do this kind of thing? I can think of a few more obvious "script kiddie" ideas, but I won't post them here lest a copilot developer sees it and closes all the elementary stuff.
> Would you push your code to a service if you knew it could be read by anyone one to ten years from now? Even if you paid to keep it a secret?
I'm old enough to remember when "assume anything you put in cleartext online is public" was received wisdom. We were taught that if you want to keep something private, keep it encrypted on your own local media. Or, failing that, at least on a server you control.
I'm not talking about copying but learning by reading code, you then synthesize the code you read surely you don't expect any copyright law to apply in such cases.
Should I agree with this guy if I believe all software should be open-source? I don't think snippets of code have copyright strength; we pass them around constantly in Slack chatrooms, IRC and Stackoverflow...
Thanks nerd for promoting copilot! I used to be obsessed with OSS ideology years ago too when I was younger. It a mental virus. I've been cured for years now. Hope you get better.
So I'm guessing we just need to wait for the court cases to resolve the various issues with this. Won't that be fun? But is that really likely?
My sense is that this is either:
a storm in a teacup;
a blackhole that swallows everything around it;
a massive copyright mess that piles up without anyone noticing then explodes all over everything;
or something else entirely.
The next few years will be interesting then. I'm wondering what happens if/when a significant chunk of GPL code gets included into a commercial product. That will get lively.
>Some unknown person is trying to get some hype on “cancel github” cry.
>I don't give a shit about the Copilot, but I care even less about Rian Hunter and his statements.
This is untrue because you had a choice of not saying anything at all and carrying on (clearly not giving a shit) or take the time to leave such a comment (giving enough of a shit to inform everyone you don't give a shit.) So far this and Lloyd's is the only crying going on in this topic.
Is it not obvious that you can care about a post on HN without caring about the page it links to?
Back away from this specific situation for a second: If you would ignore something entirely if it wasn't being shoved in your face, complaining about it being shoved in your face and saying it's stupid wouldn't mean you suddenly "care" about the underlying item.
(And no, I'm not saying that an HN post is shoved in your face. It's a more extreme example to make the point more clear.)
Innovation can take place so long as the right people are getting their deserved royalties. In this case, anyone whose code was used in the training set should get a lifetime royalty.
The essential isue is simple: taking someone's work product and financially profiting from that work without paying for it. No matter what, that is just wrong.
I have read a lot of open source code, somebody's work product, never paid to read it, and with that knowledge gotten a job which pays me. Is that wrong?
People really sign up without reading Terms of Conditions and then complain when GitHub decides to do something with the data that you've given them permission to use under the ToS
A tiny percentage (less than 1%) [0]of people read terms and conditions- they are long, repetitive and often in legal language. If you expect to read every terms and conditions and privacy policy (and every change there of), you would waste over 240 hours over the year.[1]
[0] Bakos, Y., Marotta-Wurgler, F. and Trossen, D. R. (2014) ‘Does Anyone Read the Fine Print? Consumer Attention to Standard-Form Contracts’, The Journal of Legal Studies, 43(1), pp. 1–35. doi: 10.1086/674424.
[1] McDonald, A. M. and Cranor, L. F. (2008) ‘The Cost of Reading Privacy Policies’, A Journal of Law and Policy for the Information Society, 4(3), pp. 543–568.
It's kind of interesting how quickly sentiment turned negative. The original feature showcase/announcement post was full of excitement by HN (which is kind of strange, if you think about how skeptical the HN crowd is towards AI/ML and automation of programming) but it hasn't been a week and people are already talking about the questionable ethics and potentially disastrous consequences of using the feature.
I can't speak for anyone else, but when I first saw it, it seemed kind of okay, but I also didn't really look too deeply in to it. As I've looked at it a bit more closely and thought about it for a few days, my original feelings have soured quite a bit.
I never considered the copyright and related ethical implications of ML at all, or thought about the impact it may or may not have on programmers. Your first thoughts on something can be wrong (and actually, often are) and it takes a bit to really think things though – or at least, it does for me.
You can spin your wheels all you want but going from simple first principles it is fundamentally flawed. If you believe ideas can be property, then you believe people can be property.
Yes. Though I'm not the sharpest tool in the shed, so the delivery may be less than ideal.
Copyright gives PersonA legal control over a subset of PersonB's behavior, even when PersonA is not involved. This is hard to defend, unless you are fine with people being property. Slavery gives PersonA legal control over PersonB's behavior. Under Slavery, PersonB has one master with lots of control. Under Copyright, PersonB has lots of masters with small controls.
Is there a way to believe that ideas can be property without it being a system of slavery? There's no logical way to make that work. Think about the moment that copyright "expires". At that instant, does matter disappear? Did property vanish? What changed? The only thing that changed was each person suddenly gained a little more freedom—the ability to share a new sequence that they couldn't share before. The property rights that the copyright holder had over other people went away. People became more free.
Anyone who thinks for themselves should be able to quickly deduce that these laws are shades of slavery laws, and not about property rights. I figured that out before I could legally drink. It's not that complicated. The question is why are so many duped? I think it's probably a question of priorities (I would say Freedom of Speech and the Press are more fundamental, and then Freedom to Remix and Distribute would be next) or perhaps it's because before the Internet there wasn't enough uncontrolled bandwidth for the truth to get out, or perhaps it's that the people are bombarded over and over again by the big lie from the moment of childhood—look at the FBI Warnings at the beginning of Disney Movies, or the dozens of times per day that you see the phrase "All rights reserved".
Has this person been in a coma? If I utilize a free service on the Internet, I'm trading for some kind of convenience with the knowledge that I am in some way being boned in the backend by teams of people, all of whom are likely more clever than I am and using my patronage to some kind of nefarious end.
The Internet isn't really a place to exercise an inflexible moral code. His new repository probably can be traced back to slave labor somehow if someone digs deep enough. Probably won't even take 6 degrees of separation.
If it makes it easier for me to code and gives me more time to do something other than work without doing irreprovable harm to some sentient entity, I'm firmly in the who gives a shit camp.
> His new repository probably can be traced back to slave labor somehow
And you're ok with that? It doesn't HAVE to be like this. Just because you've chosen nihilism, doesn't mean that's the only choice, and it certainly doesn't help anything.
Of course I'm not okay with that, but I'm also not disillusioned about my lack of control over the technology in which I've chosen to build my life around. We parasites can't complain really complain that our hosts smell like shit when we're riding them to the bank, can we?
It doesn't HAVE to be like this, it just is, and all of the alternatives suck. If you want to choose to inconvenience yourself in order to pass a morality test that doesn't exist, go ahead I suppose.
It seems weird to do this over a feature that is still in technical preview, we don't even know if this product will ever ship publicly. I'm guessing a public release is still years off given the number issues they need to work through before release. My understanding is that they are working on an attribution system to catch cases with common code. Beyond that this person seems to use the MIT licensed code which already can be used internally by a company to host a proprietary service without attribution. It would make more sense to be outraged if you were using AGPL or something.
Copilot regurgitating Quake code, including sweary comments - https://news.ycombinator.com/item?id=27710287 - July 2021 (625 comments)
GitHub Copilot as open source code laundering? - https://news.ycombinator.com/item?id=27687450 - June 2021 (449 comments)
Also ongoing, and more or less a duplicate of this one:
GitHub scraped your code. And they plan to charge you - https://news.ycombinator.com/item?id=27724008 - July 2021 (148 comments)
Original thread:
GitHub Copilot - https://news.ycombinator.com/item?id=27676266 - June 2021 (1255 comments)