Hacker News new | past | comments | ask | show | jobs | submit login
All public GitHub code was used in training Copilot (twitter.com/noradotcodes)
1017 points by fredley 22 days ago | hide | past | favorite | 707 comments

To me, the particular use case and whether it is fair use or not, is of minor interest. A far more pressing matter is at hand: AI centralization and monopolization.

Take Google as an example, running Google Photos for free for several years. And now that this has sucked in a trillion photos, the AI job is done, and they likely have the best image recognition AI in existence.

Which is of course still peanuts compared to training a super AI on the entire web.

My point here is that only companies the size of Google and Microsoft have the resources to do this type of planetary scale AI. They can afford the super expensive AI engineers, have the computing power and own the data or will forcefully get access to it. We will even freely give it to them.

Any "lesser" AI produced from smaller companies trying to compete are obsolete, and the better one accelerates away. There is no second-best in AI, only winners.

If we predict that ultimately AI will change virtually every aspect of society, these companies will become omnipresent, "everything companies". God companies.

As per usual, it will be packaged as an extra convenience for you. And you will embrace it and actively help realize this scenario.

I have about 300,000 photos that haven't been scanned by AI (unless someone at Backblaze did it without permission). I'm sure there are lots of other photographers out there who miss Picassa, which Google killed off to push everyone's data to their service. (It did really well in matching faces, even across age, but the last version has a bug when there are multiple faces in a picture, sometimes it swaps the labels)

If there were offline image recognition we could train on our own data privately, could the results of those trainings be merged to come up with better recognition on average than any one person could do themselves with their own photos?

In other words, would it be possible for us to share the results of training, and build better models, without sharing the photos themselves?

Absolutely possible.

What I'm building into PhotoStructure is typically called "transfer learning."


PhotoStructure is entirely self-hosted, including model training and application: the public domain base models (trained on huge datasets) are fetched and cached locally.

By design, none of your data (or even metadata) leaves your server.

(I expect to ship this in an upcoming beta next month.)

I want to label all the faces in the photos I've taken since 1997, and save them in the metadata. I'll be glad to run it against my photos. Windows 10, WSL, and/or Virtual Machine with Linux of your choice.

I've got desktop builds for macOS, Windows, and Linux, as well as "headless" builds for Docker and even "directly" via Node.js. Instructions here: https://photostructure.com/install

Nice! Will try this out. Are you planning on taking advantage of in-built neural engines like that in Apple M1 for speeding up object/facial recognition?

I'd like to, but practically speaking, I'm at the mercy of native support in the libraries I'm using. If support is added, though, it's trivial for me to add the switch as a user-definable setting.

Yes, you're talking about federated learning.


> If there were offline image recognition we could train on our own data privately...

Apple does all face recognition and image processing stuff on the edge. On your iPhone or Mac.

I wondered why my phone got frighteningly hot while charging sometimes. Then I saw the note after adding some faces manually for it to recognize, which was in the line of "Your phone will update faces when the phone is charging". My all photos are backed up to iCloud, btw.

While possible, only the tech-savvy people would take part in this "collective", which is of-course a minor fraction of the data which Google has access to. This is the same argument as saying that if you care about privacy "just" don't use Google, easier said than done for the vast majority of people on earth.

I am not an expert on the field. But my hope was that this could be facilitated by Transfer Learning. Still don’t know how the scale economies could be achieved. Maybe just out of the sweat and network of passionate people like in the case of open source.

I work in the field. Transfer learning helps get you decent/good models, but the best models remain ones trained on large amounts of data. You may be able to get away with good performance and not great on your task. For some areas you really care a lot about long tail performance (like self driving) that you will need massive dataset. For other areas if your goal is to be the best relative to other large companies you will need a massive dataset.

Transfer learning best use cases are for fast prototypes or for ml tasks that do not need state of the art performance.

If there are hundreds of people with 100,000 photos each, that collectively is a massive training database, with a lot more labels and diversity of subjects.

By keeping the training data itself private, distributed and outsourced, you might be able to get otherwise unachievable levels of performance.

This isn't going to solve the data ownership issues though, since they contaminate the program trained on them (and its blackbox nature only makes it worse)... though I guess that specifically for copyright it's going to depend on the final usage of that tool ?

While interesting, we don't know enough about how models are learning to where we would be able to consider doing this.

There would still need to be a central model (and centralized management thereof) if I understand correctly.

Photoprism, digiKam, shotwell all have image recognition features, with varied levels of sophistication.

You don't encrypt your data before uploading to backblaze?

Oh heck no, I never encrypt data.

I run windows. It can't ever be secure, anyone who wanted to hack me could.

Scrambling the data really makes things worse as any accident requiring recovery of my data is also probably going to lose the encryption key.

The only time I ever lost any significant chunk of data (a persons lifetime set of photos!) was because Windows encrypted data at rest, and thus it couldn't be recovered after a disk crash.

Unless there is some corporate or legal requirement to do so, I'll never encrypt a whole disk, or backup.

> any accident requiring recovery of my data is also probably going to lose the encryption key.

... why?

i'd hate encrypting too if I threw away all best-practices regarding it -- losing a key with the failed system is a "problem exists between chair and keyboard" type of issue.

Encryption protects your data from yourself, from your adversaries, from serendipitous grey-moral types, and from the prying eyes of over-zealous data-collection conglomerates.

You seem experienced in the field, so I won't presume what your best practices are -- but to be enthusiastic against encryption is a form of cheer-leading that I think I cannot ethically support; the longer I live and the more pervasive companies get to be with their data collection policies then the more powerful and required tools like encryption seem to become.

Agreed. Everybody talks about encrypting backups like it's common sense, but almost nobody talks about the risks involved with failing to back up the encryption key itself properly. The entire integrity of the backup then depends on that sensitive piece of data, and it's not something that can be openly shared by its nature, or included in the encrypted backup itself. It's even deceptive if your measure for success is restoring the backup to make sure it works properly, because there is now an implicit assumption that the encryption key is still valid and undamaged the next time you restore.

I wish backup tools like Duplicity would warn you about the risks of encrypting backups instead of warning the user if they disable encryption, because encryption has the possibility of rendering all those backups useless when the moment to use them finally comes.

I have a similar feeling that large swathes of my digital life would be rendered permanently inaccessible if 2FA was enabled and my device was rendered inoperable. (That's why I keep meticulous physical backups of emergency keys.) I think 2FA and the like should be considered a tradeoff with its own inherit risks and benefits, instead of a universally better option than randomly generated 80-character passwords alone.

The thought that data monopolization will be a moat against competitors is actually argued against by VC firms specializing in AI companies, who claim that after a certain amount of data (which is accessible to most people) the additional data isn't going to improve the model much.



And in case of Copilot, the training data isn’t a moat anyhow. Last I looked, everyone could freely access GitHub public repositories.

> If we predict that ultimately AI will change virtually every aspect of society, these companies will become omnipresent, "everything companies". God companies.

What we currently call AI is very from AGI, and it's not clear that sitting on piles of proprietary data gives an edge towards AGI. If the goal is human-level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system. :)

Current DL systems need huge amount of data, because they are very primitive: they work with immediate associations, so they require seeing data very similar to all possible inputs to generalize well.

As we develop more sophisticated systems, I expect that the leverage from data will tip over to engineering finesse, and nothing is better at fostering great engineering than the permissionless tinkering environment of open source.

> If the goal is human-level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system.

Pretending that the scientifically managed public school system, that attempts to manufacture uniform educated humans on a conveyer belt, is responsible for human education is fairly ridiculous.

Children have a remarkable capacity to learn, and do so automatically through free play and exploration until public education wrings that curiosity out of them and turns education into a job.

Humans get educated despite the public education system, not because of it.

> Pretending that the scientifically managed public school system,

Say what now? There may be places on Earth that practice scientific management, there are definitely some that pretend to, but IME public school systems are neither.

Schools (at least American public schools) are one of the last bastions of Taylorism in the west. They treat students like uniform widgets on an assembly line.

You can read for yourself: https://files.eric.ed.gov/fulltext/ED566616.pdf https://radicalpedagogy.icaap.org/content/issue3_2/rees.html

“Treating X as uniform widgets” (where X are not uniform widgets) and “scientific management” are not only not the same thing, they are anticorrelated.

Yeah everyone knows children will innately learn calculus from flinging mud at eachother.

> If the goal is human level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system.

Seems unlikely human education costs less than AI education in total.

For years we thought Google Translate was the best machine translation we would ever get. Then DeepL just popped up out of nowhere and today other services still didn't manage to catch up.

Every now and then you get someone to think about an old problem on a clean sheet of paper and you might get a better result with less training data / investment.

Google doesn’t have the best (publicly) available reverse image search AI. That would be Yandex.

Google is actually pretty crappy at reverse image searches.


The point still stands though, Yandex is also a behemoth with access to a massive amount of data.

It's pretty clearly intentionally hobbled for various reasons (e.g. privacy, obscenity, etc). It used to work a lot better.

Which would explain why Yandex, specifically, is the best-in-category: being based in a country where your government enjoys trolling developed world's idea of decency and responsibility can have its advantages.

(Until, of course, they force you out and give company to some crony oligarch. But that idea is also not unknown to Yandex, I believe?)

On the other hand, DeepL (made by a small German company) is better than Google Translate.

Makes it sound like DeepL exists in isolation. It's good because the company behind it has the largest hyperlocal (small phrases with confirmed usages) translation data set.

There is a second best though. Apple offers image AI which is worse than googles but wins because it works offline.

I've got 70,000 photos in my library, with AI search and recognition, all done on my device. Thanks Apple.

In fairness it's not quite as good, but, it's good enough for the searches I've wanted to do so far and gets better all the time. And they're adding searching text in photos this release. I'm happy to wait a little for this better implementation.

I largely agree. But there are still some fun opportunities around. One are things google would never touch because of PR reasons (e.g. state of the art scalable face identification). Another is just silly out-of-the-box creative uses of AI which wouldn't fit well with Google's brand.

If Google makes an amazing model that no-one can beat it will only be dominate as long as others get access to it freely. But if there are restrictions on access or if it's too expensive, other options will appear and even if they're not as perfect, they'll still be very usable. Imagine a coalition of companies all feeding data, that could compete just as well.

Google has all the data of all the users though. I'd wager that they won't just let AI companies scrape it.

I don't think Google uses user photos to train their photo search algorithm.

They use photos from the web for training, and then user photos are only used for the actual indexing.

I think it's an innate quality of technology.

Yes sophisticated AI tech concentrates power for those who already have power.

And the technology we all (presumably readers of HN) create can enhance the impact of the user. And this can result in unfair circumstances, in reality.

Law and force can prevent disproportionate use of power. Of course one must define the law, which may be done AFTER the offense has been committed. Further, if those who make the laws are corrupted by those with e.g. this AI tech power, then no effective law may be enacted and the hypothetical abuse will continue.

The final step is to break down these monopolies. The government can do that and has done it before.

Interesting, given that HN thinks that it is yandex who has SOA image search, not google https://news.ycombinator.com/item?id=23976172 which kinda counters your logic.

It is yandex who now collects massive amounts of data to improve their image search now, while google apparently doesn't.

Yandex is a giant, for sure, but google is, like, 10 times bigger and still doesn't provide the best service.

They are not hoarding the latest results, except for a few cases where the general public is a year behind their secret sauce. Take a look at the huge zoo of planetary-scale models that are published by the big companies and universities (HuggingFace, https://modelzoo.co/, ...)

The problem with the huge models like GPT-3 is that they are too expensive even to run by regular people, not train.

Regular people yes, but no problem for decently funded startups.

This seem to be inevitable. An individual doesn't have horizontal scalability, you know... So, unless we'll have some kind of brain extension capabilities, there is no other choice but to build such technologies collectively.

Also, I think you are overdramatizing this. Governments used to be omnipresent (maybe still are), in a different way, more threatening to individuals and probably as threatening to societies as "everything companies" could be.

We can decide to stop using some (or most of) Google services. It's hard, but it's not that they are pointing us with a gun in order to use their services, right? Sure, for the cases when one cannot escape Google, use it; but for the rest of scenarios? It's all about tradeoffs: Can I live without YouTube? Can I live with DuckDuckGo (Google Search is "better" but I don't mind)? etc.

Google search would be hard to replace. In fact, if Google search was turned off over night, the world would probably see a major economic downturn, caused by a sudden drop in productivity.

You, a person knowledgeable in this field, may choose to stop using Google services, but that won’t have any societal impact if you can’t also convince the “average” user to do the same.

And I have yet to see a single life-changing AI application. I haven't tested Copilot yet, but I'll bet it is so precariously useful that a lot of people will feel more productive without it. (BTW, the last time I opened VSCode, it could not even autocomplete Numpy, so I am not holding my breath for AI autocomplete.)

Well of course only the huge companies can develop products that require enormous resources.

But I'm not too worried here because everyone gets access to larger datasets every year, and it gets cheaper to process every year, so whatever Microsoft or Google is capable of doing now, smaller companies will be capable of doing in a few years.

It’s also a huge call for innovation. When a student learns to code, (s)he doesn’t need to analyse millions of Git repositories to get good at it. Throughout their entire career most developers will probably only see comparatively little code. Perhaps the equivalent of the Linux kernel, if that. And yet, we’re able to learn from the little we see and get reasonably good at coding. It even stands to reason how much better one gets by reading more code (most of which is pretty crappy anyway).

I believe this is actually powered by OpenAI, which while large (now), is nowhere near the behemoth that Microsoft or Google is.

This suggests that seeing the future a bit ahead of the rest of the world, and then assembling a motivated all-star team is (perhaps in the short term at least) one way of out-competing the "super AI" of the giants.

Not only Microsoft basically bough OpenAI a couple of years ago, they also made the GPT-3 closed thing that you can only access via API.

Don't let the name fool you, OpenAI is anything but Open.

Last I've checked, Microsoft pretty much owns OpenAI ?

I don’t think that your premise concerning “planetary-scale AI” (and the ability to pull it off) holds up. If Google and Microsoft are so dominant and had such an insurmountable head start, why are we seeing such an enormous number of AI startups? In fact, there are countless startups busy figuring out how to make AI work for software development. I’d even argue that copilot was not that expensive to build. I very much doubt that GitHub (or Microsoft for that matter) had a huge team working on this or has spent such a vast amount on hardware resources that they’d outcompete the rest of the market by virtue of their cash reserves. Any decently funded startup should be able to finance such as effort. Especially since in this case, the training data is cheap (and legal) to access for anyone.

Where Microsoft does have an “unfair advantage” is in their marketing and sales firepower. Replicating their B2B and B2C sales channels is indeed very expensive. GitHub will be able to monetise Copilot by some upselling campaign. Then again, startups regularly manage to break into markets that are supposedly locked down by the likes of Microsoft.

If the training set contains verbatim (A)GPL code does this mean that Copilot also should be distributed by Microsoft under GPL? Because without it Copilot (as it is distributed by Microsoft) couldn't be built, wouldn't it make it a derivative work of GPL'd code (and obviously every other license)?

I see a lot of people comparing human learning to machine learning in the comments, but there is a huge difference - we don't distribute copies of humans

No, see Authors Guild v. Google. Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books. The Google Books site is not a derivative work of the millions of authors they copied from, and if they did copy any coincidentally GPL, AGPL, or creative commons copyleft work, the fair use exception applies before we reach the question of whether Google is obligated to provide anything beyond what it is doing.

By comparison, Copilot is even more obviously fair use.

I've had this conversation quite a few times lately, and the non-obvious thing for many developers is that fair use is an exception to copyright itself.

A license is a grant of permission (with some terms) to use a copyrighted work.

This snippet from the Linux kernel doesn't make my comment here or the website Hacker News a GPL derivative work:

    ret = vmbus_sendpacket(dev->channel, init_pkt,
        sizeof(struct nvsp_message),
        (unsigned long)init_pkt, VM_PKT_DATA_INBAND,
This snippet from an AGPL licensed project, Bitwarden, does not compel dang or pg to release the Hacker News source code:

    await _sendRepository.ReplaceAsync(send);
    await _pushService.PushSyncSendUpdateAsync(send);
    return (await _sendFileStorageService.GetSendFileDownloadUrlAsync(send, fileId), false, false);
Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

The Free Software Foundation agrees (https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse)

> Yes, you do. “Fair use” is use that is allowed without any special permission. Since you don't need the developers' permission for such use, you can do it regardless of what the developers said about it—in the license or elsewhere, whether that license be the GNU GPL or any other free software license.

> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

(And even this verbatim copying from FSF.org for the purpose of education is... Fair use!)

You're strongly and incorrectly implying that "Fair Use" is a clear (and relatively immutable) concept within copyright law, which couldn't be further from the truth. Even if this or that particular case sets out what appears to be solid grounds, one shouldn't take that as gospel by any means.

This mostly has to do with the nature of the wishy-washy nature of the 4 part Fair Use test, which, unlike decent legal tests, doesn't actually have discrete answers. The judge looks at the 4 questions, talks about them while waving her hands, and makes a decision.

Comparing to, e.g., Patent, where you actually do have yes-or-no questions. Clean Booleans. Is it Novel? Is it Non-Obvious? Is it Useful? If any of the above is "No", then no patent for you.

As for the execution of Fair Use, while I haven't gone too deep into Software, I can assure that for music, the thing is just a silly holy-hell mess; confirmed most recently by the "Blurred Lines" case, where NO DIRECT COPYING (e.g. sampling or melody taking) was alleged, merely that the song sounded really similar to "Got to give it up" and that was enough.

So then, I'd say everything either is, or should be, up in the air, when it comes to Fair Use and software.

Most law is wishy washy. There are very few cut and dry answers in the law (If there were, we wouldn't need lawyers and a court system based on deciphering the law).

All that said, the one thing I'd add about fair use is that it isn't permission to use anything you like, but rather a defense in a legal proceeding about copyright. It's pretty much all about being able to reference copyrighted material with the law later coming in and making final decisions on whether or not that reference went too far. (IE, copying all of a disney movie and saying "What's up with this!" vs copying 1 scene and saying "This is totally messed up and here's why".)

That was a big part of the google oracle lawsuit.

> Is it Novel? Is it Non-Obvious?

Those questions for patents are barely more clear-cut than copyright fair use tests, there is lots of room for disagreement.

It's definitely true that a fair use defense against copyright infringement varies a lot by the field of work and norms can develop which are relevant to court cases. The music field is a mess, the "Blurred Lines" judgement was total bullshit. But the software field is not without its own copyright history and norms so there's no reason to expect everything to go to hell.

But there's no reason not to either - I suppose my point is, don't take too much as gospel and think about everybody's best "end-goals" and push or pull with or against the law as needed.

There’s also an aspect of this that varies by size, budget, political clout, etc etc, of the individual or organisation.

The big guns like Microsoft, Google, Oracle, do this sort of thing as a matter of course in their business activities, they have the lawyers, the money, and the ear of members of parliaments, senators etc.

Whereas an individual or small business probably wants to conduct themselves within a more narrow set of adherences.

Unanswered question, as far as I know: is a trained model a derivative work? If the model accidentally retains a copy of the work, is that an unauthorized copy?

In my opinion, the model would not be an unauthorized copy given that it's primary purpose was for some other task and the inclusion of the work was merely incidental.

The unauthorized copy arises when someone gets the work out of the model.

Of course if you make a model explicitly for the purpose of evading copyright then the courts will see through that ploy.

I think it would be pretty easy to stake opinions on those "boolean questions."

Is (was?) a swipe gesture novel? Is it non-obvious?

I think what the parent is stating is that even though the patent questions can have debate, once you settle the question "Is it Novel" as yes or no you can determine if the item is patentable... wheras for fair-use, the questions themselves aren't yes/no questions, and further, they are just used as balancing factors, so even if everyone agrees on "the effect of the use upon the potential market for or value of the copyrighted work" it's only weighed as a factor for how fair the use is, and broadly left up to the hand-waving of the particular judge.

Oh, absolutely. Kind of furthers my point. Patent is a silly mess in a lot of ways, but at least there's something like Booleans in it. "Fair use" doesn't even have THAT.

Yes to all this.

I think the factor most at risk in a fair use test with Copilot is whether it ever suggests verbatim, code that could be considered the "heart" of the original work. The John Carmack example that's popped up here at least gets closer to this question, it was a relatively small amount but it was doing something very clever and important.

One can imagine a project that has thousands of lines of code to create a GUI, handle error conditions, etc. that's built around a relatively small function; if Copilot spat out that function in my code, it might not be fair use because it's the "heart" of the original work. Additionally, its inclusion in another project could affect the potential market for the original, another fair use test.

But Copilot suggesting a "heart" is unlikely, something that would have to be ruled on in a case-by-case basis and not a reason to shut it down entirely. Companies that are risk-averse could forbid developers from using Copilot.

This is an excellent comment because it captures some important nuance missing from other analysis on HN.

I agree with you that the relative importance of the copied code to the end product would be (or should be) the crux of the issue for the courts in determining infringement.

This overall interpretation most closely adheres to the spirit and intent of Fair Use as I understand it.

For any discussion on copyright and fair use, we should distinguish between the implications to Copilot the software itself and the implications to users of Copilot.

For Copilot itself, I do see the case for fair use, though it gets fuzzy should Microsoft ever start commercializing the feature. Nevertheless it remains to be seen whether ML training fits the same public policy benefits public libraries and free debate leverages to enable the fair use defense.

For Copilot users, I don't see an easy defense. In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book. In the case of Google books, they explicitly call out the limits on how the material they publish can be used. I'm contrast, Copilot seems to be designed to encourage such copying, making it more worry some in comparison.

>In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book.

A book completely written by pasting passages of other books would actually be a pretty interesting transformative work.

Yeah, but a book like this would be an artistic work.

While software is in this limbo between copyrights and patents...

The world is global. That's a US court ruling from one court of appeals. Most countries have narrower fair use rights than the US. Even if Copilot would fall within that legal precedent (far from guaranteed), a legal challenge in any jurisdiction worldwide outside the US states covered by that particular court of appeals, or which reaches the US Supreme Court, or which goes through the Federal Circuit Court of Appeals due to the initial complaint including a patent claim, would not be bound by that result and (especially in a different country) could very plausibly find otherwise.

What's more, if any of the code implements a patent, fair use does not cover patent law, and relying on fair use rather than a copyright license does not benefit from any patent use grant that may be included in the copyright license. If a codebase infringes a patent due to Copilot automatically adding the code, I can easily imagine GitHub being attributed shared contributory liability for the infringement by a court.

Not a lawyer, just a former law student and law feel layman who has paid attention to these subjects.

> law feel layman

What a weird autocorrect typo. This should have read "law geek layman." (And it initially autocorrected again as I was typing this paragraph.)

> No, see Authors Guild v. Google.

That case required that the output be transformative, in that "words in books are being used in a way they have not been used before".

Copilot only fits the transformative aspect if it is not directly reciting code, that already exists in the form that it is redistributing. So long as it does so, it fails to meet the criteria.

I think you might be considering two different acts here:

1. The act of training Copilot on public code

2. The resulting use of Copilot to generate presumably new code

#1 is arguably close to the Authors Guild v. Google case. You are literally transforming the input code into an entirely new thing: a series of statistical parameters determining what functioning code "looks like". You can use this information to generate a whole bunch of novel and useful code sequences, not just by feeding it parts of it's training data and acting shocked that it remembered what it saw. That smells like fair use to me.

#2 is where things get more dicey - just because it's legal to train an ML system on copyrighted data wouldn't mean that it's resulting output is non-infringing. The network itself is fair use, but the code it generates would be used in an ordinary commercial context, so you wouldn't be able to make a fair use argument here. This is the difference between scanning a bunch of books into a search engine, versus copying a paragraph out of the search engine and into your own work.

(More generally: Fair use is non-transitive. Each reuse triggers a new fair use analysis of every prior work in the chain, because each fair reuse creates a new copyright around what you added, but the original copyright also still remains.)

Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

Absent this, I don't think there's a case. The courts have given extraordinarily wide latitude to fair use and ML algorithms are routinely trained on copyrighted works, photos, etc. without a license.

I understand that this feels more personal because it involves our field, but artists and authors have expressed the same sentiment when neural nets began making pictures and sentences.

The question here is no different than "Is GPT-3 an unlicensed, unlawfully created derivative work of millions, if not billions of people?"

No, I'm quite confident it is not.

> Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

It doesn't need to be substantial. In Google v. Oracle a 9-line function was found to be infringing.

If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.


> The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

Yes, because it was _transformative_, in a clear way. Because an API is only an interface. Which makes that part of that decision largely irrelevant to the topic at hand.

> Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different compu-ting environment without discarding a portion of a familiar program-ming language. Google’s purpose was to create a different task-related system for a different computing environment (smartphones) and tocreate a platform—the Android platform—that would help achieve and popularize that objective.

> If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

It was already decided earlier, and Google did not contest it, choosing instead to negotiate a zero payment settlement with Oracle over the rangeCheck function. There was no need for the Supreme Court to hear it.

A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

> A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

That's not the case. It wasn't an out-of-court-settlement, but an agreement about the damages being sought, the court had already found it to be infringing, and that was part of the ruling.

But none of that changes that 9-lines is substantial enough to be infringing. It isn't necessary to be a large body of work.

> If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

No... It means the rangeCheck function was infringing. The implication you seem to have inferred here wouldn't be inferred by any kind of plagiarism case.

I think we agree then, and appreciate the correction on the lower court settlement.

If Copilot is infringing, I suspect it's correctable (by GitHub) by adding a bloom filter or something like it to filter out verbatim snippets of GPL or other copyleft code. (And this actually sounds like something corporate users would want even if it was entirely fair use because of their intense aversion to the GPL, anyhow.)

It may be correctable... It doesn't change that Copilot is probably infringing today, which may mean that damages against GitHub may be sought.

The point of Copilot -- its entire value as a product -- is to produce code that matches the intent and semantics of code that was in the input. In other words, very deliberately not transformative in purpose.

Why did you choose the standard of "substantial" = "100s of lines"? Especially since we've already seen examples of verbatim output in the dozens of lines range, that choice of standard is rather conveniently just outside what exists so far. If we find a case with 200 lines of verbatim output will you say the only reasonable standard is 1000s of lines?

I don't think your argument is as strong as you're making it out to be.

Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use. I would be surprised if many of haven't inadvertently "copied" some GPL code in this way!

This goes to the "substantial" test for fair use. Clips from a film can contain core plot points, quotes from a book can contain vital passages to understanding a character, screen captures and scrapes of a website can contain huge amounts of textual detail, but depending on the four factors for fair use, still be fair use. (There have been exceptions though.)

The reaction on Hacker News to a machine producing code trained on their works is no different than the reactions artists and writers have had to other ML models. I suspect many of us are biased because it strikes at what we do and we think that our copyrights (because we have so many neat licenses) are special. They are not.

I think it would need to get to that level of "Copilot will emit a kernel module" before it's not obviously fair use.

After all, Google Books will happily convey to me whole pages from copyrighted works, page after page after page.


> Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use.

it's anything but obvious. https://www.copyright.gov/fair-use/

> there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

9 lines of very run-of-the-mill code in Oracle / Google weren't considered fair use.

A big difference is that software is both is and isn't an artistic work.

It's not possible to get copilot to output a transformed version of the input?

Transformed output _may_ fall under fair use.

However - Copilot directly recites code. That is _very unlikely_ to fall under fair use.

Redistributing the exact same code, in the same form, for the same purpose, probably means that Copilot, and thus the people responsible for it, are infringing.

> However - Copilot directly recites code.

You make that statement as an absolute, but in the interests of clarity, all evidence so far shows that it directly recites code very rarely indeed. Even the Quake example had to be prompted by the specific variable names used in the original code.

In practice, the output code is heavily influenced by your own context — the comments you include, the variable names you use, even the name of the file you are editing — and with use it’s obvious that the code is almost certainly not a direct recitation of any existing code.

> all evidence so far shows that it directly recites code very rarely indeed.

_Once_ is enough for it to be infringing. The law is not very forgiving when you try and handwave it away.

You sound quite sure that the outlying instances of direct copying wouldn't be covered by the Fair Use copyright exemption. Any particular reason for that?

I tend to think it would be covered (provided it there were relatively small snippets and not entire functions).

I'm not the person you're replying to, but one strong reason is that the global reach and standardization of copyright law is far broader than the global reach and standardization of the fair use exception. A single non-US country in which GitHub Copilot is used in a way that would be infringing without the US fair use exception, and outside the scope of any such exception in that law, would be enough to cause GitHub/MS a legal hassle. There could well be more than one such country.

Oh, absolutely.

I'm not American, but like others around here — I was just restricting the discussion to American law for simplicity's sake.

Fair, but GitHub/MS (same company now) can't afford to ignore other countries' law in their internal evaluations of whether globally* available products like Copilot are legal.

* Minus a few countries/regions targeted by US sanctions, I assume, though they've gradually broadened their services in sanctioned countries with the necessary licenses from OFAC.

Precedent. Google v. Oracle found 9 lines, of an "obvious" implementation to be infringing.

Right, but would 3-4 lines in the middle of a 50 line function also be infringing? What about 2 lines?

I don't know the answer. I was only surprised that the commenter seemed dead sure that any and all copying (no matter how small) would be infringing.

That just doesn't correlate with my understanding of how Fair Use works: The "amount" of the infringement is one (of several) factors in determining if something falls under Fair Use:

>The third factor assesses the amount and substantiality of the copyrighted work that has been used. In general, the less that is used in relation to the whole, the more likely the use will be considered fair.

From https://en.wikipedia.org/wiki/Fair_use

So if a foreign company pilfers the source code to Windows, can they add it to a training set and then 'prompt' the machine learning algorithm to spit out a new 'copyright free' Windows, just by transforming the variable names?

I think that's my question regarding this whole thing:

If it's so fair use, why not train it on all Microsoft code, regardless of license (in addition to GitHub.com) ? Would Microsoft employees be fine with Copilot re-creating "from memory" portions of Windows to use in WINE ?

Well no, because only GitHub has access to the training set. But more importantly this misunderstands how Copilot even works -- even if Windows was in the training set, you couldn't get Copilot to reproduce it. It only generates a few lines of code at a time, and even then it's almost certainly entirely novel code.

Now, if you knew the code you wanted Copilot to generate you could certainly type it character by character and you might save yourself a few keystrokes with the TAB key, but it's going to be much MUCH easier to simply copy the whole codebase as files, and now you're right back where you started.

GPT-3 is still Microsoft licensed, but a similar model can be put together with the freely available GPT-2 and source code -- especially if your intent is copyright transfer.

As Francois Chollet points out in this talk, ultimately deep neural network models are locally sensitive hash tables, so the examples of people pulling out source code is an inherent shortcoming of deep learning models in general. Give the right 'key' and you can 'recall' the value you are looking for.


> "However - Copilot directly recites code."

Sounds like that wouldn't be difficult to fix? Transform the code to an intermediate representation (https://en.wikipedia.org/wiki/Intermediate_representation) as a pre-processing stage, which ditches any non-essential structure of the code and eliminates comments, variable names, etc., before running the learning algorithms on it. Et voila, much like a human learning something and reimplementing it, only essential code is generated without any possibility of accidentally regurgitating verbatim snippets of the source data.

At that point, can we all just agree IP is the stupidest concept to ever be layered on top of math (which programming is) and move on with non-copyrightable code?

Only if you agree that copyleft licenses are also stupid; without copyright, there's no way to prevent companies from making closed-source forks of code you wrote and intended to stay open.

The whole point of copyleft was as a stepping stone to get to RMS's four freedoms (https://www.gnu.org/philosophy/free-sw.en.html) which effectively eliminates copyright for software.

Freedom 1: “Access to the source code is a precondition”

With no copyright/copyleft, how do you enforce the rule that derived works must provide access to the source code? I’ve never heard that copyleft was a stepping stone—rather, it’s the stick that fully realizes the four freedoms.

Correct. Copyleft is idiocy as well. You don't really need a pay for a proprietary fork of a tool when no one can keep you out of the free one, and the proprietary stuff diffuses into the free option.

Yes, sure. Without copyright there's no need for copyleft left, right?

No...? Not unless that closed-source project's source code is leaked?

You don't care about attribution and other moral rights ?

(I guess these are going to depend a LOT on the jurisdiction that you're in ?)

I care, but in the long run, I care more about our descendants not having tools locked out of their hands. Facilitated information asymmetry is the root of far too many evils.

Where is your ego when you're dead and gone? Where could we be if the majority of human advancement we're not tightly clutched as trade secrets?

As someone who has done paid software engineering (yes, you can feel free to call me a hack or sell out if you wish), I've come to find that the salary I've pulled over the years has not gone to me... But keeping a roof over those I love, helping other people's projects grow, giving people a shot, etc.

My time on the other hand, gets dumped into implementing the same handful of processes doing the same damn thing, but different this time, because you can't just bloody make "Here ya go, here's your Enterprise-in-a-box".

I'd like people more people able to solve novel problems than necessarily need to retread the same path over and over. Some degree of that will always have to be done to keep the skills fresh in the population, but we could do way better at marshaling that split, and I'm convinced part of what necessitates it is creating artificial barriers through things like enforced implementation monopolization. Yes. It ensures a minimum level of novelty and variance across populations, but it also does terribly at not consuming the finite amount of human capacity for truly novel thought to innovate.

It may make societies that function based on greed and economic/fiscal measures work, but I'm not convinced other incentive structures won't keep the rolling stone of innovation from accruing moss.

I don't understand what you're talking about, I'm talking about the non-commercial parts of the monopoly rights that are copyrights and patents, the non-commercial parts arguably aren't going to restrict the users much, and their commercial parts are temporary by design.

(Copyright has went IMHO overboard with its duration, we should scale to back to the original 14 years renewable once, just like patents, but copyright doesn't apply to processes anyway, and so arguably it shouldn't apply to software that can't claim to have any artistic merit.)

> By comparison, Copilot is even more obviously fair use.

You are correct about (US specific) the fair use exception, but it is in no way as clear as you suggest that what copilot is doing entirely falls under fair use. Fair use is always constrained.

I suspect some variant of this sort of thing will have to be tested in court before the arguments are really clear.

> By comparison, Copilot is even more obviously fair use.

Not sure I see it that way.

If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

Copying and storing a book isn't recreating another book from it. Copilot is creating new stuff from the contents of the "books" in this case.

Edit: I misunderstood fair use as it turns out...

Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.

> Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.

Not sure if you meant to reply to me but I agree with you: you can't compare what Google did to what Copilot does.

Copilot just suggests code.

And someone accepts it. Even if suggesting derivatives of licensed code is not a license infringement, then Copilot sure is a vector for mass license infringement by the people clicking "Accept suggestion". And those people are unable to know (without doing extensive investigation that completely nullifies the point of the tool) whether that suggestion is potentially a verbatim copy of some existing work in an incompatible license.

If I suggest whole lines of dialogue to you, the screenwriter, did I write those lines or you? If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

Suggesting code is generating code

> did I write those lines or you

Neither. Someone else did, and published it. Copilot copied the dialog and suggested it.

> If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

It depends. Talking generalities isn't productive or interesting. Can you give an example and we can discuss specifics?

> Suggesting code is generating code

This isn't even superficially true

There are situations where the question is are the mishmashes from Copilot 'fair use'.

But the other, more direct question is ... what about the instances where Copilot doesn't come up with a learned mishmash result? What happens when Copilot just gives you a straight up answer from it's learning data, verbatim?

Then you, as a dev, end up with a bunch of code that is effectively copied, via a 'copying tool', which is GPL'd?

It's that specific case that to me sticks out as the 'most concerning part'.

Please correct me if I'm wrong.

For your specific case, “take your hard work that you clearly marked with a GPL license and then make money from it”, you don’t even need to rely on fair use. As long as you comply with the terms of the GPL, making money with the code is perfectly acceptable, and the FSF even endorses the practice. [1] Red Hat is but one billion-dollar example.

[1] https://www.gnu.org/licenses/gpl-faq.en.html#DoesTheGPLAllow...

But the person making money from the GPL code has to follow the terms of the license. Attribution, sharing modifications, etc.

Correct. That's why I said "As long as you comply with the terms of the GPL".

I've edited my comment with examples and a clarification.

Fair use is an exception to copyright and, by definition, copyright licenses.

I understand the concept of fair use (I think) but I can't see how it applies to Copilot.

Google didn't create new books from the contents of existing ones (whether you agree that they should have been allowed to store the books or not) but Copilot is creating new code/apps from existing ones.

Edit: I guess my understanding of fair use was wrong. I stand corrected.

If Google Books were creating new books, that would only help their argument. Transformativeness is one of the four parts of the fair use test.

Copilot producing new, novel works (which may contain short verbatim snippets of GPL works) is a strong argument for transformativeness.

It would help the transformativeness, but it would substantially change the effect upon the market. By creating competing products with the copyrighted material, there is a higher degree of transformative, but you also end up disrupting the marketplace.

I don't know how a court would decide this, but I do think the facts in future GPT-3 cases are sufficiently different from Author's Guild that I could see it going any way. Plus, I think the prevalence of GPT-3 and the ramifications of the ruling one way or another could lead some future case to be heard by the Supreme Court. A similar case could come up in California, or another state where the 2nd Circuit Artist Guild case isn't precedent.

> short verbatim snippets of GPL works

Define short


Yeah, I realise that now.

However, where does one draw the line between fair use and derivative works?

Creating something based on other stuff (Google creating AI books from the existing ones for example) would possibly be fair use I think but would it not also be derivative works?

There's no clear line and there can never be because the world is too complex. We leave up determination to the court system.

Google Books is considered fair use because they got sued and successfully used fair use as a defense. Until someone sues over Copilot, everyone is an armchair lawyer.

I don’t disagree with your point but was it necessary to make it in such a snarky way?


Would you please stop breaking the site guidelines? You've been doing it repeatedly and it's not cool. Please just be kind.


This is the clearest display yet that moderation on HN has absolutely nothing to do with your purported values like constructive criticism, and has everything to do with whether dang agrees with you or not.

I actually have no idea what you were arguing about, nor which side you were on, nor what your argument was. I haven't paid enough attention to know those things, because (a) I don't want to, (b) I don't need to, and (c) not doing it leaves me in the desirable state of being incapable of agreeing or disagreeing.

It's a happy fact that figuring out people's arguments is often unnecessary for moderating the threads, especially in cases where people are breaking the site guidelines. Everyone needs to follow the site guidelines regardless of what the topic is, what their argument is, and how right they are or feel they are. Please stick to the rules when posting here.


I don't think that's an accurate description...

Fair use is a defense for cases of copyright infringement, which means you're starting of from a case of copyright infringement, which sort-of muckys up the whole "innocent until proven guilty" thing. And considering it's a weighted test, it's hardly very cut-and-dry at that.

If you view GPL code with your browser would that mean that your browser now has to be GPL as well? In the sense that copilot is not much different than a browser for Stack Overflow with some automation, why would it need to be GPLed? Your own code on the other hand…

For sake of discussion, it would be clearer to split copilot code (not derived from GPL'd works) and the actual weights of the neural network at the heart of copilot (derived from GPL'd works via algorithmic means).

For your browser analogy, that would mean that the "browser" is the copilot code, while the weights would be some data derived from GPL'd works, perhaps a screenshot of the browser showing the code.

I'd think that the weights/screenshot in this analogy would have to abide by the GPL license. In a vacuum, I would not think that the copilot code had to be licensed under GPL, but it might be different in this case since the copilot code is necessary to make use of the weights.

But then again, the weights are sitting on some server, so GPL might not apply anyway. Not sure about AGPL and other licenses though. There is likely some illegal incompatibility between licenses in there.

As I understand it the things copilot tries to do is automate the loop of “Google your problem, find a Stack Overflow answer, paste in the code from there into my editor”. In that sense, the burden of whether the license of the code being copy pasted is on the person who answered the SO question and on me. If this literally was what copilot did, nobody would bat an eye that some code it produced was GPL or any other license because it wouldn’t be copilot’s problem.

No let’s substitute a different database of for the code that isn’t SO. It doesn’t really matter if that database is a literal RDBMS, a giant git repo or is encoded as a neural net. All copilot is going to do is perform a search in that database, find a result and paste it in. The burden of licensing is still on me to not use GPL code and possibly on the person hosting the database.

The gotcha here is that copilot’s database is a neural network. If you take GPL code and feed it as training data to a neural network to create essentially a lookup table along with non-GPL code did you just create a derived work? It is unclear to me whether you did or not. In particular, can they neural network itself be considered “source code”?

> If you view GPL code with your browser would that mean that your browser now has to be GPL as well?

Some good responses in sibling comments already, but I don't see the narrow answer here, which is: No, because no distribution of the browser took place.

If you created a weird version of the browser in which a specific URL is hardcoded to show the GPL'd code instead of the result of an HTTP request, and you then distributed that browser to others, then I believe that yes, you'd have to do so under the GPL. (You might get away with it under fair use if the amount of GPL'd code is small, etc.)

If you use your browser to copy some GPL code into your project your project must now be GPL as well.

So following your own argument, even if Copilot is allowed, using it still risks you falling under GPL

My point exactly. Copilot is innocent in that case just like the browser.

Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.

That probably depends on how large and how significant the bits you remember are. Otherwise one could take a person with photographic memory and circumvent all GPL licenses easily, by making that person type what they remember.

> Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.

I do not find that to be obvious at all.

You do not find it obvious that a human being would not become a GPL'd work?

To build a browser you don't need a verbatim GPL code, so it's not a derivative work in the same sense copilot is.

Stackoverflow on the other hand is much trickier question...

SO clearly doesn’t need GPL code to be useful. The wider SE network is evidence of that.

> If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

If I'm Google, and I scan your code and return a link to it when people ask to find code like that (but show an ad next to that link for someone else's code that might solve their problem too), that's fair use and legal. My search engine has probably stored your code in a partial format, and that's fine.

It's fine because a search engine is a generic tool the main purpose of which is not to replicate the code verbatim to be used as code.

>If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

You can actually take snippets from commercial movies and post them onto YouTube if your YouTube video is transformative enough for your usage to be considered fair use. Well, theoretically at least - in reality YouTube might automatically copyright strike it.

>Copying and storing a book isn't recreating another book from it.

That doesn't mean that GitHub has to redistribute Copilot under GPL. However, the end user could potentially have to if they use Copilot to generate new code that happens to copy GPL code verbatim.

> You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

> That doesn't mean that GitHub has to redistribute Copilot under GPL

I wasn't saying that was the case: some of the code that Copilot used may not allow redistribution under GPL.

But let's say that all of the code it scanned was GPL for the sake of argument. Why would they not have to distribute their Copilot source yet, if I use it to generate some code, I'd have to distribute mine?

My spidey-sense it tingling at that one!

> Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

Again, fair use is an exception to copyright protection. If something is fair use, the license does not apply. The fact that Copilot does not release its source code is related only to a specific term of a specific license, which does not apply if Copilot is indeed fair use.

Making money is irrelevant to fair use

Irrelevant to GPL maybe.

> ...the non-obvious thing for many developers is that fair use is an exception to copyright itself.

More precisely, fair use is an affirmative defense to an claim of copyright infringement. A fair use defense basically says, "Yes, I am copying your copyrighted material and I don't have a license (or am exceeding a licensed use), but my usage is allowed under the fair use doctrine (codified in 17 USC 107 in US law)."

Thanks for this, but can you answer the question:

Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Even if Copilot is 'fair use' ... does that mean the results are 'fair use' on the part of AutoPilot users?

And a bigger question: is your interpretation of those statues and case law enough to make the answer unambiguous?

I don't have legal background, but I do have an operating background with lawyers and tech ... and my 'gut' says that anyone using Copilot is opening themselves up to lawsuits.

If the code you put in your software comes, via Copilot, but that code is verbatim from some kind of GPL's (or worse, proprietary) ... there's a good chance you could get sued if someone gets the inclination.

Maybe it's because of my personal experience, but I can just see corporate lawyers banning Copilot straight up as the risks are simply now worth the upside. That's now what we like to hear in the classically liberal sense i.e. 'share and innovate' ... but gosh it doesn't feel like a happy legal situation to me.

Looking forward to people with more insight sharing on this important topic.

> Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Only a lawyer (and truly, only a court) could answer that question.

If you copy 100 lines of code that amounts to no more than a trivial implementation in a popular language of how to invert a binary tree, it's likely fair use.

If you copy 10 lines of code that are highly novel, have never been written before, and solve a problem no one outside the authors have solved... It may not be fair use to copy that.

Other people who have replied have mentioned "the heart" of a work. The US Supreme Court has held that even de minimis - "minimal", to be brief - copying can sometimes be infringement if you copied the "heart" of a work.

If this issue is eventually litigated, we will see. The law in the Second Circuit (where the final judgment was rendered before the case was eventually settled) may well be different than the law in a different circuit. If there is a split in the circuit courts, then the Supreme Court may have to weigh in on this issue.

When fair use is an issue, the courts look at the facts in context each time. These are obviously different facts than scanning books for populating a search index and rendering previews; and each side is going to argue that the facts are similar or that they are dissimilar. How the court sees it is going to be the key question.

This could either be:

1. a fascinating Supreme Court opinion.

2. a frustrating ruling because SCOTUS doesn't understand software and code.

3. the type of anti-anticlimactically(?) narrow ruling typical of the Roberts court.

While our Congresspersons can't seem to wrap their minds around technology/social media, I think SCOTUS would understand this one enough to avoid (2).

Fair use cases tend to produce narrowly-written law because the outcomes hinge on how the court judges the facts against the list of factors codified in the Copyright Act (17 U.S.C. section 107). The courts don't really have breathing room to use a different test. I don't recall any cases in which the courts have set binding guidelines for interpretation of these factors.

The Google vs Oracle case showed that SCOTUS can handle technical topics

Next up, Copilot for college papers! Who needs to pay a professional paper-writer (ahem, I mean write the paper) when you can have an AI write your paper for you! It's fair use, so you're entitled to claim ownership to it, right?

I think you are confusing legal protections for intellectual property with plagiarism. (At least that's what I think you're doing if I read your comment as sarcasm and guess what you're trying to say non-sarcastically?) But they are entirely different things.

You can be violating copyright without plagiarizing, so long as you cite your source, but if you copy a copyright-protected work in an illegal way when doing so.

And you can be plagiarizing without violating copyright, if you have the permission of the copyright holder to use their content, or if the content is in the public domain and not protected by copyright, or if it's legal under fair use -- but you pass it off as your own work.

Two entirely separate things. You can get expelled from school for plaguriism without violating anyone's copyright, or prosecuted for copyright without committing any academic dishonesty.

You can indeed have the legal right to make use of content, under fair use or anything else, but it can still be plagiarism. That you have a fair use right does not mean "Oh so that means you are allowed to turn it in to your professor and get an A and the law says you must be allowed to do this and nobody can say otherwise!" -- no.

Yeah, I was being sarcastic. But you make a good point about the legality of plagiarism.

Copilot is not doing what your example does.

If Github had a service that automatically mirrored public repositories on Gitlab, that would be equivalent to the example you gave.

But Github is taking content under specific licenses to build something new for commercial use.

I'm not sure if what Github does falls under Fair Use, but I don't know that it matters. I can read fifty books and then write my own, which would certainly rely—consciously or not—on what I had read. Is that a copyright violation? It doesn't seem like it is but maybe it is and until now has been impossible to prosecute?

GitHub isn’t building anything.

The end user is.

By this logic any and all neural nets that draw pictures are copyright infringing as well.

If they create exact copies of copyrighted pictures, then yes, they do.

> Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

...and if you're outside the USA?

Read the Authors Guild v Google dismissal. The court considered it fair use because Google's project was built explicitly to let users find and purchase books, giving revenue to the copyright holders. Copilot does not do that.

> ... giving revenue to the copyright holders.

That's a reference to factor four of the fair use test, "the effect of the use upon the potential market for or value of the copyrighted work." (17 USC 107).

None of the factors are dispositive, however. For example, a scathing book review that quotes a passage to show how bad the writing is might eviscerate sales of the book, but such a use is usually protected. For a counter-example, see Harper & Row v. Nation Enterprises 471 U.S. 539 (1985).

> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

Exactly the point I came to make.

The Authors’ Guild is a US entity, and so is Google, so only US law applies. And thus, we have the Fair Use exception.

But developers sharing code on GitHub come from and live all over the world.

Now, Github’s ToS do include the usual provision stating that US & California law applies, et cætera, et cætera [1], but… and even they acknowledge it may be the case, such provisions usually aren’t considered legal outside of the US.

So… developers from outside the US, in countries with less lenient exceptions to copyright, definitely could sue them.

Identifying these countries and finding those developers, however, is a different matter altogether.

[1]: https://docs.github.com/en/github/site-policy/github-terms-o...

While I agree you are correct about (in the US anyway) fair use being an exemption from copyright, thus superceeds licensing

I disagree that Copilot is "more obviously fair use.", some parts might be, but we have seen clear examples (i.e verbatim code reproduction) that would not be.

I dont believe the question of "is this fair use" is as clear as you believe it to be

This was a good point. Really enjoying this discussion. Interesting stuff.

I'm really out of my depth in giving my own opinion here, but I'm not sure that either the "distribution != derivative" characterization, or that "parsing GPL => derivative of GPL" really locks this thing down. The bit that I can't follow with the "distribution != derivative" argument is that the copilot is actually performing distribution rather than "design". I would have said that copilot's core function is generating implementations, which to me does not seem like distribution. This isn't a "search" product, and it's not trying to be one. It is attempting to do design work, and I could see a case where that distinction matters.

I buy the argument about copilot itself and this comment. But when someone goes to release software that uses the output of Copilot, I fail to see how they wouldn’t be a GPL derivative work if enough source was used. Copilot is essentially really fancy copy/paste in that context.

I think this is the correct answer. IANAL but the copilot code vs the copilot training data are different things and licensing for one shouldn’t affect the other, right? And the fact that training data happens to also be code is incidental.

One view would be that copilot the app distributes GPL'd code, in a weird encoding. Training the model is a compilation step to that encoding

I assume the code is a derivative work of training data because given different data code would be also different (neuron weights)

If I read a GPL implementation of a linked list and then write my own linked list implementation, was my neural network in my brain a derivative work of the GPL code?

Sure it is, you brain is not software though

So as long as I read GPL code, then rewrite it from memory and feed it to copilot to train it I can unGPL anything?

If fair use memorising whole source code byte-by-byte, storing it as ie. some non-100%-lossless compression for subsequent retrieval or arbitrary size snippets?

If copilot was trained using the entirety of the linux kernel, wouldn't the neural network itself need to be GPLed, if not its output.

> Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books.

For commercial use and derivative works?

Authors won't incorporate snippets of books into new works unless they're reviews. Copilot is different.

Google Books is a commercial site which incorporated the snippets of millions of copyrighted works. And of course, sitting in thousands of Google servers/databases are full copies of each of those books, photos of each page, the OCRed text of each page, and indexes to search them. Even that egregious copying without a license or permission was considered fair use.

If anything, the ways in which Copilot is different aid Microsoft/GitHub's argument for fair use. Because Copilot creates novel new works, that gives them a strong argument their system is more transformative than Google Books, which just presents verbatim copies of books.

The Google books example really misses the point, one of the reasons why the judges considered it fair use was because it was pointing back to the original sources (and thus potentially increasing publishers earnings).

Copilot does none of that. If all the ML companies are so sure this is fair use I encourage them to train an AI on Disney movies to generate short cartoon snippets based on some description. There sure would be a court case.

The main issue here is less doing it, but getting sufficiently nice results. I've done work in generative AI before and right now the state of the art is passable on single images with some but not enough control and is still weak on videos without heavy structure requirements. I expect in 5-10 years we will have good enough models (or hardware) to do short video generation and the question will get tested then. I also think a meaningful good video requires audio and have fun making well aligned text (for dialogue) audio of that text, and video frames. Aligning all that generation together is still challenging today.

> Authors won't incorporate snippets of books into new works

Of course they do, previous works are quoted all the time.

But that's another thing - co-pilot doesn't quote it encourages something more akin to plagarism, doesn't it?

Plagiarism, pretending you made a work entirely yourself when you didn't, is rarely a matter for a court to decide and the standards for what constitutes plagiarism can vary a lot. When I turn in projects for a course, a cite sources in the comments a lot, even if what I turn in is substantially modified. An employer generally doesn't care if you copied and pasted code from StackOverflow or wherever, so long as you don't expose them to a suit and you don't lie if asked "Did you write this 100% yourself?"

Citing your source is not a get out jail free card for copyright infringement, it doesn't really matter.

> Citing your source is not a get out jail free card

No, but it's a requirement of the license stackoverflow.com uses, which is unfortunate, for code (as opposed to text, where a quote can be easily attributed).

...with attribution.

And without. Attribution isn't a "copyright escape clause", copying a work without permission is still infringement - unless it's fair use.

Plagiarism is not the same as infringement.

Does intent not matter? Pasting code for explanatory reasons and citing the source seems different than silently incorporating it directly into a commercial work product.

Can you still apply Fair Use if they make Copilot a payed service?

> Fair use is an exception to copyright itself.

And copyright itself is an exception to the normal state of things : the public domain, copyright being only a temporary monopoly.

Assuming that Copilot's use of GPL'd code to provide snippets to a developer is fair use, what rights does the developer have to using that snippet?

Can you copy 10 lines of code from a open source project in your software? Yes you an, it's considered fair use. Nobody will ever sue for that. If it was, websites like StackOverflouw where developers post code probably taken by project with some restrictive license and other developer copy it in their projects would not exist.

Copilot will not write an entire software module, it will provide you with snippets. I see using GPL code for training fair use. If a developer reads the source code of a project to take inspiration and possibly copy some small parts does it violate the license?

When the recent Github v. youtube-dl fiasco happened, I remember reading similarly strongly-worded but dismissive comments regarding fair use, stating how it is quite obvious that youtube-dl's test code could never be fair use and how fair use itself is a vague, shaky, underspecified provision of the copyright law which cannot ever be relied on.

To me, seeing youtube-dl's case as fair use is so much easier than using hundreds of thousands source code files without permission in order to build a proprietary product.

How would you feel about a paid-for search engine using hundreds of millios of web pages without permission in order to build a proprietary product?

There is a crucial difference though, the search engine links back to the content. If Google would just display the content on their verbatim, it would definetly not be considered fair use. Even like this several countries have restricted what Google can do when displaying e.g. News.

Somehow building a list of pointers to original content does simply not have the same ring to me as a product that rehashes all of the content. A rehashing of content sounds to me much more like, for example, publishing a sequel to my favourite book. After all, a sequel is just a rehashing of the same characters in new adventures. If we can't do that, why should Copilot be fine?

My point was however that I'm just utterly failing to see how the youtube-dl test thing could be more of a copyright problem than this entire thing based on millions of others' works that is Copilot.

You mean like a search engine?

This is a thoughtful and insightful reply. Thank you.

Just for reference, the hackernews source is public.

Not the current version? AFAIK there's some security-by-obscurity in the measures against spam, voter rings etc ?

Books (mostly) are not distributed under the GPL.

True. But pretty good privacy might be worth considering in this context - it was at one point published as a book after all...


The GPL only gives you additional permissions relative to what you would have by default. The books included in that suit were more strongly restricted, since there was no license at all.

There are certainly some interesting additional conditions the GPL creates by taking the license away if you violate certain clauses. Regardless, the interesting part of this is that this looks different from the user's point of view and Microsoft's. Sure, 5 lines out of 10,000 is probably fair use. For Microsoft, their system is using the whole code base and copying it a few lines at a time to different people, eventually adding up to potentially lots more than fair use.

The question on this one will be about the difference between Microsoft/Github's product and a programmer using copilot's code:

"If I feed the entire code base to a machine, and it copies small snippets to different people, do we add the copies up, or just look at the final product?"

Does the GPL forbid fair use? Why don't book publishers use a license that forbids fair use?

Because fair use is an exception to copyright itself. A copyright license can't take away your legal right to fair use.

> Why don't book publishers use a license that forbids fair use?

They couldn't do it with a license, which only imposes conditions for the license to be valid. Fair use applies even if the copier has no license at all.

Potentially they could do it with a contract. A license is not a contract and imposes no covenants on the parties.

I think the bigger issue is that use of Copilot puts the end user at risk of using copyrighted code without knowing it.

Sure one could argue that Copilot learned in the way a human does. There is nothing that prevents one from learning from copyrighted work, but snippets delivered verbatim from such works are surely a copyright violation.

More interestingly, if we can trick it into regurgitating a leaked copy of the windows source code, Microsoft apparently says that’s fair use.

This is pretty interesting for AI in general. Should you be able to train with material you don't own? Can your training benefit from material that has specific usage licenses attached to it? What about stuff like GameGAN?

> ...Should you be able to train with material you don't own?

If relating this to how humans learn, books and other sources are used to inform understanding and human knowledge. One can purchase or borrow a book without actually owning the copyright to it. Indeed, a given passage may be later quoted verbatim, provided it is accompanied with a reference to its source.

Otherwise, a verbatim use without attribution in authored context is considered plagiarism.

So, sure one can use a multitude of material for the training. Yet, once it gets to the use of the acquired "knowledge" - proper attribution is due for any "authentic enough" pieces.

What is authentic enough in this case is not easy to define, however.

"If relating this to how humans learn" seems like a big IF though right? Are we going to treat computer neural nets as human from a legal standpoint?

At some point Neural Nets like GameGAM might be good enough to duplicate (and optimize) a commercial game. Can you then release your version of the game? Do you just need to make a few tweaks? Are we going to get a double standard because commercial interests are opposed depending on the use case?

It would be pretty funny if Microsoft as a game publisher lobbies to prevent their IP being used w/ something like GameGAN, but then takes the opposing stand point for something like their CoPilot! Although I'm sure it'll be spun as "These things are completely different!".

This is the key question. In school I was taught to be careful to always cite even paraphrased works. If Copilot regurgitates copyrighted fragments without citation or informing acceptors of licenses involved then it's facilitating infringement.

> Are we going to treat computer neural nets as human from a legal standpoint?

Maybe we will some day, but for now this isn't the case, where the law is concerned :


Assuming that copilot is a violation of copyright on GPL works, it would also be a violation of non-GPL copyrighted works, including public but, but fully copyrighted works. Therefor relicensing others source code under GPL would violate even more copyright.

So in that case, of course copilot would have to give license info for every. single. snippet. Case solved. Only, that they will probably not do that.

Probably they get away with it, but it definitely seems against the spirit of the GPL just as closed source GitHub existing because of open source software seems quite hypocritical.

IANAL but as I understand it, ruling in the US is that machines can not produce "derived works" of copyrighted works. If it replicates (A)GPL code verbatim, it's up to the user to comply with its license.

Of course the interesting part is that the user not only has no idea what that license is but also where the code came from and if it is in fact copied verbatim. It's unlikely a court would agree that putting licensed code through a machine strips the licensing requirements of the code, of course, but that doesn't seem to be Microsoft's problem.

I think Microsoft's use of public code hosted on GitHub is covered by the terms of service but if this use includes granting a license more permissive than the license indicated on the code itself, this would probably put every GitHub user who ever committed less permissively licensed code to GitHub that they didn't control in violation of those licenses.

There's really only three ways this can go:

1) Machine learning does legally become a license-stripping black box, which would allow creating a machine generated commons by feeding arbitrary copyrighted works into sloppy AIs that mostly just replicate their input without changes.

2) Copyright law is extended to consider the output of machine learning as derived works from its inputs, massively extending the reach of copyright and creating massive headaches for everyone (e.g. depending on the exact ruling this would effectively make it impossible to reproduce a digital artwork as merely rendering it on a screen would create a derived work).

3) The original licenses are upheld and remain in effect, rendering the output of Copilot useless by creating a massive legal headache for anyone trying not to violate copyright.

I think outcome 2 is unlikely but 1 and 3 aren't mutually exclusive.

If Github hosts AGPL code, does that mean that github's own code must be AGPL? Obviously not. What's the difference?

There's no point to copilot without training data, some but not all of the training data was (A)GPL. There's no point to github without hosting code, some but not all of the code it hosts is A(GPL).

The code in either cases is data or content, it has not actually been incorporated into the copilot or github product.

> If Github hosts AGPL code, does that mean that github's own code must be AGPL? Obviously not. What's the difference?

GitHub's TOS include granting them a separate license (i.e., not the GPL) to reproduce the GPL code in limited ways that are necessary for providing the hosting service. This means commonsense things like displaying the source text on a webpage, copying the data between servers, and so on.

Code isn't to GitHub what training data is to this model, or at least even if you could argue that it is within a current framework it shouldn't be.

> we don’t distribute copies of humans

A bit of a tangent and it’s fictional, but I really have to recommend the tale of MMAcevedo. https://qntm.org/mmacevedo

This is a great argument.

copilot isn't distributing copies of itself either.

I am really confused by HN's response to copilot. It seems like before the twitter thread on it went viral, the only people who cared about programmers copying (verbatim!) short snippets of code like this would be lawyers and executives. Suddenly everyone is coming out of the woodworks as copyright maximalists?

I know HN loves a good "well actually" and Microsoft is always suspect, but let's leave the idea of code laundering to the Oracle lawyers. Let hackers continue to play and solve interesting problems.

Copilot should be inspiring people to figure out how to do better than it, not making hackers get up in arms trying to slap it down.

I am really confused by HN's response to copilot.

If you're asking about the moral reaction here, I think it depends on how one views Copilot. Does Copilot create basically original code that just happens to include a few small snippets? Or does Copilot actually generate a large portion of lightly changed code when it's not spitting out verbatim copies of the code? I mean, if you tell Copilot, "make me a QT compatible, crossplatform windowing library" and it spits out a slightly modified version of the QT source code and if someone started distributing that with a very cheap commercial license, that would be a problem for the QT company, which licenses their code commercial or GPL (and as QT a library, the QT GPL forces user to also release their code GPL if they release it, so it's a big restriction). So in the worst case scenario, you can something ethically dubious as well as legally dubious.

Copilot should be inspiring people to figure out how to do better than it, not making hackers get up in arms trying to slap it down.

Why can't we do both? I mean, I am quite interested in AI and it's progress and I also think it's important to note the way that AI "launders" a lot of things (launders bias, launder source code, etc). AI scanning of job applications has all sorts of unfortunate effects, etc. etc. But my critique of the applications doesn't make me uninterested in the theory, they're two different things.

A naive developer thinks that they are the source code they write (you're not), and their source code leaking to the world makes them worthless. (Which isn't true, but being that invalidated explains a lot of the fear. Which, welcome to the club, programmers. Automation's here for your job too.)

Still, some of the moral outrage here has to do with it coming from Github, and thus Microsoft. Software startup Kite has largely gone under the radar so far, but they launched this back in 2016. Github's late to the game. But look at the difference (and similarities) in responses to their product launch posts here.

https://news.ycombinator.com/item?id=11497111 and https://news.ycombinator.com/item?id=19018037

A naive developer thinks that they are the source code they write (you're not), and their source code leaking to the world makes them worthless.

Maybe Github isn't violating the licenses of the programmers who host on them. Maybe Copilot doesn't just spit out code that belongs to other people. Those are matters of interpretation and debate.

But if Github was doing this with Copilot, virtually an open source programmer would have a reason to be upset. Open source programmers don't give their code out for free they license it. This is a legal position, not a feeling. "Intellectual property" may be a pox on the world but asking open source developers to abandon their licenses to ... closed source developers, is legitimately a violation.

And before the spitting out source code problem appeared, I recall quite a few positive responses to Copilot. Lots of people still seem excited. And yeah, people are looking at the downside given Microsoft's long abusive history but hey, MS did those thing.

You've answered your own question. They went under the radar and nobody cared about them. They're not the multibillion company that sued Mike Rowe and keeps ReactOS developers awake at night.

Try doing any type of deal (fundraising, M&A) where you can't point to the provenance of your application's code. This isn't good for programmers, programmers WANT clean and knowable copyrights. This is good for lawyers, who'll now have another way to extract thousands of $$ from companies to launder their code.

If you do get sued, the Copilot page is written in a way that would make Github legally responsible for it, not you. "Just like with a compiler, the output of your use of GitHub Copilot belongs to you."

Yeah, right... This isn't going to fly in court any more than if the Pirate Bay page was written in a way that says that it's solely responsible for what you do with the magnet links that they share.

The pirate bay is very clear to not claim any responsibility for what people post on their site. That's how they get away with it.

I know, it's an hypothetical.

On many ML posts, you get arguments about IP, and there's a long history of IP wars on this forum, especially when licensing comes up. Then you add the popular Big Tech Is Evil arguments you see. I think it's a variety of factors coming together for people to be upset about someone else profiting from their own work in ways they didn't mean to allow.

I expect that we'll need new copyright law to protect creators from this kind of thing (specifically, to give creators an option to make their work public without allowing arbitrary ML to be trained on it). Otherwise the formula for ML based fair use is "$$$ + my things = your things" which is always a recipe for tension.

I think the real issue is less about the "copying short snippets", and more about how it was done, i.e zero transparency, default opt in without any regards to licensing (with no way to opt out??) and last but not least - planning to charge money for it.

I've always cared but never talked about it. Someone copy and pasting code from a source that is clearly forbidden (free software, reverse engineered code, leaked source code, etc) isn't an interesting thing to talk about. It's obviously wrong.

Also people rarely do it; I've caught maybe a couple instances of it in my career and I never really thought too much about them again. This tool helps make it a lot easier and more common. I have a feeling other people chiming in are also in the camp of "Oh, this is going to be a thing now, huh?"

I also can't help but to think that my negative opinion of it isn't solely based on this provenance issue. While it's cool it seems questionable about how practical it is. If the value was more clear I think I could stomach the risk a bit better.

Firstly it's important to remember that HN is not a single person with a single opinion, but many people with conflicting opinions. Personally I'm just interested in the copyright discussion for the sake of it because I find it interesting. Though, I imagine there's also an amount of feelings of unfairness.

As a mature, skilled engineer, you wouldn’t mind sharing your knowledge—but you’d really prefer to do this on your own terms.

First, you might choose to distribute your code under a copyleft license to advance the OSS ecosystem. Second, the older you get, the more experience you accumulate, paradoxically the harder it is for you to find a job or advance your career in this industry—so, to maintain at least some source of motivation for tech companies to hire you, you may choose to make some of the source available, but reserve all the rights to it.

You’re fine making the source of your tool or library open for anyone to pass through the lens of their own consciousness and learn from it, but not to use as is for own benefit.

Now with GitHub Copilot suddenly you see the results of your labour you’ve previously made (under the above assumptions) public being passed through some black box, magically stripped from your license’s protections, and used to provide ready-made solutions to everyone from kids cheating at college tests to well-paid senior engineers simply lacking your expertise.

I hope it’s easy to spot how engineer’s interests in the above example are not necessarily aligned with GitHub’s, how this may be perceived as an unfair move disadvantaging veteran rank-and-file software engineers while benefitting corporate elites and investors, and subsequently has the potential to disincentivize source code sharing and deal a blow to OSS ecosystem as a whole.

Perhaps people on HN start sensing that successors of Github Copilot will take their programming job. Rightly so.

Personally, I think that in the age of AI programming any notions of code licensing should be abolished. There is no copyright for genes in nature or memes in culture; similarly, these shouldn't be copyright for code.

> Perhaps people on HN start sensing that successors of Github Copilot will take their programming job. Rightly so.

I still think we're a long way from that. Copilot will help write code quicker, but it's not doing anything you couldn't do with a Google search and copy/paste. Once developers move beyond the jr. level, writing code tends to become the least of their worries.

Writing the code is easy, understanding how that code will affect the rest of the system is hard.

Based on the responses I've seen, people have it in their heads that Copilot is a system where you describe what kind of software you want and it finds it on Github and slaps your own license on it.

It's just a smarter tab-completion.

Depends on your definition of "a long way". Some of the GPT3 based code generation demos (which, explicitly, are just that - demos - we aren't shown the limitations of the system during the demo) say that's closer than I think.

https://analyticsindiamag.com/open-ai-gpt-3-code-generator-a... has a bunch of videos of this in action.

That's because the training set had that specific demo, not because copilot imagined up a demo.

> Perhaps people on HN start sensing that successors of Github Copilot will take their programming job. Rightly so.

I feel like this comment misunderstands what a software developer is doing. Copilot isn't going to understand the underlying problem to be solved. It's not going to know about the specific domain and what makes sense and what doesn't.

We're not going to see developers replaced in our lifetime. For that you need actual intelligence - which is very different from the monkey see monkey do AI of today.

The thing is that understanding the domain and thinking out a fairly efficient or elegant solution is something a lot of industry specialist and scientists can do, and only part of programming. Another part is dealing with all the language syntax and specialist lego bits/glue code, and that's something domain specialists tend to be less good at and not enjoy spending time on; it's its own craft.

Having a semi-intelligent monkey that can fetch obvious things off the shelf, build very basic control structures, and do the boring little housekeeping tasks is bad for the craft of programming but very good for the good-enough-solution situation. I can see it having the same impact as cheap and widely available digital cameras; anyone can be a kinda decent photographer now, but if you want to be a professional you're probably going to have to work a lot harder to stand out, whether that's by development of craft, development of narrow technical expertise and fancy equipment, or development of excellent business skills.

The funny thing with "good enough" solutions is that at some point it becomes unmanageable. I've basically spent a good part of my career cleaning up these solutions to make way for scalable, maintainable solutions that don't introduce security holes.

Photography is a good analogy - with everyone having fancy cameras you could think that a photographer is now not necessary. But yes there are still photographers about - they see things that the average person doesn't. The camera doesn't tell them what type of photos to take, what composition the photo should have or what poses a model should have.

You have excellently described the job of business analysts and system architects, but this is not the job of 90% of programmers today, including senior-level. Part of this is already done by other people and doesn't require specific programming skills, hence, at the very least, programmers will lose their privileged position. Another part of it is actually too hard for most people who are currently employed as programmers to do on a decent level (such as meaningfully hacking on Linux kernel).

Memes are absolutely copyrightable, heard of Grumpy Cat?

New genetic sequences are patentable, not copyrightable, but that because of the process involved in creating new genetic sequences more then the genes themselves.

Sure naturally occurring genes aren't patentable, but it's not like we have code growing on trees. So that's a terrible comparison.

The problem with Copilot is, that so far it doesn't seem to be much of an AI and more of an copy-bot. If you are just copying code, you quickly run into copyright issues with your sources. A true AI based on training on open source software would be something different.

Patents on genes actually are a thing. So that example is pretty false. Whether they should be a thing is a separate question, but right now discovery of a gene and it's usefulness can be patented and is done for medical patents.

People aren't happy because Microsoft is exploiting open source. They're training it on open source code and keeping the service for themselves.

If they made the trained model public (and also trained it on private code) the response would be completely different.

>There is no copyright for genes in nature

Since when are humans not a part of nature?

You don't have to be a copyright maximalist to worry about a company taking snippets of code that used to be under an open license and using them in a closed-source app.

In addition, this is extremely hard to enforce. I think the amount of code running in closed systems that does not exactly respect the original license is shocking. What was the last case you know where this was a "scandal"?

It only happens at boss level when tech giants litigate IP issues.

I don't know about HN in general but my impression has been that anyone copying random code off the internet or adding dependencies without understanding the license (e.g. just blindly adding AGPL code) would be very much frowned upon in any remotely professional setting because a basic understanding of copyright and open source licensing is expected of even junior developers.

"Hackers" "playing" and ignoring copyright is fine, but Copilot isn't promoted as a toy, it's promoted as a tool for professional software development. And in that framing it is about as dangerous as an untrained intern with access to the production server.

I'm more surprised that people don't care about the telemetry aspect. It's an extension that sends your code to an MS service, and MS promises access is on a need-to-know basis.

I don't care if MS copies my hobby projects exactly, but I'm not sure my employer(defense contractor) would even be allowed to use a tool like this.

I think it looks cool though. I will probably try it out if it is ever available for free and works for the languages I use.

It's quite possible to do this on-prem and even on-device. TabNine, a very similar system with a smaller model (based on GPT-2 rather than 3), has existed for years and works on-device.

The difference between copilot and copy pasting from stackoverflow is consent

It's a pretty standard "big company releases new thing" reaction. HN is usually negative on everything.

Is it really confusing? It's a rich company using the fruits of our labor, provided free TO OTHER DEVELOPERS. I have never okayed "use my code to train AIs that nobody else could". It's backhanded and unfair.

Programmers love to pretend that they're lawyers, especially when it comes to copyright law. Something about the law really appeals to hackers!

If very powerful companies are appropriating and reproducing code in contravention of copyright then that is something that should be called out.

> Copilot should be inspiring people to figure out how to do better than it, not making hackers get up in arms trying to slap it down.

One of the (many) problems is that GitHub/Microsoft already benefit from runaway network effects so it’s difficult to “do better”. Where will you get all of that training code if not off GitHub?

The real answer to this is to yank your projects from GitHub now while you search for alternatives.

Even if you do that, what's to stop them from using open source software from all over the web and not just what's on GitHub? The only way to stop them then is to go closed source.

I mean stop them at a larger level by threatening their success as an organization. If developers stop publishing to GitHub they have bigger problems than training ML models.

Whether or not this move is “legal”, it should serve as a wake up call that GH is not actually a service we should be empowering. This incident is just one example of why that’s a bad idea.

They make you give up some of your monopoly rights when you put stuff on Github (some parts of those ToS might or might not be legal).

You would have a much stronger case if they had taken your code from elsewhere.

Copy-left licenses are generally liked by developers, this flys very directly against that since it suggests circumvention of those type of licenses.

Of copilot were open source I wouldn't have an issue with it. However it is closed source and a later version it's intended to be sold.

Copyright defends us from some of the abuse by large corporations in the form of the GPL.

Want Linux to run on your thing? You must publish driver source then or you're violating copyright law. This was less a big deal before device vendors ratcheted the pathological behavior up to 11 with smartphones and that's why far more people seem to react far more strongly now.

Copilot violates the assumptions many people made when they open sourced their code. Moving from manual to automated use feels like a privacy violation because it dramatically changes the amount of effort it takes to leverage the work in an unintended context.

This isn't true at all. There are stories concerning code stealing that regularly lead the front page on HN and rouse a pretty intense reaction from the community. Saying that HNers have never before cared about this issue seems pretty inaccurate or disingenuous.

It is a large corporation eroding the integrity of open source licenses. It is perfectly reasonable to be pissed off about this.

idk, I don't quite enjoy the idea of having my code stolen without any respect for its licence or even attribution

but then again I migrated away from github as soon as MS bought it

still, it's a matter of principle

Hacker News hates everything, especially if it seems to work. Don't read into it.

"Please don't sneer, including at the rest of the community."


Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact