Hacker News new | past | comments | ask | show | jobs | submit login

So this makes it official... this post[0] and the comments on the announcement[1] concerned about licensing issues were absolutely correct... and this product has the possibility of getting you sued if you use it.

Unfortunately for GitHub, there's no turning back the clocks. Even if they fix this, everyone that uses it has been put on notice that it copies code verbatim and enables copyright infringement.

Worse, there's no way to know if the segment it's writing for you is copyrighted... and no way for you to comply with license requirements.

Nice proof of concept... but who's going to touch this product now? It's a legal ticking time bomb.

0. https://news.ycombinator.com/item?id=27687450

1. https://news.ycombinator.com/item?id=27676266




Adding to this:

I run product security for a large enterprise, and I've already gotten the ball rolling on prohibiting copilot for all the reasons above.

It's too big a risk. I'd be shocked if GitHub could remedy the negative impressions minted in the last day or so. Even with other compensating controls around open source management, this flies right under the radar with a c130's worth of adverse consequences.


Do you also block stack overflow and give guidance to never copy code from that website or elsewhere on the Internet? I'm legitimately curious - my org internally officially denounces the copying of stack overflow snippets. Thankfully for my role it's moot as I mostly work with an internal non-public language, for better or worse, and I have no idea how well that's followed elsewhere in the wider company.


Apples and oranges: Stack overflow snippets are explicitly granted under a permissive license, as long as you attribute.

https://stackoverflow.com/help/licensing

It appears that the code that copilot is using is created under a huge variety of licenses, making it risky.

On the other hand, a small snippet in a function that is derived from many existing pieces of other code may fall under fair use, even if it is not under an open source license of some sort.


Stack Overflow and Copilot are similar. Usage of both routinely violates licenses. Stack Overflow content is licensed under CC-BY-SA. Terms [1]:

* Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

* ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

In over a decade of software engineering, I've seen many reuses of Stack Overflow content, occasionally with links to underlying answers. All Stack Overflow content use I've seen would clearly fail the legal terms set out by the license.

I suspect Copilot usage will similarly fail a stringent interpretation of underlying licenses, and will similarly face essentially no enforcement.

[1] https://creativecommons.org/licenses/by-sa/4.0/


The difference here is that it's hard to sue a company for sporadic, difficult to track down usages of SO content written by their own engineers.

One can now trivially coerce copilot to regurgitate copyrighted content without attribution. Copilot's basic premise violates the CC-BY-SA terms, and this will continue until no party can demonstrate a viable method of extracting copyrighted code.

There is now a single party backed by a company with a 2 Trillion dollar market cap that can be sued for flagrant copyright violations.


Surely you would have to sue the people using the tool to produce verbatim copies of code, not the creator of the tool?


I would think it's more complicated when the tool is the thing spitting out the verbatim copies of code. Both the tool and the developer are independently distributing copyrighted code that neither of them have the rights to distribute.


why? one could easily claim that if the tool is reproducing the contents of copyrighted works they are a "distributor". Subjecting the makers of the tool/distributor too much higher copyright infringement claims.


Let's differentiate legal risk by the party it affects:

* Companies with engineers using Copilot. Risk here is negligible, like that of copying Stack Overflow answers, or any code that isn't under a truly permissive license like CC0 [1]. Prohibiting use of Copilot in a company based on this risk has no merit.

* GitHub and Microsoft. Risk for them is higher yet worthwhile. Copilot is more like Stack Overflow than Napster. Affected copyright holders added their works to GitHub and agreed to their terms, so GitHub has a legal basis to show that content in Copilot. In terms of facilitating copyright infringement, far more violations occur by engineers manually searching and copying code on GitHub; lawsuits against GitHub due to that would be dismissed. Determining provenance is slightly harder in Copilot than in search, but GitHub could minimize risk to itself by noting in Copilot terms that users must review Copilot's suggestions for underlying license concerns. Engineers rarely will -- they routinely violate licenses of Stack Overflow and code copied from elsewhere -- but that shifts responsibility from GitHub, and legal risk to companies using Copilot remains negligible.

[1] https://creativecommons.org/share-your-work/public-domain/cc...


In addition to other licensing gotchas, a ton of SO snippets are copied wholesale from elsewhere—docs or blog posts. So it's pretty likely that the poster can't license them in the first place because they never checked the source's license requirements.


It just seems bizarre that this wasn’t flagged internally at Microsoft. They have tons of compliance staff.


Maybe we’ll even get a sneak peak at Windows 11’s source code. Time to start writing a Win32 API wrapper and see what the robot comes up with!


That's because Microaoft doesn't dare use this for production code (presumably).

They are 100% okay with letting their competitors get into legal hot water.


isn't it copilot's liability for "distributing" the copyrighted code?


It’s surely a bit of a liability grey area?


Could bet they baked in the legal fees and are taking a calculated risk


Except that CC-BY-SA is not a permissive license; the SA part is a form of copyleft. It's just that nobody enforces it. From the text [1]:

- "[I]f You Share Adapted Material You produce [..] The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License."

- "Adapted Material means material [..] that is derived from or based upon the Licensed Material" (emphasis added)

- "Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.'

- "You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply."

A program that includes a code snippet is unquestionably a derived work in most cases. That means that if you include a Stack Overflow code snippet in your program, and fair use does not apply, then you have to license the entire program under the CC-BY-SA. Alternately, you can license it under the GPLv3, because the license has a specific exemption allowing you to relicense under the GPLv3.

For open source software under permissive licenses, it may actually be okay to consider the entire program as licensed under the CC-BY-SA, since permissive licenses are typically interpreted as allowing derived works to be licensed under different licenses; that's how GPL compatibility works. But you'd have to be careful you don't distribute the software in a way that applies any Effective Technological Measures, aka DRM. Such as via app stores, which often include DRM with no way for the app author to turn it off. (It may actually be better to relicense to the GPL, which 'only' prohibits adding additional terms and conditions, not the mere use of DRM. But people have claimed that the GPL also forbids app store distribution because the app store's terms and conditions count as additional restrictions.)

For proprietary software where you do typically want to impose "different terms or conditions", this is a dead end.

Note that copying extremely short snippets, or snippets which are essentially the only way to accomplish a task, may be considered fair use. But be careful; in Oracle v. Google, Google's accidental copying of 9 lines of utterly trivial code [2] was found to be neither fair use nor "de minimis", and thus infringing.

Going back to Stack Overflow, these kinds of surprising results are why Creative Commons itself does not recommend using its licenses for code. But Stack Overflow does so anyway. Good thing nobody ever enforces the license!

See also: https://opensource.stackexchange.com/questions/6777/can-i-us...

[1] https://creativecommons.org/licenses/by-sa/4.0/legalcode

[2] https://majadhondt.wordpress.com/2012/05/16/googles-9-lines/


Yes. In a past life, after researching the situation, we had to find and remove all the code copied from Stack Overflow into our codebase. I can’t fathom why SO won’t fix the license.

What makes it even worse is if you try to do the right thing by crediting SO (the BY part) you’re putting a red flag in the code that you should have known you have to share your code (the SA part).


> I can’t fathom why SO won’t fix the license.

They tried to relicense code snippets to MIT a while back, it was a big mess.


Who really copies stack overflow snippets verbatim? It's usually just easier to refer to it for help figuring out the right structure and then adapt it for your own needs. Usually it needs customization for your own application anyway (variables, class instances, etc).


I don't think I've ever copied code directly from any of the Stack* sites. I generally read all the answers (and comments) and then use what I learn to write my own (hopefully better) code specific to my needs.


Yeah my experience has always been "ohhh that solution makes sense" then I go write it myself

If nothing else this whole copilot thing is helping ease some chronic imposter syndrome


Ha! Well, I think a lot of people copy code from StackOverflow verbatim once at least - including me.

Of course it turned out the code I'd blindly inserted into my project contained a number of bugs. In one or two cases, quite serious ones. This, even though it was the accepted answer.

It was probably more effort to fix up the code I'd copy pasta'd than write it from scratch. Since then I've never copied and pasted from StackOverflow verbatim.


Yeah! I've uh, ... never copied a bit of code into my repo verbatim, right?

yeah right. I wish.

(Not saying every dev does this)


I've copied plenty of Microsoft sample code verbatim, because the Win32 API sucks and their samples usually get the error handling right.

But, I can't think of a single scenario where I've copied something from Stack Overflow. I'm searching for the idea of how to solve a problem, and typically the relevant code given is either too short to bother copying, or it's long and absolutely not consistent with how I want to write it.


"Too short to bother copying"? I copy single words of text to avoid typing and typos. I would never type out even a single line of code when I could paste and edit.


> "Too short to bother copying"? I copy single words of text to avoid typing and typos. I would never type out even a single line of code when I could paste and edit.

Very honest suggestion: learn how to touch type. You can still copy if needed, but your typed input will be much faster.


I'm somewhere between 45-75 wpm. But Ctrl+C Ctrl+V can type 300wpm!

Typing when you could paste is like having that Github Copilot put the right sentence right in front of you and you decide to type over it instead. Not only does it feel like wasted and robotic effort, typing everything leads to RSI.

I'm not sure why people disagree. Another symptom is that I insist on aliases for everything while others type out all the commands every time. Maybe I get distracted by the words when I type and lose my train of thought?


You need to highly the correct test first and move the curser to the correct location to paste text. I bet you can type 123123 several times faster than you can highlight that text in this comment and past it into a reply.


Double click to select a word is fast, and then you are in per word selection mode.


Sure, move mouse to text, double click, ctrl-c, ctrl-v it’s still slower than touch typing one word.


That's fair use.


Same here. I copy boilerplate code for new projects etc. regularly. But I don't remember copying anything verbatim from SO. Function, argument and variable names rarely fit the scheme used in the particular project I'm working on at that moment and usually I do a better job at adapting the code thinking what I'm doing rather than just copy and paste and then wonder what went wrong.


I think I did a few times, usually for languages that I wasn't going to spend to much time with (so no benefits in figuring how to do it from the answers) and for specific tasks.


Anything posted to Stack Overflow has a specific (Creative Commons IIRC) license associated with it. The same is not true of GitHub Copilot, and in fact their FAQ doesn’t specify a license at all, probably because they are technically unable to since it is trained on a wide variety of code from differing licenses (and code not written by a human is currently a grey area for copyright). The FAQ simply says to use it at your own risk.


Google (and most of other big techs I guess?) also explicitly prohibit employees from use of stack overflow code snippets.


I tried Googling this and couldn't find it. I also don't want to believe it because it seems like the world suddenly turned into an apocalyptic hellscape with no place for developers like me. Do you have a source?


First, I work at Google and its onboard training explicitly mentions Stack Overflow as a forbidden example due to CC-BY-SA license (SA is the problematic part). The following link is the official reference.

https://opensource.google/docs/thirdparty/licenses/#restrict...


I work at Google.

SO definitely comes up during copyright/IP training.

The basic idea is 'reading SO answers to learn how to solve a problem is fine, copying/transcribing the code is not'.

Google is quite paranoid re. copyright and licenses.


I don't have a source to link, but I've also been told this by someone who works at Google. Is copy-pasting stuff verbatim from SO really that much of a thing? I use SO plenty, but have never considered taking anything verbatim.


That's actually an attack vector: mirror SO using their open-sourced DB and inject malware into the suggestions, or change the text before it enters the clipboard. People blindly copy/pasting aren't going to notice.


Same here. I’ve directed our teams and infra managers that we must be able to block the use of copilot for our firm’s code.

Id be very surprised if the other large enterprises that I have worked at downs doing exactly the same thing. Too much legal risk, for practically no benefit.


No-one cares about this. People have no clue about licenses and just copy-paste whatever. If someone gets access to their code and see all the violations they're screwed anyway.


Ask your legal department about that. Sure, engineers don't care about licensing at all, but we are not the only players here.


Are legal departments in the habit of reviewing all code line by line? Seems like that would be cost prohibitive...


Obviously they aren't, but just as obviously, "the legal department didn't review this, therefore it's safe to assume it's legal" would not pass muster with said legal department. :) Kiro's comment ("if someone gets access to their code and sees all the violations they're screwed anyway") is probably technically accurate, even if in practice you're unlikely to get caught. As other people have noted elsewhere in the comments here, the Google v. Oracle case over Java definitely suggests that verbatim copying of just a few lines, even for trivial functions, is enough to get you in trouble if those lines aren't licensed in a way that lets you do that.


No legal come up to you and say stuff like:

"You guys aren't using any free software are you? Because you can't do that."

"You mean copying software source code without respecting the license, right? Because we absolutely respect all licenses fully."

"No I mean you can't use Free software! It's a clear management directive! What are you doing?!?"

"Is that an apple laptop you're using there? Ever had a look at the software licenses for it?"

Legal are generally idiotically ignorant about the real issues. Whose fault that is we can argue about.


This is absolutely not true. While some individuals might not care and might not always conform to their companies' policies, most companies have policies, and most employees are aware of and mindful of these policies.

It's absolutely the case that before using certain libraries, most engineers in large corporations will make sure they are allowed to use that library. And if they don't, they are doing their job very badly IMO.


This kind of sucks honestly, copy and pasting without understanding has lead to all sorts of issues in IT. Not to mention legal issues as mentioned by another reply.


Not only this but a huge amount of publicly available code is truly terrible and should never really be used other than a point of reference, guidance.


I think that proper coding assistant should help with not writing code (and I stress that it is "not writing code") - how to rearrange your code base for new requirements, for example.

Code not written does not have defects, does not need support and, as you point it out, is not a liability.


Seems like the liability should also be on Copilot itself, as a derivative work.


The practical utility will outweigh the legal concerns. Engineers using this are going to be more productive and this is a competitive advantage that companies won't eschew.


If the legal concerns are well-known, then what you are describing might be viewed as criminal negligence (at worst) and or insufficient duty of care (at best). Such engineers should be held fully responsible and accountable for their actions.


It seems like the risk is somewhat exaggerated because even when people get bad autocomplete results, they mostly won’t use them.


That's optimistic. The people who would rely heavily on this sort of thing are going to be the worst at detecting what a "bad autocomplete result" would look like. But even if you are capable of judging that you've got a good one, it still doesn't inform you of the obvious potential licensing issues with any bit of code.

Surely somebody working on this project foresaw this problem…


If they get rid of licensed stuff it should be ok no? I really want to use this and seems inevitable that we'll need it just as google translate needs all of the books + sites + comments it can get a hold of.


Well... the whole training set is licensed, so you can't really get rid of it. I think that the technology they are using for this is just not ready.


Just retrain the model using properly licensed code? ("just" is doing a ton of heavy lifting, but let's be real, that's not impossibly hard)


There's not many licenses that let you reuse code without including the same headers / licensing blurb. You're in public domain, non-copyleft territory. WTFPL etc.


There is no such thing as properly licensed code because it is a function of the what is legally acceptable for your company and what it intends to do with the work.


Unlicensed code just means “all rights reserved.” You’d need to limit it to permissively licensed code and make sure you comply with their requirements.


That would be a long list of restated license terms and attributions.


Which licenses would it be ok that the training material is licensed under, though? If it produces verbatim enough copies of eg. MIT licensed material, then attribution is required. Similar with many other open source-friendly licenses.

On the other hand, if only permissive licenses that also don't require attribution is used, well, then for a start, the available corpus is much smaller.


The overwhelming majority of code on Github, even code under permissive licenses, require attribution of the original authors.


How would they do that?


Read the LICENSE file in each repo.


What guarantees it’s intact?


It doesn't need to be. If the license isn't positively exactly permissive then you can't use it.


Can you even trust that the License in a random repo is accurate and expresses the actual copyright of all the contained code?

I guess my point is, you can't be positive that even if you're following the license in a repo you forked that the repo owner hasn't already violated someone else's license, and now transitively, so have you.


> Can you even trust that the License in a random repo is accurate and expresses the actual copyright of all the contained code?

In fact, that seems to be exactly the problem shown in the tweet - someone copy-pasted the quake source and slapped a different license on it, and copilot blindly trusted the new license.


Is it still a legal concern if I'm just coding because I want to solve a problem and I'm not trying to use it to do business?


Yes: not all code on GitHub is licensed in a way that lets you use it at all. People focus on GPL as if that were the tough case; but, in addition to code (like mine) under AGPL (which you need to not use in a product that exposes similar functionality to end users) there is code that is merely published under "shared source" licenses (so you can look, but not touch) and even literally code that is stolen and leaked from the internals of companies--including Microsoft!... this code often gets taken down later, but it isn't always noticed and either way: it is now part of Copilot :/--that, if you use this mechanism, could end up in your codebase.


If you publish the code anywhere, potentially. You could be (unknowingly) violating the original license if the code was copied verbatim from another source.

How much of a concern this is depends heavily on what the original source was.


And the problem with copilot is that you have no way of knowing. If it changes even a little bit of the code, it's basically ungoogleable but still potentially in violation.


Distributing binaries to third parties is enough to trigger a license violation. For internal corporate tools, it would be less of an issue as "distribution" hasn't happened.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: