Hacker News new | past | comments | ask | show | jobs | submit login
GitHub’s AI Copilot Might Get You Sued If You Use It (medium.com/geekculture)
77 points by bluish29 on July 9, 2021 | hide | past | favorite | 86 comments



I keep seeing developers weigh in on their thoughts about copyright/legal issues but no IP legal experts. Why are people taking any of this without reviewing it with, ya know, lawyers? (for the record, I loathe lawyers like the rest of you, but they have a purpose here)

Just like when congressmen try to talk about tech and developers bemoan "you have no idea what you're talking about, stop trying", can developers just take a back seat before becoming a bunch of keyboard warriors here? Telling people they "might get sued" is boring and unhelpful. If you think there is a legal implication - then consult a lawyer. Just like a lawyer would consult their web developer if CSS was broken on their website.

That being said - I think it's entirely okay for developers to say "I did not explicitly choose to let Copilot use my data as a training set and I'm taking my code off GitHub until that is done".

PS - I want grellas back :(



That seems to entirely ignore that Copilot has 'generated' GPL protected code verbatim, even down to the comments. There appears to be exact copies of certain code segments within their model, and that is being exactly copied out. Doesn't that fall under copyright protection?


She does say she doesn’t think most snippets meet the originality criteria.


You are right:

> Otherwise, copyright conflicts would constantly arise when two authors use the same trivial statement independently of each other, such as “Bucks beats Hawks and advance to the NBA finals”, or “i = i+1”. The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality. Precisely because copyright only protects original excerpts, press publishers in the EU have successfully lobbied for their own ancillary copyright that does not require originality as a precondition for protection.


Nobody is complaining about `i = i + 1`, it's examples like this [1], an entire function, including comments written by the original author, protected under GPL:

[1] https://twitter.com/mitsuhiko/status/1410886329924194309


There is still the question of whether such a small part of a bigger work is enough to be protected, but regardless. I think the article was written before that tweet and does not assume that 1:1 copies of longer code parts happen. As I wrote already to you, there is that whole faction of "even using GPL code for learning is wrong", and to my reading Redas article mainly addresses that. And it has the assumption that the verbatim copies will be very small.


Where I live: It does fall under it, if the code is sufficient complex enough (e.g. there is evident originality) to be protected under copyright in the first place.


This only covers the part about the training data and derivative work. At what point does exactly cloning code become not a derivative work?


Obviously it is not derivative work anymore when huge international corporations let their "AI" copy the code verbatim./s

Emperors new clothes and all that.


Real world rights infringement work completely different from programmer's understanding, and is dictated by monetary gains: Rights holders sue manufacturers of million dollar products to get the cut or secure a deal. That is what it is for.

Programmers think that a right is granted to novel concepts itself to preserve a technology family tree, independent of cash flows or organizations or stakeholders' power dynamics. That is not what it is.

I tend to sympathize with programmer thinking though! Code has to be law and law has to be code.


> Code has to be law and law has to be code.

No, that's DRM and it's far from all good.


I completely agree. The copyright issue doesn't seem to be simple at all, at least from the perspective of USA law. This has been posted on another thread and completely destroyed my hope of understanding the consequences:

https://ilr.law.uiowa.edu/print/volume-101-issue-2/copyright...


Exactly my sentiment.

It’s particularly ridiculous to see devs shelling out bold statements of fact and not even open-ended questions or more softly worded opinions (“I believe that…”).

Arguably, even real domain experts should have the latter attitude, and so should the dev community, in which many are educated in the scientific method.

Clearly some people are feeling threatened and are embarrassing themselves as a result.


Yeah problem is the lawyers rarely come give free advice on a public forum. So we are left to speculate...


Am I the only person who feels that it's copyright that's the issue rather than machine learning training sets?

Consider a new additional feature added on to Copilot - a language aware rewriting tool that transforms the initial generated code into a new form with equivalent functionality.

It would be nearly impossible to trace the original code or make a copyright claim.

However - you could use this same trick directly on copyrighted code. Now things are even murkier...

But I would argue that this is essentially what our brains are doing. I've read code, got the gist of it and written my own version. Technically it's not a clean-room reimplementation but an average coder wouldn't realistically expect to get sued for copyright for doing this.

Maybe they should but if you're an open source advocate and you've reached this position then there's something very weird going on.

I always thought the idea of open source was to use copyright against itself because we believed in openness. Not embracing it and just throwing out one small aspect of it.


There is equivalent thing in academia to what you describe which is "paraphrasing" which is you read the source, grasp it and then rephrase it in your own words. Still though you have to cite the ideas and this is actually essential because you can't keep quoting in your paper/thesis. But can this solve the problem with code license? I am not sure.


>Still though you have to cite the ideas

That's not because of copyright though, that is due to academic standards. Legally you can paraphrase without citing a source.

And from a legal perspective, an entire book of nothing but unsourced quotes from many works would probably end up being both transformative and using a minimal percent of each source.

That said, an argument for citing: if copilot was using public code, and not keeping a record of its sources, it might put someone in a spot where they dont know what risk they are exposed to. A citation log would both let you know what your sources are, and as a defense if accused of using something. The question becomes, going back to copyright, what if copilot learns a language, and then produces identical code to someone else, because that line of code is the most efficient obvious way to do something.


Copyrights protect expression of ideas, not the ideas themselves. The ideas themselves could be a subject matter of patents. This addresses a few of your comments.

>> It would be nearly impossible to trace the original code or make a copyright claim.

The difficulty in proving infringement is a different aspect than an infringement happening.

>> I always thought the idea of open source was to use copyright against itself because we believed in openness.

That's exactly the point. The open concern being discussed with using Copilot is that one may incorporate parts of the code under the "openness" license into an application which is not under a compatible "openness" license.


  but an average coder wouldn't realistically expect to get sued for copyright for doing this
Yeah, but they absolutely could if they copied code verbatim, which is what Copilot is often doing.



You're citing a study done by Github on their own product, on something that has generated lots of backlash over the last week. I would take it seriously if it was an independant study, but right now it's hard to believe it.


Github's study predates any negative press, so that would not have be the reason for them to manipulate the study. And the examples that are making the rounds combine two aspects a) prompt-engineering b) famous code samples. That's hardly representative of normal use.

So while independent testing would be welcome I don't consider backlash observational evidence. What we're seeing is in line with prior experience with GPT.


> Github's study predates any negative press, so that would not have be the reason for them to manipulate the study.

I seriously doubt that no one raised the concerns that are raised today during the development of Copilot.

> And the examples that are making the rounds combine two aspects a) prompt-engineering b) famous code samples. That's hardly representative of normal use.

That's true, however the way they advertise Copilot is to prompt it with comments, which might push it to regurgitate code more often.


>I seriously doubt that no one raised the concerns that are raised today during the development of Copilot.

I would be shocked if they did. The common wisdom in the ML community is that training data is fair use. GPT has been operating for 2-3 years now with no legal issues, and this is just a different fine tune of GPT3.


Which means that if every one of us was using it, many of us would be using copyrighted code by the end of a single day.


Yes. That part was implicit in my post. I'm trying to follow the logic a bit further to see where it goes.


isn't the issue that github's particular implementation of the idea is braindead - as in it copy-pastes directly verbatim.


More like braindead because it was trained on an unfiltered set of data without checking licenses. In effect, they've created a code pirating tool.


Slight exaggeration but even given that aspect, I'm uncomfortable with the community's reaction


I am too. It's so weird to see free software proponents argue for what is in effect a big extension of what should be protected under copyright. Not only using it, not only distributing it, but even learning from it - and the line there between learning from it as an AI and learning as a human is fluid. This copyright maximization could have a similar effect as software patents - making it more difficult to write software, because there will be an increased danger of being sued for infringement ("you learnt how to do that if from my GPL code, but you used MIT!").

1:1 copies are a different question (a "it depends" on how much gets copied and whether the code is protected, as in not obvious, in the first place), but those are not the main thing that gets focused on in that strange twittersphere (which the article is referencing).


> It's so weird to see free software proponents argue for what is in effect a big extension of what should be protected under copyright.

I don't believe that's what they're arguing. They're arguing that they're putting themselves (or their employer's) products at risk by having unattributed copyright protected code in their products. It doesn't matter how much any of us would hope that copyright didn't exist, the fact is, it does.

There are other issues with Copilot, but by far the biggest is the licensing risks for those using it in a professional software engineering environment.

I'm a FOSS advocate and all my FOSS projects are MIT licensed. Anyone can use my code to do anything they like, but my day-to-day professional software engineering life needs to know where the code comes from and what its license is.


Did you see https://twitter.com/NoraDotCodes/status/1412741339771461635, discussed at https://news.ycombinator.com/item?id=27769440 ?

> it's official, obeying copyright is only for the plebs and proles, rich people and big companies can do whatever they want

> Wow. It's amazing that nobody has brought up the point that the result of ML is a derivative work from the input dataset. If the dataset has a restrictive license, like the GPL, it is a clear cut case of violating the terms of the GPL because it is not being released in the open

There are other reactions of course, I'm handpicking here, as those are the reactions I was referencing above.

> my day-to-day professional software engineering life needs to know where the code comes from and what its license is.

I understand, I agree with it. But that doesn't mean that code produced by an AI that did learn from GPL code has to be licensed under GPL. Verbatim copies of sufficient high originality, so that copyright applies: Sure. And that's where there is some risk of Copilot going wrong. But that's not the primary output of Copilot nor its goal.


> And that's where there is some risk of Copilot going wrong. But that's not the primary output of Copilot nor its goal.

It took no more than 24 hours for the verbatim code to appear in the wild. It's irrelevant if it's the primary goal or not, if it can happen once then it can happen again; therefore (I believe) it's an existential risk to any organisation that uses it for their own products.


This is a clown take that was planted in you by the GitHub comments on this issue. Nobody cares particularly if you feed data into an algorithm of your choice. I can feed data into a hash algorithm all day and that won't come back to bite me.

This particular algorithm isn't learning anything, it's a big pattern matching/information retrieval system. It wasn't fed tokens, AST, constructs, type information, it was fed pure source code hoping the oblique magic in between would make it develop understanding. Nothing of the sort happened and it is instead regurgitating input data verbatim, which is the obvious copyright issue at hand.


You are making it sound as if the system only copies full functions verbatim, but that's not the case. See https://docs.github.com/en/github/copilot/research-recitatio... (as linked in a different comment thread here).

And btw, clowns take? That's no way to enter into a productive discussion.

> Nobody cares particularly if you feed data into an algorithm of your choice. I can feed data into a hash algorithm all day and that won't come back to bite me.

I showed examples to the contrary in my other comment.


> You are making it sound as if the system only copies full functions verbatim

It doesn't only do that, but it does do that [1]. And that is the issue. It doesn't matter if it only does this 1% of the time, it's enough to be high risk.

[1] https://twitter.com/mitsuhiko/status/1410886329924194309


I used it and I had the impression that it understood my code, sometimes at least.

It once wrote an API client for a rather idiosyncratic interface I just implemented.


Has that been proven? It's possible it's only doing that due to a small training set


On top of that, I get the feeling that way too many programmers have a poor understanding of the actual value of their work.

A single function, even if particularly clever, elegant, or performant, doesn't have much value on its own. Sure, you might have thought about it for hours and struggled hard to get it just right, but that doesn't imply it's particularly valuable on its own.

A particular inverse square root function has no use outside a complete library or program. It's that added value (embedding it as part of something much bigger or complex) together with unrelated work (e.g. API calls, artwork, game design, etc.) that generates a product with actual value.

There are people who unironically try to argue that choosing variable names and expressing trivial mathematical relations in a particular way holds intellectual value worth protecting; even if it's just a single line of code.

Interestingly people cheered when Google won against Oracle and APIs remained "free". It'd be so interesting to take a peek into an alternate reality where CoPilot has been released by Google or Tesla(!) instead of Microsoft. My gut feeling is that the reaction would've been very different depending on the company that released the tool.


If I write a poem by variable names, don’t you think that would fall under copyright?

The hurdle for getting copyright are pretty low (of course that also means any compensation by a court would be low).

Anything that is not technical ly required (or deviates from some generic standard) but involves some aesthetic considerations is a already potentially copyrightable.

Because the boundaries are so fuzzy no sane company lawyer would allow the use of CoPilot . Why take a risk for some glorified text snippets?


Copyright is different from licensing and is far more complicated that most people think.

> Why take a risk for some glorified text snippets?

Name an example where that was actually an issue. And I'm talking explicitly about single functions and even single lines of code.


That's not how lawyers think.

Microsoft's programmers were (are?) not allowed to read GPL code to rule out any doubt that they might be copy GPL code into the codebase, even if only unconsciously.

Lawyers are not just for suing they are also for covering your ass so there is no way you could lose if you are getting sued.


> It'd be so interesting to take a peek into an alternate reality where CoPilot has been released by Google or Tesla(!) instead of Microsoft. My gut feeling is that the reaction would've been very different depending on the company that released the tool.

I really doubt that. I do think the reaction would be more favorable if the project was itself open source.


Well, GTP-3 isn't open source, yet the Twitter-sphere was ecstatic about its code generation features.

It isn't even new at all [0], [1], [2] - it's just getting exposure because people start actually using it.

GPT-3 still forms the basis of Codex, which is the basis of CoPilot. None of these are Open Source and even if they were - what exactly do you think that would change? The models are black boxes, they are too big to even run on consumer level hardware, let alone train or finetune and the code of CoPilot is most likely just API calls, glue code, some post-processing and integration into VS Code.

In short, no I don't agree that an open source product would've helped with people's opinion and it would've been irrational even, because there'd still be no difference in the core functionality and what people are complaining about.

[0] https://sourceai.dev

[1] https://gpt3demo.com/category/code-generation

[2] https://www.datacamp.com/community/blog/gpt3


The exaggerations won't stop now, will they? First of all, it's not as if CoPilot spits out verbatim replica of training data on every other prompt.

Secondly, the consequences of accidentally copying code by means of using this tool are pretty minor. The author acts as if copypasta from StackOverflow, RosettaCode and similar sites is NOT a daily occurrence (and can't even be checked in the case of closed source software).

Fake gurus like Siraj Raval [0] can manage to literally steal - as in copying other people's work and claiming it as their own - for years without consequences and face ZERO legal backlash even after being exposed. Some of his repos had hundreds or even thousands of stars and forks on GitHub, while the original authors he copied from got no attention or credit at all.

If this is what people can get away with who do this knowingly and deliberately and with entire projects, then I really have to wonder what the fuss is about when an ML model occasionally spits out a few lines of code snippets verbatim from its training set.

[0] https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A


Your arguments may be valid from a common sense standpoint. However, they may not exactly be valid from a legal standpoint as explained below:

>> First of all, it's not as if CoPilot spits out verbatim replica of training data on every other prompt.

Your comment is about how often it is happening. That it happens as the OP shows via examples itself makes this a concern. It's not necessarily a question of quantity but of existence.

At least when I last checked [1], Github itself [2] notes: "Does GitHub Copilot recite code from the training set? GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. ..."

In other words, 0.1% of the time snippers are copied verbatim, a percentage of which may be copyright violations (not everything is).

>> The author acts as if copypasta from StackOverflow, RosettaCode and similar sites is NOT a daily occurrence

Those may be violations too. That it is a common practice does not turn it into a non-violation.

>> Fake gurus like Siraj Raval [0] can manage to literally steal

That someone escaped does not imply that the laws are now somehow invalid.

>> I really have to wonder what the fuss is about when an ML model occasionally spits out a few lines of code snippets verbatim from its training set.

I agree. AI here is posing a challenge which should lead to refinement of the law, as I have explained in another context too [3]. If this is indeed needed, don't expect this to happen some anytime soon.

---

[1] https://news.ycombinator.com/item?id=27736089

[2] https://copilot.github.com/

[3] https://news.ycombinator.com/item?id=27635481


> Those may be violations too. That it is a common practice does not turn it into a non-violation.

I'm not disputing that. What I'm asking is, why is this particular 0.1% small snippets case such a big deal all of a sudden, when it obviously IS common practice (within the bounds of small code snippets)?


You are right in asking that question. What just happens that when there are thousands of small violators, it's not easy for lawsuits to proceed. When there's a central entity involved, Github in this case, there is someone discrete to be challenged or talked about.

What Github has done is that they have not claimed rights over the generated code, and have passed on the responsibility to the developer/entity using the generated code. In other words, they are supplying tool that can make copies, not necessarily violating the copyright themselves {Foonote 1}. So the Betamax ruling can come into play here, AFAIK.

What's thereby needed is that the developers now need to be educated about it, which is what the articles like the OP will end up doing. The developers often aren't well-educated about this stuff themselves, and presume that using generated code is just fine. See how I was downvoted here, as an example: https://news.ycombinator.com/item?id=27735484

{Footnote 1}: Whether the Github CoPilot model itself is violating copyrights is another matter. If the training data has included AGPL-licensed code, even that may be an violation. However, several other licenses would not be violated till the model is "distributed".


Good and important point about educating developers.

Maybe it goes even further and similarly to planned EU regulations, products based on black-box AI models need to have documented behaviour.

So the documentation needs to include explicit hints about generated code being unsuitable for clean room engineering and has to be considered "tainted" one way or the other.

I don't worry about verbatim reproductions at all, because results of that nature can easily be filtered out if need be (or be annotated with the proper reference). I'd expect a company that also runs the 2nd biggest web search engine to be able to pull that off with ease.


All stuff on Stack Overflow has a CC BY-SA license though, so the license burden is quite minor. Meanwhile most stuff on GitHub is (most likely still) unlicensed, meaning that it cannot be used in any way.


How often is code submitted on stack overflow copied from github or other sources though?


StackOverflow is explicitly CC-BY-SA which makes it easy to comply (e.g. link to the answer in a comment, use a compatible license).


And who checks that the code snippets didn't come or were derived from code licensed under a different license?


The hilarious thing about this, and no offense to anyone, is that a lot of public code on github is terrible (even mine!). Garbage in, garbage out, as they say.


> stole it from a code repository protected by a license.

This is a common misconception. The license doesn't protect the code. In fact license are about removing protection (in certain situations). It is copyright which protects the code.


That is just picking on semantics. License is indeed granting some rights that would not have been under copyrights, however, the license is placing some limitations still, which are being placed to protect "something" -- This something may be freedom (free as a bird) of FOSS software or whatever. It's understood that this protection still rests on the copyright laws.


That is a way you could look at it, but to me it just seems to be setting people up to be confused. The license isn't really limiting you at all. It is just specifying conditions in which it grants some rights. For example the GPL isn't limiting your use of the code in closed-source software. It is simply granting your rights to use the code only if you release your code under the GPL.

However I think that you are right when phrased like this. If you assume that you use the code, then the license does effectively apply restrictions. But again, I wouldn't want to presuppose that someone is using the code. It puts the decision in the wrong place.

But still this is quite different from the original framing. In the context of the article the problem isn't that the code was protected by a license, the problem was that the code is protected by copyright. This problem could be worked around if you complied with the license such it gave you permission, however the point remains, if you revoke all of the licenses the original problem remains.


Yes, you are right from semantic standpoint. I already fully agree.

>> This problem could be worked around if you complied with the license such it gave you permission, however the point remains, if you revoke all of the licenses the original problem remains.

Absolutely right. If the code used for training data did not come with a license (i.e., it retains full copyright by default), the problem isn't solved. It's rather bigger as you have noted.

However, the OP is still making the right point. If the conditions/limitations which came along with the license are not honored, that becomes a copyright violation, and hence, the developers need to be careful or they might get sued.

>> GitHub’s AI Copilot Might Get You Sued If You Use It

Github Copilot has placed the responsibility on the developer, so it's important that the developers understand all this.


I think that it's now in a gray area and that we'll see if it's legal or not in the upcoming years. Because let's be honest, the current legal systems weren't designed for ML and AI...


And yet they'll get shoehorned into our legal systems regardless. It seems like legal systems need a more adequate revising process, something to help refactor after all the shit we've shoved in over the decades


Seems more and more like the current legal systems are like huge legacy projects. They desperately need a redo, but so much depends on them that one is afraid of touching them.


It is called Revolution. No amount of reform will unseat the reigning oligarchy. The system is too corrupt to be reformed from within. Fairly certain these statements hold true for every country.


People mostly concentrate on whether using Copilot might be a real copyright violation.

But the danger of being sued is a different question.

Consider the following scenario:

1) Company X has its product code stolen. Somebody puts it on GitHub. It's discovered and the code is removed.

2) You work on an open-source project which competes with that product of Company X, you use Copilot, and make it known.

3) Company X looks through your code and find fragments which look vaguely similar to fragments of their code.

4) They sue, claiming that you copied and obfuscated their code.

Were you not using Copilot, one line of defense for you would be that you never looked at the stolen code, never accessed it, so no copying took place.

With Copilot, this line of defense is not available to you, because Copilot "saw" that code and in principle that could help to produce the fragments in question. (Of course other lines of defense are still available).

Whether courts would accept this argument is a different question, but the argument is not obviously invalid, and Company X can cause enough trouble for you...


Honestly it seems a bit like an overreaction. The author found a single person who is leaving GitHub over it and they're waving it around like "some people" are leaving GitHub.


I said this in the other copilot threads too, but don't forget that copyright is not the only protection there is.

Lots of countries (USA first) have tons of software patents. Apache and GPL have clauses to protect the project and its users, but that obviously does not extend to copilot generated code.

Now go guess where that code comes from and if it is somehow protected.


for those of you wondering why a litigiously rigorous company like Microsoft is pushing Copilot despite overwhelming evidence of copyright infringement, its not a technical limitation they seek to challenge but the legal limitation of the GPL and open source code in general. What they could not destroy through market dominance, they will use their 143 billion in revenue to simply render moot.

Microsoft has the coffers and attorneys to litigate this all the way to the supreme court, and I surmise thats just what they intend to do. a win for Github AI would be a damning indictment against the protection offered by open source licensing. cloud is Microsofts golden calf in 2021 and ensuring it grazes rent-free on your projects..your code...has become a priority.


I feel like its pushing for a future where source available becomes equivalent to open source.


Open source means many different things to many different people: that's why we have licenses. And that's why talking about "open source" often elides important details. When author(s) license a project under CC0/public domain, their vision of open source is dramatically different than a project licensed under AGPL3.


Yes, and what I'm saying is that this feels like its pushing for anything that is source available to become true open source with no limitations, including due to practicality, attribution.


First off, "open source" does not typically mean free of any kind of restrictions, and that is the entire point of this debate. So "source available" becomes equivalent to "open source" would only mean some open-source license applies by default, whereas under Berne Convention, copyright applies by default. Anyways, I am taking your statement to mean what you perhaps actually meant below.

Think about what would happen if this is applied to everything. Anything being available becomes free of restrictions. Then all books published become openly available for anyone to copy or print. Even compiled software becomes free of restrictions, which anyone can copy freely. This is basically demise of copyright laws itself, which would be disasterous.

In case I misunderstood your comment, please clarify.


By open source in this context I mean totally open with no requirements, including attribution.

> Think about what would happen if this is applied to everything. Anything being available becomes free of restrictions. Then all books published become openly available for anyone to copy or print. Even compiled software becomes free of restrictions, which anyone can copy freely. This is basically demise of copyright laws itself, which would be disasterous.

I think standards for code should be different from art. They're different concepts. Art you want to protect a specific expression. Code you want to protect the output of an expression. Everyone's talking about how this tool spits out code verbatim, but who cares? What if my tool spits out your python2 code in python3 code? What if it also just fluffs it a little to read differently but perform similarly. It doesn't need to do this explicitly either, it can just guarantee that no outputted code matches an original work exactly but get to the same place. I can make the code substantially different, while keeping the output exactly the same. But we don't get anywhere but generating continuously obtuse implementations of the same thing for stylistic differences. Functionality isn't copyrightable for good reason.

edit: and the example of rewriting the code in a different language in a different style probably still is technically copyright violation, but it'll be extremely difficult to win that trial in case, and thus is a stupid rule.


I would like to brainstorm more on this. Perhaps offline would make more sense.

I agree that for code, the function is more important than form. The "ideas" behind the function are protected by patents, not copyrights as you recognize already. If there are no inventive steps involved, then of course, patents do not come into picture and the function sees no protection, only the expression (form) part does.

I do think there is creative expression involved in writing code as well, however, I need to understand your point better. Code is just being written in a programming language, whereas a prose/poem is written in a natural language. Both can involve creative expression. I would like to hear more about how exactly you are distinguishing the two which may hold in form of say a revised law. What would a revision to the law look like which enables the code to be always free of creative expression (that copyright protects) but still leave books, paintings, etc., still being open to copyrights protection.

If the argument is only that the writer of the code is not interested in the form, only the function, they are free to put their code in the public domain, not making use of the copyright protection. And as far as protecting function is concerned, they may make use of patents where applicable. In most other respects (am ignoring other IP laws), the "openness" you ask for is already there.


> I do think there is creative expression involved in writing code as well, however, I need to understand your point better. Code is just being written in a programming language, whereas a prose/poem is written in a natural language. Both can involve creative expression. I would like to hear more about how exactly you are distinguishing the two which may hold in form of say a revised law. What would a revision to the law look like which enables the code to be always free of creative expression (that copyright protects) but still leave books, paintings, etc., still being open to copyrights protection.

I would suggest the form of the code is irrelevant, unless published in a way that exhibits its form AND violated in a way that exhibits its form. E.g. perhaps in a blog about code practices. Code being used to do things is not an expression like poetry. Much in the same way actual language cannot be copyrighted outside of doing a performance. Thinking of an acute example is difficult, but you cannot, for instance, copyright a pickup line at a bar and prevent others from using it, though you may prevent them from publishing it.

But pointing again to the fact that it doesn't really matter, because if we fixate on form being relevant, we can train machines to copy intent while distorting form. The complaint of copilot copying code verbatim is a red herring to the problem at hand.

edit: though to your other point, I think patents on code are not acceptable either.


I am not very clear on your arguments. Would help if you could explain more.

>> Code being used to do things is not an expression like poetry.

Agreed. When "being used to do things" is not an expression. Writing code is to be differentiated from it being used. It's the writing part which can involve creative expression, and where copyright applies.

>> Much in the same way actual language cannot be copyrighted outside of doing a performance

Whether natural or programming languages are themselves copyrightable or not is unsettled by law as yet. AFAIK, there are no precedants either way. So for now at least, we can assume that language cannot be copyrighted.

>> you cannot, for instance, copyright a pickup line at a bar ...

Short phrases are not copyrightable. How short is short is an open question (i.e., no fixed rule is available, other than vaguely saying how likely are two people to come up with the same thing independently).

>> ... and prevent others from using it, though you may prevent them from publishing it.

Copyright usually comes into action when the subject matter is fixed into a tangible medium. So merely someone using a pickup line at a bar may not grant copyrights or prevent someone else from using it.

>> But pointing again to the fact that it doesn't really matter, because if we fixate on form being relevant, we can train machines to copy intent while distorting form. The complaint of copilot copying code verbatim is a red herring to the problem at hand.

I see your point, which is hinting that code has something that's beyond pure form and patentable ideas. This is what I would like to brainstorm on and understand. What exactly is this, how could this be codified into a law.

And how do we still distinguish code from prose/poetry in both respects involved, i.e., creative expression of form, and the functional aspect.


> Copyright usually comes into action when the subject matter is fixed into a tangible medium. So merely someone using a pickup line at a bar may not grant copyrights or prevent someone else from using it.

No I mean you can't prevent someone from using a pickup line you write, publish, and copyright. Let's ignore the length requirement here as I feel that's tangential to my point. My point is that you cannot stop the usage of language as a function with copyright. Performances aside.


I see the point (though it is not answering my original questions or at least I have not yet understood it to that level yet).

Just to make your point complete, how do you separate out performances? Is that just the presence of a large number of people? Amount of financial gain involved attributed to the said performance? I'm asking as that may be the differentiation in practice even as applied to code.


Not sure what I'm missing.

> Just to make your point complete, how do you separate out performances?

I feel like intent is a strong start towards classifying something as a performance.


I think this equivalence is what supposed to be. At least availability for non-commercial purposes.


But all the code used IS open source; this does NOT mean anyone is free to use it in e.g. commercial or differently licensed products.

I mean there's some licenses out there that do allow for it (I'm not that well versed in OS licenses), but open source code often comes with caveats. GPL for example makes it mandatory for you to make any changes you make to it open source yourself, making any GPL-licensed code suggested by Copilot unusable in a commercial closed-source application.


https://twitter.com/NoraDotCodes/status/1412741339771461635

Seems that Github used any public repository, not just ones with open-source licenses. This could include source-available code or any other kind of "visible, but nonfree" license.


>But all the code used IS open source

All the code used is source-available but not necessarily open source. If you push a repository to GitHub with no license, it's not open source.

In 2015, the number of repositories with a license was around 20%: https://github.blog/2015-03-09-open-source-license-usage-on-...


Are you sure that is how the GPL-license works? I thought you only have to provide the source if you share the code with someone else. If you are hired to write code and sign a contract that the copyright of the code belongs to the person who hired you, the owner can extend and use the GPL code closed-source. Only if the owner decides to share the code do they have to tack on a GPL compliant license.


If you distribute the product built with GPL licensed code, then you need to make ths source code available. It's the distribution aspect that's crucial and for example SaaS products do not have to disclose their backend code because a user remotely accesses GPL code through their browser.


> But all the code used IS open source; this does NOT mean anyone is free to use it in e.g. commercial or differently licensed products.

That is source available, not open source.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: