Hacker News new | past | comments | ask | show | jobs | submit login

> train on open source projects

To be specific, the FAQ states: "It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub."

Some have raised concerns that Copilot violates at least the spirit of many open source licenses, laundering otherwise unusable code by sprinkling magic AI dust... most likely leaving the Copilot user responsible for copyright infringement.




Yep. The only reason it hasn't been utterly dogpiled by lawyers is that far fewer people care about code than other forms of IP. If I made an AI assistant called PhotoStar to help with digital art and it just attaches Big Bird's face onto a character in my children's book I'm going to get sued. "Hey now, I just hit paste, the software pressed copy by itself" is not going to hold up.


Or the fact that you grant GitHub an implicit license as outlined in the ToS.


GitHub has never asked for representation to provide an unlimited-rights license to GitHub themselves for any purpose. Further, the person posting GPLed code to GitHub is not necessarily the only or sole copyright holder, and GitHub has never represented that there was a problem with this.


GitHub isn't liable. That's been established in court with regards to training AIs. Who is liable is you who may or may not have the legal right to use the code CoPilot spits out for you.


It seems like this space will open up all sorts of interesting novel legal questions.

It is possible to provide CoPilot with a sequence of inputs that produces some of the input, which was copyrighted. Let's say you want to help people violate copyright, so you as a third party distribute a script that provides that sequence of inputs. Who's violating the copyright there?

Alternatively -- it is apparently legal to produce a clean-room implementation that duplicates a copyright implementation. Supposing you were to use a tool like CoPilot, which has just been trained on that copyright implementation. Is your room still clean? You might even be able to get it to spit out identical functions!

Or, if you have a ML algorithm which has been trained on leaked closed source code, and it is sufficiently over-fitted as to just provide the source code given the filename or the original binary, who is violating copyright when this tool is used? If it is just the end user, then this seems like a really convenient way to launder leaked closed source code.


I don't think it's a clear cut as you make it out to be. Tortious interference is a common law remedy that might make Github/MS liable.

If I induce you to break a contract with someone else they can come after me for damages.

For example in this case, there are developers who have created GPL code. That code was licensed to some other developer. Github then encouraged people to upload git copies of the GPL code onto github where it was put into the model. That model contains the copyrighted materials and isn't coming with the necessary notices. The output of the model can be code that is a direct stand in for the copyrighted work. Thus Github have become a party to breaking the license even though they themselves never agreed to the GPL.

In addition Github are encouraging (They are advertising it and making it available broadly) other developers to copy that code and use it in their project. Again that's encouraging an action that breaks a contract. Github is well aware that this is likely happening and they continue on. Thus they might be liable. You also might be liable.

All of these things can and likely will be argued before courts but it's not at all one sided.

> That's been established in court with regards to training AIs.

What are you basing the certainty of this statement on? The case law I have seen around this is pretty spotty. Cases around training on copyrighted materials have predominately been about the input, and not the output. With the final output usually being controlled by the model owner. For example Google obtained the books they scanned legally then used them to produce google books' index. There are some major differences.

- The books were purchased, meaning they got a license to use the book. There's for sure code in the model that Github does not legally have the right to use. They are aware of this. Making the input more shaky for github. - Github is making a direct profit off of this service. It's a revenue generating enterprise. That's important since it raises the bar of what they can be expected to do.

There's been nothing that goes to the supreme court yet; it's all per circuit and not settled case law. Also this gets WAAAAY more complex when we start talking about outside of the US and isn't decided at all.

These things are complex and likely you need your lawyer to advise you with any real questions.


> The books were purchased, meaning they got a license to use the book.

This may be a bit nit-picky, but I don't think that is correct.

Most books I've seen don't say anything about granting a license so there would be no explicit license that comes with them.

Maybe you could find an implicit license if normal use of a book required a license but it does not. Copyright law allows all the normal uses of a book without requiring permission of the copyright owner. You only need a license when you want to do something that requires permission.


I should have been more explicit; You are completely correct.

I was saying that there's some implied license after first purchase. I believe that was part of the court's decision. Paying for a book (or a library paying) gives you implicit rights to fair use. Github's copies of code were not purchased. They were given by sometimes third party.

So there's likely some room to argue that fair use rights are different enough between previous cases and github.


This has been explained many times - you can check word for word the output is original. All it takes is a bloom filter trained on the Copilot training set and an ngram extractor.


Yes, and you'll be fine if you do. The problem is you might not bother.


Alpha-equivalence be damned!



Fortunately, it can also generate high-quality completely novel characters, every bit as lovable and unthreatening as Big Bird:

https://imgur.com/a/ppeclPL


But if you made DALL-E and it just remixes images sourced from a broad scan of the Internet, filtered through several layers of machine learning indirection, you're all good.


Sure, if it's remixed to the point where most people don't go "hey that's Big Bird!" CoPilot doesn't, or at least doesn't always, like when it just copied Quake's fast inverse square code with the verbatim comments including profanity. Using CoPilot to create commercial code opens the coder to significant liability if there's enough money at stake.


That piece of code had duplicates in the training set making it prone to memorisation. Almost all generated code is original.


Almost all generated code is original

Good, you will almost not be liable for infringement.


Let's wait for the first big Codex infringement scandal to erupt and then I will start worrying about it.


Just argue that you subcontracted that code to Microsoft in good faith for $10/month and pass on the lawsuit to them.


I still can't believe they trained it with open source code, and didn't have some tag system to a) exclude based on licensing, and b) autoinclude licensing, or at least warn about it before applying code. Especially when many cases were shown of it line by line writing code from the same exact codebase.


Another concern is that nearly every stackoverflow answer or wikipedia article that isn't a trivial algorithm tends to be buggy at its edge conditions. Most of them look like they were submitted by college students and not experts.


Remember when we believed that experts were over because the wisdom of the crowds would reign supreme?

Been a hell of a decade, hasn't it.


The "wisdom of the crowds" doesn't mean what many people think it means.

The wisdom of crowds works best when:

1. participants are independent (otherwise you may get failure modes, such as "groupthink" or "information cascades")

2. participants are informed, but in different ways, with different opinions;

3. there is a clear, accepted aggregation mechanism, where individual errors "cancel out" to some degree

I view the topics in James Surowiecki's book (or the Wikipedia summary of it, at least) as required thinkinpg for everyone, preferably synthesized with a study of statistics and political economy.

In particular, the Wikipedia article's section on "Five elements required to form a wise crowd" is a slightly different slicing of the required elements that I offer above.

* If you read that section, trust is listed. I, however, don't see trust as a necessary condition for a "wise crowd". Trust is often useful (or even necessary) when a collective decision is used for governance, decision-making, and policy.


When the wisdom of the crowds is all easily accessible, the hard part becomes curating.


This is legit. While it seems it takes forever to bring this kind of stuff to trial, it will be an interesting case for sure. Especially in the broader more general sense.

AI is just recomposition of existing snippets of code, art, text, music, etc. Does an AI fall under fair use? What happens when an AI produces something too similar to an existing work or trademark. I know the computer won't get sued, the owner/user will. But still, it's a hard problem.

Even if Copilot was initialized with snippets from Open Source Software (exclusively), it doesn't mean that copyright infringement isn't a concern.


> AI is just recomposition of existing snippets of code, art, text, music, etc.

It's not random recomposition, which is worthless. It's useful recomposition, adapted to the request and context. It adds something of its own to the mix.


Not to mention that just because the code is public, doesn't mean you can use it however you want. You can publish code and still retain copyright. Wonder if GitHub looked at the license when they gathered the data for the model.


It seems unfortunately clear that generative ML as typically practiced falls under fair use of even the most restrictive licenses or lack thereof (e.g. a training set including disney movies without disney’s permission). Some people say that’s great and it’s legal hooray, but I would love it if the law caught up and added requirements to the models trained this way. If you benefit from other people’s stuff without their permission then you ought to have to give back in some way.


What is actually crazy is having copyright/patents/whatever apply to mathematical structures and code, and be retainable for long, it's rent on ideas, such a ridiculous concept.


Copyright and patents are very different. I think the general consensus among developers is that software patents are silly, but copyright on source code is very important.


If you can't prove your code was stolen you shouldn't have a claim. And Codex should just skip code that exists in the training set. All that remains is creative code.


Would a cartoon about Mickey Duck and Donald Mouse be infringing?


You can work on the definition of "similar code". It can be a separate model on its own. Use human judgements to learn it.


It’s hardly different from reading those projects yourself and learning from them.


Learning from them would be fine, reproducing them as-is without abiding by the license is not and that's where the difference lies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: