Ask HN: Open-Source GitHub Copilot?

ericpauley · on Nov 6, 2022

Short answer, yes: https://github.com/moyix/fauxpilot

armchairhacker · on Nov 6, 2022

There is a 3TB model “The Stack” which I believe is partly designed for this: all of the code is properly licensed.

Training the model would be expensive but it’s a one-and-done process. With the model openly available cloud providers could provide a subscription service to end-users which recoups the cost of running it.

The only issue is I imagine GitHub has much more code than 3TB.

jacooper · on Nov 6, 2022

BTW, this wouldn't solve the legal hurdles of Copilot. The model needs to mention which license the code has, which AFAIK Amazon's competitor to Copilot already does that.

williamcotton · on Nov 6, 2022

If the model is found to be fair use (which it most likely will) then the license doesn’t matter.

If the outputs of the model are found to be not covered by copyright in the first place due to established legal doctrine [0] then developers will not be liable for copyright infringement.

So the only reason for providing attribution will be as a product feature that some developers might want to use.

I personally don’t list every artist that has used a I-IV-V progression in one of their songs and generally lump it into the recognition that my artistic foundation relies on preexisting culture. But hey, you do you.

[0] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

tyingq · on Nov 6, 2022

It would be hard to defend it as "transformative" while it's still possible to get it to spit out large blocks of verbatim (or near-verbatim) copies.

That is, the way GitHub describes Copilot working might pass as fair use. The way it (sometimes) works in real-life will not.

williamcotton · on Nov 6, 2022

The model itself is transformative and is considered separate from the outputs.

The outputs will always be a liability for a developer using the tool. So far the outputs are not covered by copyright due to the merger doctrine of the idea-expression distinction.

https://en.wikipedia.org/wiki/Idea%E2%80%93expression_distin...

tyingq · on Nov 6, 2022

Not sure I follow. Forgetting how the tool works, it's sold as a tool that outputs usable code for customers. If it's outputting copyright encumbered code (even occasionally), then Microsoft/Gitlab is going to be liable for that.

I don't think an explanation of how it's okay since it's an AI model is going to impress a judge, if the plaintiff shows long passages of verbatim copyrighted code coming out of it.

williamcotton · on Nov 6, 2022

Here's an exhaustive explanation:

https://texaslawreview.org/fair-learning/

Here's some relevant case law:

https://en.wikipedia.org/wiki/Baker_v._Selden

https://en.wikipedia.org/wiki/Whelan_v._Jaslow

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

Here's some relevant legal doctrine:

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

https://en.wikipedia.org/wiki/Idea–expression_distinction

The only thing I can say is that if there is a tool that is consistently outputting copyright protected works that this burden would not be worth it for most people. But as I have yet to see a single output from my own use or from Twitter that would pass the filtration test I am not worried about my personal liability.

alrlroipsp · on Nov 6, 2022

> So the only reason for providing attribution will be as a product feature that some developers might want to use.

Not at all.

The reason for providing attribution is to create a incitement for anyone to even publish their work as FOSS in the first place.

We even put it explicit in our LICENSE files.

MIT license:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

BSD 4-clause:

> Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

floitsch · on Nov 6, 2022

That's not true for many people.

When working at Google, developers generally open-source their private projects (which are legally owned by Google) under a Google umbrella, with a Google copyright. This rarely stops developers. When I worked there, my incentive was to make my work more useful and give something back. I'm pretty sure many other Googlers feel the same way.

If Github found another way of making my work more useful to others, all the better. I would prefer if Copilot wasn't the only option, and if there was a good open-source alternative, but that's completely independent of the fact that my code was used to teach a neural network how to complete code snippets.

cercatrova · on Nov 6, 2022

> The reason for providing attribution is to create a incitement for anyone to even publish their work as FOSS in the first place.

What? You think people only develop open source simply to have attribution? Personally I publish open source because I made something others could use and I don't want them to waste time duplicating effort. I don't even care whether they list me or not, it doesn't factor into my decision to open source at all.

williamcotton · on Nov 6, 2022

Yeah but those licenses don’t matter if it is fair use or if the sections of code were not covered by copyright in the first place.

I agree that if SSO [0] didn’t exist that it would create poor incentives for open source software but that is not what these tools accomplish. They are the poetic equivalents of rhyming dictionaries. Should I figure out everyone that has rhymed “book” with “look” when I publish a new song?

[0] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

btwillard · on Nov 6, 2022

Not a complete replacement, but very cool and related: https://github.com/webyrd/Barliman

wokwokwok · on Nov 6, 2022

You’ll need > 20 GB of GPU memory run the model.

This is the same reason people can’t easily “play with” GPT like models.

> would it be possible to run it on a lenovo type laptop?

No.

You might, with a hybrid Mac book pro M1 or M2 with 64GB of combined memory; pretty much any other lapto, categorically no.

You’d have to rent / own a separate server with epic GPU power.

> Final question is will a home brewed version be just as good?

No.

The open source language models are not as good as GPT3.

forumranger · on Nov 6, 2022

This is not really asked at you, but the whole idea of copilot. How about people just write their own code??

alrlroipsp · on Nov 6, 2022

I don't think most open source dev:s want CoPilot or a FOSS alternative for this very reason:

Code assist AI does no attribution.

This removes engagement between the dev and library authors. this ruins chances of engaging new contributors over time, eroding and killing the FOSS communities.

Code assist AI also does not care about licenses. See [1]

1: https://www.bleepingcomputer.com/news/security/microsoft-sue...

ShamelessC · on Nov 6, 2022

Most is a stretch. I would say it’s more like a vocal minority, largely consisting of GPL proponents.