Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Open-Source GitHub Copilot?
28 points by ghoomketu on Nov 6, 2022 | hide | past | favorite | 17 comments
Just wondering if for some reason copilot shuts down, I was wondering if it's possible to home brew it.

Some hurdles I see:

- Github rate limits the GET requests so it doesn't seem possible to scrape all the source code on there. But maybe it can be crowdsourced like seti@home so 1000 people can install a program to get around this.

- Training the model. I would imagine this would be hardest as it would need millions of dollars for this? Is there a way to get around it or using free tools like colab?

- Running the api. Once the model is trained, would it be possible to run it on a lenovo type laptop? I guess you need lots of VRAM to run it?

Final question is will a home brewed version be just as good? What factors determine that?

Just curious on how we can do it as I imagine there a lot of ML experts here.




There is a 3TB model “The Stack” which I believe is partly designed for this: all of the code is properly licensed.

Training the model would be expensive but it’s a one-and-done process. With the model openly available cloud providers could provide a subscription service to end-users which recoups the cost of running it.

The only issue is I imagine GitHub has much more code than 3TB.


BTW, this wouldn't solve the legal hurdles of Copilot. The model needs to mention which license the code has, which AFAIK Amazon's competitor to Copilot already does that.


If the model is found to be fair use (which it most likely will) then the license doesn’t matter.

If the outputs of the model are found to be not covered by copyright in the first place due to established legal doctrine [0] then developers will not be liable for copyright infringement.

So the only reason for providing attribution will be as a product feature that some developers might want to use.

I personally don’t list every artist that has used a I-IV-V progression in one of their songs and generally lump it into the recognition that my artistic foundation relies on preexisting culture. But hey, you do you.

[0] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...


It would be hard to defend it as "transformative" while it's still possible to get it to spit out large blocks of verbatim (or near-verbatim) copies.

That is, the way GitHub describes Copilot working might pass as fair use. The way it (sometimes) works in real-life will not.


The model itself is transformative and is considered separate from the outputs.

The outputs will always be a liability for a developer using the tool. So far the outputs are not covered by copyright due to the merger doctrine of the idea-expression distinction.

https://en.wikipedia.org/wiki/Idea%E2%80%93expression_distin...


Not sure I follow. Forgetting how the tool works, it's sold as a tool that outputs usable code for customers. If it's outputting copyright encumbered code (even occasionally), then Microsoft/Gitlab is going to be liable for that.

I don't think an explanation of how it's okay since it's an AI model is going to impress a judge, if the plaintiff shows long passages of verbatim copyrighted code coming out of it.


Here's an exhaustive explanation:

https://texaslawreview.org/fair-learning/

Here's some relevant case law:

https://en.wikipedia.org/wiki/Baker_v._Selden

https://en.wikipedia.org/wiki/Whelan_v._Jaslow

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

Here's some relevant legal doctrine:

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

https://en.wikipedia.org/wiki/Idea–expression_distinction

The only thing I can say is that if there is a tool that is consistently outputting copyright protected works that this burden would not be worth it for most people. But as I have yet to see a single output from my own use or from Twitter that would pass the filtration test I am not worried about my personal liability.


> So the only reason for providing attribution will be as a product feature that some developers might want to use.

Not at all.

The reason for providing attribution is to create a incitement for anyone to even publish their work as FOSS in the first place.

We even put it explicit in our LICENSE files.

MIT license:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

BSD 4-clause:

> Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.


That's not true for many people.

When working at Google, developers generally open-source their private projects (which are legally owned by Google) under a Google umbrella, with a Google copyright. This rarely stops developers. When I worked there, my incentive was to make my work more useful and give something back. I'm pretty sure many other Googlers feel the same way.

If Github found another way of making my work more useful to others, all the better. I would prefer if Copilot wasn't the only option, and if there was a good open-source alternative, but that's completely independent of the fact that my code was used to teach a neural network how to complete code snippets.


> The reason for providing attribution is to create a incitement for anyone to even publish their work as FOSS in the first place.

What? You think people only develop open source simply to have attribution? Personally I publish open source because I made something others could use and I don't want them to waste time duplicating effort. I don't even care whether they list me or not, it doesn't factor into my decision to open source at all.


Yeah but those licenses don’t matter if it is fair use or if the sections of code were not covered by copyright in the first place.

I agree that if SSO [0] didn’t exist that it would create poor incentives for open source software but that is not what these tools accomplish. They are the poetic equivalents of rhyming dictionaries. Should I figure out everyone that has rhymed “book” with “look” when I publish a new song?

[0] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...


Not a complete replacement, but very cool and related: https://github.com/webyrd/Barliman


You’ll need > 20 GB of GPU memory run the model.

This is the same reason people can’t easily “play with” GPT like models.

> would it be possible to run it on a lenovo type laptop?

No.

You might, with a hybrid Mac book pro M1 or M2 with 64GB of combined memory; pretty much any other lapto, categorically no.

You’d have to rent / own a separate server with epic GPU power.

> Final question is will a home brewed version be just as good?

No.

The open source language models are not as good as GPT3.


This is not really asked at you, but the whole idea of copilot. How about people just write their own code??


I don't think most open source dev:s want CoPilot or a FOSS alternative for this very reason:

Code assist AI does no attribution.

This removes engagement between the dev and library authors. this ruins chances of engaging new contributors over time, eroding and killing the FOSS communities.

Code assist AI also does not care about licenses. See [1]

1: https://www.bleepingcomputer.com/news/security/microsoft-sue...


Most is a stretch. I would say it’s more like a vocal minority, largely consisting of GPL proponents.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: