Hacker News new | past | comments | ask | show | jobs | submit login
FauxPilot – an attempt to build a locally hosted version of GitHub Copilot (github.com/moyix)
422 points by fniephaus on Aug 3, 2022 | hide | past | favorite | 79 comments



Bravo! I'm so glad this is born. I'm curious about its model, SalesForce CodeGen. Does it train on all the public repos on GitHub? Does Copilot have access to private repos that CodeGen cannot access?

Also, it would be really cool if I can personalize FauxPilot by feeding it with all my repos on GitHub. Sometimes I just need to reimplement a function I've written before but it's really hard to find where my old code is.


It is possible to fine-tune CodeGen using Huggingface Transformers! Then you'd be able to fine-tune it on your own code and use the resulting model. However, training is more expensive -- you'd need an A6000 or better to train the 6B model. Something like the following should work:

    deepspeed --num_gpus 1 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=./codegen-6B-finetuned --dataset_name your_dataset --tokenizer_name Salesforce/codegen-6B-multi --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir --deepspeed ds_config.json
Where run_clm.py is this script: https://github.com/huggingface/transformers/blob/main/exampl...

It might be doable to set this up on an AWS machine with a beefy GPU or two. I haven't tried it yet though.

Once you have a model trained in Huggingface Transformers you'd be able to convert it using this script:

https://github.com/moyix/fauxpilot/blob/main/converter/huggi...


I train models 24/7 right now and PLEASE do not use AWS for it. You're going to pay out of your backside for it.

Better alternatives: Google Colab, Paperspace Gradient, Lambdalabs Cloud, Vultr GPU instances

Colab will give you a T4, K80, V100 or P100 (alternatively their own TPUs) for free - $50 for 24h uninterrupted background jobs, Gradient will give you free A6000s and sometimes even free A100s for a $40 subscription for 6 hours (repeatable ad infinitum), Lambdalabs gives you a RTX 6000 for 0.50/hour and A6000 for 0.80/hour and Vultr GPU will give you 1/7th of an A100 for 0.37/hour


Thank you for sharing the command for finetuning! Is it possible to share your ds_config.json? I tried to finetune the 2B model on A100 (40GB) using your command, but got a CUDA out of memory error. The ds_config I used was the one from huggingface (https://github.com/huggingface/transformers/blob/main/tests/...).


A friend of mine runs Sushi Cloud (https://www.sushi.cloud/), which could help make things cheaper than AWS for training purposes.


I can't see how this is relevant to the discussion. There is no mention of GPU instances in the first place.


How do I create a dataset?


Have a look at the datasets library [1], but as a shortcut, you can just create a file named "my_code.json" in jsonlines format with one line per source file that looks like:

   {"text": "contents_of_source_file_1"}
   {"text": "contents_of_source_file_2"}
   ...
And then pass that my_code.json as the dataset name.

[1] https://github.com/huggingface/datasets


> I just need to reimplement a function I've written before

There was a project here some time back that allowed to call function using its hash. This avoided replication. It was entirely different paradigm to resolve dependencies.

Anyone remembers its name? It was controversial here but it was interesting nonetheless.



Yeah this is it!


A comment below mentions those details are in the paper:

https://news.ycombinator.com/item?id=32328168


Copilot only accesses the private repositories for the user it is authenticated with.

I'm fine with them accessing public repositories, as long as they respect the license. Which they probably aren't for private?

On a related note, Google apparently treats their free customers in a way where all their data is mixed in for the algorithm, but apparently not the data from accounts who use Google Suite (paid business accounts), instead they have their own recommendations built for them on a per-user basis, then has the public algo mixed in.


> I'm fine with them accessing public repositories, as long as they respect the license.

They don't respect the license.


For you and all my sibling comments here. Are you guys forgetting GitHub IS Microsoft? Do you guys actually believe Microsoft is respecting privacy on repositories? Did you read the EULA after the acquisition? C'mon, let's be serious!


Tangential, maybe someone here knows; how complicated would be to implement an autocompleter that just understands syntax + patterns and can offer me suggestions but based on my own stuff? Like what is the simplest version of Copilot that doesn't require huge amounts of training but just does a decent job at recognizing tokens and trying to fill in the structure based on some input code. e.g: `app.get('/',|` <- at this point I should get an express.js handler autocomplete like `(req, res, next) => {}`, maybe with multiple choices grepped from my own data ranked by occurences. Is this too extreme it needs a multibillion parameter AI model to achieve? Does anything like this exist? Like an auto-snippet thing but contextual and token replacing.


To do this in a way that’s actually useful is hard. Microsoft has had Intellisense, JetBrains have autocomplete in Intellij and other products. Both have big teams and decades of work put into them, and still and great, hence the ML approach being tried now.


Is it really that difficult?

If I type app.get('/', then looking for literal occurrences and presenting me a menu of all the literal completions I have used before (no ML required at all) would already be a huge win.


What if it's not called `app` – in larger codebases there's often a layer of indirection with different naming.

What if it's not indented the same amount, is it the same?

What if it's not a GET, but a POST? Both use the req/res handlers so you'd want the prediction for both.

What if it's not for the path /, but for some other path. As you can only have one handler for the root, most handlers will be for different paths.

Maybe you can write these edge cases in for the Express situation and get req/res prediction, but that's a bunch of work to automate ~10 characters and it only works for this specific use-case. It doesn't work for any other languages, frameworks, libraries, or use-cases.

There are 2 ways to do this well: 1) static analysis of code to understand what is allowed to come next, and predicting based on that (Intellij, Intellisense, etc) or 2) ML or statistical analysis to determine what's likely to come next without knowing much about the rules (Copilot, Tabnine, Intellicode). Both of these approaches are hard to do right, but have a high pay off.


Emacs and vim do this out of the box


Vim does not do this "out of the box".


It does it on the "word" level: https://stackoverflow.com/a/52635099/2958070


Yes, I use this a lot.

One problem though is that sometimes I have a sub-folder with lots of (unrelated) files and Vim starts searching it, thereby basically hanging the editor without a possibility to stop the operation.


That's exactly the thing I would be looking for, too.

Especially since I'm working in proprietary niche language, it would be so great to have my own code to be the corpus of the model.

It could be so easy: stuff your codebase into a model generator and out comes some sensible autocomplete. Well, at least I hope we will see something like this soon.


If you would like a custom model on your code (or language) this is one of the unique capabilities of Tabnine. We have done this on lots of code already and do it all the time for companies across the globe. Your code is your code. Transparently I am with Tabnine.


I've tested Tabnine but found it significantly slowing down my whole PC down to barely usable at all. Maybe this issue has been fixed and I can have another look at it. Also it seemed to me that the suggestions were only based on the existing code, whereas Github's Copilot appears to be "smarter".


The plugin TabNine does what you're looking for. I've been using it enjoyably for about two years.


Clarification - a user can run Tabnine on their laptop only OR as a part of Tabnine Enterprise we can run the large cloud models in your VPC (on GPU's) for your entire team of developers. Your code is your code and you can run it anywhere.


It sends your code to the cloud though.


Well, it _can_ but that option is configurable.


How effective is it without cloud access?


I'm pleasantly surprised at its effectiveness!


For sane, modern languages like C# and Java, autocomplete has worked somewhere between adequate and amazingly well for a long time already.

I think it used to be kind of usable also in VB6 also, at least that is how I remember it.

Golang and Typescript seems to have good support too now.

But for just the thing you mention above you could go a long way with live templates or what it is called in different IDEs, or just text templates with variables like the Rails guys used to use in their demos.


The latest version of Sublime Text works kind of like that. It know the tokens that most commonly appear together in the folders you currently have open in the sidebar and prioritizes the suggestions popup based on that. Seems to work pretty well for me.

It's a relatively new feature - within the last year I think.


Next step is to train a model exclusively on leaks of proprietary Microsoft source code. Fair use, right Microsoft?


It probably wouldn't produce good code though :D


Man that would be so ironic.

And if they want to sue for that, they will be shooting themseleves in the foot by proving their Fair-use argument isn't real.


Awesome work, I made a similar project which is making a cost effective and privacy focused alternative to OpenAI text generation that can be switched to in a one line/easy way because it's API compatible https://text-generator.io/blog/over-10x-openai-cost-savings-... also works for generating code too so would be excited for someone to try that out too.


Does it work with natural language to sql?


Yep, when doing auto complete it helps to include your real table schema in (create table statements ) there's an example in the playground of python code autocomplete which can be changed to SQL or natural language indicating some SQL is expected. Also can try a comment with the file name e.g.

### migrations/user_add_free_credit.sql adds free credits field to user table


What's the license of the project? There is no license file in the repo


Fixed! Didn't expect this to hit HN so fast :)


love the speed you are moving!


This is awesome!

How is the quality vs. GitHub Copilot for Python or JavaScript?


The largest Python model (codegen-16B-mono) should be pretty competitive! The biggest thing that it lacks is the ability to make use of context after the cursor. Unfortunately, that capability will require retraining the models:

https://arxiv.org/abs/2207.14255


Interesting, lots of models are already trained on MLM (masked language modelling), so you can iteratively add mask tokens in the middle but you need some rules around termination such as probability/length of mask tokens...

I'd be interested in how to convert models without retaining, or minimal retaining, I'd have thought it would just work with iterative mask tokens at least for models trained with MLM


This is excellent. Please do suck the wind from Microsoft's exploitative sails here.

Copilot primarily exists as a way for Microsoft to end-run around the GPL.


I wonder what code corpus the SalesForce CodeGen model (which this uses) was trained on.


From the paper[1]:

> The family of CODEGEN models is trained sequentially on three datasets: THEPILE, BIGQUERY, and BIGPYTHON.

> The natural language dataset THEPILE is an 825.18 GiB English text corpus collected by Gao et al. (2020) for language modeling. The dataset is constructed from 22 diverse high-quality subsets, one of which is programming language data collected from GitHub repositories with >100 stars that constitute 7.6% of the dataset.

> The multi-lingual dataset BIGQUERY is a subset of Google’s publicly available BigQuery dataset, which consists of code in multiple programming languages such as C, Python, Ruby. For the multi-lingual training, the following 6 programming languages are chosen: C, C++, Go, Java, JavaScript, and Python.

> The mono-lingual dataset BIGPYTHON contains a large amount of data in the programming language, Python. We have compiled public, non-personal information from GitHub consisting of permissively licensed Python code in October 2021.

[1]: https://arxiv.org/abs/2203.13474


With model sizes starting at 2GB, wouldn't the model be hopelessly overdetermined for small codebases?


The models aren't customized for individual codebases; all of them were trained on most of GitHub.


Right... thanks for correcting me.


I would love to build a free and oss clone of GitHub using things like this. Kudos to moyix


It might be interesting to look into existing projects like Gitea or SourceHut. Gitea specifically seems to have a similar look and feel as GitHub.


Oof. GitLab is really the only contender to GitHub. Both gitea and sr.ht don’t have search features. So we are really starting from square one here


If you're working with code that fits on a single machine I recommend using ripgrep for brute force search - it's shockingly fast.

I built my own web frontend for it a while back: https://simonwillison.net/2020/Nov/28/datasette-ripgrep/


This is legit! I will peruse this during my free time tomorrow and see how to integrate into osshub


Square one is great then. Let's move to square two "feature complete" by implementing search :-)

Honestly, Gitea's feature set is quite impressive. sr.ht I'm less familiar.

Of course GitLab is another option. I didn't mention it because it is well known, and probably less interesting / approachable to someone who would like to start from scratch, because it is more complex and there's this open core aspect.


Hahaha okay! I actually replaced gitlab with gitea a few years ago. Mainly due to a conflict of interest but the sheer speed of gitea won me over. Look out for an email from me to coordinate search implementation :D


This is awesome and I'm looking forward to seeing more from this.


Maybe this is a dumb question, what would be the most common use-case for this?


Any company that are not allowed to share code with third parties either by internal or customer policy. This should be a lot of companies.


Or just not wanting to live in a surveillance-first, service-first world.


Well, do you prefer services over inspectable code you can run yourself? I don't. In fact, I strongly dislike having to rely on services.


man I finally got a 3080 and. ow you're telling me I need a second?!

haha seriously though, I'm kinda sad the model size is 2GB then straight to 13GB.


In theory the 2B model should be somewhere in between, whenever it gets fixed.


Assuming that it's just that hardcoded check I should be able to add it today. But if that check is load-bearing (i.e. if it relies on those assumptions elsewhere in the code) it could be a bit more painful.

Edit: sadly, it's not that easy. Removing that check lets the 2B model load, but it produces gibberish. I've opened an issue with FasterTransformer here, and will also try to debug it further myself, but unfortunately it's not obvious how they're using that assumption. https://github.com/NVIDIA/FasterTransformer/issues/268


Got it working :) You can now use the 2B models in FauxPilot as well.


You probably don't need physical GPUs but you can rent them in some cloud for a few days.


you can get it running with resizable bar


Such a clever name :)


This is awesome! Will be trying it out. Hope you're doing well :D


I don't understand the point of dedicating so many resources just so that you can have a Clipper add-on second-guess your code and make bad suggestions.


Copilot is basically copying-from-stack-overflow on steroids. Maybe it should be called Copy-a-lot.

Whatever. Regardless of copying mechanisms I expect my fellow developers to understand every suggestion made by such tools.


Friendly reminder, stackoverflow is licensed with the Creative Commons sharealike license[0]. If you copy and paste from SO you are required to attribute the original author in your code.

I don’t think anybody does this. Let’s also ignore the fact that people share licensed code (GPL as example) as SO answers.

[0] https://stackoverflow.com/help/licensing


Copilot has given me great utility and increased my productivity a lot. I constantly get surprised and amazed by the suggestions. A self-hosted version would be awesome.


especially with model tailoring and utilization of proprietary code bases. Really, models may want to overweight for those, at the users' discretion.


You have to start somewhere and the AI suggestions are already a big time saver.


It will not end there. It began with Python when they removed "offensive" words such as "kill". Then it was GitHub then Git itself who removed the word "master".

Now GitHub / Microsoft, who are producing tools that integrate themselves more and more into the programmers' workflow, will now more opportunities to enforce this kind of fringe ideology.

Dystopian predictions; the following words will be replaced:

Parent / Child Inheritance Class Binary Invalid




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: