Bravo! I'm so glad this is born. I'm curious about its model, SalesForce CodeGen. Does it train on all the public repos on GitHub? Does Copilot have access to private repos that CodeGen cannot access?
Also, it would be really cool if I can personalize FauxPilot by feeding it with all my repos on GitHub. Sometimes I just need to reimplement a function I've written before but it's really hard to find where my old code is.
It is possible to fine-tune CodeGen using Huggingface Transformers! Then you'd be able to fine-tune it on your own code and use the resulting model. However, training is more expensive -- you'd need an A6000 or better to train the 6B model. Something like the following should work:
Colab will give you a T4, K80, V100 or P100 (alternatively their own TPUs) for free - $50 for 24h uninterrupted background jobs, Gradient will give you free A6000s and sometimes even free A100s for a $40 subscription for 6 hours (repeatable ad infinitum), Lambdalabs gives you a RTX 6000 for 0.50/hour and A6000 for 0.80/hour and Vultr GPU will give you 1/7th of an A100 for 0.37/hour
Thank you for sharing the command for finetuning! Is it possible to share your ds_config.json? I tried to finetune the 2B model on A100 (40GB) using your command, but got a CUDA out of memory error. The ds_config I used was the one from huggingface (https://github.com/huggingface/transformers/blob/main/tests/...).
Have a look at the datasets library [1], but as a shortcut, you can just create a file named "my_code.json" in jsonlines format with one line per source file that looks like:
> I just need to reimplement a function I've written before
There was a project here some time back that allowed to call function using its hash. This avoided replication. It was entirely different paradigm to resolve dependencies.
Anyone remembers its name? It was controversial here but it was interesting nonetheless.
Copilot only accesses the private repositories for the user it is authenticated with.
I'm fine with them accessing public repositories, as long as they respect the license. Which they probably aren't for private?
On a related note, Google apparently treats their free customers in a way where all their data is mixed in for the algorithm, but apparently not the data from accounts who use Google Suite (paid business accounts), instead they have their own recommendations built for them on a per-user basis, then has the public algo mixed in.
For you and all my sibling comments here. Are you guys forgetting GitHub IS Microsoft? Do you guys actually believe Microsoft is respecting privacy on repositories? Did you read the EULA after the acquisition? C'mon, let's be serious!
Tangential, maybe someone here knows; how complicated would be to implement an autocompleter that just understands syntax + patterns and can offer me suggestions but based on my own stuff? Like what is the simplest version of Copilot that doesn't require huge amounts of training but just does a decent job at recognizing tokens and trying to fill in the structure based on some input code. e.g: `app.get('/',|` <- at this point I should get an express.js handler autocomplete like `(req, res, next) => {}`, maybe with multiple choices grepped from my own data ranked by occurences. Is this too extreme it needs a multibillion parameter AI model to achieve? Does anything like this exist? Like an auto-snippet thing but contextual and token replacing.
To do this in a way that’s actually useful is hard. Microsoft has had Intellisense, JetBrains have autocomplete in Intellij and other products. Both have big teams and decades of work put into them, and still and great, hence the ML approach being tried now.
If I type app.get('/', then looking for literal occurrences and presenting me a menu of all the literal completions I have used before (no ML required at all) would already be a huge win.
What if it's not called `app` – in larger codebases there's often a layer of indirection with different naming.
What if it's not indented the same amount, is it the same?
What if it's not a GET, but a POST? Both use the req/res handlers so you'd want the prediction for both.
What if it's not for the path /, but for some other path. As you can only have one handler for the root, most handlers will be for different paths.
Maybe you can write these edge cases in for the Express situation and get req/res prediction, but that's a bunch of work to automate ~10 characters and it only works for this specific use-case. It doesn't work for any other languages, frameworks, libraries, or use-cases.
There are 2 ways to do this well: 1) static analysis of code to understand what is allowed to come next, and predicting based on that (Intellij, Intellisense, etc) or 2) ML or statistical analysis to determine what's likely to come next without knowing much about the rules (Copilot, Tabnine, Intellicode). Both of these approaches are hard to do right, but have a high pay off.
One problem though is that sometimes I have a sub-folder with lots of (unrelated) files and Vim starts searching it, thereby basically hanging the editor without a possibility to stop the operation.
That's exactly the thing I would be looking for, too.
Especially since I'm working in proprietary niche language, it would be so great to have my own code to be the corpus of the model.
It could be so easy: stuff your codebase into a model generator and out comes some sensible autocomplete. Well, at least I hope we will see something like this soon.
If you would like a custom model on your code (or language) this is one of the unique capabilities of Tabnine. We have done this on lots of code already and do it all the time for companies across the globe. Your code is your code. Transparently I am with Tabnine.
I've tested Tabnine but found it significantly slowing down my whole PC down to barely usable at all. Maybe this issue has been fixed and I can have another look at it. Also it seemed to me that the suggestions were only based on the existing code, whereas Github's Copilot appears to be "smarter".
Clarification - a user can run Tabnine on their laptop only OR as a part of Tabnine Enterprise we can run the large cloud models in your VPC (on GPU's) for your entire team of developers. Your code is your code and you can run it anywhere.
For sane, modern languages like C# and Java, autocomplete has worked somewhere between adequate and amazingly well for a long time already.
I think it used to be kind of usable also in VB6 also, at least that is how I remember it.
Golang and Typescript seems to have good support too now.
But for just the thing you mention above you could go a long way with live templates or what it is called in different IDEs, or just text templates with variables like the Rails guys used to use in their demos.
The latest version of Sublime Text works kind of like that. It know the tokens that most commonly appear together in the folders you currently have open in the sidebar and prioritizes the suggestions popup based on that. Seems to work pretty well for me.
It's a relatively new feature - within the last year I think.
Awesome work, I made a similar project which is making a cost effective and privacy focused alternative to OpenAI text generation that can be switched to in a one line/easy way because it's API compatible https://text-generator.io/blog/over-10x-openai-cost-savings-... also works for generating code too so would be excited for someone to try that out too.
Yep, when doing auto complete it helps to include your real table
schema in (create table statements ) there's an example in the playground of python code autocomplete which can be changed to SQL or natural language indicating some SQL is expected. Also can try a comment with the file name e.g.
### migrations/user_add_free_credit.sql adds free credits field to user table
The largest Python model (codegen-16B-mono) should be pretty competitive! The biggest thing that it lacks is the ability to make use of context after the cursor. Unfortunately, that capability will require retraining the models:
Interesting, lots of models are already trained on MLM (masked language modelling), so you can iteratively add mask tokens in the middle but you need some rules around termination such as probability/length of mask tokens...
I'd be interested in how to convert models without retaining, or minimal retaining, I'd have thought it would just work with iterative mask tokens at least for models trained with MLM
> The family of CODEGEN models is trained sequentially on three datasets: THEPILE, BIGQUERY,
and BIGPYTHON.
> The natural language dataset THEPILE is an 825.18 GiB English text corpus collected by Gao et al. (2020) for language modeling. The dataset is constructed from 22 diverse high-quality subsets, one
of which is programming language data collected from GitHub repositories with >100 stars that
constitute 7.6% of the dataset.
> The multi-lingual dataset BIGQUERY is a subset of Google’s publicly available BigQuery dataset, which consists of code in multiple programming languages such as C, Python, Ruby. For the multi-lingual training, the following 6 programming languages are chosen: C, C++, Go, Java, JavaScript,
and Python.
> The mono-lingual dataset BIGPYTHON contains a large amount of data in the programming language,
Python. We have compiled public, non-personal information from GitHub consisting of permissively
licensed Python code in October 2021.
Square one is great then. Let's move to square two "feature complete" by implementing search :-)
Honestly, Gitea's feature set is quite impressive. sr.ht I'm less familiar.
Of course GitLab is another option. I didn't mention it because it is well known, and probably less interesting / approachable to someone who would like to start from scratch, because it is more complex and there's this open core aspect.
Hahaha okay! I actually replaced gitlab with gitea a few years ago. Mainly due to a conflict of interest but the sheer speed of gitea won me over. Look out for an email from me to coordinate search implementation :D
Assuming that it's just that hardcoded check I should be able to add it today. But if that check is load-bearing (i.e. if it relies on those assumptions elsewhere in the code) it could be a bit more painful.
Edit: sadly, it's not that easy. Removing that check lets the 2B model load, but it produces gibberish. I've opened an issue with FasterTransformer here, and will also try to debug it further myself, but unfortunately it's not obvious how they're using that assumption. https://github.com/NVIDIA/FasterTransformer/issues/268
I don't understand the point of dedicating so many resources just so that you can have a Clipper add-on second-guess your code and make bad suggestions.
Friendly reminder, stackoverflow is licensed with the Creative Commons sharealike license[0]. If you copy and paste from SO you are required to attribute the original author in your code.
I don’t think anybody does this. Let’s also ignore the fact that people share licensed code (GPL as example) as SO answers.
Copilot has given me great utility and increased my productivity a lot. I constantly get surprised and amazed by the suggestions. A self-hosted version would be awesome.
It will not end there. It began with Python when they removed "offensive" words such as "kill". Then it was GitHub then Git itself who removed the word "master".
Now GitHub / Microsoft, who are producing tools that integrate themselves more and more into the programmers' workflow, will now more opportunities to enforce this kind of fringe ideology.
Dystopian predictions; the following words will be replaced:
Also, it would be really cool if I can personalize FauxPilot by feeding it with all my repos on GitHub. Sometimes I just need to reimplement a function I've written before but it's really hard to find where my old code is.