I stumbled upon this repo by accident a few days ago when its source code appeared to contain the only usage the term "GH_COPILOT_TOKEN" in any repo on Github:
What I found there was some truly impressive reverse-engineering work by a single individual. I really like the "JOURNAL" daily-diary they kept of progress and random thoughts so you could see the progression day-by-day.
--------
One thing I found interesting: The author says that it queries only the 20 recently most opened files of the same language.
But in an AMA, I asked about how much "context" Copilot has available, and one of the devs says it can, for example, read header files that pair with C/C++ files that are open in separate tabs:
> "I assume Copilot uses the contents of the current project (IE, all files) as contextual information to offer suggestions. Is this the case?"
> "Yes, copilot looks at what we call "related tabs", for example .c files are often paired with .h files, so copilot considers the contents of other tabs open in the editor to try to feed a more complete prompt to the model."
Hey Gavin, that's consistent with what I said in the blog post. Header files (.h) and (.c) files are classified as having the same language (C). As a result, if you have the header file open in a tab (or in the workspace -- I'm not sure if unaccessed files from workspace are used), then it'll be considered for the prompt.
Heh this blew up here :D Didn't know till a friend told me about it.
I'd love to know if you guys have any specific questions about copilot's internals that I can try to answer by staring at the code or if you have any feedback for the tool/post!
The neovim plugin mostly actually communicates with a node-js service seen here (https://github.com/github/copilot.vim/tree/release/copilot/d...). This is why they require you to install node for using the plugin and allows them to share logic with the vscode extension (also in javascript). I think all the features should be available even in neovim.
Honestly I get more value out of ChatGPT than I do from Copilot even though both generate the wrong stuff now and then. But I like to describe the desired functionality by writing in plain English than trying to goad copilot in the right direction by coming up with method and variable names that copilot will "like"
Yeah, I second what bibabloo said. Sometimes when I want to edit something, I comment out existing thing, and start writing what'd be the new version. Then copilot autocompletes what I have in mind (often enough).
Amazing if this is only a 12B model. If this already increases coding productivity by up to 50% (depending on kind of work), imagine what a 1T model will be capable of! I do wonder if some programmers at FAANG are already having access to a way more powerful coding assistants, and whether they code much at all at this point, or only make high level code specifications and then fix up the automatically generated code.
> If this already increases coding productivity by up to 50% (depending on kind of work)
Does anyone believe that?
edit: I'm surprised to see that (so far) 3 replies actually agree with the statement. Is there a video that you'd recommend that shows realistic usage and gain from copilot? Maybe a livestream or something.
Agreed with this. If the main bottleneck is typing, then Copilot can dramatically speed up the process. If the bottleneck is thinking, it doesn't help out nearly as much unfortunately.
I'd add that for me at least it's quite good at some small specific subsets of "requires me to think". For example, I do a lot of 3d rotations & transformations, and it's very good at figuring out the math of that based on the function name I chose etc. Most of those would take me a piece of paper and 5-10 mins, but it usually gets it in 1 or 2 tries.
But yes, mundane work it is best at. Some things I have found it made particularly easy:
- scraping websites
- file i/o
- "mirroring" things (I write a bunch of code for doing something on x axis, it automatically replicates it for y and z etc with the right adjustments, or cardinal directions, or arrow keys, etc etc etc)
Sure. I'm way more productive with Copilot. I haven't been coding much lately but I could imagine it would double my productivity with regards to the actual "get an implementation of a thing done" bit of the work.
In terms of design, I had a long conversation with ChatGPT the other day about designing a database, including optimizations that could be made given certain requirements and constraints, etc. It was a big productivity boost, like rubber ducking on steroids.
I tried it to help me optimize some sql, but even after many attempts it didn't really do anything useful for me. The best thing was really to show how the syntax works for features that I rarely use - so in that sense it's a better stackoverflow.
I told it I was designing a database. I told it that my database could tolerate failure levels where more than a quorum of nodes failed at a given time. I then asked it about different algorithms for consensus; RAFT, Paxos, swarm based, etc. It described algorithms for me. I told it that in my database I could guarantee certain things, like that every operation commutes, and I asked how that would let me optimize things - it explained that I could paralellize certain parts of those algorithms.
At one point I told it to name the algorithm we had been discussing something like "OptSwim" and we just kept iterating on the idea.
But aren't you afraid that whenever you veer discussion from Wikipedia/stackoverflow type explanations it's likely lying to you? This was my general experience -- it's great at querying for stuff which already exists and is popular on the internet and for conversing on a surface level or broad level but as soon you delve into details it starts confidently lying and/or hallucinating things, which undermines my trust in it, which in turn means I need to verify what it says, which means it did not increase my productivity that much after all.
It routinely invents arguments, functions or concepts which don't exist in reality or don't apply to the current context, but look like they could, so you are even more likely to get caught by this.
Haha, yes, it indeed invents arguments that aren't part of specific APIs and would offer to do something that you'd like to do in a very easy way, but since they actually aren't part of the API, well, you're out of luck.
It's just taking the "I wish they'd thought of my use case when designing that API" on the next level by simply pretending in a very sincere and convincing way that your wish came true, then writing a usually-pretty-correct program around that assumption that would actually work _if that wish had come true_ - but unfortunately that API doesn't really accept this convenient parameter, so...it's not that easy in reality.
The earlier days you can see it speeds you up a lot. The later days (such as today) you still want to wrap your own head around difficult computer science concepts so it is kind of useless.
I finally got it to do something useful for me the other day. I got it to invert the rendering of rows and columns in a React widget I was writing.
It wasn’t something I actually needed help on, though. When I tried to go further with it and complete more of the task, it got stuck in a loop of just suggesting more and more comments but never offering more code, and then it mysteriously stopped responding at all.
This is the best experience with it I’ve had so far.
Absolutely. 50% feels conservative. The thing is that Copilot becomes so ingrained in your workflow that you don't notice it until internet goes down and you feel completely handicapped. Only then do you realize how much you rely on it.
A 1T model would be capable of much more than what the current version of Copilot in terms of autocompletion and even code correction. However, at that point, even with a lot of model parallelism to speedup inference, it's likely to be atleast 10x slower on the generation side. From my experience working on Codeium, a Copilot alternative, this would be too frustrating for users. It could be useful as a tool that runs asynchronously that modifies all your code at scale.
Given how fast Copilot is (a few seconds), I wouldn't mind waiting 10x. I also wouldn't mind letting it run overnight for some tasks (ie: write documentation, write tests, suggest bug fixes, etc...). Will check on my buddy on the next morning.
I think the UX of large suggestions will require a lot of thinking and experimentation. That's because the longer the output of such model, higher the risk of it making some mistake. For short completions, it's often easy to identify mistakes from useful suggestions (though sometimes subtle bugs slip in). But for longer completion, it'll get tedious and we might start accepting wrong suggestions.
It could be interesting if it was an alternative that a user could query. I could imagine someone starting to write a new function might be willing to wait 10x more time to get something better.
Very true, I think the issue though is unless that output is very likely to be 100% correct, a user would always prefer something that is incomplete but quicker to iterate on. It would be interesting to see if we can get to a paradigm like that.
Though isn't it highly likely that core devs working at the big tech giants have access to 10x-100x faster compute, e.g. some secret TPU successor at Google?
The magical number for performance is actually memory bandwidth which is actually lower for TPUs compared to A100s. They have more aggregate compute, but it's not trivial to use that to get very low latency on a per request basis.
But they have highly likely internal prototypes with higher bandwidth and latency. Also, with distilled latent diffusion one can probably generate text(-images) much faster anyhow as it could produce long chunks of text at once, rather than needing recurrently feed back the new token to the inputs.
In my eyes, the limitation of these models is that they only fit a limited amount of context. Not the complete API of your code base, or the latest version of the libraries you are using. I also don't believe a bigger model would resolve these limitations.
However, I do believe there could be a meta model that can query code and libraries.
Yeah, continuous online learning by fine-tuning seems like an obvious way of making these models recall information from outside the perceptible context. One could also prompt the model to (recursively) summarize code and prepend this summary to each prompt, and/or enable the model to interactively query function definitions or code summaries before outputting a final answer (trained by RLHF). But any such tricks might also quickly be outcompeted by an even more general model, e.g. one that directly controls the GUI and can communicate with coworkers...
It doesn't work like this. A 1T model without architectural changes would not perform substantially better unless it has been trained on a lot more code. The original Codex was trained on 100B tokens, so you could possibly get some gains by increasing the model size but only up to a point. See the Chinchilla paper for reference.
Also, even with "Chinchilla laws", you still gain performance in a larger model, you just need a lot more data (if just as noisy) to reach the same level of convergence, but a larger model will have already partially converged to a superior model with the same amount data.
I've actually seen this paper before, but I don't think it's helpful. If the entire GitHub is 100B tokens and your prune it down properly, then fine, you can get equal performance with fewer tokes. However, if you want improved performance, you still need more data, not just a larger model size, and that's hard to obtain. I don't think it's a lost cause and we will be be stuck with current performance by any means though - there are other ways to go.
The loss decreases with greater model size at the same compute budget (i.e. stopping sooner regarding training data). Also some rehearsal/multi-epoch training improves the forgetting rate (thereby improving performance substantially), which hasn't been taken into account by Chinchilla et al. because they train <1 epoch.
No. It shows the opposite. All model sizes converged to a similar loss as the compute increased towards maximum. But larger models had larger loss for a given compute budget.
Their text about Figure 3 confirms what I'm saying: "We find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train"
Yes, but the losses in Figure 3 increase because the larger models see fewer data to keep the FLOP budget constant, not because of overfitting. Large models do not overfit very much, so the loss of a larger model will still be better compared to a smaller model when you keep dataset size constant.
It's likely that programmers have this skill somewhere. We all make mistakes when typing in code, and many of them do get found. Some of them don't, that's what we call a bug. So AI isn't exactly breaking any ground here.
I played with ChatGPT and asked it interview questions, and I thought it was a pretty interesting exercise to find its mistakes and get it to fix them. Good tool for training interviewers, perhaps.
Microsoft paid for early exclusive access to GPT-3 internals. They're using it to develop things like Power Apps. FAANG are all doing similar and Google in particular at least purports to have models that outperform what OpenAI is doing.
One problem I’ve always had with Copilot is that it tends to introduce extra parentheses and braces. Say I already have a pair of braces for a function body, and Copilot decides to write the whole function, it will write everything including the closing brace, leaving me with an extra and a syntax error to fix. It really shouldn’t be that hard to tell I already have a closing brace, especially when they’re already considering the suffix.
I had to disable Copilot for Clojure because of this. Structural editing relies on parens being always balanced, and gets confused when Copilot re-inserts already existing closing parens.
Free idea for GitHub: a huge bit of missing context for the model right now is the last few edits made by the user. If you move your cursor to a different part of a long file, Copilot immediately forgets about that part of the file. If it knew the last few edits you did then it would be able to make much more intelligent suggestions based on the task you're working on, rather than just current cursor position.
This is a pretty cool idea even for just engineering the prompt! It's a complicated tradeoff to decide what should go into the context and what should be selected from other files (2000 tokens is a lot but sometimes not long enough for the longest files). Previous cursor location is a great signal directly from the user, compared to metrics like Jaccard similarity. I'd actually like to try this out for our next release of Codeium.
Not sure how easy it would be to make work. Code edit data is not that prevalent. The best I can think of is looking at github commit changes. That's one place where Repl.it has a big advantage as it has live editing data from its users
They could start by simply including the code around previous cursor positions as additional context the same way they do with code from other files. Nothing specific to the edits themselves. That alone would help a lot I think. Maybe they already do but I don't think so based on the behavior I see, and this article doesn't mention anything like that.
But Copilot is getting tons of live editing data from its users too, and soon should be able to construct a nice dataset of edits. There's no way they aren't already doing that.
You would be taking snippets of code (that are potentially unparseable), concating them together and putting it in the prompt. The issue is that it would be kind of prompt the model has never seen before in the training data. Maybe it would work with some clever 0-shot prompt. But if you look at the fill in the middle paper from OpenAI for example, they specifically pretrain the model with that kind of data to make it work.
The live data is gonna be useful though ya. Is Copilot allowed to use it though under ToS?
The "Privacy – Copilot for Individuals" section under https://github.com/features/copilot does say that Copilot collects code snippets if allowed by telemetry.
> User Engagement Data
When you use GitHub Copilot it will collect usage information about events generated when interacting with the IDE or editor. These events include user edit actions like completions accepted and dismissed, and error and general usage data to identify metrics like latency and features engagement. This information may include personal data, such as pseudonymous identifiers.
> Code Snippets Data
Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files path.
My biggest issue with Copilot is that it only wants to add code.
That's useful but I edit code a lot. And if I have 10 similar lines and made one edit, it'd be very convenient for Copilot to suggest edit following line or even lines.
You can look into Copilot Labs extension in vscode, it does the editing and bunch of other stuff too (like explain what the highlighted code does, etc). It's not as smooth, but it's getting there.
Curious, what functionality did you find the most useful? Sometimes on edits, I find it adding more than expected (potentially entirely new files) which causes me to not accept the suggestion. Explain does work well sometimes though!
That's interesting, even IntelliCode (generally less capable than Copilot afaik) will do exactly that. I've had it trigger a few times in C# recently, where I make one or two similar edits, and it prompts me to make more.
The model with the most similar name in this list is code-cushman-001 which is described as "Codex model that is a stronger, multilingual version of the Codex (12B) model in the paper".
The next stronger Codex model is called code-davinci-001 which appears to be a fine-tuned version of the GPT-3 Davinci model which is known to have 175B parameters. The model naming is alphabetical in the order of the model size:
Most likely latency and cost reasons. A model that's 10x as big requires 10x the hardware to serve at the same latency. Since most generations are not too long, a smaller finetuned model should work well enough.
Their JetBrains plugin is written in Kotlin / Java but it spins up a agent server written in node.js which handles the business logic (building the prompt, caching, making completion requests to their API). I assume most of the code is shared between the VSCode extension and this javascript agent.
Huh, was this post revitalized? I remember seeing it (and upvoting it) in /new yesterday, but it didn't reach critical mass for the front page. Seems to be gone now.
https://github.com/search?q=GH_COPILOT_TOKEN&type=code
(My Copilot was broken and this was in the error output I was seeing, see: https://github.com/community/community/discussions/41878)
What I found there was some truly impressive reverse-engineering work by a single individual. I really like the "JOURNAL" daily-diary they kept of progress and random thoughts so you could see the progression day-by-day.
--------
One thing I found interesting: The author says that it queries only the 20 recently most opened files of the same language.
But in an AMA, I asked about how much "context" Copilot has available, and one of the devs says it can, for example, read header files that pair with C/C++ files that are open in separate tabs:
https://github.com/orgs/community/discussions/29932#discussi...