Hacker News new | past | comments | ask | show | jobs | submit login
Copilot Internals (thakkarparth007.github.io)
358 points by jjwiseman on Dec 18, 2022 | hide | past | favorite | 80 comments



I stumbled upon this repo by accident a few days ago when its source code appeared to contain the only usage the term "GH_COPILOT_TOKEN" in any repo on Github:

https://github.com/search?q=GH_COPILOT_TOKEN&type=code

(My Copilot was broken and this was in the error output I was seeing, see: https://github.com/community/community/discussions/41878)

What I found there was some truly impressive reverse-engineering work by a single individual. I really like the "JOURNAL" daily-diary they kept of progress and random thoughts so you could see the progression day-by-day.

--------

One thing I found interesting: The author says that it queries only the 20 recently most opened files of the same language.

But in an AMA, I asked about how much "context" Copilot has available, and one of the devs says it can, for example, read header files that pair with C/C++ files that are open in separate tabs:

https://github.com/orgs/community/discussions/29932#discussi...

  > "I assume Copilot uses the contents of the current project (IE, all files) as contextual information to offer suggestions. Is this the case?"

  > "Yes, copilot looks at what we call "related tabs", for example .c files are often paired with .h files, so copilot considers the contents of other tabs open in the editor to try to feed a more complete prompt to the model."


Hey Gavin, that's consistent with what I said in the blog post. Header files (.h) and (.c) files are classified as having the same language (C). As a result, if you have the header file open in a tab (or in the workspace -- I'm not sure if unaccessed files from workspace are used), then it'll be considered for the prompt.


Heh this blew up here :D Didn't know till a friend told me about it.

I'd love to know if you guys have any specific questions about copilot's internals that I can try to answer by staring at the code or if you have any feedback for the tool/post!


Any word on what the neovim plugin does? Considering some features like,

> Then, most recently accessed 20 files of the same language are queried from VSCode.

are probably not available by default in nvim?

---

Also, do you think there's any chance you get in trouble for reverse engineering copilot?


RE: neovim

Looks like the 20 file logic is present in neovim as well. I went through its code (https://github.com/github/copilot.vim/blob/release/copilot/d...) after beautifying it (https://codebeautify.org/jsviewer), and found it present.

I couldn't exactly trace it to a specific neovim event but I'm guessing it corresponds to buffer-update-events (https://neovim.io/doc/user/api.html#api-buffer-updates) or something like that.

Re: getting in trouble

I surely hope not :P. I mean, the code is basically public (available on every user's computer).


Interesting, thanks.


The neovim plugin mostly actually communicates with a node-js service seen here (https://github.com/github/copilot.vim/tree/release/copilot/d...). This is why they require you to install node for using the plugin and allows them to share logic with the vscode extension (also in javascript). I think all the features should be available even in neovim.


Just a note on the first sentence under "What does a prompt look like?", it seems like there's a continuity error from the above paragraph.


oh lol. Yeah, messed up while rewriting. Fixing. Thanks!


Honestly I get more value out of ChatGPT than I do from Copilot even though both generate the wrong stuff now and then. But I like to describe the desired functionality by writing in plain English than trying to goad copilot in the right direction by coming up with method and variable names that copilot will "like"


Yeah, I second what bibabloo said. Sometimes when I want to edit something, I comment out existing thing, and start writing what'd be the new version. Then copilot autocompletes what I have in mind (often enough).


Have you tried writing a comment as a prompt for copilot?


yeah, I still find that ChatGPT seems to understand the intent better than Copilot. Maybe it's the type of stuff I tend to ask.


is that not the difference in ease of understanding of answering an explicit question vs guessing at implicit intent?


Amazing if this is only a 12B model. If this already increases coding productivity by up to 50% (depending on kind of work), imagine what a 1T model will be capable of! I do wonder if some programmers at FAANG are already having access to a way more powerful coding assistants, and whether they code much at all at this point, or only make high level code specifications and then fix up the automatically generated code.


> If this already increases coding productivity by up to 50% (depending on kind of work)

Does anyone believe that?

edit: I'm surprised to see that (so far) 3 replies actually agree with the statement. Is there a video that you'd recommend that shows realistic usage and gain from copilot? Maybe a livestream or something.


On menial task, it's way more than 50%. For quick scripting, dirty parsing, PoC and plumping, this is about 300% for me.

However, for anything that requires me to think, it's 5% at best.

Don't take up the 50% figure as anything serious, I think it's just a way to state "if it is a such a meaningful boost in productivity".

Which it is, for a lot of tasks, because the vast majority of programming jobs are boring stuff outside of the HN bubble.

It's amazing how much of the world economy runs on csv uploaded to ftp servers.


Agreed with this. If the main bottleneck is typing, then Copilot can dramatically speed up the process. If the bottleneck is thinking, it doesn't help out nearly as much unfortunately.


I'd add that for me at least it's quite good at some small specific subsets of "requires me to think". For example, I do a lot of 3d rotations & transformations, and it's very good at figuring out the math of that based on the function name I chose etc. Most of those would take me a piece of paper and 5-10 mins, but it usually gets it in 1 or 2 tries.

But yes, mundane work it is best at. Some things I have found it made particularly easy:

- scraping websites

- file i/o

- "mirroring" things (I write a bunch of code for doing something on x axis, it automatically replicates it for y and z etc with the right adjustments, or cardinal directions, or arrow keys, etc etc etc)


It is indeed a cheap script boy for me as well

It does mundane work exceptionally well


Sure. I'm way more productive with Copilot. I haven't been coding much lately but I could imagine it would double my productivity with regards to the actual "get an implementation of a thing done" bit of the work.

In terms of design, I had a long conversation with ChatGPT the other day about designing a database, including optimizations that could be made given certain requirements and constraints, etc. It was a big productivity boost, like rubber ducking on steroids.


I tried it to help me optimize some sql, but even after many attempts it didn't really do anything useful for me. The best thing was really to show how the syntax works for features that I rarely use - so in that sense it's a better stackoverflow.


Can you give us an example how it helped to design the database?

I could not think how it would have helped me, but maybe I m limited in my imagination or don’t know how to ask.


I told it I was designing a database. I told it that my database could tolerate failure levels where more than a quorum of nodes failed at a given time. I then asked it about different algorithms for consensus; RAFT, Paxos, swarm based, etc. It described algorithms for me. I told it that in my database I could guarantee certain things, like that every operation commutes, and I asked how that would let me optimize things - it explained that I could paralellize certain parts of those algorithms.

At one point I told it to name the algorithm we had been discussing something like "OptSwim" and we just kept iterating on the idea.


But aren't you afraid that whenever you veer discussion from Wikipedia/stackoverflow type explanations it's likely lying to you? This was my general experience -- it's great at querying for stuff which already exists and is popular on the internet and for conversing on a surface level or broad level but as soon you delve into details it starts confidently lying and/or hallucinating things, which undermines my trust in it, which in turn means I need to verify what it says, which means it did not increase my productivity that much after all.

It routinely invents arguments, functions or concepts which don't exist in reality or don't apply to the current context, but look like they could, so you are even more likely to get caught by this.


Haha, yes, it indeed invents arguments that aren't part of specific APIs and would offer to do something that you'd like to do in a very easy way, but since they actually aren't part of the API, well, you're out of luck.

It's just taking the "I wish they'd thought of my use case when designing that API" on the next level by simply pretending in a very sincere and convincing way that your wish came true, then writing a usually-pretty-correct program around that assumption that would actually work _if that wish had come true_ - but unfortunately that API doesn't really accept this convenient parameter, so...it's not that easy in reality.


Well then. The singularity is here. Almost no humans understand these things.


I think people may be downvoting you because technically, neither does the AI.


I used CoPilot last Advent of Code and really liked it.

This year I recorded most of my days and uploaded them to youtube. So if you want to get a realistic view, take a look here: https://www.youtube.com/channel/UCOqPGQCzgieAOL6iOJjj8hg.

The earlier days you can see it speeds you up a lot. The later days (such as today) you still want to wrap your own head around difficult computer science concepts so it is kind of useless.

Let me know if you have any questions!


I finally got it to do something useful for me the other day. I got it to invert the rendering of rows and columns in a React widget I was writing.

It wasn’t something I actually needed help on, though. When I tried to go further with it and complete more of the task, it got stuck in a loop of just suggesting more and more comments but never offering more code, and then it mysteriously stopped responding at all.

This is the best experience with it I’ve had so far.


Absolutely. 50% feels conservative. The thing is that Copilot becomes so ingrained in your workflow that you don't notice it until internet goes down and you feel completely handicapped. Only then do you realize how much you rely on it.


I haven’t tried Copilot but I’ve used ChatGPT to help with doing Advent of Code in Python (which I don’t use regularly so I forget bits of syntax).

At first I found it very useful to ask it to parse the input. Much faster than looking up three separate docs to piece together what I had in mind.

But then I asked it to parse a more complex input and it just kept failing badly even when I gave it sample inputs and outputs.

I’d say it definitely offers some productivity gains and is worth trying.


A 1T model would be capable of much more than what the current version of Copilot in terms of autocompletion and even code correction. However, at that point, even with a lot of model parallelism to speedup inference, it's likely to be atleast 10x slower on the generation side. From my experience working on Codeium, a Copilot alternative, this would be too frustrating for users. It could be useful as a tool that runs asynchronously that modifies all your code at scale.


Given how fast Copilot is (a few seconds), I wouldn't mind waiting 10x. I also wouldn't mind letting it run overnight for some tasks (ie: write documentation, write tests, suggest bug fixes, etc...). Will check on my buddy on the next morning.


I think the UX of large suggestions will require a lot of thinking and experimentation. That's because the longer the output of such model, higher the risk of it making some mistake. For short completions, it's often easy to identify mistakes from useful suggestions (though sometimes subtle bugs slip in). But for longer completion, it'll get tedious and we might start accepting wrong suggestions.


That sounds like modern day outsourcing


It could be interesting if it was an alternative that a user could query. I could imagine someone starting to write a new function might be willing to wait 10x more time to get something better.


Very true, I think the issue though is unless that output is very likely to be 100% correct, a user would always prefer something that is incomplete but quicker to iterate on. It would be interesting to see if we can get to a paradigm like that.


Though isn't it highly likely that core devs working at the big tech giants have access to 10x-100x faster compute, e.g. some secret TPU successor at Google?


The magical number for performance is actually memory bandwidth which is actually lower for TPUs compared to A100s. They have more aggregate compute, but it's not trivial to use that to get very low latency on a per request basis.


But they have highly likely internal prototypes with higher bandwidth and latency. Also, with distilled latent diffusion one can probably generate text(-images) much faster anyhow as it could produce long chunks of text at once, rather than needing recurrently feed back the new token to the inputs.


In my eyes, the limitation of these models is that they only fit a limited amount of context. Not the complete API of your code base, or the latest version of the libraries you are using. I also don't believe a bigger model would resolve these limitations.

However, I do believe there could be a meta model that can query code and libraries.


Presumably if you had access to them you could fine tune them on your codebase.


Yeah, continuous online learning by fine-tuning seems like an obvious way of making these models recall information from outside the perceptible context. One could also prompt the model to (recursively) summarize code and prepend this summary to each prompt, and/or enable the model to interactively query function definitions or code summaries before outputting a final answer (trained by RLHF). But any such tricks might also quickly be outcompeted by an even more general model, e.g. one that directly controls the GUI and can communicate with coworkers...


It doesn't work like this. A 1T model without architectural changes would not perform substantially better unless it has been trained on a lot more code. The original Codex was trained on 100B tokens, so you could possibly get some gains by increasing the model size but only up to a point. See the Chinchilla paper for reference.


Not necessarily: https://arxiv.org/abs/2206.14486

Also, even with "Chinchilla laws", you still gain performance in a larger model, you just need a lot more data (if just as noisy) to reach the same level of convergence, but a larger model will have already partially converged to a superior model with the same amount data.


I've actually seen this paper before, but I don't think it's helpful. If the entire GitHub is 100B tokens and your prune it down properly, then fine, you can get equal performance with fewer tokes. However, if you want improved performance, you still need more data, not just a larger model size, and that's hard to obtain. I don't think it's a lost cause and we will be be stuck with current performance by any means though - there are other ways to go.


> if you want improved performance, you still need more data

Not true. See figure 2: https://arxiv.org/pdf/2203.15556.pdf#page=5

The loss decreases with greater model size at the same compute budget (i.e. stopping sooner regarding training data). Also some rehearsal/multi-epoch training improves the forgetting rate (thereby improving performance substantially), which hasn't been taken into account by Chinchilla et al. because they train <1 epoch.

https://arxiv.org/abs/2205.12393


No. It shows the opposite. All model sizes converged to a similar loss as the compute increased towards maximum. But larger models had larger loss for a given compute budget.

Their text about Figure 3 confirms what I'm saying: "We find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train"


Yes, but the losses in Figure 3 increase because the larger models see fewer data to keep the FLOP budget constant, not because of overfitting. Large models do not overfit very much, so the loss of a larger model will still be better compared to a smaller model when you keep dataset size constant.


Original Codex is Python only.


True. I think they're counting duplicated code though. I don't see any mention of de-duplication in their paper.


'fix up generated code' but do you agree that finding a mistake (without even knowing if it's there) might be even harder than writing from scratch?


It's likely that programmers have this skill somewhere. We all make mistakes when typing in code, and many of them do get found. Some of them don't, that's what we call a bug. So AI isn't exactly breaking any ground here.

I played with ChatGPT and asked it interview questions, and I thought it was a pretty interesting exercise to find its mistakes and get it to fix them. Good tool for training interviewers, perhaps.


We are doing this all the time anyway during code reviews.


Microsoft is FAANG level and beyond.


Microsoft paid for early exclusive access to GPT-3 internals. They're using it to develop things like Power Apps. FAANG are all doing similar and Google in particular at least purports to have models that outperform what OpenAI is doing.


One problem I’ve always had with Copilot is that it tends to introduce extra parentheses and braces. Say I already have a pair of braces for a function body, and Copilot decides to write the whole function, it will write everything including the closing brace, leaving me with an extra and a syntax error to fix. It really shouldn’t be that hard to tell I already have a closing brace, especially when they’re already considering the suffix.


I had to disable Copilot for Clojure because of this. Structural editing relies on parens being always balanced, and gets confused when Copilot re-inserts already existing closing parens.


Free idea for GitHub: a huge bit of missing context for the model right now is the last few edits made by the user. If you move your cursor to a different part of a long file, Copilot immediately forgets about that part of the file. If it knew the last few edits you did then it would be able to make much more intelligent suggestions based on the task you're working on, rather than just current cursor position.


This is a pretty cool idea even for just engineering the prompt! It's a complicated tradeoff to decide what should go into the context and what should be selected from other files (2000 tokens is a lot but sometimes not long enough for the longest files). Previous cursor location is a great signal directly from the user, compared to metrics like Jaccard similarity. I'd actually like to try this out for our next release of Codeium.


Not sure how easy it would be to make work. Code edit data is not that prevalent. The best I can think of is looking at github commit changes. That's one place where Repl.it has a big advantage as it has live editing data from its users


They could start by simply including the code around previous cursor positions as additional context the same way they do with code from other files. Nothing specific to the edits themselves. That alone would help a lot I think. Maybe they already do but I don't think so based on the behavior I see, and this article doesn't mention anything like that.

But Copilot is getting tons of live editing data from its users too, and soon should be able to construct a nice dataset of edits. There's no way they aren't already doing that.


You would be taking snippets of code (that are potentially unparseable), concating them together and putting it in the prompt. The issue is that it would be kind of prompt the model has never seen before in the training data. Maybe it would work with some clever 0-shot prompt. But if you look at the fill in the middle paper from OpenAI for example, they specifically pretrain the model with that kind of data to make it work.

The live data is gonna be useful though ya. Is Copilot allowed to use it though under ToS?


The "Privacy – Copilot for Individuals" section under https://github.com/features/copilot does say that Copilot collects code snippets if allowed by telemetry.

> User Engagement Data When you use GitHub Copilot it will collect usage information about events generated when interacting with the IDE or editor. These events include user edit actions like completions accepted and dismissed, and error and general usage data to identify metrics like latency and features engagement. This information may include personal data, such as pseudonymous identifiers.

> Code Snippets Data Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files path.


I think they're working on something similar to this, it'll be available somewhere next year. Source: Github Next Discord Server : )


My biggest issue with Copilot is that it only wants to add code.

That's useful but I edit code a lot. And if I have 10 similar lines and made one edit, it'd be very convenient for Copilot to suggest edit following line or even lines.


You can look into Copilot Labs extension in vscode, it does the editing and bunch of other stuff too (like explain what the highlighted code does, etc). It's not as smooth, but it's getting there.


Curious, what functionality did you find the most useful? Sometimes on edits, I find it adding more than expected (potentially entirely new files) which causes me to not accept the suggestion. Explain does work well sometimes though!


That's interesting, even IntelliCode (generally less capable than Copilot afaik) will do exactly that. I've had it trigger a few times in C# recently, where I make one or two similar edits, and it prompts me to make more.


Why does “cushman-ml” suggest a 12B model instead of the 175B model?


The model with the most similar name in this list is code-cushman-001 which is described as "Codex model that is a stronger, multilingual version of the Codex (12B) model in the paper".

https://crfm-models.stanford.edu/static/help.html

The next stronger Codex model is called code-davinci-001 which appears to be a fine-tuned version of the GPT-3 Davinci model which is known to have 175B parameters. The model naming is alphabetical in the order of the model size:

https://blog.eleuther.ai/gpt3-model-sizes/

See also A.2 here: https://arxiv.org/pdf/2204.00498.pdf#page=6


Code is the base model in more recent iterations [0]

[0] https://beta.openai.com/docs/model-index-for-researchers


Most likely latency and cost reasons. A model that's 10x as big requires 10x the hardware to serve at the same latency. Since most generations are not too long, a smaller finetuned model should work well enough.


This is about the VSCode extension, which is obfuscated (maybe compiled) JS.

The plugin for IntelliJ (PyCharm etc), is this written in Java? Reverse compiling this might give some additional insights.


Their JetBrains plugin is written in Kotlin / Java but it spins up a agent server written in node.js which handles the business logic (building the prompt, caching, making completion requests to their API). I assume most of the code is shared between the VSCode extension and this javascript agent.


Just wanted to say great job on the analysis and explanation!


Thank you! :)


Huh, was this post revitalized? I remember seeing it (and upvoting it) in /new yesterday, but it didn't reach critical mass for the front page. Seems to be gone now.


Yes, I think it was.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: