Hacker News new | past | comments | ask | show | jobs | submit login
An empirical cybersecurity evaluation of GitHub Copilot's code contributions (arxiv.org)
137 points by pramodbiligiri 27 days ago | hide | past | favorite | 42 comments



My favorite part of the paper, in the section discussing how small prompt variations affect results:

> M-2: We set the Python author flag [in the prompt] to the lead author of this paper. Sadly, it increases the number of vulnerabilities.

> M-3: We changed the indentation style from spaces to tabs and the number of vulnerable suggestions increased somewhat, as did the confidence of the vulnerable answers. The top-scoring option remained non-vulnerable.

@authors: I think something is wrong in the phrasing for M-4 (or some text got jumbled). Was the top-scoring option vulnerable or not? The second half might belong to D-3 instead (where no assessment is given)?


Thanks for your feedback! Yes, unfortunately Copilot doesn't seem to think much of my programming security...

> @authors: I think something is wrong in the phrasing for M-4 (or some text got jumbled). Was the top-scoring option vulnerable or not? The second half might belong to D-3 instead (where no assessment is given)?

Argh, you're absolutely correct. Five authors proofreading the paper and still these things slip through. Fortunately we can fix it


More nits:

- In Fig. 6 (Scenario 787-08), lines 2-4 appear to be truncated. [edit: I see you already responded to this elsewhere in the thread.]

- The layout of the figure labels is confusing; at first glance it looks more like they're labeling what's under them, rather than above them.

- The quotation marks are all turned into curly closing quotes.

- This is more of a style thing, but the code snippets appear to be using a variable-width font laid out as if it were fixed width, which looks pretty bad. Perhaps the font isn't embedded properly.


Re: M-3: Let the tab-vs-spaces war flare up again.


I've experimented a bit with this on the raw Codex model (https://smitop.com/post/codex/), and I've found that some prompt engineering can be helpful: explicitly telling the model to generate secure code in the prompt sometimes helps. (such as by adding to the prompt something like "Here's a PHP script I wrote that follows security best practices"). Codex knows how to write more secure code, but without the right prompting it tends to write insecure code (because it was trained on a lot of bad code).

> the settings and documentation as provided do not allow users to see what these are set to by default

There isn't a single default value. Those parameters are chosen dynamically (on the client side): when doing more sampling with a higher top_p a higher temperature is used. I haven't tracked down where the top_p value is decided upon, but I think it depends on the context: I believe explicitly requesting an completion causes a higher top_p and a more capable model (earhart), which gives better but slower results than the completions you get as autocomplete (which are from the cushman model with a lower top_p). Copilot doesn't use any server-side magic, all the Copilot servers do is replace the GitHub authentication token with an OpenAI API key and forward the request to the OpenAI API.


> I've found that some prompt engineering can be helpful: explicitly telling the model to generate secure code in the prompt sometimes helps.

As noted in the diversity of prompt section, we did try a lot of different/ reasonable changes to the prompt to see what would happen in our SQL injection scenario. In our case, asking it to make it secure actually made the prompt slightly worse (!), and the biggest bias towards making the code better was having other good code.

> There isn't a single default value.

That's what we also guess, but as you say, it's not written or documented anywhere.


> Codex knows how to write more secure code

I think that "knows" is anthropomorphizing too much--you're giving hints that cause the model to use a more secure subset of its training data. =)


I think this is algomorphizing what humans do when given such a hint too much :)


Hi - I am actually the lead author of this paper! I'd be happy to answer any questions about the work.


Is there equivalent empirical data from real programmers?

That is to say, you have code prompts here, let Copilot fill in the gaps, and rate that code. Is there a study that uses the same prompts with a selection of programmers to see if they do better or worse?

I'm curious because in my testing of copilot, it often writes garbage. But if I'm being honest, often, so do I.

I feel like Twitter's full of cheap shots against copilot's bad outputs, but many of them don't seem to be any worse than common errors. I would really like to see how copilot stands up to the existing human competition, especially on axes of security, which are a bit more objectively measurable than general "quality".


Yes, the work definitely lends itself towards the question "is this better or worse than an equivalent human developer?" This is quite a difficult question to answer, although I agree that simply giving a large number of humans the same prompts could be insightful. However, then you would be rating against an aggregate of humans, rather than an individual (i.e. this is "the" copilot). Also, knowing research, you would really be comparing against a random corpus of student answers, as it is usually students that would be participating in a study such as this.

Nonetheless, we think that simply having a quantification of Copilot's outputs is useful, as it can definitely provide an indicator of how risky it might be to provide the tool to an inexperienced developer that might be tempted to accept every suggestion.


Rather than comparing against students in lab conditions, I’d be more interested to see it compare to students with access to Stack Overflow et al vs students with access to just Co Pilot. Ie is a junior developer more likely to trust bad suggestions online vs bad suggestions made by Co Pilot?


Junior engineers will trust whatever information is provided to them as long as it is easily accessible. The reason juniors consult Stack Overflow is because is one Google search and one click away, whereas consulting the official documentation/reference takes more effort (because they usually don't appear on Google when one searches for errors/bugs/how-to). If Copilot (or another similar tool) is very well integrated in whatever IDE a junior is using, you can be sure it will be used and trusted because it will be faster than Google+SO.


That’s a very uncharitable view. My view point was a lot more around whether the extra information provided in SO (and others) helps an engineer to make better informed decisions about code quality and how many engineers will simply take the top answer verbatim without testing (either online or via Co Pilot).

The reason I singled out juniors has nothing to do with them being the least likely to check documentation. In fact seniors are just as bad in that regard. Plus a lot of the time the reason an engineer goes to SO is a particular common problem doesn’t have a pre-built solution in a languages standard library. The reason I suggested juniors is just because that’s who the researchers said they mostly work with.

So let’s be clear about one thing: I know plenty of junior developers who have put plenty of seniors to shame. I’m not about to suggest that juniors are worse developers; maybe less experienced in the chronological sense but you need to be damn careful before making other generalisations.


> So let’s be clear about one thing: I know plenty of junior developers who have put plenty of seniors to shame.

Agree. My comment was more about: most of the engineers out there (and yes, I do include myself) tend to rely and trust on information that is easily accessible; no matter if they juniors or seniors. I think this also applies to everyday life in general as well (e.g., we tend to read the newspapers to get "informed", but we rarely go to the source of truth to check our "facts").


Thank you! This is an important piece of work, and I hope that your efforts are recognized.

One suggestion: On Arxiv's "Code & Data" tab it says "No official code found", but it looks like you have shared your code here: https://zenodo.org/record/5225651#.YSRKBi2cbyU

It would be a good idea to get that linked on the Arxiv page, 'cause the first question I had was, "Can I reproduce the findings?"


Thanks for the tip! I'll make sure we update our arXiv description.


Supposing a team was building a product without a rigorous security focus or experience. Do you have any reason to believe a co-pilot enabled team would produce more or less secure products?


This is a difficult question to answer as one team might be very different from another team.

However, (my opinion only follows) I think our paper shows that there is a danger for Copilot to suggest insecure code - and inexperienced / security non-aware developers may accept these suggestions without understanding the implications, whereas if they had to write the code from scratch then they might (?) not make the mistakes (as they need to put in more effort, meaning there might be a higher chance they stumble upon the right approach - e.g. if they ask an experienced developer for help).


For non copilot, the words around the code found on stack overflow or a blog post may indicate the lack of correct security, which would be a signal to a developer that they need to consider something further.


Interesting work. My main question when reading is how the results compared to manually-written code. Naturally this is a much harder question to answer but it would be really interesting to see the results. Could be that Copilot is doing no better (or worse) than developers copying from Stack Overflow.

Perhaps you could take a similar approach as [1] and leverage MOOC participants?

[1] https://dl.acm.org/doi/pdf/10.1145/3383773


Thanks for your feedback!

> My main question when reading is how the results compared to manually-written code.

Ah, this is exactly the question. But as you say, much harder to answer. Even if you run a competition, unless you can encourage a wide range of developers to enter, you won't be getting the real value. Instead you might be getting incidence rates of code written by students/interns. Perhaps if you could get a few FAANGs on board to either share internal data (unlikely) or send a random sample of employees (also very unlikely) to make teams and then evaluate their code... It seems like a difficult question to answer.

We think a more doable way would be to take snapshots of large open source codebases (e.g. off GitHub) and measure the incidence rate of CWEs, but this also presents its own challenges with analyzing the data. Also, what's the relationship between open source code and all code?

Lots of avenues to consider.


It seems like many of the code examples are incorrect in the pdf for example, figure 6b vs the file in the actual dataset experiments_dow/cwe-787/codeql-eg-PotentialBufferOverflow/Copilot -- lines are truncated at the first "%" char or something along those lines.


Yes, unfortunately the code example got somewhat mangled as it passed through the arXiv sanitization script [1]. The original is:

    char str_a[20], str_b[20], str_c[20];
    sprintf(str_a, "%f", a);
    sprintf(str_b, "%f", b);
    sprintf(str_c, "%f", c);
[1] https://github.com/google-research/arxiv-latex-cleaner


Argh! Thanks for spotting it! As moyix said, we were tricked by the arXiv latex cleaner tool. We've now updated this, but it will take a couple of days to clear arXiv moderation.



Summary: CONCLUSIONS AND FUTURE WORK

There is no question that next-generation ‘auto-complete’ tools like GitHub Copilot will increase the productivity of software developers. However, while Copilot can rapidly generate prodigious amounts of code, our conclusions reveal that developers should remain vigilant (‘awake’) when using Copilot as a co-pilot. Ideally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities. While our study provides new insights into its behavior in response to security-relevant scenarios, future work should investigate other aspects, including adversarial approaches for security-enhanced training


A "lead foot" on the software development gas pedal, with no attached safety systems that are activated by anybody but the driver.


Copilot didn't worsen the appsec story, it just highlighted it. If you have devs who don't know how to write secure code, and/or you don't have security engineering support (internal or outsourced), you were already failing (or probably more apropos, walking the tight rope without a net).

Was anyone checking the security of code copy pasted from Stackoverflow? Hopefully this work gets fed back into Copilot, improving it, which improves the experience (and safety) for its users. Lots of folks are still writing code without copilot or security engineering knowledge.


> If you have devs who don't know how to write secure code

The problem with GHC is the developers are not writing the code - they're simply accepting what's being written for them, often in large quantities at a time.

> don't have security engineering support

Valuable, but my analogy was intended to point out that it's not inherent in the tooling.

> Was anyone checking the security of code copy pasted from Stackoverflow

Yes, other users on Stackoverflow via comments and other answers. They're not perfect, but their checks and balances exist as a facet of that tool.

> Hopefully this work gets fed back into Copilot

Only if it's open source, and a large volume of it, to boot. In other words, I don't hold hope that the security situation will be better anytime soon.


> activated by anybody but the driver.

Except this is precisely what the abstract is saying is a misuse of the system. You have the option to give the driver the control.

> Ideally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities.

You're oversimplifying by assuming the purpose of CoPilot is to write a whole block of text from generated code. CodePilot is a 80/20 thing when every developer on HN is pedantically assuming its a 100/0 one.


> [Copilot] is a 80/20 thing when every developer on HN is pedantically assuming its a 100/0 one.

Tesla has 80% of a self-driving solution and runs into parked cars.

I don't think it's "pedantic" to realize that 80% is enough that low-information users are going to assume it's 100%. And that low-information users are the ones who will flock to a tool like this. (I won't, because after trying Copilot out, I realized that checking the code generally takes about as long as writing it.)


> Tesla has 80% of a self-driving solution and runs into parked cars.

Weird take, not sure how thats relevant at all, but okay.

> And that low-information users are the ones who will flock to a tool like this.

Developers are notoriously not low-information users. As someone else pointed out:

> If you have devs who don't know how to write secure code, and/or you don't have security engineering support (internal or outsourced), you were already failing (or probably more apropos, walking the tight rope without a net).

So devs with limited security experience are going to continue to developer code that doesn't conform to security standards. CoPilot neither makes this better nor makes it worse. In fact, that's exactly the problem AI - it simply mimics real humans. (see -> Amazon's attempts at using AI for hiring and the resulting bias)

CoPilot is for the people who google StackOverflow answers without having to search on StackOverflow.


just wait until github-microsoft adds a fee to use the results for certain uses, and then scan all your repos constantly to find code that doesn't pay up


Surely AI can also be taught some boundary conditions like "thou shalt not build SQL from strings"?


I think you could use linting tools that check for things like this and filter the output. Or use outputs that fail the lint as negative training examples.


I don't know anything about Copilot's design, but surely they passed all the code they fed it in the training stage through some pretty strict linters, right? I mean that's just common sense...


No, definitely not. The Copilot essentially tries to use "all code in the wild as it is". In the Copilot design, code blocks are essentially treated as independent units of text, not attempting to put different files in context of a larger project - IIRC they were taking size-limited chunks of files as the independent units - much less attempt to build or lint the projects, because if your basic unit is 100 lines from the middle of a file, it obviously won't even compile (because of external references), and there's no straightforward way to handle deploying a working build/test environment (w. dependencies) automatically for a million projects.

Furthermore, what's your expectation of the proportion of code on Github that would pass "through some pretty strict linters" ? My guess is that for serious, widely used and actively maintained projects it's perhaps 5% (almost no projects pass linters without something getting flagged, but there's a small minority of projects who actively fight all the "identified lint"), and for random code it's roughly 0%, with only specific teaching/toy projects passing.


tl;dr they tested GitHub Copilot against 89 risky coding scenarios and found about 40% of the roughly 1,700 sample implementations Copilot generated in the test were vulnerable (which makes sense given it's trained on public GitHub repos, many of which contain sample code that's a nightmare from a security perspective).


Soooo, the big question is - is 40% higher or lower than what an average developer cranks out? ;-)


it keeps insisting on this structure that the language does not have, i am have tempted to build an extension for it, that’s how often it calls this


>Breaking down by language, 25 scenarios were in C, generating 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13 (52.00 %) had a top-scoring program vulnerable. 29 scenarios were in Python, generating 571 programs total. 219 (38.4,%) were vulnerable. Of the scenarios, 11 (37.93 %) had a vulnerable top-scoring program.

I'd bet a good chunk of those were buffer-overflow related




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: