> M-2: We set the Python author flag [in the prompt] to the lead author of this paper. Sadly, it increases the number of vulnerabilities.
> M-3: We changed the indentation style from spaces to
tabs and the number of vulnerable suggestions increased somewhat, as did the confidence of the vulnerable answers.
The top-scoring option remained non-vulnerable.
@authors: I think something is wrong in the phrasing for M-4 (or some text got jumbled). Was the top-scoring option vulnerable or not? The second half might belong to D-3 instead (where no assessment is given)?
> @authors: I think something is wrong in the phrasing for M-4 (or some text got jumbled). Was the top-scoring option vulnerable or not? The second half might belong to D-3 instead (where no assessment is given)?
Argh, you're absolutely correct. Five authors proofreading the paper and still these things slip through. Fortunately we can fix it
- In Fig. 6 (Scenario 787-08), lines 2-4 appear to be truncated. [edit: I see you already responded to this elsewhere in the thread.]
- The layout of the figure labels is confusing; at first glance it looks more like they're labeling what's under them, rather than above them.
- The quotation marks are all turned into curly closing quotes.
- This is more of a style thing, but the code snippets appear to be using a variable-width font laid out as if it were fixed width, which looks pretty bad. Perhaps the font isn't embedded properly.
> the settings and documentation as provided do not allow users to see what these are set to by default
There isn't a single default value. Those parameters are chosen dynamically (on the client side): when doing more sampling with a higher top_p a higher temperature is used. I haven't tracked down where the top_p value is decided upon, but I think it depends on the context: I believe explicitly requesting an completion causes a higher top_p and a more capable model (earhart), which gives better but slower results than the completions you get as autocomplete (which are from the cushman model with a lower top_p). Copilot doesn't use any server-side magic, all the Copilot servers do is replace the GitHub authentication token with an OpenAI API key and forward the request to the OpenAI API.
As noted in the diversity of prompt section, we did try a lot of different/ reasonable changes to the prompt to see what would happen in our SQL injection scenario. In our case, asking it to make it secure actually made the prompt slightly worse (!), and the biggest bias towards making the code better was having other good code.
> There isn't a single default value.
That's what we also guess, but as you say, it's not written or documented anywhere.
I think that "knows" is anthropomorphizing too much--you're giving hints that cause the model to use a more secure subset of its training data. =)
That is to say, you have code prompts here, let Copilot fill in the gaps, and rate that code. Is there a study that uses the same prompts with a selection of programmers to see if they do better or worse?
I'm curious because in my testing of copilot, it often writes garbage. But if I'm being honest, often, so do I.
I feel like Twitter's full of cheap shots against copilot's bad outputs, but many of them don't seem to be any worse than common errors. I would really like to see how copilot stands up to the existing human competition, especially on axes of security, which are a bit more objectively measurable than general "quality".
Nonetheless, we think that simply having a quantification of Copilot's outputs is useful, as it can definitely provide an indicator of how risky it might be to provide the tool to an inexperienced developer that might be tempted to accept every suggestion.
The reason I singled out juniors has nothing to do with them being the least likely to check documentation. In fact seniors are just as bad in that regard. Plus a lot of the time the reason an engineer goes to SO is a particular common problem doesn’t have a pre-built solution in a languages standard library. The reason I suggested juniors is just because that’s who the researchers said they mostly work with.
So let’s be clear about one thing: I know plenty of junior developers who have put plenty of seniors to shame. I’m not about to suggest that juniors are worse developers; maybe less experienced in the chronological sense but you need to be damn careful before making other generalisations.
Agree. My comment was more about: most of the engineers out there (and yes, I do include myself) tend to rely and trust on information that is easily accessible; no matter if they juniors or seniors. I think this also applies to everyday life in general as well (e.g., we tend to read the newspapers to get "informed", but we rarely go to the source of truth to check our "facts").
One suggestion: On Arxiv's "Code & Data" tab it says "No official code found", but it looks like you have shared your code here: https://zenodo.org/record/5225651#.YSRKBi2cbyU
It would be a good idea to get that linked on the Arxiv page, 'cause the first question I had was, "Can I reproduce the findings?"
However, (my opinion only follows) I think our paper shows that there is a danger for Copilot to suggest insecure code - and inexperienced / security non-aware developers may accept these suggestions without understanding the implications, whereas if they had to write the code from scratch then they might (?) not make the mistakes (as they need to put in more effort, meaning there might be a higher chance they stumble upon the right approach - e.g. if they ask an experienced developer for help).
Perhaps you could take a similar approach as  and leverage MOOC participants?
> My main question when reading is how the results compared to manually-written code.
Ah, this is exactly the question. But as you say, much harder to answer. Even if you run a competition, unless you can encourage a wide range of developers to enter, you won't be getting the real value. Instead you might be getting incidence rates of code written by students/interns.
Perhaps if you could get a few FAANGs on board to either share internal data (unlikely) or send a random sample of employees (also very unlikely) to make teams and then evaluate their code... It seems like a difficult question to answer.
We think a more doable way would be to take snapshots of large open source codebases (e.g. off GitHub) and measure the incidence rate of CWEs, but this also presents its own challenges with analyzing the data. Also, what's the relationship between open source code and all code?
Lots of avenues to consider.
char str_a, str_b, str_c;
sprintf(str_a, "%f", a);
sprintf(str_b, "%f", b);
sprintf(str_c, "%f", c);
There is no question that next-generation ‘auto-complete’
tools like GitHub Copilot will increase the productivity of
software developers. However, while Copilot can rapidly
generate prodigious amounts of code, our conclusions reveal
that developers should remain vigilant (‘awake’) when using
Copilot as a co-pilot. Ideally, Copilot should be paired
with appropriate security-aware tooling during both training
and generation to minimize the risk of introducing security vulnerabilities. While our study provides new insights into
its behavior in response to security-relevant scenarios, future work should investigate other aspects, including adversarial approaches for security-enhanced training
Was anyone checking the security of code copy pasted from Stackoverflow? Hopefully this work gets fed back into Copilot, improving it, which improves the experience (and safety) for its users. Lots of folks are still writing code without copilot or security engineering knowledge.
The problem with GHC is the developers are not writing the code - they're simply accepting what's being written for them, often in large quantities at a time.
> don't have security engineering support
Valuable, but my analogy was intended to point out that it's not inherent in the tooling.
> Was anyone checking the security of code copy pasted from Stackoverflow
Yes, other users on Stackoverflow via comments and other answers. They're not perfect, but their checks and balances exist as a facet of that tool.
> Hopefully this work gets fed back into Copilot
Only if it's open source, and a large volume of it, to boot. In other words, I don't hold hope that the security situation will be better anytime soon.
Except this is precisely what the abstract is saying is a misuse of the system. You have the option to give the driver the control.
> Ideally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities.
You're oversimplifying by assuming the purpose of CoPilot is to write a whole block of text from generated code. CodePilot is a 80/20 thing when every developer on HN is pedantically assuming its a 100/0 one.
Tesla has 80% of a self-driving solution and runs into parked cars.
I don't think it's "pedantic" to realize that 80% is enough that low-information users are going to assume it's 100%. And that low-information users are the ones who will flock to a tool like this. (I won't, because after trying Copilot out, I realized that checking the code generally takes about as long as writing it.)
Weird take, not sure how thats relevant at all, but okay.
> And that low-information users are the ones who will flock to a tool like this.
Developers are notoriously not low-information users. As someone else pointed out:
> If you have devs who don't know how to write secure code, and/or you don't have security engineering support (internal or outsourced), you were already failing (or probably more apropos, walking the tight rope without a net).
So devs with limited security experience are going to continue to developer code that doesn't conform to security standards. CoPilot neither makes this better nor makes it worse. In fact, that's exactly the problem AI - it simply mimics real humans. (see -> Amazon's attempts at using AI for hiring and the resulting bias)
CoPilot is for the people who google StackOverflow answers without having to search on StackOverflow.
Furthermore, what's your expectation of the proportion of code on Github that would pass "through some pretty strict linters" ? My guess is that for serious, widely used and actively maintained projects it's perhaps 5% (almost no projects pass linters without something getting flagged, but there's a small minority of projects who actively fight all the "identified lint"), and for random code it's roughly 0%, with only specific teaching/toy projects passing.
I'd bet a good chunk of those were buffer-overflow related