
Bug Prediction at Google (2011) - joubert
http://google-engtools.blogspot.com/2011/12/bug-prediction-at-google.html
======
frik
How does Google deal with code refactoring? If one splits the code file into
two files, how does the algorithms deals with its former repository commits
scoring? Sure splitting classes in Java is hard (one file per class), but in
languages like C++, C#, PHP with namespaces it's trivial and happens all the
time. (disclaimer: I read ~80% of the article and comments, and used the
browser search function, but found no answer)

~~~
Lewisham
So I actually wrote this article (strange to see it again!)

We don't run the code any more and I would have to go back to the source to
check, but I _think_ that it handles renames OK, but it doesn't have any
handling for splitting two files up that way.

It's a somewhat by-the-by issue, as my intuition are that the files that
actually can be broken up and successfully refactored are not the same ones
that are going to get flagged. The flagged files are the ones that are
churning because no-one really knows how to write them any better.

~~~
Raphomet
Hi! Just curious, why did you stop running the code? Seems like a useful
thing.

~~~
Lewisham
So the follow-up paper that assesses the impact is here [1]

TL;DR is that developers just didn't find it useful. Sometimes they knew the
code was a hot spot, sometimes they didn't. But knowing that the code was a
hot spot didn't provide them with any means of effecting change for the
better. Imagine a compiler that just said "Hey, I think this code you just
wrote is probably buggy" but then didn't tell you where, and even if you knew
and fixed it, would still say it due to the fact it was maybe buggy recently.
That's what TWR essentially does. That became understandably frustrating, and
we have many other signals that developers _can_ act on (e.g. FindBugs), and
we risked drowning out those useful signals with this one.

Some teams did find it useful for getting individual team reports so they
could focus on places for refactoring efforts, but from a global perspective,
it just seemed to frustrate, so it was turned down.

From an academic perspective, I consider the paper one of my most impactful
contributions, because it highlights to the bug prediction community some
harsh realities that need to be overcome for bug prediction to be useful to
humans. So I think the whole project was quite successful... Note that the
Rahman algorithm that TWR was based on did pretty well in developer reviews at
finding bad code, so it's possible it could be used for automated tools
effectively, e.g. test case prioritization so you can find failures earlier in
the test suite. I think automated uses are probably the most fruitful area for
bug prediction efforts to focus on in the near-to-mid future.

[1]
[http://www.cflewis.com/publications/google.pdf?attredirects=...](http://www.cflewis.com/publications/google.pdf?attredirects=0)

~~~
nostrademons
I was one of the interviewees for the study (or at least, I remember ranking
those three lists as described in the experimental design).

My impressions were that the results of the algorithm were pretty _accurate_ ,
but they were not very _actionable_. Very often, the files identified were
ones the team knew to be buggy, but there were good reasons they were buggy,
eg. the problem the code was solving was complex, that area of the code was
undergoing heavy churn because the problem it solved was a high priority, or
the code was ugly but another system was being developed to replace it and it
wasn't worth fixing when it was going to be thrown away anyway. In some cases,
proposals to fix or refactor the code had been nixed repeatedly by executives.

Basically - not all bugs are created equal. Oftentimes code is buggy _because_
it's important, and the priority is on satisfying user needs rather than
fixing bugs.

------
tdicola
I'd love to hear a followup and if this worked out for them or ended up being
more trouble than it was worth. I remember at the time when it came out being
a bit skeptical about the process.

edit: Although skimming the comments, maybe they need to turn that machine
learning on blogger comments. What a wasteland of spam and crap...

~~~
zatkin
The original author of the article has replied to comments on Hacker News
here:
[https://news.ycombinator.com/item?id=9324701](https://news.ycombinator.com/item?id=9324701)

------
niedbalski
My Python implementation [https://github.com/niedbalski/python-
bugspots](https://github.com/niedbalski/python-bugspots)

------
plg
my machine learning algorithm for predicting whether there are bugs in any of
my team's code: (C code)

int main(int argc, char *argv[]) { return 1; }

~~~
lern_too_spel
Nonzero exit code means false (no bugs). If this program represents the
entirety of your team's code, your story checks out.

~~~
Dylan16807
Amusing, but I feel like that's a terrible way to think about return codes. No
error vs. error number does not map cleanly to a boolean.

~~~
lern_too_spel
Any other interpretation of his program goes against decades of Unix
convention. [http://tldp.org/LDP/Bash-Beginners-
Guide/html/sect_07_01.htm...](http://tldp.org/LDP/Bash-Beginners-
Guide/html/sect_07_01.html)

~~~
Dylan16807
Ehhh, "if" is a bit of a specialized case. In general programs don't map fully
to true/false, just leave it as "error code".

It's not that we should use any _other_ interpretation, it's that that
interpretation is only _mostly_ true. Don't overgeneralize lest you introduce
mistakes.

------
j2kun
I'm a little amused that they did not remark that perfect bug prediction is
known to be impossible in general. Is this because they assume every reader
already knows that, or because they forgot their theory lessons?

On another note, I wonder whether one could rigorously define bug prediction
for a "helpful" programmer who isn't trying to trick the machine by using
diagonalization tricks and obfuscating things.

~~~
michael_storm
They probably didn't mention it because formal, deterministic bug prediction
may as well be on a different planet from probabilistic bug prediction based
on hot spots in commit logs. When a _blog post_ cites two research articles,
it's generally safe to assume the authors haven't forgotten their theory.

------
scott_b
Pretty interesting to revisit this idea. In practice, I never found that
automated bug detection or auto-code-quality tools ever really helped when we
used it as a tool to pinpoint problems in the code.

That being said, I am a fan of tools like gitprime, to identify opportunities
to improve visibility and insight into the dev process. It expands on this
idea of risk identification, but applies it to the project not the file.

------
yzh
Are there graph-based methods for doing analysis and prediction?

------
lfender6445
here is ruby implentation
[https://github.com/igrigorik/bugspots](https://github.com/igrigorik/bugspots)

