
Bug Prediction at Google - nl
http://google-engtools.blogspot.com/2011/12/bug-prediction-at-google.html
======
nickolai
So in summary, their system highlights the files that had 'a lot of bugs
lately'. While useful, this is more of a trend analysis and extrapolation
rather than actual prediction. The title made me hope of something that would
highlight some potential bug nests from static code analysis or the like.
Interseting, but a little disappointing.

~~~
Lewisham
Hi. I'm Chris, the author of the blog post/work.

You're not wrong, although bug prediction is a very nebulous term in the
academic literature which encompasses a pretty broad range of techniques. The
term is used to properly ground the work in what's come before.

It's perhaps useful to talk a little bit about how I got where I ended up,
which wasn't discussed much in the post for length reasons. I originally spent
a lot of time trying to implement FixCache, a pretty complicated algorithm
that looks at things such as spacial locality of bugs (bugs appear close to
each other), temporal locality of bugs (bugs are introduced at the same time)
and files that change together (bugs will cross cut these). It then puts it
all in a LIFO cache and flags files that way. This most definitely fits in the
"bug prediction" line of work, and a number of follow-up papers independently
verified its success. but I was struggling deploying FixCache. This was
largely because the open source code that had been written beforehand at the
university lab [1] wasn't designed for MapReduce, and also because I was
becoming worried that it was very difficult to understand how it flagged
files, which could lead to the tool being written off as outputting false
positives and then ignored.

About halfway through the internship, a pre-print of the Rahman et al. [2]
paper was handed to me, which indicated that maybe everything FixCache was
doing could be massively simplified and get largely the same result. We
trialled both algorithms using user studies, and found that developers thought
the output of the Rahman algorithm better matched their intuition as to where
the hotspots in their code base was, so we went with that.

So, after this very roundabout discussion, we reach the question of "If Rahman
is a subset of the FixCache algorithm, and FixCache is a bug prediction
algorithm, is Rahman a bug prediction algorithm, too?" and I'll leave that for
others to judge, but that's why it's been described as so :)

[1] <https://github.com/SoftwareIntrospectionLab/FixCache> [2]
<http://macbeth.cs.ucdavis.edu/fse2011.pdf>

------
esrauch
Without considering the number of submits or file length you will end up with
files that are simply longer and edited more getting undue high scores. A
single 100 line file will be scored to be twice as buggy as the exact same
code split into two 50 line files.

Considering source file lengths roughly follow a Zipfian type distribution, it
seems like at least the top 5% length files would get flagged regardless
unless the bug distribution is extremely uneven.

~~~
tesseract
I've seen a few studies (sorry, no citations handy, if I have time I'll
revisit this comment) which concluded that above some fairly low floor,
bugginess is strongly positively correlated with file length.

Which is to say, always flagging the top 5% longest files as being among the
buggiest has a good chance of being the right thing to do.

~~~
esrauch
That could be true, my point was that even if a 1000 line file has a 1% chance
of bug-per-line it will be marked "buggy" while a 50 line file with a 10%
chance of bug-per-line won't.

It makes no sense that if you took that 1000 line file and refactored it into
10 files and those 10 files had the same bugs as the original that the 1000
line file will be flagged as buggy but all 10 of the split files won't.

~~~
dilap
Not necessarily, because maybe refactoring your code to be clean enough to fit
into ten files is also makes it clean enough to remove bugs. And/or hairy code
doesn't end up in short files.

~~~
maaku
Who said anything about refactoring? Chop it up into 10 files verbatim. If
that is indeed all it takes to cheat the algorithm, it's not very useful.

~~~
Lewisham
I'm Chris, author of the post/work.

Cheating the algorithm is indeed a possibility and is actually very easy:

1\. Just don't file bug tickets.

2\. If you do, don't attach the tickets to changes.

3\. If you attach the tickets to changes, don't put the "bug" type on the
ticket.

Doing things to deliberately cheat the algorithm is likely not going to pass
code review.

One common misconception is that the algorithm is a stick to beat developers
with. It really, really isn't, and I go to great lengths to try and make this
clear in the internal docs (although the blog post doesn't do this as much).
The idea is just to provide another insight into the code, so that devs don't
accidentally change something that's a real problem without help.

The algorithm is just a tool to help developers understand their code. It's
not subjective because different teams use the bug tracker differently, so a
score from a file in one project cannot be compared to a file from another
project. We mitigate this by only taking off the top 10% of code across the
entire codebase, so hopefully we by and large only flag the worst offenders.
Part of what I wanted to do was to set up a web app so teams could claim
various parts of the code base, and then run the algorithm on a team-by-team
basis instead, but I ran out of time on that.

I think this is actually one of the nice things about the algorithm, because I
don't like the idea of implicitly pitting people or teams against one another.
I just like that people can go "Oh hey, this has been a real problem for us,
let's tread carefully here."

~~~
esrauch
I don't think the issue with the algorithm is people deliberately trying to
cheat it. The problem is that files that are actually by any empirical measure
more bug-free getting higher scores.

You could have two very similar pieces of code, one of which is in one larger
file with 1000 lines. Imagine this code gets 1 bug per day.

Now imagine the other code that is functionally identical happens to be broken
up into 10 files. Now imagine that 9 bugs per day are applied to this unit of
code that is essentially identical to the 1000 line file; the broken up code
is exactly the same and is 9x buggier, but it will be scored lower because
each file is only getting 0.9 bugs on average per day.

Do you disagree that the code in the second example would actually be way
"trickier" and deserves to have a flag on it much more than the first one? I
understand that one requirement for this is clarity of the algorithm to
developers, but it seems like you could easily take the ratio of bug fixes to
total commits on a file or normalize by file length if you wanted to actually
get a reasonable metric of how "tricky" or dangerous it is to edit a file.

------
jdp23
If you're into defect prediction, it's worth checking out the work the
Empirical Software Engineering group at Microsoft Research has done over the
years -- [http://research.microsoft.com/en-
us/groups/ese/publications....](http://research.microsoft.com/en-
us/groups/ese/publications.aspx)

------
igrigorik
<https://github.com/igrigorik/bugspots> \- for anyone that's curious to try it
on their own git repo.

(improvements welcome! :-))

~~~
silverlake
looks good, but you might move the pattern matching in scanner.rb out to a
config file so users can search for other keywords.

~~~
rbxbx
<https://github.com/igrigorik/bugspots/pull/5>

;)

------
jacques_chester
What I like about Google is that they take uncontroversial "should-dos" -- you
should unit test, you should review, you should pay more attention to bug-
prone models -- and then build and improve _tools_ that supports those
practices.

To me one of the reasons agile so swept our field was because it rode the back
of the increasing availability of two genres of toolset: the unit testing
framework and the bug/issue/story/feature tracker.

------
nimrody
Trouble is, lots of nontrivial bugs are not module (or file) specific. The
occur due to _interaction_ between modules. Implicit assumptions on input are
violated, invalid output is produced and continues propagating in the system.

[Hope this technology will help make Chrome more stable. Do they even consider
resource leaks bugs?]

~~~
gujk
That doesn't matter so much, because this project is about highlighting
delicate code regardless of reason, not assigning blame.

------
eric-hu
I would gladly use this tool and provide feedback if it were released as
something that tacks onto git.

~~~
bostonvaulter2
Or even better. On to Gerrit!

~~~
eric-hu
Is this feature in Gerrit?

~~~
bostonvaulter2
Not yet!

------
hndl
What about files with limited or no history. Cycolmatic Complexity [1] to the
rescue?

[1] <http://en.wikipedia.org/wiki/Cyclomatic_complexity>

------
cwills
Prevention is better than cure... So the next step would be to warn the
developer before they make the code changes or during. Hooking into the IDE to
visually indicate hot spots within the source file being edited on would be
neat.

~~~
Lewisham
Yes, this is exactly right. We didn't have time in the period of an internship
to hook into the IDE (along with all the other tasks, like user studies,
algorithm design etc etc), so just went with the warning in source control.

Given more time, I absolutely would have done this. Nicolas Lopez at UC Irvine
is thinking about putting tools like this at the IDE level, if you're
interested.

------
piptastic
Unfortunately there were 18 hot spots in the bug prediction code.

------
tibbon
Also of note, this blog is beautiful in the classic view.

~~~
MaysonL
I'd vehemently differ, other than perhaps as a piece of abstract art: the
black bar is attention-grabbing to negative effect, and the opening animation
is self-indulgent, as is the massive whitespace atop. Also, the width of the
page seems rather strange: why so wide, with no content?

~~~
jdpage
The text column is narrow because it's easier for humans to read text in
narrow columns. Also, negative space isn't necessarily bad -- it can be used
to draw attention to important items.

Now that I think about it, if the black bar was lighter, then the page
wouldn't balance visually. To be honest, I didn't actually notice what was on
the black bar when I first read the article, and when you scroll down it
collapses into something less visually imposing. The open animation could do
with being a bit faster or vanishing altogether though. I feel like having a
smooth close animation is important because the eye is attracted to sudden
movement.

~~~
MaysonL
I have no problem with the narrow text column, what bugs me is the fact that
the web page has horizontal scroll bars on my 1440-pixel wide display. Nothing
on the page requires that much width.

~~~
jdpage
You're right, it does. Didn't notice that. That _is_ rather silly.

------
jnazario
neat idea. i whipped up a version in python (and for svn, our scm of choice)
and plowed through some codebases. i mimic igrigorik's ruby/git version's
output (it's very usable).

the code's up here:

<http://monkey.org/~jose/blog//viewpage.php?page=bugspot>

------
ashutoshm
<http://youtu.be/SzRqd4YeLlM>

------
jayeshsalvi
I'll recommend Nicholas Taleb's "The Black Swan" to the authors of this
prediction tool. Statistics cannot be used to "predict" future behavior of a
complex system.

~~~
nl
_Statistics cannot be used to "predict" future behavior of a complex system_

They certainly can, and commonly are! Think of weather prediction, Bayesian
diagnostic systems, Google's statistical translator and spell checker, etc
etc.

Taleb's point is that they fail sometimes - perhaps more often than you think
- and so you should be _very_ careful in placing bets on statistical
predictions.

In this case the cost of a failed prediction is very low, and a successful
prediction is worth quite a lot, so it is a net win for Google.

------
T_S_
I've got a great bug prediction tool: ghc, the haskell compiler. Whenever I
use a value of the wrong type my program won't compile. In my code it is
always a bug when that happens. (Some smarter people can write non-buggy code
the compiler doesn't like.)

Sometimes I can trick ghc into using the wrong type. Like when I use "String"
to mean "File". That's when all the bugs show up as run-time errors. But I
_should_ know a string is not a file. And hopefully somebody writes a module
that knows what a "File" really is and stops me from writing buggy code if I
use it.

~~~
srean
I am not surprised that you are being downvoted. In my experience, one may
question the superiority of dynamic typing at HN only to the detriment of ones
karma. Well, its not HN alone, dynamic languages are certainly popular and
quite persuasively championed.

Now I belong to the "had been persuaded before, but now I am not so sure"
category. More so after discovering absolutely stupid typos and compile time
checkable errors in long running Python code that I had written. It really
sucks when your code bails out after 16 hrs of number crunching because of
some silly error on your part.

I used to think, "who makes type errors of all things?" it turns out that that
idiot is me. I am not claiming that dynamic languages are bad, or that checked
languages are always good, but I do benefit from checked code. A better
programmer perhaps wouldn't benefit as much.

Rather than implementing a type checking system badly and in an ad-hoc manner,
I now lean more towards compile time checked languages, especially for pieces
of code that I expect will run for a long time.

I think the sweet spot would be _optionally_ type-checked / type-inferred
languages, for example Cython, Common Lisp. Strongly type-checked languages
_tend_ to be all or nothing.

EDIT: @nl yeah! absolutely, they arent by any means a bug predicting algo.

~~~
tedunangst
I think, more accurately, one may rave about the irrelevant merits of static
typing to the detriment of one's karma.

Here's a bug from the real world: the login form doesn't render correctly in
IE. When someone implements IE's layout algorithm as a haskell type, I'll
start paying attention.

~~~
srean
That there exist errors that cannot be type-checked makes type-checking
irrelevant ? I dont quite see why exhaustiveness is a pre-requisite for
relevance or attention. Is that a thumb rule you use for other things as well
?

I am not sure if type-checking even aspires to be exhaustive. As long as it
can detect my stupid and costly mistakes at a cost that is cheaper than the
cost of the mistakes, I am happy. But I can understand that for people who do
not make costly and compile time checkable mistakes, it might not be worth it.

@tedunangst Got you. Your second paragraph in the previous comment threw me
off. It does read like you consider type-checking unworthy of attention
because it cannot catch all possible errors. Agree with your comment about
people overstating their case. I think the noisiest of both the camps (if one
can call it that) are equally guilty.

~~~
tedunangst
If the original post does not mean to convey "I don't need this because I have
ghc", well, it kinda sounds like it does. I didn't say type checking was
irrelevant. I said its merits are irrelevant in a discussion about how to find
bugs that have already been checked in. Unless the devs are really sloppy,
such code has already been compiled and run through whatever more or less
exhaustive checking that entails.

Type checking is useful. I like it. But I don't like people talking about it
like it's magic pixie dust.

