Hacker News new | comments | show | ask | jobs | submit login
Bug Prediction at Google (google-engtools.blogspot.com)
216 points by nl on Dec 15, 2011 | hide | past | web | favorite | 65 comments

So in summary, their system highlights the files that had 'a lot of bugs lately'. While useful, this is more of a trend analysis and extrapolation rather than actual prediction. The title made me hope of something that would highlight some potential bug nests from static code analysis or the like. Interseting, but a little disappointing.

Hi. I'm Chris, the author of the blog post/work.

You're not wrong, although bug prediction is a very nebulous term in the academic literature which encompasses a pretty broad range of techniques. The term is used to properly ground the work in what's come before.

It's perhaps useful to talk a little bit about how I got where I ended up, which wasn't discussed much in the post for length reasons. I originally spent a lot of time trying to implement FixCache, a pretty complicated algorithm that looks at things such as spacial locality of bugs (bugs appear close to each other), temporal locality of bugs (bugs are introduced at the same time) and files that change together (bugs will cross cut these). It then puts it all in a LIFO cache and flags files that way. This most definitely fits in the "bug prediction" line of work, and a number of follow-up papers independently verified its success. but I was struggling deploying FixCache. This was largely because the open source code that had been written beforehand at the university lab [1] wasn't designed for MapReduce, and also because I was becoming worried that it was very difficult to understand how it flagged files, which could lead to the tool being written off as outputting false positives and then ignored.

About halfway through the internship, a pre-print of the Rahman et al. [2] paper was handed to me, which indicated that maybe everything FixCache was doing could be massively simplified and get largely the same result. We trialled both algorithms using user studies, and found that developers thought the output of the Rahman algorithm better matched their intuition as to where the hotspots in their code base was, so we went with that.

So, after this very roundabout discussion, we reach the question of "If Rahman is a subset of the FixCache algorithm, and FixCache is a bug prediction algorithm, is Rahman a bug prediction algorithm, too?" and I'll leave that for others to judge, but that's why it's been described as so :)

[1] https://github.com/SoftwareIntrospectionLab/FixCache [2] http://macbeth.cs.ucdavis.edu/fse2011.pdf

Without considering the number of submits or file length you will end up with files that are simply longer and edited more getting undue high scores. A single 100 line file will be scored to be twice as buggy as the exact same code split into two 50 line files.

Considering source file lengths roughly follow a Zipfian type distribution, it seems like at least the top 5% length files would get flagged regardless unless the bug distribution is extremely uneven.

I've seen a few studies (sorry, no citations handy, if I have time I'll revisit this comment) which concluded that above some fairly low floor, bugginess is strongly positively correlated with file length.

Which is to say, always flagging the top 5% longest files as being among the buggiest has a good chance of being the right thing to do.

That could be true, my point was that even if a 1000 line file has a 1% chance of bug-per-line it will be marked "buggy" while a 50 line file with a 10% chance of bug-per-line won't.

It makes no sense that if you took that 1000 line file and refactored it into 10 files and those 10 files had the same bugs as the original that the 1000 line file will be flagged as buggy but all 10 of the split files won't.

Not necessarily, because maybe refactoring your code to be clean enough to fit into ten files is also makes it clean enough to remove bugs. And/or hairy code doesn't end up in short files.

Who said anything about refactoring? Chop it up into 10 files verbatim. If that is indeed all it takes to cheat the algorithm, it's not very useful.

I'm Chris, author of the post/work.

Cheating the algorithm is indeed a possibility and is actually very easy:

1. Just don't file bug tickets.

2. If you do, don't attach the tickets to changes.

3. If you attach the tickets to changes, don't put the "bug" type on the ticket.

Doing things to deliberately cheat the algorithm is likely not going to pass code review.

One common misconception is that the algorithm is a stick to beat developers with. It really, really isn't, and I go to great lengths to try and make this clear in the internal docs (although the blog post doesn't do this as much). The idea is just to provide another insight into the code, so that devs don't accidentally change something that's a real problem without help.

The algorithm is just a tool to help developers understand their code. It's not subjective because different teams use the bug tracker differently, so a score from a file in one project cannot be compared to a file from another project. We mitigate this by only taking off the top 10% of code across the entire codebase, so hopefully we by and large only flag the worst offenders. Part of what I wanted to do was to set up a web app so teams could claim various parts of the code base, and then run the algorithm on a team-by-team basis instead, but I ran out of time on that.

I think this is actually one of the nice things about the algorithm, because I don't like the idea of implicitly pitting people or teams against one another. I just like that people can go "Oh hey, this has been a real problem for us, let's tread carefully here."

I don't think the issue with the algorithm is people deliberately trying to cheat it. The problem is that files that are actually by any empirical measure more bug-free getting higher scores.

You could have two very similar pieces of code, one of which is in one larger file with 1000 lines. Imagine this code gets 1 bug per day.

Now imagine the other code that is functionally identical happens to be broken up into 10 files. Now imagine that 9 bugs per day are applied to this unit of code that is essentially identical to the 1000 line file; the broken up code is exactly the same and is 9x buggier, but it will be scored lower because each file is only getting 0.9 bugs on average per day.

Do you disagree that the code in the second example would actually be way "trickier" and deserves to have a flag on it much more than the first one? I understand that one requirement for this is clarity of the algorithm to developers, but it seems like you could easily take the ratio of bug fixes to total commits on a file or normalize by file length if you wanted to actually get a reasonable metric of how "tricky" or dangerous it is to edit a file.

what you haven't thought about is that when you do refactor the 1000 line file into 10 files, that refactor would necessarily have made each bit slightly simpler (by virtue of it being smaller). Hence, it is likely that bugs don't get introduced as easily, simply because there is less to 'worry' about when writing code that has a smaller cognative load. so may be the metric isn't as wrong as you initially thought.

No, you can literally just take 10 javascript functions and put them in 10 different files or one file with absolutely no modification at all. Why would one get flagged and the other one not?

The point is simply that if you have a unit of code that has X bugs per day this will give a lower score if code that is actually buggier is across multiple shorter files.

Hi, I'm Chris, author of the blog post/work.

You're on the right lines, and I spent some time thinking about this. There is a couple of things that led me not to concern myself with file length. Before I start, I would note that a file just being big and getting changed isn't a problem: it has to be changed and have a bug ticket with the "bug" classification put against it.

1. As others have stated, the research does strongly point towards LOC as being part of the metric that should be looked at. [1] is a paper that discusses LOC, but it's behind the IEEE pay wall and I wasn't able to find a free version. Sorry. I'm on holiday and a bit pressed for time this morning, but I can root around for more papers if you would like. Pretty much all the models we have for bug prediction will invariably throw LOC into the pot (although as they often do PCA too, one could argue that the LOC going down could be one of the components ;) ).

2. Google, by and large, is an OOP company. The intuitive reason for why LOC leads to defects is that the file is no longer meeting the principle of single responsibility. I would hypothesize that once your file has gone past this, defects are going to start to snowball as your file is on the way to becoming what I called a Gateway file, where large amounts of functionality pass in and out of the file, even if it only does simple stuff. So not only is it going to attract defects because it's long, but it might well attract even more defects because when other functionality is modified, that file has to as well. The added problem is these files resistant to refactoring because of the possibility of breaking so much functionality.

A Very Big Philosophical Argument is whether you would put configuration files under this classification, as config files affect all sorts of functionality. Personally, I would, but most of the engineers disagreed with me, so they were left out.

3. In a very unscientific way, I did experiment with all sorts of modifiers to try and even out files that may have been "unfairly" flagged due to some metric like LOC or number of commits or whatever. I eventually abandoned this, not just because I found the results largely unsatisfying, but I realized I really was doing things in an unscientific manner without any real justification for why I was doing it. Without a sound justification for evening it out, which LOC doesn't have, I decided not to do it.

Hope this helps a little bit. I very much see your point, and a number of engineers brought this up during user studies, but those big files that were getting flagged really were problems.

[1] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4297...

A single 100 line file will be scored to be twice as buggy as the exact same code split into two 50 line files.

I suspect that isn't the case. Subjectively, longer files are harder to understand and in languages like Java (one file = one class, ignore inner classes etc) is likely to represent more complicated and tightly coupled code.

If the same file keeps appearing, then split it in half and see what happens...

The two 50-line files will probably have better internal cohesion, making them easier to test and review than one 100-line file.

I was thinking the same thing. I was expecting them to analyze the diffs and keep a cache of code blocks that are changing in each bugfix. Keeping track of bugs at the class/function/method level would seem to be more powerful.

Existing research points to "number of bug-fixing commits" as being enough to predict.

I would assume that's due to the fact that bugs are not always constrained to a single function, but usually to closely related code. Which often means all that related code is in the same file.

Indeed. Also, typically code "ownership" is at the file level -- and that's likely to correlate strongly with defects in situations where a peron who hasn't been well trained (or is just a bad programmer) is writing or maintaining the code.

You could cut up every file into chunks of 10 or so lines and then run those chunks through the algorithm. Seems like that should yield better results. You ought to be able to pull line numbers from the commits.

If you're into defect prediction, it's worth checking out the work the Empirical Software Engineering group at Microsoft Research has done over the years -- http://research.microsoft.com/en-us/groups/ese/publications....

https://github.com/igrigorik/bugspots - for anyone that's curious to try it on their own git repo.

(improvements welcome! :-))

This is awesome - might roll a Python port if I have some time in the next few days.

looks good, but you might move the pattern matching in scanner.rb out to a config file so users can search for other keywords.

This is so awesome, thanks igrigorik :)

I came to the comments hoping to find that this existed!

What I like about Google is that they take uncontroversial "should-dos" -- you should unit test, you should review, you should pay more attention to bug-prone models -- and then build and improve tools that supports those practices.

To me one of the reasons agile so swept our field was because it rode the back of the increasing availability of two genres of toolset: the unit testing framework and the bug/issue/story/feature tracker.

Trouble is, lots of nontrivial bugs are not module (or file) specific. The occur due to interaction between modules. Implicit assumptions on input are violated, invalid output is produced and continues propagating in the system.

[Hope this technology will help make Chrome more stable. Do they even consider resource leaks bugs?]

That doesn't matter so much, because this project is about highlighting delicate code regardless of reason, not assigning blame.

I would gladly use this tool and provide feedback if it were released as something that tacks onto git.

Or even better. On to Gerrit!

Is this feature in Gerrit?

Not yet!

What about files with limited or no history. Cycolmatic Complexity [1] to the rescue?

[1] http://en.wikipedia.org/wiki/Cyclomatic_complexity

Prevention is better than cure... So the next step would be to warn the developer before they make the code changes or during. Hooking into the IDE to visually indicate hot spots within the source file being edited on would be neat.

Yes, this is exactly right. We didn't have time in the period of an internship to hook into the IDE (along with all the other tasks, like user studies, algorithm design etc etc), so just went with the warning in source control.

Given more time, I absolutely would have done this. Nicolas Lopez at UC Irvine is thinking about putting tools like this at the IDE level, if you're interested.

Unfortunately there were 18 hot spots in the bug prediction code.

Also of note, this blog is beautiful in the classic view.

I'd vehemently differ, other than perhaps as a piece of abstract art: the black bar is attention-grabbing to negative effect, and the opening animation is self-indulgent, as is the massive whitespace atop. Also, the width of the page seems rather strange: why so wide, with no content?

The text column is narrow because it's easier for humans to read text in narrow columns. Also, negative space isn't necessarily bad -- it can be used to draw attention to important items.

Now that I think about it, if the black bar was lighter, then the page wouldn't balance visually. To be honest, I didn't actually notice what was on the black bar when I first read the article, and when you scroll down it collapses into something less visually imposing. The open animation could do with being a bit faster or vanishing altogether though. I feel like having a smooth close animation is important because the eye is attracted to sudden movement.

I have no problem with the narrow text column, what bugs me is the fact that the web page has horizontal scroll bars on my 1440-pixel wide display. Nothing on the page requires that much width.

You're right, it does. Didn't notice that. That is rather silly.

I cannot read beyond the first few lines on my Droid. Very disappointing.

Well, meh. It doesn't scroll when pressing the spacebar, so this is quite bad from a usability point of view.

neat idea. i whipped up a version in python (and for svn, our scm of choice) and plowed through some codebases. i mimic igrigorik's ruby/git version's output (it's very usable).

the code's up here:


I'll recommend Nicholas Taleb's "The Black Swan" to the authors of this prediction tool. Statistics cannot be used to "predict" future behavior of a complex system.

Statistics cannot be used to "predict" future behavior of a complex system

They certainly can, and commonly are! Think of weather prediction, Bayesian diagnostic systems, Google's statistical translator and spell checker, etc etc.

Taleb's point is that they fail sometimes - perhaps more often than you think - and so you should be very careful in placing bets on statistical predictions.

In this case the cost of a failed prediction is very low, and a successful prediction is worth quite a lot, so it is a net win for Google.

I've got a great bug prediction tool: ghc, the haskell compiler. Whenever I use a value of the wrong type my program won't compile. In my code it is always a bug when that happens. (Some smarter people can write non-buggy code the compiler doesn't like.)

Sometimes I can trick ghc into using the wrong type. Like when I use "String" to mean "File". That's when all the bugs show up as run-time errors. But I should know a string is not a file. And hopefully somebody writes a module that knows what a "File" really is and stops me from writing buggy code if I use it.

I am not surprised that you are being downvoted. In my experience, one may question the superiority of dynamic typing at HN only to the detriment of ones karma. Well, its not HN alone, dynamic languages are certainly popular and quite persuasively championed.

Now I belong to the "had been persuaded before, but now I am not so sure" category. More so after discovering absolutely stupid typos and compile time checkable errors in long running Python code that I had written. It really sucks when your code bails out after 16 hrs of number crunching because of some silly error on your part.

I used to think, "who makes type errors of all things?" it turns out that that idiot is me. I am not claiming that dynamic languages are bad, or that checked languages are always good, but I do benefit from checked code. A better programmer perhaps wouldn't benefit as much.

Rather than implementing a type checking system badly and in an ad-hoc manner, I now lean more towards compile time checked languages, especially for pieces of code that I expect will run for a long time.

I think the sweet spot would be optionally type-checked / type-inferred languages, for example Cython, Common Lisp. Strongly type-checked languages _tend_ to be all or nothing.

EDIT: @nl yeah! absolutely, they arent by any means a bug predicting algo.

Type systems are great. Other things that are great are unit tests, automatic code inspection tools and code reviews.

They aren't the same as bug prediction systems, though.

I think, more accurately, one may rave about the irrelevant merits of static typing to the detriment of one's karma.

Here's a bug from the real world: the login form doesn't render correctly in IE. When someone implements IE's layout algorithm as a haskell type, I'll start paying attention.

That there exist errors that cannot be type-checked makes type-checking irrelevant ? I dont quite see why exhaustiveness is a pre-requisite for relevance or attention. Is that a thumb rule you use for other things as well ?

I am not sure if type-checking even aspires to be exhaustive. As long as it can detect my stupid and costly mistakes at a cost that is cheaper than the cost of the mistakes, I am happy. But I can understand that for people who do not make costly and compile time checkable mistakes, it might not be worth it.

@tedunangst Got you. Your second paragraph in the previous comment threw me off. It does read like you consider type-checking unworthy of attention because it cannot catch all possible errors. Agree with your comment about people overstating their case. I think the noisiest of both the camps (if one can call it that) are equally guilty.

If the original post does not mean to convey "I don't need this because I have ghc", well, it kinda sounds like it does. I didn't say type checking was irrelevant. I said its merits are irrelevant in a discussion about how to find bugs that have already been checked in. Unless the devs are really sloppy, such code has already been compiled and run through whatever more or less exhaustive checking that entails.

Type checking is useful. I like it. But I don't like people talking about it like it's magic pixie dust.

Build an HTML generator of function PreHtml -> HTML that is quirk compatible with all browsers. Yes, you will has many bugs to fix over time. Now have all your templates code emit output of type PreHtml and your final page generation phase is the function above.

Pay attention here: any template that attempts to emit HTML will fail to Typecheck.

Not a panacea, but static enforcement of Don't Repeat Your Bugs, even before those particular bugs are discovered.

Hi, I'm Chris, author of the blog/work.

While I love me some type systems, I worry you're thinking a bit too code-centric.

For example, let's say I describe to you a really hard problem, and you come up with a solution (that type checks!), and then we launch to users and they find a use case we never thought of, and our code suddenly breaks. I am sure this has happened to you. The problem isn't at the code level (that's just the manifestation), but our current understanding of the problem.

IMHO, once you reach a certain level of competency, a sizable minority, if not majority, of bugs that make their way into a source control system (particularly one with code review, like Google's) are not line-by-line problems, but wider problems with actually understanding the task at hand and all that encompasses. When you work on the complexity of problems that Google does, you're tending towards a very large chunk of bugs being members of this class.

However, we don't have good tools (yet) to have a computer properly check the semantic meaning of our code. Bug prediction sits as a sort of baby step. It's the computer making a best-effort guess of where issues will be. The algorithm designed here is saying "Look this file has been a struggle. It's probably going to continue being a struggle, so we need to focus attention when we change anything to it."

Defect detection != defect prediction

Type errors are only one type of bug.

True, but the whole point of designing modern type systems is to attempt to catch as many common bugs in the static-check phase as possible (while still being decidable), rather than purely checking datatypes (int, string, etc.) in the classical sense. I don't think it's too unreasonable to compare it to empirical bug prediction: they're a rationalist vs. an empiricist approach to statistically predicting where bugs probably lie, with some various tradeoffs.

I'm not sure what your point is.

Yes, type systems can catch a class of bugs.

Google - like most of the software development world - has a lot of people who believe in type systems completely.

That's why they have tools like the Closure JS compiler[1], which provides type-checking for Javascript, and GWT[2], which produces (un-typesafe) Javascript/CSS/HTML from (mostly)typesafe Java.

Using methods of reducing bugs (such as type safety) is orthogonal to producing systems that predict where bugs will occur.

[1] http://code.google.com/closure/compiler/

[2] http://code.google.com/webtoolkit/

Your last statement's the part I disagree with, though it depends on what you mean by "orthogonal". If you mean that they're two different approaches that can coexist, then yes. But I don't think they target orthogonal classes of bugs. In both cases, the goal is to employ some algorithmic, decidable method at compile-time to predict whether a given piece of code is "correct" or "incorrect", trading off the possibility of a false positive or false negative. Each approach does better or worse in different cases, but I think in a manner that either isn't orthogonal, or at least isn't obviously orthogonal (if "orthogonal" is meant in any strong sense, rather than just "two different approaches"). And I think for any one, we can at least in principle ask whether the other one could've done it: e.g., if we're noticing a lot of bugs of a certain sort, could the type-checker have caught that? The ambition, at least, of static type systems is to render accurate bug-prediction impossible, because every bug statically predictable at compile-time will be a type error.

I think this misunderstands the goals of bug prediction, which are typically along the lines of

- directing engineering resources to where they're likely to be most leveraged from a quality standpoint [what Google focused on]

- estimating how many resources to devote to bug fixing and/or long it will take an in-progress code base to stabilize [the goal of a lot of the work Micorosft did with Windows in the early 2000s]

- and estimating post-release defects to estimate (and perhaps direct) customer support and maintenance resources [more typical in the hardware world than software]

And I wouldn't say that static type systems have an ambition of rendering bug-prediction impossible. If you can truly detect all errors at compile time, then prediction's trivial: no bugs remain!

>The ambition, at least, of static type systems is to render accurate bug-prediction impossible, because every bug statically predictable at compile-time will be a type error.

I might not use Haskell, but I do understand what a type system does, and why it is important.

BUT, not all bugs are statically predictable at compile-time. Take things like cross-browser compatibility - something like GWT goes a long way to reducing bugs with that, but no type system will protect you from a new bug in a new browser you need to work around.

(Edit: by orthogonal I meant "statistically independent". Given a piece of code written in a type safe language, this method will predict bugs independently of a type system.)

so your thesis is that a bug prediction system ought to be able to predict bugs that are not possible to statically predict?

I suppose that is a reasonable argument. i think different people in this argument are arguing about different things.

Type Systems are good at catching more than just type errors. For example, booleans are much better represented in the type system.

These are trivial kinds of errors. The big fuckups are invariably more subtle and people / requirements oriented.

When all you've got is a hammer ....

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact