You're not wrong, although bug prediction is a very nebulous term in the academic literature which encompasses a pretty broad range of techniques. The term is used to properly ground the work in what's come before.
It's perhaps useful to talk a little bit about how I got where I ended up, which wasn't discussed much in the post for length reasons. I originally spent a lot of time trying to implement FixCache, a pretty complicated algorithm that looks at things such as spacial locality of bugs (bugs appear close to each other), temporal locality of bugs (bugs are introduced at the same time) and files that change together (bugs will cross cut these). It then puts it all in a LIFO cache and flags files that way. This most definitely fits in the "bug prediction" line of work, and a number of follow-up papers independently verified its success. but I was struggling deploying FixCache. This was largely because the open source code that had been written beforehand at the university lab  wasn't designed for MapReduce, and also because I was becoming worried that it was very difficult to understand how it flagged files, which could lead to the tool being written off as outputting false positives and then ignored.
About halfway through the internship, a pre-print of the Rahman et al.  paper was handed to me, which indicated that maybe everything FixCache was doing could be massively simplified and get largely the same result. We trialled both algorithms using user studies, and found that developers thought the output of the Rahman algorithm better matched their intuition as to where the hotspots in their code base was, so we went with that.
So, after this very roundabout discussion, we reach the question of "If Rahman is a subset of the FixCache algorithm, and FixCache is a bug prediction algorithm, is Rahman a bug prediction algorithm, too?" and I'll leave that for others to judge, but that's why it's been described as so :)
Considering source file lengths roughly follow a Zipfian type distribution, it seems like at least the top 5% length files would get flagged regardless unless the bug distribution is extremely uneven.
Which is to say, always flagging the top 5% longest files as being among the buggiest has a good chance of being the right thing to do.
It makes no sense that if you took that 1000 line file and refactored it into 10 files and those 10 files had the same bugs as the original that the 1000 line file will be flagged as buggy but all 10 of the split files won't.
Cheating the algorithm is indeed a possibility and is actually very easy:
1. Just don't file bug tickets.
2. If you do, don't attach the tickets to changes.
3. If you attach the tickets to changes, don't put the "bug" type on the ticket.
Doing things to deliberately cheat the algorithm is likely not going to pass code review.
One common misconception is that the algorithm is a stick to beat developers with. It really, really isn't, and I go to great lengths to try and make this clear in the internal docs (although the blog post doesn't do this as much). The idea is just to provide another insight into the code, so that devs don't accidentally change something that's a real problem without help.
The algorithm is just a tool to help developers understand their code. It's not subjective because different teams use the bug tracker differently, so a score from a file in one project cannot be compared to a file from another project. We mitigate this by only taking off the top 10% of code across the entire codebase, so hopefully we by and large only flag the worst offenders. Part of what I wanted to do was to set up a web app so teams could claim various parts of the code base, and then run the algorithm on a team-by-team basis instead, but I ran out of time on that.
I think this is actually one of the nice things about the algorithm, because I don't like the idea of implicitly pitting people or teams against one another. I just like that people can go "Oh hey, this has been a real problem for us, let's tread carefully here."
You could have two very similar pieces of code, one of which is in one larger file with 1000 lines. Imagine this code gets 1 bug per day.
Now imagine the other code that is functionally identical happens to be broken up into 10 files. Now imagine that 9 bugs per day are applied to this unit of code that is essentially identical to the 1000 line file; the broken up code is exactly the same and is 9x buggier, but it will be scored lower because each file is only getting 0.9 bugs on average per day.
Do you disagree that the code in the second example would actually be way "trickier" and deserves to have a flag on it much more than the first one? I understand that one requirement for this is clarity of the algorithm to developers, but it seems like you could easily take the ratio of bug fixes to total commits on a file or normalize by file length if you wanted to actually get a reasonable metric of how "tricky" or dangerous it is to edit a file.
The point is simply that if you have a unit of code that has X bugs per day this will give a lower score if code that is actually buggier is across multiple shorter files.
You're on the right lines, and I spent some time thinking about this. There is a couple of things that led me not to concern myself with file length. Before I start, I would note that a file just being big and getting changed isn't a problem: it has to be changed and have a bug ticket with the "bug" classification put against it.
1. As others have stated, the research does strongly point towards LOC as being part of the metric that should be looked at.  is a paper that discusses LOC, but it's behind the IEEE pay wall and I wasn't able to find a free version. Sorry. I'm on holiday and a bit pressed for time this morning, but I can root around for more papers if you would like. Pretty much all the models we have for bug prediction will invariably throw LOC into the pot (although as they often do PCA too, one could argue that the LOC going down could be one of the components ;) ).
2. Google, by and large, is an OOP company. The intuitive reason for why LOC leads to defects is that the file is no longer meeting the principle of single responsibility. I would hypothesize that once your file has gone past this, defects are going to start to snowball as your file is on the way to becoming what I called a Gateway file, where large amounts of functionality pass in and out of the file, even if it only does simple stuff. So not only is it going to attract defects because it's long, but it might well attract even more defects because when other functionality is modified, that file has to as well. The added problem is these files resistant to refactoring because of the possibility of breaking so much functionality.
A Very Big Philosophical Argument is whether you would put configuration files under this classification, as config files affect all sorts of functionality. Personally, I would, but most of the engineers disagreed with me, so they were left out.
3. In a very unscientific way, I did experiment with all sorts of modifiers to try and even out files that may have been "unfairly" flagged due to some metric like LOC or number of commits or whatever. I eventually abandoned this, not just because I found the results largely unsatisfying, but I realized I really was doing things in an unscientific manner without any real justification for why I was doing it. Without a sound justification for evening it out, which LOC doesn't have, I decided not to do it.
Hope this helps a little bit. I very much see your point, and a number of engineers brought this up during user studies, but those big files that were getting flagged really were problems.
I suspect that isn't the case. Subjectively, longer files are harder to understand and in languages like Java (one file = one class, ignore inner classes etc) is likely to represent more complicated and tightly coupled code.
If the same file keeps appearing, then split it in half and see what happens...
I would assume that's due to the fact that bugs are not always constrained to a single function, but usually to closely related code. Which often means all that related code is in the same file.
(improvements welcome! :-))
To me one of the reasons agile so swept our field was because it rode the back of the increasing availability of two genres of toolset: the unit testing framework and the bug/issue/story/feature tracker.
[Hope this technology will help make Chrome more stable. Do they even consider resource leaks bugs?]
Given more time, I absolutely would have done this. Nicolas Lopez at UC Irvine is thinking about putting tools like this at the IDE level, if you're interested.
Now that I think about it, if the black bar was lighter, then the page wouldn't balance visually. To be honest, I didn't actually notice what was on the black bar when I first read the article, and when you scroll down it collapses into something less visually imposing. The open animation could do with being a bit faster or vanishing altogether though. I feel like having a smooth close animation is important because the eye is attracted to sudden movement.
the code's up here:
They certainly can, and commonly are! Think of weather prediction, Bayesian diagnostic systems, Google's statistical translator and spell checker, etc etc.
Taleb's point is that they fail sometimes - perhaps more often than you think - and so you should be very careful in placing bets on statistical predictions.
In this case the cost of a failed prediction is very low, and a successful prediction is worth quite a lot, so it is a net win for Google.
Sometimes I can trick ghc into using the wrong type. Like when I use "String" to mean "File". That's when all the bugs show up as run-time errors. But I should know a string is not a file. And hopefully somebody writes a module that knows what a "File" really is and stops me from writing buggy code if I use it.
Now I belong to the "had been persuaded before, but now I am not so sure" category. More so after discovering absolutely stupid typos and compile time checkable errors in long running Python code that I had written. It really sucks when your code bails out after 16 hrs of number crunching because of some silly error on your part.
I used to think, "who makes type errors of all things?" it turns out that that idiot is me. I am not claiming that dynamic languages are bad, or that checked languages are always good, but I do benefit from checked code. A better programmer perhaps wouldn't benefit as much.
Rather than implementing a type checking system badly and in an ad-hoc manner, I now lean more towards compile time checked languages, especially for pieces of code that I expect will run for a long time.
I think the sweet spot would be optionally type-checked / type-inferred languages, for example Cython, Common Lisp. Strongly type-checked languages _tend_ to be all or nothing.
@nl yeah! absolutely, they arent by any means a bug predicting algo.
They aren't the same as bug prediction systems, though.
Here's a bug from the real world: the login form doesn't render correctly in IE. When someone implements IE's layout algorithm as a haskell type, I'll start paying attention.
I am not sure if type-checking even aspires to be exhaustive. As long as it can detect my stupid and costly mistakes at a cost that is cheaper than the cost of the mistakes, I am happy. But I can understand that for people who do not make costly and compile time checkable mistakes, it might not be worth it.
@tedunangst Got you. Your second paragraph in the previous comment threw me off. It does read like you consider type-checking unworthy of attention because it cannot catch all possible errors. Agree with your comment about people overstating their case. I think the noisiest of both the camps (if one can call it that) are equally guilty.
Type checking is useful. I like it. But I don't like people talking about it like it's magic pixie dust.
Pay attention here: any template that attempts to emit HTML will fail to Typecheck.
Not a panacea, but static enforcement of Don't Repeat Your Bugs, even before those particular bugs are discovered.
While I love me some type systems, I worry you're thinking a bit too code-centric.
For example, let's say I describe to you a really hard problem, and you come up with a solution (that type checks!), and then we launch to users and they find a use case we never thought of, and our code suddenly breaks. I am sure this has happened to you. The problem isn't at the code level (that's just the manifestation), but our current understanding of the problem.
IMHO, once you reach a certain level of competency, a sizable minority, if not majority, of bugs that make their way into a source control system (particularly one with code review, like Google's) are not line-by-line problems, but wider problems with actually understanding the task at hand and all that encompasses. When you work on the complexity of problems that Google does, you're tending towards a very large chunk of bugs being members of this class.
However, we don't have good tools (yet) to have a computer properly check the semantic meaning of our code. Bug prediction sits as a sort of baby step. It's the computer making a best-effort guess of where issues will be. The algorithm designed here is saying "Look this file has been a struggle. It's probably going to continue being a struggle, so we need to focus attention when we change anything to it."
Yes, type systems can catch a class of bugs.
Google - like most of the software development world - has a lot of people who believe in type systems completely.
Using methods of reducing bugs (such as type safety) is orthogonal to producing systems that predict where bugs will occur.
- directing engineering resources to where they're likely to be most leveraged from a quality standpoint [what Google focused on]
- estimating how many resources to devote to bug fixing and/or long it will take an in-progress code base to stabilize [the goal of a lot of the work Micorosft did with Windows in the early 2000s]
- and estimating post-release defects to estimate (and perhaps direct) customer support and maintenance resources [more typical in the hardware world than software]
And I wouldn't say that static type systems have an ambition of rendering bug-prediction impossible. If you can truly detect all errors at compile time, then prediction's trivial: no bugs remain!
I might not use Haskell, but I do understand what a type system does, and why it is important.
BUT, not all bugs are statically predictable at compile-time. Take things like cross-browser compatibility - something like GWT goes a long way to reducing bugs with that, but no type system will protect you from a new bug in a new browser you need to work around.
(Edit: by orthogonal I meant "statistically independent". Given a piece of code written in a type safe language, this method will predict bugs independently of a type system.)
I suppose that is a reasonable argument. i think different people in this argument are arguing about different things.
When all you've got is a hammer ....