

Show HN: My implementation of Google's bug prediction algorithm - d0vs

Hi, everyone!<p>I'm a high-school student. I saw that blog post about bug prediction on Google Engineering Tools blog (&#60;http://google-engtools.blogspot.com/2011/12/bug-prediction-at-google.html&#62;), and today I decided to implement it in Python, just for fun.<p>It's buggy on some repos due to issues with GitPython.<p>&#60;http://pypi.python.org/pypi/bugspots&#62;<p>Anyway, what do you think?<p><i>I know, weird title: character limiting.</i>
======
jnazario
had a look at the code (specifically version 1.0). a couple of comments.

first it's not bad! much better code than when i was your age.

second, you have a huge assumption, i think, at line 73 where your code has
"repo_age = int(time.time()) - first_commit_time". this assumes that the
repository has seen edits recently, but think about a repository that has been
dormant for a few months. the score of any bugfix will go down but there's no
reason for it to: did bugs get fixed - and less "hot" - when the repo was
dormant? no. what you need to do is to find the total timespan of the
repository, looking at the min and max times of the commits, then base your
scores on that.

keep up the good work!

~~~
d0vs
Thanks for your comment!

About your concern: I too think it's strange but they put it clearly in their
blog post:

"The timestamp used in the equation is normalized from 0 to 1, where 0 is the
earliest point in the code base, and 1 is now (where __now is when the
algorithm was run __)."

I could do `repo_age = commits[0]["time"] - commits[-1]["time"]` but I'd like
to stick to what the blog post says.

~~~
Lewisham
(I'm the author of the original blog post/algorithm)

For Google, this makes sense, although I see now it's a weird idiosyncrasy
outside of the Googleplex :)

Google has changes all year round, so "now" always represents a day of some
changes. The algorithm is run on the entire code base, so now is a perfectly
normal spot to go with. It didn't occur to me that really what now represents
is "last time of changes". This is a good spot, and I'll put it in the paper
I'm writing! Thanks!

A reasonable/desirable modification is probably for now to be changed to
represent the last change, just like the first change is when the
normalization window begins. (0 = time of first change, 1 = time of last
change).

HTH.

~~~
d0vs
Thanks for clearing that out, I now remember that "50% of the Google codebase
changes every month."

