
Code History Miner - ingve
http://codehistoryminer.com/index.html
======
stevebmark
Interesting project. Serious question: Have repo analytics actually given you
any useful results? Especially pertaining to a business? There are no case
studies here. So many companies / people say "we'll help you analyze and view
your data!" but then come up short for how that data practically affects
anything. Looks like you're going down that path As A Business™. What sorts of
results are you expecting to see?

~~~
sdesol
I'm not the poster but I've done some grad research on software metrics and it
is a hot topic, but if you look at it objectively, it does offer value. What
has stuck with me the most in my research is, metrics should be used to gauge
risk and quality and not performance.

Microsoft's research division has published a bunch of academic papers on code
churn related metrics like the following:

[http://research.microsoft.com/apps/pubs/default.aspx?id=6912...](http://research.microsoft.com/apps/pubs/default.aspx?id=69126)

And if you do a quick google search, you'll find others like:

[http://web2-clone.research.att.com/export/sites/att_labs/tec...](http://web2-clone.research.att.com/export/sites/att_labs/techdocs/TD_100504.pdf)

However looking at what was posted, I'm not sure the product calculates code
churn in same the way that I've learned it to be. My understanding of code
churn and as it is used in academic research, describes code churn as the
total number of lines added, changed and deleted between two points. The
ability to distinguish between line additions, changes and deletions are
important.

For example, as you get closer to a release milestone, you would expect the
churn to be line changes as opposed to adding or removing lines.
Adding/removing lines is an indication of restructuring, while line changes
would be modifications to existing source.

There also does not appear to be any obvious way to drill into the churn data
to see what files are creating the churn. Maybe this is information is in the
database, but being able to identify areas of churn is important. Having churn
in your documentation is definitely not as risky as having churn in your core
code as you get closer to a certain milestone.

I'm currently working on better integrating my code churn technology with
GitHub and I have some examples at

[http://imgur.com/a/d4avE](http://imgur.com/a/d4avE)

which goes over how code churn metrics can be used to better understand repo
changes.

Edit:

I didn't realize the charts were interactive, but it does appear to recognize
lines added, changed and deleted. However it is not clear how the churn is
calculated. Not sure if it is a cumulative churn or if it is diff churn.

~~~
golergka
That sounds really interesting. You seem to know a lot about this matter; if
you would create simple istructions about how to (1) gets stats from the
typical production repo, (2) interpret these stats and (3) take actions based
on these stats, I think that a lot of people would find it very valuable. So
far, I've only seen solutions offering 1 and 2, but not even suggesting 3, and
they were typically focused on wow factor than being useful for business.
Personally, I would definetly pay for a reasonable subscription for SAAS that
would do that for me with minimum setup hussle: while I have vague ideas about
how to interpret repository stats, I don't have enough experience of doing it
to be sure and certainly not enough time to set it up.

~~~
sdesol
Interpreting the metrics is always going to be a hard thing and I really can't
see how this can become a SAAS business. The metrics generation part can
certainly be, but how you should react to the numbers will depend on the
software being developed, the people developing the software, and so forth.
Basically there are a lot of variables to consider.

However, if step 1 (collecting the metrics) is done poorly, you'll have no
real way of making sense of what is going on and people tend to use this to
push some agenda. The most important thing in my mind, is being able to
capture information in a non-objective way. Developers, manager, project
managers, etc. should be able to scrutinize the numbers. And to be able to do
this, you need to be able to capture every line change and be able to abstract
this change in a way that makes sense to upper management.

For example, since GitLab is a hot topic right now, I've decided to index it
to see what is going. If you goto

[https://github.com/gitlabhq/gitlabhq/graphs/contributors](https://github.com/gitlabhq/gitlabhq/graphs/contributors)

you'll find GitHub's metrics for GitLab development. I like GitHub, but none
of their metrics, provides any indicators for code quality or gives you any
indicators of where to look for potential code quality problems. The closes
thing is this:

[https://github.com/gitlabhq/gitlabhq/graphs/code-
frequency](https://github.com/gitlabhq/gitlabhq/graphs/code-frequency)

but their date range is way too great to provide any meaningful metrics. Those
large spikes in the past are creating way too much noise. However if you goto

[http://imgur.com/lTW3lZh](http://imgur.com/lTW3lZh)

you'll find a screenshot that shows the code churn for the master branch for
the last 30 days. And that large spike not too long ago should make you wonder
what's going on, and this is where being able to dig into the number is
important.

Since everything is captured in a SQL database, a simple query shows the
following:

13376 (total), 13376 (add), 0 (chg), 0 (del), fixtures/emojis/index.json

which explains everything. What this means to you will vary and it's why I
don't think you can really make a SAAS business out of interpreting the
numbers. The best you can do is capture everything under the sun and make it
as easy as possible for them to make sense of the numbers.

~~~
sytse
Cool, we're thinking of adding code analytics features to GitLab 8.5, what do
you think of [https://gitlab.com/gitlab-org/gitlab-
ee/issues/112](https://gitlab.com/gitlab-org/gitlab-ee/issues/112) ?

~~~
sdesol
I was actually going to fire off an email to you or comment on the issue at
GitLab to further discuss GitLab's metrics plans/ambitions. Having spent years
refining my indexing engine, I know being able to capture data accurately is
not a trivial thing to do at the enterprise level (hundreds of repositories
with 10's of thousands of branches) and the bullet points in the issue are
certainly lofty if you want to make the numbers meaningful and drillable.

I also can't overstate the importance of branch level indexing, as this is
what people always neglect to do, since it doesn't scale well. However with
branch level indexing, you can answer questions like where do we stand with
release x and y very easily.

Using GitLab development as an example, if a project manager wants to know how
aligned release 8.4 and master are, the only options they have right now is to
do a compare, which results in this:

[https://gitlab.com/gitlab-org/gitlab-
ce/compare/master...8-4...](https://gitlab.com/gitlab-org/gitlab-
ce/compare/master...8-4-stable)

which can be quite overwhelming. Or you can tell them how many commits they
are behind and ahead of one another which is interesting, but doesn't put
things into perspective.

However, if you've indexed things at the branch level, you'll be able to
provide them with a higher level picture like this:

[http://imgur.com/lzHlk2P](http://imgur.com/lzHlk2P)

which shows the deviation didn't start until around the 12th. And with further
manipulation, you can show them something like this

[http://imgur.com/yewQJBc](http://imgur.com/yewQJBc)

which shows the unique commits per branch. And from there, you can start
drilling into the data like so

[http://imgur.com/58FExxL](http://imgur.com/58FExxL)

to see what commits they were and what issues they fixed.

Without branch level indexing, providing this type of metrics would not be
possible. And as I've stated before, if you are focused on capturing the
smallest detail, you'll be able to make the metrics meaningful for everybody.

Let me know what is the best way to get in touch, as making my technology
available for GitLab is definitely on the roadmap.

~~~
sytse
Looks cool, good point to make provide the analytics per branch. I would love
to get in touch, please email sytse@ company domain.

------
ChristianBundy
Please don't use an auto-carousel, _especially_ for interactive features.
There's nothing more annoying than clicking a drop-down only to have the
carousel move.

------
johnmaguire2013
I've considered doing something like this myself for fun. I thought something
like "code ownership by file" could be useful for example.

