
A Large Scale Study of Programming Languages and Code Quality in GitHub [pdf] - oskarth
http://macbeth.cs.ucdavis.edu/lang_study.pdf
======
luu
The claim:

“Most notably, it does appear that strong typing is modestly better than weak
typing, and among functional languages, static typing is also somewhat better
than dynamic typing. We also find that functional languages are somewhat
better than procedural languages.”

But how did they determine that?

The authors looked at the 50 most starred repos on github, for each of the 20
most popular languages plus typescript (minus CSS, shell, and vim). For each
of these projects, they looked at the languages used (e.g., projects that
aren’t primarily javascript often have some javascript).

They then looked at commit/PR logs to determine figure out how many bugs there
were for each language used. As far as I can tell, open issues with no
associated fix don’t count towards the bug count. Only commits that are
detected by their keyword search technique were counted.

After determining the number of bugs, the authors ran a regression,
controlling for project age, number of developers, number of commits, and
lines of code.

That gives them a table (covered in RQ1) that correlates language to defect
rate. There are a number of logical leaps here that I’m somewhat skeptical of.
I might believe them if the result is plausible, but a number of the results
in their table are odd.

The table “shows” that Perl and Ruby are as reliable as each other and
significantly more reliable than Erlang and Java (which are also equally
reliable), which are significantly more reliable than Python, PHP, and C
(which are similarly reliable), and that typescript is the safest language
surveyed.

They then aggregate all of that data to get to their conclusion.

I find the data pretty interesting. There are lots of curious questions here,
like why are there more defects in Erlang and Java than Perl and Ruby? The
interpretation they seem to come to from their abstract and conclusion is that
this intermediate data says something about the languages themselves and their
properties. It strikes me as more likely that this data says something about
community norms (or that it's just noise), but they don’t really dig into
that.

For example, if you applied this methodology to the hardware companies I’m
familiar with , you’d find that Verilog is basically the worst language ever
(perhaps true, but not for this reason). I remember hitting bug (and fix) #10k
on a project. Was that because we have sloppy coders or a terrible language
that caused a ton of bugs? No, we were just obsessive about finding bugs and
documenting every fix. We had more verification people than designers (and
unlike at a lot of software companies, test and verification folks are first
class citizens), and the machines in our server farm spent the majority of
their time generating and running tests (1000 machines at a 100 person
company). You’ll find a lot of bugs if you run test software that’s more
sophisticated than Quickcheck on 1000 machines for years on end.

If I had to guess, I would bet that Erlang is “more defect prone” than Perl
and Ruby not because the language is defect prone, but because the culture is
prone to finding defects. That’s something that would be super interesting to
try to tease out of the data, but I don't think that can be done just from
github data.

~~~
ggreer
You've done a good job of pointing out weaknesses in the paper, but I have a
question about your defect rate argument: Would you hold the same position if
the study had said the opposite? In other words, if it claimed that Ruby and
Perl had more bug fixes (and thus defects) than Erlang and Java, would you
claim the study was flawed due to a culture of meticulous bug-finding in Perl
and Ruby?

From Conservation of Expected Evidence[1]:

 _If you try to weaken the counterevidence of a possible "abnormal"
observation, you can only do it by weakening the support of a "normal"
observation, to a precisely equal and opposite degree._

It really seems like a stretch to say that higher bug fix counts aren't due to
higher defect rates, or that higher defect rates are a sign of a better
language. Language communities have a ton of overlap, so it seems unlikely
that language-specific cultures can diverge enough to drastically affect their
propensity to find bugs.

1\.
[http://lesswrong.com/lw/ii/conservation_of_expected_evidence...](http://lesswrong.com/lw/ii/conservation_of_expected_evidence/)

~~~
loup-vaillant
Personally, I have read the abstract alone, and was pleased to see it nurtured
my own preferences: static typing is better, strong typing is better. But then
I saw this disclaimer:

> _It is worth noting that these modest effects arising from language design
> are overwhelmingly dominated by the process factors such as project size,
> team size, and commit size._

So I dismissed the paper as "not conclusive", at least for the time being. I
wasn't surprised if their finding were mostly noise, or a confounding factor
they missed.

By the way, I recall some other paper saying that code size is the most
significant factor ever to measure everything else. Which means that more
concise and expressive languages, which yield smaller programs, will also
reduce the time to completion, as well as the bug rate. But if their study
corrects for project size, while ignoring the problems being solved, then it
overlooks one of the most important effect of programming languages in a
project: its size.

------
breuderink
> For those with positive coefficients we can expect that the language is
> associated with, ceteris paribus, a greater number of defect fixes. These
> languages include C, C++, JavaScript, Objective-C, Php, and Python. The
> languages Clojure, Haskell, Ruby, Scala, and TypeScript, all have negative
> coefficients implying that these languages are less likely than the average
> to result in defect fixing commits.

Notice how top projects in popular languages do have tendency to have more
fixes than top projects in more obscure languages. Perhaps these projects have
simply more users, leading to more reported bugs and community pressure?
Interesting paper, but it is all to easy to jump to conclusions.

------
oskarth
From the conclusion:

 _The data indicates functional languages are better than procedural
languages; it suggest that strong typing is better than weaking typic; that
static typing is better than dynamic; and that managed memory usage is better
than unmanaged._

Obviously not the last word, but interesting study nonetheless.

~~~
djur
The worst outcome was for Procedural-Static-Weak-Unmanaged, aka C and C++.
There is no OO category, since most OO languages can be (and are) used in a
procedural style.

~~~
pathikrit
Because there are no Procedural-Dynamic-Weak-Unmanaged languages :)

~~~
abecedarius
Forth! (I haven't read the paper.)

~~~
Immortalin
Forth is untyped, not dynamically typed. ;)

------
ChrisCinelli
The sample size is too low. It also seem not to weight enough the age of the
project. Older project are usually more prone to defect since the amount of
changes of specs that they have to go through.

The results for Typescript are completely wrong. Bitcoin, litecoin,
qBittorrent do not have any Typescript code
[http://bitcoin.stackexchange.com/questions/22311/why-does-
gi...](http://bitcoin.stackexchange.com/questions/22311/why-does-github-say-
that-the-bitcoin-project-is-74-typescript)

Double Face palm!

~~~
robert-boehnke
Not only are they not Typescript, but they the once I checked were mostly C++,
which is on the other end of the defect proneness spectrum, according to this
study.

------
djur
This seems to be really well done and I'm looking forward to reading it in
more detail, but one thing jumped out at me: one of the keywords used to
identify fixes for programming errors is "refactoring". Refactoring doesn't
necessarily (and actually shouldn't) refer to fixing a defect.

EDIT: The Procedural/Scripting split seems overdetermined. All of the
procedural languages have static typing, and all the scripting languages have
dynamic typing.

~~~
tarvaina
Indeed. Refactoring means restructuring code in a way that doesn't change its
external behavior. If we define bug as incorrect external behavior of code,
then refactoring excludes bug fixes.

Refactoring doesn't necessarily even mean that the original code was
structurally bad. When new features are added, code gets more complicated,
requiring more abstraction. I like to abstract code via refactoring when we
need it, not before. Then refactoring changes good code to good code. It's
just that the new situation presents different demands to the code than the
old situation.

------
zzzcpan
The thing is we already know that some languages are much more error prone,
than other languages. But languages are not entirely to blame. Before we can
find out proper correlations we should answer the question: how these errors
came to be?

For example, what caused a typo in a sting somewhere, that now contains "foo"
instead of "bar"? Likely cognitive overload, because the code was too complex
to keep entirely in memory. I.e. author was to busy processing the code in his
head to notice the typo he just made. Therefore code with lower cognitive load
is likely to have fewer bugs like this or even overall. Some programmers have
learned that there is a way to prevent string typos with appropriate test
cases and some programmers keep their code as simple as possible, i.e. with
lowest cognitive load. So, we should see correlations to which languages such
programmers prefer and which languages encourage such coding practices.

This is a complex subject that has much more to do with psychology, than with
technology. And it should be studied as such. Trying to study bugs without
touching psychology is pretty much bs.

------
apta
It was interesting to observe that Go/Golang, while supposedly designed for
concurrency, ranks the "worst" in terms of concurrency bugs (see Table 8 in
the paper). I presume this is due to the absence of any way to specify
immutability in the language. Of course, it could mean that because it has
concurrency primitives built-in, that people are willing to use concurrency
more. It would be interesting to see how a stronger type system would have an
affect on such error rates.

I don't get this statement though: _The enrichment of race condition errors in
Go is likely because the Go is distributed with a race-detection tool that may
advantage Go developers in detecting races._ I thought that including a race
detector would _reduce_ race condition errors. Am I missing something?

It was also interesting to see it rank low in the security correlation ranking
too.

~~~
rrradical
They aren't actually analyzing whether the software has race condition errors;
they are analyzing whether there are commits to fix race condition errors. If
more errors can be detected, more will show up in commit logs and so the
language will be deemed error-prone.

------
tomcam
First off, props to this krewe for tackling a large and scary topic, for
exceptional clarity, and especially their humility. At about a 1% spread, the
differences among languages are negligible.

Even a sample this size is too small:

 _A first glance at Figure 1(a) reveals that defect proneness of the languages
indeed depends on the domain. For example, in the Middleware domain JavaScript
is most defect prone (31.06% defect proneness). This was little surprising to
us since JavaScript is typically not used for Middleware domain. On a closer
look, we find that JavaScript has only one project, v8 (Google’s JavaScript
virtual machine), in Middleware domain that is responsible for all the
errors._

~~~
paulajohnson
1% is not the "spread" (whatever that is), its the proportion of the variance
in bug fixes that is attributable to languages. The other 99% is attributable
to the number of commits (i.e. projects with more commits have more bugs. Well
gee.)

Once you factor out the level of commit activity, language influence is
actually quite large.

------
paulajohnson
The statement by the authors that the language effects are "small" is very
strange: the only non-language "big" effect they found was that projects with
lots of commits have lots of bugs, and that 99% of the bug count variance was
due to commit count variance. Well gee.

Once you factor out commit count, the impact of language turns out to be quite
large. A Haskell project can expect to see 63% of the bug fixes that a C++
project would see. I don't call a 36% drop in bugs "small".

------
didibus
This is dumb, you can not compare different projects, of different difficulty,
and of different scope together. You also can't isolate the simple randomness
factor that different programmers are doing work on each of these, and that
could simply be the cause of difference. Also, having a lot of bugs reported,
can actually be a sign of a good thing, as opposed to having a lot of bugs not
reported.

Anyhow, I don't think this says anything about anything.

------
lifeisstillgood
I just love this stuff. I am seriously considering a research MSC on FOSS /
methodologies so this is catnip to me

However it screams "well we did not find anything much". (Which is good
science anyway so props to them)

Choice of language contributes just 1% to the variance in bugs - that is if
you want to improve your project quality - don't bother looking at the
language choice.

It's an interesting result - but does rather beg the question what drives the
other 99%?

~~~
toast0
The other 99% is people. If I write terrible code (which I probably do), I'll
write terrible PHP, terrible Perl, and terrible Erlang, and maybe a little bit
of terrible C or C++ from time to time, and some terrible Java if I must.

~~~
paulajohnson
No, the other 99% is project size.

------
davidD02
I wonder if a generalized linear mixed model approach to analysis might be
better - I am curious if there was different between-project variation in
error rate within each language. And interpreting the proportion of model
deviance explained ("1%") is deceptive. Committer number and project size are
nuisance variables - we wish to know what the size of the differences between
languages would be when applied to the same project. We should also be
interested in interactions between project size and language. The paper needs
summary plots of the raw data.

------
swuecho
for Perl, there are three project in the paper, gitolite, showdown, rails-dev-
box

gitolite is a perl project.

showdown?
[https://github.com/showdownjs/showdown](https://github.com/showdownjs/showdown)
it has some perl code in it, but it is a js project.

rails-dev-box? rail dev, do you agree with that?

So I will not read the paper.

------
mike_ivanov
Developer/team experience is not a confounder? Very believable, nevertheless.

