
On the Impact of Programming Languages on Code Quality - lelf
https://arxiv.org/abs/1901.10220
======
scott_s
A common thread I'm seeing in the discussion here is confusing the claims from
the _original_ paper with conclusions made by _this paper_. This is a
replication study, and it starts by stating what the original paper found.
What the original paper found is not what this paper finds. For the most part,
this is a replication failure, for a myriad of reasons, including several
kinds of misclassification.

From this paper:

 _The reanalysis failed to validate most of the claims of [18]. As Table
6(d-f) shows, the multiple steps of data cleaning and improved statistical
modeling have invalidated the significance of 7 out of 11 languages. Even when
the associations are statistically significant, their practical significance
is small._

The reference [18] is
[https://dl.acm.org/citation.cfm?doid=2635868.2635922](https://dl.acm.org/citation.cfm?doid=2635868.2635922),
which is the original paper that this paper is a replication study of.

~~~
pmiller2
Yes, definitely. This is mostly a _non-replication_ of the original study. I
was a little disappointed that they didn't look at defects per line of code,
even though that's already a fairly well-studied metric.

I think the lesson here is that you can write good or bad code in any
language, but choose one that's fit for purpose. Using proper abstractions
(including those builtin to languages) probably does more to reduce the
incidence of bugs than anything you can do other than writing automated tests.

~~~
neves
The practice that most reduces the incidence of tests is code review:
[https://kevin.burke.dev/kevin/the-best-ways-to-find-bugs-
in-...](https://kevin.burke.dev/kevin/the-best-ways-to-find-bugs-in-your-
code/)

~~~
JamesBarney
Emphasis on "Formal code review" which is practiced in far fewer shops than
"code review".

For those who don't happen to know the difference, formal code review
basically uses a check list of common best practices and possible defects to
look for versus doing each code review ad-hoc.

~~~
codazoda
I thrive on checklists and hate the ad-hoc way I do code reviews on my team.
This sounds great, I'm going to go look for some basic formal checklists. I
need language agnostic versions and/or versions for javascript, php, and
python, if anyone has suggestions.

~~~
treebog
The Art of Software Testing by Glenford Myers has an excellent language-
agnostic checklist. See the chapter on “walkthroughs”

~~~
pfdietz
That is a book that, even in the third edition, has a paragraph that
disparages random testing.

It's difficult for me to see how, in the era of jsfunfuzz, Csmith, AFL, and
other such tools, such a source should be taken seriously.

~~~
treebog
I’ve never found a book on software I agreed 100% with. Even my favorites have
some bits I think are dead wrong. It doesn’t mean the other parts aren’t
valuable.

~~~
pfdietz
It means the book has not been updated, at least in that part. And the first
edition of the book was written back when computing power was seven orders of
magnitude more expensive (or more). Is advice about testing from then really
going to be valid now?

------
robbrit
Seems like they're just doing associations, and not controlling for a lot of
factors. For example, you can come up with some narratives to explain a lot of
these findings:

* Python, JS, PHP, Java, C++ are all very common first languages; I'd imagine the average experience level for these is dragged down a bit compared to others like Golang, Haskell, or Scala. One of the recent Stack Overflow Developer surveys found Golang use to be strongly correlated with age, and since I'm assuming bug count would negatively correlate with age you'd see a confounding factor here.

* Ruby's community has an obsessive culture towards unit testing, that could impact bug levels.

* The trade-offs that make C and C++ attractive such as memory performance are double-edged swords and will naturally correlate with more bugs. For example, I work a fair bit in embedded systems which use MISRA C out of necessity; I've also worked in large scale data processing which chooses C++ for its memory mapping performance over Java or Go. If "number of defects" is your primary measure then you're ignoring some of the reasons languages are chosen. Sometimes a language that minimizes bugs is not a viable option.

* I'd guess there is a skill correlation here: an engineer who is willing to seek out a specific less-used language and learn to use it effectively is _probably more likely_ (not guaranteed) to do a better job than one who is just going with the default choice.

It'd be very interesting to see a regression done that accounts for these
factors.

~~~
stcredzero
_Ruby 's community has an obsessive culture towards unit testing, that could
impact bug levels._

Like Smalltalk, they're forced to do this to catch the errors that would've
been caught by type annotation. So then, could the argument be made that less
stringent type annotation hurts code quality by removing effort from unit
testing?

~~~
dragonwriter
> Like Smalltalk, they're forced to do this to catch the errors that would've
> been caught by type annotation.

Very few of the unit tests I've written for Ruby/Python catch errors that
would be caught by typical type system use in C/Java/etc. (Haskell, maybe,
but...) The main thing I miss in dynamic languages vs. static is the more
powerful automatic testing possible when test specs can leverage type
information (e.g., QuickCheck/ScalaCheck), not an absence of need for testing.

~~~
joshlemer
I've only ever done property based testing in Scala, but I thought it was
pretty popular in dynamic languages like clojure and erlang. What would
prevent it from working in Ruby/Python?

~~~
dragonwriter
Without type information already in the code (for parameters, etc.) there
would be a lot more overhead in setting it up; it's possible, but higher
marginal friction in the ideal case.

Plus, AFAIK, no one has written the libraries, making it even higher marginal
friction in practice even more than than is true in the ideal case.

~~~
ramchip
Type information generally isn’t used when doing property testing in Elixir,
and Python has a very mature library too (hypothesis). If there’s a difference
in popularity, it could just be culture. The Erlang/Elixir community is rather
fanatical about reliability.

~~~
dragonwriter
Hypothesis uses the same type annotations as Python's optional typechecker as
a tool to select appropriate input data generators without additional test-
specific description. That's a fairly common feature of property-based testing
tools; I don't know for sure, but I wouldn't be surprised to find that
hypothesis and mypy tend to be used in the same segments of the Python
community.

------
kazinator
The problem is that "code quality" isn't just bugs per line of code.
Maintainability, extensibility, readability, expressivity all factor into it.

Suppose I have to write 100,000 lines in language B to write something doable
in 10,000 lines of A. Who cares if the defect rate is twice as high per line
of A: I still have five times fewer defects due to ten times fewer LOC's.

Another thing to consider is the nature of defects. A defect that allows a
remote execution exploit is not the same as an incorrect result being ignored
and propagating to later calculations, is not the same as terminating with a
diagnostic.

~~~
caust1c
Very much agree. This type of analysis is hard to quantify but anecdotally I
don't know many people who have written Go, Rust or Typescript that would
prefer python, ruby, java, javascript or c++ for their daily work after using
them.

------
pron
The most practically relevant finding of the original paper remains in effect:
the choice of language has an effect of less than 1% on the bug rate[1]. The
original paper explains: "With effects coding, each coefficient indicates the
relative effect of the use of a particular language on the response as
compared to the weighted mean of the dependent variable across all projects."
So the entire discussion is on the relative effect of particular languages
_relative to the total effect of language choice, which is less than 1%_. The
original paper reported stronger _relative_ effects for particular languages,
while this one found the effects to be smaller, and in many cases they
disappeared altogether, thus summarizing the effects of the languages as
"exceedingly small."

So even the comparison of C++ vs. Haskell, that has ~25% difference, is
_within that 1%_. This is similar to a study comparing the effect of running
shoes on running speed that shows that Nike positively affects your running
speed 25% more than Adidas, to the extent running shoes affect your speed at
all, which is less than 1%.

\-----

[1] _One should take care not to overestimate the impact of language on
defects. While these relationships are statistically significant, the effects
are quite small. In the analysis of deviance... we see that activity in a
project accounts for the majority of explained deviance... The next closest
predictor, which accounts for less than one percent of the total deviance, is
language._
([https://web.cs.ucdavis.edu/~filkov/papers/lang_github.pdf](https://web.cs.ucdavis.edu/~filkov/papers/lang_github.pdf))

~~~
yogthos
Two things this studies haven't measured are time to deliver features, and the
number of developers necessary. For example, if a project in Haskell can be
delivered faster by a smaller team than an equivalent project in C++, that
would be an important factor.

~~~
bunderbunder
There's a concept I discovered in the realm of political science, the
"fundamentally unanswerable question" \- ones where it's not actually possible
to generate the kinds of counterfactuals you'd need to try and run an
experiment.

I think this is probably one of those. Haskell and C++ are very different
languages, that are designed for very different purposes, and are popular in
very different problem domains. To start with, that would make it difficult to
come up with a task that wouldn't handicap one language or the other. (For
example, a video game would favor C++.) Even if you found that, you'd have a
hard time coming up with a good test. Under random assignment, Haskell is
presumably going to be handicapped relative to C++ by the fact that most
languages are imperative and use Algol-style syntax. You could try to avoid
that by using participants with zero programming experience, but then you're
studying learning curve for absolute beginners rather than productivity among
experts. And so on and so on.

Long story short, probably best to leave this one to the spittle-flecked
Internet arguments.

~~~
yogthos
I really don't see how that would be the case. Both Haskell and C++ are
general purpose languages, there are plenty of projects written in many
overlapping domains. The test would to treat projects as black boxes, and look
for trends across many projects. If you saw a trend that Haskell projects are
consistently delivered faster by smaller teams, then you could make a
hypothesis that Haskell is a factor. At the end of the day, we've come up with
ways to test hypothesis like quantum mechanics. I'm pretty sure measuring
impact of a language on software development process is a tractable problem.

~~~
pron
As someone who develops mostly in C++, I can tell you that productivity is a
very small concern. We are well aware that most GCed languages are
significantly more productive. Virtually the only reason people use C++
nowadays is for performance (and/or footprint) (that is partly because we have
seen reports that Java delivers better productivity in the early '00s, so
those who care about that have switched a long time ago; in fact, the effect
was so large that they didn't have a choice). Now, if someone showed that
Haskell has a significantly better performance/footprint than C++ that may
cause someone to shift.

~~~
yogthos
Sure, in some domains you have to use specific languages because of the
constraints of the domain. However, as you yourself state, a language can play
a significant impact when it comes to productivity.

For example, immutability provides similar benefits to GC. Instead of the
developer having to manually manage references across the project by hand, the
language takes care of that work. This frees the developer from doing
additional work, and removes a source of errors. This also directly leads to
the ability to do local reasoning about code resulting in developers needing
less context to understand code. So, if GC plays a significant enough role
then it's highly likely that immutability does as well because it addresses a
similar set of problems.

------
idoubtit
My own take of this:

\- This study tries to enhance the original work by better metrics (e.g.
confusion with C/C++), using better statistics (e.g. uncertainty).

\- Automatically evaluating the quality of a code is hard and time-consuming.
They had to use peer review of labellisation.

\- Most of the old claims are not confirmed.

\- Only 2 languages in this study had both a significant impact (impact coeff)
and enough fiability (p-value): Clojure and Haskell. But the impact is still
small, and there may be other influencing factors.

\- They propose several ways to further enhance their work (e.g. taking
regression tests into account).

~~~
msluyter
_"... and there may be other influencing factors."_

Selection bias? Consider the type of person likely to be attracted to
languages like Haskell or Clojure.

(Note, didn't read the original study so I don't know the selection criteria.
Were code samples in different languages from the same programmers compared,
or were there different sets of programmers per language? My comment really
only applies to the latter case.)

~~~
sidlls
What is the type of person likely to be attracted to languages like Haskell or
Clojure?

~~~
maxkwallace
I can think of selection bias effects both in-favor of more defects and
against it.

On one hand, Haskell or Clojure probably attract more academic types who have
a reputation for (consistent with my experience) on-average worse-quality
code.

On other hand, the barrier-to-entry for these languages is higher so they tend
to be programmed by more experienced folks. I assume that people pick
functional languages in part because they believe the functional paradigm
leads to higher-quality code, so the sample probably cares more about code
quality to begin with.

~~~
danielscrubs
You can’t win against people’s insecurities.

They went to school, they where not bashed in the head with hammers.

------
umass_alum
This is great (to see something I know about on HN). As an undergrad at UMass
(where Emery Berger teaches), I remember the exact paper they are talking
about, and how it was presented at UMass (this was part of a lunch and learn
with the faculty, grad, and undergrad students and the author of the original
paper).

I remember Emery being quite confused at the methodology in the paper, and he
was very confused with one aspect of it, which was how they came to conclude
that C was more bug prone (something about void * being able to cast to
everything? It was a while ago...)

Glad to see he wrote a rebuttal, a little awkward watching a professor debate
with the paper's author :)

~~~
mcguire
I knew Emery at UT Austin; he's a good researcher, but I'd imagine "awkward"
is an understatement.

------
dahart
One of the simplest but more important conclusions that affects me all the
time... it’s easy to search for things, but hard to know whether you’ve got
all the answers you want and hard to know if all the answers you got are
answers you want.

6.3 Grep considered harmful

Simple analysis techniques may be too blunt to provide useful answers. This
problem was compounded by the fact that the search for keywords did not look
for words and instead captured substrings wholly unrelated to software
defects. When the accuracy of classification is as low as 36%, it becomes
difficult to argue that results with small effect sizes are meaningful as they
may be indistinguishable from noise. If such classification techniques are to
be employed, then a careful post hoc validation by hand should be conducted by
domain experts.

~~~
stcredzero
_One of the simplest but more important conclusions that affects me all the
time... it’s easy to search for things, but hard to know whether you’ve got
all the answers you want and hard to know if all the answers you got are
answers you want._

Smalltalk was wonderful in this regard. If you wanted to search for how
something was used, you just right-click search for "senders." If you wanted
to find implementors, the same.

This big difference occurred with chains of senders. In other languages, where
there's 2 or 3 ways to get to something or call something, you have this
O(2^n) or O(2.5^n) blowup in the work you have to do to properly follow chains
of senders. With Smalltalk, it's just O(n).

On top of that, there were tools that could let you compose really
sophisticated SQL like queries of the code base, dynamically, then pop that up
in a browser.

------
yogthos
It looks like the key takeaways are:

Overall language effects do not appear to be a dominant factor in software
quality. Manual memory management is error prone. Functional languages appear
to produce better results than OO/imperaive ones. Static typing does not
appear to play a measurable role. In fact, Clojure and Erlang were in a
category of their own, beating out statically typed counterparts.

~~~
Skunkleton
> Clojure and Erlang were in a category of their own

I wonder how they controlled for developer skill?

~~~
yogthos
It's likely that Clojure and Erlang communities have a higher ratio of
experienced developers than mainstream languages. However, this would be true
for Scala and Haskell communities as well presumably. Meanwhile, the fact that
experienced developers choosing these languages appear to produce better code
is an interesting data point of itself.

------
alkonaut
The language is what makes me apply to work at your company with 15 years
experience but paid 20% less than I would be if I had to do OO. That’s the
driver.

Is the code quality better? Well maybe. Probably. But dollar for dollar you’ll
do better with a good language because you’ll get much better people for your
dollar.

------
tsegratis
I wonder if this lack of significant link is because number_of_bugs =
time_spent_programming / time_fixing_bugs

So choice of language does affect the number of potential bugs. But this ratio
of effort a programmer gives to bugs vs features is almost completely
independent. Actual bugs making it past the programmer depend on how much
effort the programmer cares to expend

So a safer language doesn't reduce bugs. It's like farmers with back pain from
driving tractors. Giving them comfier rides doesn't reduce their pain; it just
means they drive faster upto the same pain limit

If this is true, then bugs can be reduced by making bugs more painful for the
programmer

But also a safer language will let the programmer drive at a higher speed,
releasing a greater percentage of safe code (even if the total number of bugs
is the same)

------
nullc
Good to see better statistics being applied.

However, the base methodology appears to be really weak.

The presence of fixes doesn't necessarily reflect the presence of underlying
bugs: highly conscientious projects make many "fixes" that improve the
software quality against an abstract ideal without directly fixing any defect,
while highly unconscientious projects don't bother fixing anything.

There are also many confounders. For example, I've seen project institute
restrictions on refactoring changes in response to the huge amount of
brownian-code-motion popular projects can receive, only to then have some
contributors start misleading describing their commits, "by splitting this
function up, we avoid the risk that someone would later introduce an
overflow!".

Considering how important software quality is to contemporary engineering,
including life safety critical systems, it's disappointing that we don't see
more randomize studies. E.g. take a pool of developers randomly assign them to
teams using different tools each to accomplish the same task, then compare the
results against each other and a hidden test suite.

------
salmonellaeater
Of note is that these are all associations. It's possible the associations are
all due to selection effects (or even that the "better" language are actually
worse but their effects are reversed by selecting for programmers who are
better).

In the early days of Python, some employers used it as a way to filter for
candidates who were ahead of the curve. I believe Paul Graham mentioned Lisp
being good for attracting better programmers, although I can't find the
reference anymore [1]. If there's any value in this kind of filter, the
effects will show up in the programmers' output.

1\. I'm having a hard time finding references for these assertions; the best
I've found is some Reddit posts and Joel Spolsky's post on finding great
developers: [https://www.joelonsoftware.com/2006/09/06/finding-great-
deve...](https://www.joelonsoftware.com/2006/09/06/finding-great-
developers-2/)

~~~
noname120
In an infamous rant, Linus Torvalds also hints at using a language as a filter
for programmers[1]:

>Quite frankly, even if the choice of C were to do _nothing_ but keep the C++
programmers out, that in itself would be a huge reason to use C

[1]
[http://harmful.cat-v.org/software/c++/linus](http://harmful.cat-v.org/software/c++/linus)

~~~
shereadsthenews
Considering the astonishing density of defects in Linux we can treat its
author’s advice on software development as anti-advice.

~~~
mcguire
What is the density of defects in Linux? Any published comparisons?

~~~
ofrzeta
According to Coverity, the "defect density" of the Linux kernel is around 0.5
errors per 1000 lines of code, while the average of the analyzed open source
projects is around 0.7. Best among the analyzed projects is Python (the
language/stdlib I guess) with an error rate of 0.08.

------
ggregoire
Casual observation: adding a framework to any language seems to improve the
overall code quality. It adds and teaches structure, patterns and good
practices that developers usually follow.

I've seen terrible codebases written in vanilla PHP, JavaScript, Python
(probably the worst ones). But I've seen very good looking, easy to
understand, easy to maintain codebases written in Symfony, React and Django.

(React is an interesting case because it's not actually a framework but a view
library, which doesn't force you to follow any structure to your project. But
its declarative, component-based paradigm seem to help developers structuring
their projects and writing simple, reusable and composable components in a
good way).

~~~
tabtab
I agree. Our framework, libraries, and/or interfaces can and should make a
bigger difference than programming language.

If you have to use the "weird parts" of a language to get typical work done,
you are doing something wrong. And if you are not using the weird parts, then
programming languages look and do pretty much the same thing.

Braggings such as, "look! My language can do double recursive lambda backflips
while blindfolded and chewing gum!" are mostly just 1950's car racing done in
cubicles. We buy cars to get to work and shops as cheaply and conveniently as
possible, not beat Fonzy.

~~~
syastrov
I think it really depends on how you use the “weird parts”. Some of those let
you write code much more concisely and effectively. Used in the wrong context,
it’s a unnecessary distraction. It also depends on if you want to write
“typical code” or code that the competition can’t match.

See Paul Graham’s essay on the Blub Paradox [0], which discusses why some
languages seem to have “weird parts” from the perspective of someone who knows
a less-powerful language.

[0] [http://www.paulgraham.com/avg.html](http://www.paulgraham.com/avg.html)

~~~
tabtab
You make it hard to hire if you limit the field to those who know the esoteric
techniques. Programming itself is mostly a dead-end career; you either move
into management or quasi-management, or face ageism on average. There is thus
not a lot of incentive to "perfect the art".

------
baby
I’ve read a lot of code as a consultant. The best codebase I’ve read had
almost no comments and was written in erlang. Besides that every codebase I’ve
read in Golang were super clear and of excellent quality for the exception of
one that was written by java developers.

I can say I hated reading C because of maccros and different build systems. I
hated C++ codebases because people abused generics. And I hated Java the most
because of the verbosity and the directory structure and all the layers of
abstractions and all the factories and all the singletons and all the...

~~~
Insanity
That is always a risk if a team decides to pick up a new language but does not
make the effort to learn the language properly. At work we are going through a
similar Java - Go transition, so I know :(

~~~
noir_lord
It's always a problem when people shoe-horn a non-idiomatic approach from one
language into another.

It makes me wish the language creators wrote a single none-trivial 'this is
how we intended it to look' project.

When learning a new language I find as many big projects written in it as I
can and spend time exploring them.

It helps but there is often no guarantee those projects are doing it
idiomatically either.

~~~
chubot
Looking through the Python standard library or Go standard library should be a
clue not to use Java idioms, but yes I realize those habits are hard to break.
I have seen Java-in-Python and it is indeed painful.

~~~
noir_lord
The problem with reading framework and library code is that it's very coupled
to what it is, so while it's interesting and valuable to read it's not really
the type of code you'd see in a business app or whatever.

~~~
JamesBarney
Yeah I've tried this several times and the code written for the std library is
so different from LoB code(because the domains are so different) that you can
only learn a small % of possible idioms from them.

------
gilbetron
Unfortunately, it's the original paper that will keep getting quoted at me
forever.

My favorite line from this paper is: "Correlation is not causality, but it is
tempting to confuse them." I think most people believe "correlation is not
causality, but most of the time it is", when in reality it is, "correlation is
not causality, and almost never is." There was a great analysis of this in a
book or paper I read years ago that I haven't been able to find, but the tldr
is that if you have a set of events, if causality were truly common, the
events would for a seized up network of causality.

------
melling
...deleted

I skimmed the paper. I misunderstood the results.

~~~
smoe
Your quote was the conclusion of the original work by Ray et al., not the one
from the reproduction paper linked here. That one in short is:

"Unfortunately, our work has identified numerous problems in the FSE study
that invalidated its key result."

------
coldcode
Testing Objective-C is sort of pointless, they should have picked Swift.

~~~
pmiller2
I disagree. If anything, they should have tested Obj-C _and_ Swift. I would
also have liked to see Scheme, Common Lisp, or something else lisp-y besides
Clojure. Kotlin would have been interesting. R or Mathematica would have been
fun, too.

I guess it just comes down to needing a sufficiently large corpus of code and
commits and having to eventually publish.

------
panpanna
Maybe off topic, but isn't the language in the abstract a bit aggressive?

I mean, this type of study is somewhat open to interpretation. It's not like
everyone should arrive to the exact same numbers..

~~~
omginternets
Which part? This kind of direct language is both standard and encouraged in
academia.

The idea is to take a clear position, which can be attacked if wrong.

------
voidhorse
Interesting! Excepting typescript, the “safer” languages also happen to be the
ones I gravitate towards on a daily basis and actually enjoy writing programs
in, while the others...

It’d be interesting to see whether programmers’ preferences uphold the
division posited by the study, or whether they are widely variable—eg is it
likely that if you prefer ruby you’ll also prefer typescript over JavaScript,
if you prefer Haskell you’d choose clojure over c++ etc.

~~~
scott_s
I think you're confusing the conclusions from the original paper with the
conclusions from this paper. This paper is a reproduction of the original
study, and for the most part, it is _not_ able to reproduce the findings.

~~~
voidhorse
How right you are. That’s what I get for skimming news piecemeal way too
quickly in the mornings!

------
jph
(Edited-- I was reading the original paper, which showed good results for
TypeScript; the new paper linked above refutes that result. My mistake. Thanks
to the posters who corrected me!)

The winners in this paper are Clojure, Haskell, Scala.

I look forward to seeing Rust added. The authors go into substantial detail
about what they consider errors, and in my personal experience Rust solves
many of them. See e.g. "Some defect type like memory error, concurrency errors
also depend on language primitives."

~~~
xvilka
Scala will really start to shine only after Scala Native[1][2] is ready.

[1] [http://www.scala-native.org](http://www.scala-native.org)

[2] [https://github.com/scala-native/scala-native](https://github.com/scala-
native/scala-native)

~~~
mannykannot
What features of Scala Native do you think will improve Scala code quality?

