
The social side of science seen in the research on programming language quality - hwayne
https://www.hillelwayne.com/post/this-is-how-science-happens/
======
rossdavidh
Wow, what a bad idea. Do Haskell and PHP get used for the same sort of
projects? Or Erlang and Ruby? No, they get used for very different sorts of
projects. You would need to only compare similar kinds of projects (e.g. web
frameworks, or statistical analysis, or device drivers) to get anything useful
out of this data source. But, that's probably not possible, because in most
cases there aren't enough projects in very many languages. This also assumes
that bugs are all equally likely to get FOUND in all languages. What if bugs
are easier to find in language X than in language Y? Then you will get MORE,
not fewer, commits that mention fixing them in language X, which is the
opposite of what you would want the signal to indicate. I could go on, but
there's no need, as even these points are enough to totally invalidate the
premise of this research. Getting into the weeds of the analysis of the
original study or the rebuttal, is not going to give you anything useful, if
the original premise that you are measuring how language compares to bugs is
wrong, and it almost certainly is.

~~~
Tarq0n
Confounding variables don't invalidate research, they merely inform us on how
to weigh the evidence, and shape future research.

Studies can have interesting conclusions even in the presence of confounding
variables. Sometimes the state of the art in a certain hypothesis space hasn't
progressed past of of these yet, but that's still better than gut feeling.

In some areas like public policy research you will never completely eliminate
all of these factors, but that's no reason not to attempt to create knowledge.

~~~
cdheuer
I don't think GitHub mining is state of the art, and there are opportunities,
as the article says, to separate the signal from the noise. I agree in the
general sense, but there should be some lines drawn at which point we can say
research is invalid so people don't run away with the results.

~~~
SiempreViernes
Yeah, the original study is clearly so methodological unsound that it's
conclusions shouldn't be repeated.

But fankly most of the posts here are people almost calling the original
authors _heretical_ for even trying to data mine commit messages.

------
ken
At the very end of this grand tour:

> Science has its problems, but it’s still the best we got.

I would agree _if_ you're doing particle physics and looking for p < 0.0000003
then absolutely. Science is a great tool for the kinds of investigations of
the natural world which can be understood by science. If you're doing social
science by scraping some data from the internet and looking for p < 0.05,
that's a very different story.

I'm not actually sure what aspect of this qualifies as "science", besides
using p-values to analyze the data. Even if all the categorization and
sampling and analyses were perfect, it's using commit messages and known bugs
with known fixes, all of which are essentially self-reported.

~~~
carapace
From my POV the real problem is that there are phenomenon that apparently
cannot be studied scientifically. Like _Reiki_. Now I'm not a scientist but I
am a huge proponent of science. Carl Sagan is among my personal heroes, okay?
I am in favor of science over superstition and irrationality.

One day I encountered a phenomenon that is as real and physically dramatic as,
say, the electric field of a Van de Graff generator. Being a fan of science I
went looking for the math/science/engineering for this phenomenon. Long story
short, after a few years of research, I discovered that: 1) this phenomenon
has been known since antiquity in a pre-scientific way; 2) several researchers
had independently (re-)discovered it multiple times going back about 2.5
centuries; 3) each time it had achieved a certain notoriety but then been
debunked and subsequently ignored or forgotten. The first was Mesmer & "animal
magnetism", but there are several others including Baron Von Reichenbach & the
"Odic force", and Wilhelm Reich & "Orgone". This is repeating today with
several kinds of "energy healing" including Reiki (which is distinguished by
being Japanese and unabashedly theistic (Rei means Divine)) where practice is
widespread, but lacking scientific validation. As I say, I've _felt_ Reiki, so
I know it's real, but so far I've personally never been able to interest any
scientists in studying it, despite offering to demonstrate it. And the studies
that have been done haven't helped. The Wikipedia article for Reiki just
straight up calls it pseudoscience.

~~~
Analemma_
To the extent that Reiki or any other alternative medicine technique actually
does anything, it's easily explainable by the placebo effect.

You know what we call alternative medicine that works? Medicine.

~~~
yters
The placebo effect itself is fairly spooky.

~~~
disgruntledphd2
Not really.

It's all Descartes fault, essentially, by postulating the "mind" as distinct
from matter.

More specifically, placebo pain-relief has been associated with endogenously
produced opioids for approximately 40 years now (1978, I believe).

More recent research suggests that much of the variance in the effects of
pain-killers can be associated with these endogenously produced opioids (when
the patient was not aware of the pain-relief, responses were similar).

So essentially, the placebo effect is just a normal part of your body
responding to medical treatment, regardless of the efficacy of said treatment.

I'm talking about pain because most of the research is on pain, but the
principles are similar where placebos tend to be effective (not against
infections, sadly but pain and depression seem to be extremely placebo-
responsive).

------
mihaaly
Programming and programming languages are in a big portion subject of social
sciences anyway. Made for the needs of humans used by humans in groups with
practices and methods addressing the particularities of humans and human
groups. Not ordinary humans but humans indeed. Probably it is more of social
science than mathematics or other human independent matters. Social science
that is using other disciplines, similar to economics, but at least an
interdisciplinary subject.

~~~
throwaway17_17
I just can not agree with this concept. Using the standard you propose,
architecture for instance is a social science, and I’d argue most non-computer
engineering fields would be classified as social sciences as well. Computer
Science is by traditional applied mathematics, at least the theoretical
courses of study treat it so. But there is a reason Computer Engineering is
the predominantly understood title of the overarching software field.
Programming languages do not exist in some philosophical vacuum, they are
created to be implemented in the real world, on real hardware, and deal with
the very real physical capabilities and drawbacks of that real world hardware.
While programming, and software design as it is generally practiceD, is not as
rigorous as other engineering fields, it is the application of the sciences to
create or implement technology. So I just can not agree that programming
languages and programming are in any way a social science.

Also, just to be clear, there can be social science fields which inquire into
aspects of the engineering field of Computer Science, but that doesn’t effect
the underlying character of the field.

Also, also, I would strenuously argue against the proposition that mathematics
is in any way ‘human independent.’ But that argument is not entirely relevant.

~~~
TheOtherHobbes
Architecture is very much about culture. I'm not sure I'd call it a social
science, but the science part of architecture - making sure that buildings
don't fall down - is left to structural engineers.

If you read how architects justify their designs, they're full of references
to culture, philosophy, and occasionally - too rarely - practical research
into how people move through and live/work in spaces.

It's a social activity in that architects design for other architects. Clients
are secondary.

CS is actually no different, it's just far less culturally self-aware. There's
nothing objective about CS. It grew out of a specific philosophical current,
and it could easily have gone in other directions, emphasising other kinds of
entity relationships - e.g. less atomised and sequential, more contextual and
associative. That in turn would have influenced the design of the hardware.

~~~
boyobo
You can say the exact same thing about mathematics.

------
goto11
I'm confused. So the original paper was shown to be flawed, but then the
rebuttal was also shown to be flawed. But this article argues the flaws in the
rebuttal was not as glaring as the flaws in the original paper?

But the approach of counting commits on github just seems fundamentally
flawed. The first step should be to show that the github dataset _can_ be used
to say anything about quality and productivity of things like language choice.
Arguing about how to classify commit messages seem pointless without this
foundation.

~~~
yaacov
He talks about that in the end. Here’s the relevant section:

 _Is this even a good idea?_

We’ve just spent several thousand words on methodology to show how the
original FSE paper was flawed. But was that all really necessary? Most people
point to a much bigger issue: the entire idea of measuring a language’s defect
rate via GitHub commits just doesn’t make any sense. How the heck do you go
from “these people have more DFCs” to “this language is more buggy”? I saw
this complaint raised a lot, and if we just stop there I could have skipped
rereading all the papers a dozen times over.

Jan Vitek emphasizes this in his talk: if they just said “the premise is
nonsense”, nobody would have listened to them. It’s only by showing that the
FSE authors messed up methodologically that TOPLAS could establish “sufficient
cred” to attack the premise itself. The FSE authors put in a lot of effort
into their paper. Responding to that all is “nuh-uh your premise is nonsense”
is bad form, even if the premise is nonsense. Who is more trustworthy: the
people who spent years getting their paper through peer review, or a random
internet commenter who probably stopped at the abstract?

It all comes back to science as a social enterprise. We use markers of trust
and integrity as heuristics on whether or not to devote time to understanding
the claim itself. This is the same reason people are more likely to listen to
a claim coming from a respected scientist than from a crackpot.12 Pointing out
a potential threat to validity isn’t nearly as powerful as showing how that
threat actually undermines the paper, and that’s what FSE had to do to be
taken seriously.

~~~
goto11
Interesting, but as a layman this seems totally backwards. If the underlying
premise is flawed, who cares if the methodology is correct? If the original
study _hadn 't_ made errors like misclassifying some projects or commit
messages, the result would still be based on an unfounded premise.

~~~
SiempreViernes
To put it simply: if the methodology is sound then the premise _can 't_ be
unfounded, because the premise is the source of the methodology.

~~~
smt88
I think you may misunderstand what others mean when they say "premise".

For this paper, the premise is that GitHub commit history indicates language
quality.

The methodology is then the implementation of the study: how do we classify
each commit? How do we identify the base language in a repository? What would
lead us to discard a repository as a data source?

The methodology may be perfect, but you may still end up with nonsense because
GitHub is not a comprehensive or representative source of data for studying
programming.

~~~
SiempreViernes
No, pretty obviously the claim is that language properties will _affect_ the
type of commits being made, _this_ is the premise they start from.

The CACM paper then did a bad job of testing if this is true by not being vary
careful with the labelling and cleaning of their data, a point on which they
were then attacked.

Please notice that they were critiqued on their _methodology_ , not for
thinking that commits are casually influenced by the design of the language
they modify.

~~~
mumblemumble
Their premise is even more deeply flawed than that. AFAICT, they didn't
consider selection bias seriously enough.

I'm not sure if it's politically correct to say this, but, e.g., it's entirely
possible that particularly meticulous coders are both more likely to select a
strong, statically typed functional language _and_ generate bugs at a lower
rate. And that the effect turns out to be almost entirely due to the person
and not the programming language.

There are also some fundamentally unanswerable questions. Like, would dynamic
typing come off looking better if the selection of dynamic languages weren't
dominated by warty messes like php and ecmascript?

~~~
luord
> I'm not sure if it's politically correct to say this, but, e.g., it's
> entirely possible that particularly meticulous coders are both more likely
> to select a strong, statically typed functional language

I don't think you need to worry; this isn't "politically incorrect". It's just
wrong.

Agree with what you said afterwards, though: Meticulous coders probably _are_
less likely to write bugs. But that will be true regardless of the languages
they choose or prefer.

~~~
mumblemumble
> It's just wrong.

It's something everyone has a strong opinion on, but nobody's ever really
figured out how to measure. Leaving it a subject that's still firmly in the
province of conjecture. Typically self-affirming conjecture.

Me, I like dynamic languages, but, given that this subject is pretty much a
complete knowledge vacuum, populated only by occasional iffy, easily
criticized studies like the one in question, I'm inclined to say that the only
position that offers any decent chance of not being informed primarily by
confirmation bias is to assume that, for the most part, none of us has any
firm idea what they're talking about. Least of all oneself.

------
WalterBright
Number of commits as a metric is a poor idea, because it will immediately be
gamed. I constantly advocate with D that people break up larger PRs into
smaller self-contained ones, for several reasons. Undermining this with a
metric that draws unfounded conclusions isn't helpful.

~~~
goto11
Just make sure you don't use the word "fix" in the commit messages, then the
reliability of D will be through the roof!

~~~
eternalban
That _was_ funny, but in fairness can you imagine how expensive and long drawn
a real study into this question would be?

~~~
zzzcpan
All it takes to avoid that is looking into commit histories of projects in
each language and figuring out how they mark bugfixes. If it's suspiciously
reliable you would know you made a mistake classifying bugfixes somewhere or
your sample is too small or the programming practice in the language is
unusual and needs closer look.

------
hcs
Interesting rundown, thanks!

I'm a bit puzzled with the "not a dox" tweet, what is going on here? I know
that this is probably the least interesting part of the article, but I'm
confused and curious.

~~~
hwayne
Ooops, didn't quite explain that properly! The tinyurl links to Berger's
Facebook page.

------
muglug
It's amazing the paper got published in the first place – the idea that
comparing open source codebases could tell you all that much about the
underlying language is sort of bizarre when comparing such different projects
to begin with.

~~~
pron
It is not the role of peer-review to do this kind of gatekeeping (although it
would be nice if it could filter out bad methodology and data errors). This is
the role of exchanges just like the one discussed here.

~~~
muglug
Journals can reject a paper on editorial grounds – for example, Nature doesn't
publish every valid paper submitted to it.

~~~
pron
True, but it is not the role of peer-review as a whole to prevent papers with
whose premise you can argue from being published anywhere. The goal of peer-
review is to filter out papers that use bad methodology. Debates over premises
is carried out in other papers.

------
jdm2212
Every project grows until it becomes too complex and hence unmanageably buggy.
Some languages/static analyzers/linters/architectures/frameworks let a project
become bigger and more complex before becoming unmanageably buggy.

So if you look at two projects, and both have a bunch of bugs, that doesn't
tell you much. You have to know bugs-per-unit-complexity, which is pretty damn
close to unmeasurable.

It's telling, though, that the companies managing the world's most complex
systems run them in C++, Java, and TypeScript. And more recently, Go and Rust.

With much finance research, if the results were real, the researchers would be
making billions on Wall Street rather than be academics. So, too, with
software research. If the results were real, the researchers could become
gazillionaires advising FAANG instead of being academics.

~~~
westoncb
This is something I come back to repeatedly when attempting to evaluate claims
about certain technologies (e.g. programming languages) being extravagantly
effective: people's sentiments about how great some technology is often seems
out of proportion to what their gains were through using it.

Of course that's difficult to evaluate. But if their gains _were_ in
proportion, I think you'd see these crazy spikes like what you're saying about
these researchers becoming gazillionaires. And it just doesn't seem to happen.

Which is not to say there's _no_ difference—but 1) the improvements seem to be
by smaller factors than what engineers believe 2) they don't seem to
generalize as much as is often claimed: afaict, a significant aspect of the
gains has to do with reducing impedance mismatch between problem domain and
language (but advocates will typically adopt a stance that the language is
just intrinsically superior).

I would be very interested to hear counter-examples if people have them (I
know the PG/Viaweb/Lisp story).

~~~
jdm2212
I think the folks really fascinated with which language is intrinsically
superior are missing out on the fact that the language itself is <10% of the
software development process. There's tooling, architecture, libraries,
monitoring, etc. The language's feature set or verbosity is rarely the
bottleneck.

I'm skeptical that the PG/Viaweb/Lisp story generalizes at all to non-
geniuses. Like, sure, if you gave Paul Graham raw assembly he could probably
do some pretty impressive stuff even so. Doesn't mean anyone else can. He's a
genius. So is Robert Morris.

------
dang
Related from a few months ago:
[https://news.ycombinator.com/item?id=21637411](https://news.ycombinator.com/item?id=21637411)

------
luord
It's astounding how discussing programming languages can devolve into
flamewars, even when the discussion starts as actual academic research and the
participants are dedicated scientific researchers.

------
pron
This is a truly terrific post, but it says:

> They’re underselling the effect here. While the dominant factor is number of
> commits, language still matters a lot, too. If you choose C++ over
> TypeScript, you’re going to have twice as many DFCs! That doesn’t
> necessarily mean twice as many bugs, but it is suggestive. Further, while
> they say the effect is “overwhelmingly dominated by […] project size, team
> size, and commit size”, that doesn’t actually bear out. Only the number of
> commits is a bigger factor in language choice.

This is inaccurate. "Effect" is not the expected difference (i.e. difference
between the means) but, roughly, the expected difference divided by variance.
Just looking at the expected difference is insufficient to determine if the
effect is large (and undersold) or small.

Expected difference is not an interesting statistical property, just as the
mean isn't (by itself). If you're looking at ten Clojure projects and ten C++
projects, and all ten Clojure projects have 10 DFCs, while eight C++ projects
have 8 DFCs and two have 500, then the expected difference is huge, but the
effect is small. Indeed, when looking at the variance, Clojure, the
"best"-performing language in this dataset, and C++, the "worst"-performing
language in this dataset, the two were not very distinguishable, supporting
everyone's finding of a very small effect.

~~~
closed
> Expected difference is not an interesting statistical property, just as the
> mean isn't (by itself).

Difference in means can meet the definition of effect size, and are listed as
an example right away in the wikipedia article on it [0]! The key is that it
capture the phenomena of interest (and generally, the magnitude shouldn't be a
function of number of observations). In psychology, where scales often have
arbitrarily defined ranges (eg IQ scales), means usually are not useful effect
sizes.

The case you list is a separate issue from effect size (probably the violation
of distributional assumptions in some implicit or explicit model). Even using
difference in means / variance (say, cohen's d), two observations that extreme
would make the interpretation of both mean and variance calculations pretty
dubious (a problem not solved by dividing them).

[https://en.wikipedia.org/wiki/Effect_size](https://en.wikipedia.org/wiki/Effect_size)

~~~
pron
Which example? The six examples that talk about "difference in means" really
show the difference in means divided by some measure of variance. Who would
consider the same expected difference to mean the same effect size when the
population distributions are narrow and when they're wide?

> two observations that extreme would make the interpretation of both mean and
> variance calculations pretty dubious

Yeah, it's a bad example, but I think it at least gives an intuition for why
expected difference is not a good measure of effect. This shows two data sets
with the same expected difference but one shows a large effect and one shows a
small one: [https://imgur.com/a/hg2NNl2](https://imgur.com/a/hg2NNl2) (the
rebuttal paper also draws the expected value _with variance_ and shows a
similar situation).

~~~
closed
> Which example?

First paragraph: "Examples of effect sizes include the correlation between two
variables,[2] the regression coefficient in a regression, the mean difference,
or the ..."

> Who would consider the same expected difference to mean the same effect size
> when the population distributions are narrow and when they're wide?

For the case where the unit of measurement has a sensible, relevant
interpretation (e.g. if a study measured dollar value of two interventions),
and attempts to capture uncertainty via CI or another means, I would consider
it _one_ meausure of effect size.

The key to understanding effect size is that its most common focus is around
making measures over arbitrary scales become scale invariant. But scales are
not always arbitrary.

Daniel Lakens has a great article on ES and puts the motivation for
calculating them well..

> First, they allow researchers to present the magnitude of the reported
> effects in a standardized metric which can be understood regardless of the
> scale that was used to measure the dependent variable. Such standardized
> effect sizes allow researchers to communicate the practical significance of
> their results (what are the practical consequences of the findings for daily
> life), instead of only reporting the statistical significance (how likely is
> the pattern of results observed in an experiment, given the assumption that
> there is no effect in the population).

[https://www.frontiersin.org/articles/10.3389/fpsyg.2013.0086...](https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00863/full)

To the degree that the thing measured has an inherently meaningful scale to
the researchers (eg sometimes dollars, time), then it is already an effect
size measure.

There is some nuance here, since the level on which you might want to
interpret something (eg the spread could be part of the value, especially to
an individual who will have 1 and not many codebases).

You might also want to render it comparable to other studies that measure
something else, and it's unclear how to convert it to your scale, so you use
standardized ES measures (eg many meta analyses).

But an important point is that whether something qualifies is really a
question of the scale. (What the best ES is for your specific question is
another important issue!)

~~~
pron
> the mean difference

Yes, but later the article shows that "mean difference" is really mean
difference divided by variance.

> an important point is that whether something qualifies is really a question
> of the scale

It's one of the concerns, but I would say it's the main one. An effect size
needs to distinguish between the two cases here, both having the same expected
difference: [https://imgur.com/a/hg2NNl2](https://imgur.com/a/hg2NNl2)

If you care about units, you can talk about the expected value/difference, but
that doesn't make that a meaningful effect size. What you need to do in those
cases is to, at least, mention both the expected difference and the variance.

~~~
closed
> Yes, but later the article shows that "mean difference" is really mean
> difference divided by variance.

I would say later the article shows examples of standardized effect size, so
divides by variance. Whether that means effect size can't be a difference in
means (and explains the wiki intro comment) I disagree with--see my Lakens
comment for why.

> If you care about units, you can talk about the expected value/difference,
> but that doesn't make that a meaningful effect size. What you need to do in
> those cases is to, at least, mention both the expected difference and the
> variance.

This seems like it is begging the question. If you look at the definition and
rationale for effect size, both in the wikipedia and in Lakens, a claim this
strong is not there. For example, a CI over difference in means is one way to
capture what a standardized effect size that uses a specific variance term
might be doing (what Lakens describes as the second ES viewpoint: statistical
significance, distinguishing between effect and power).

In any event, thanks for discussing--it's been really helpful to think about
and revisit this topic!

------
d0mine
> the entire idea of measuring a language’s defect rate via GitHub commits
> just doesn’t make any sense. How the heck do you go from “these people have
> more DFCs” to “this language is more buggy”?

If Github is used by ordinary programmers (what are the reasons to assume
otherwise?) i.e., if the sample is unbiased then what is the issue with going
from "these [random] people" to "this language"?

------
6510
I'm surprised someone takes my github saved games seriously. It certainly
wasn't me.

I think this science project would be more fun if assumptions were discarded.
Just gather the data, do it without a hypothesis. Something interesting might
come up worthy of one.

I liked the bit of crankology in the article. If you sound like a crank no one
will take you seriously? I had one day pondered what if the cranks are right?
How would we know? It follows that we can't know sounding crank equals being
wrong and that our idea of "sounding serious" is based on a flawed data set,
if it isn't noise entirely. The topic of the study here certainly doesn't
inspire my confidence.

~~~
nxpnsv
Such post hoc hypothesis trawling essentially is p hacking. Try a large enough
set of explanations and your data will show something significant. Might be
hard to replicate.

~~~
hwayne
One of my favorite examples is this tour de force on "proving" that chocolate
helps with weight loss: [https://io9.gizmodo.com/i-fooled-millions-into-
thinking-choc...](https://io9.gizmodo.com/i-fooled-millions-into-thinking-
chocolate-helps-weight-1707251800)

~~~
nxpnsv
or the forever classic: [https://www.xkcd.com/882/](https://www.xkcd.com/882/)

------
0xdeadbeefbabe
> But it still seems intuitive that language should matter.

Am I the only one that thinks this is a naive assumption? They aren't really
languages like English is a language--so much for intuition. Don't definitions
matter to science? Does science carry a selfie stick? I'm suddenly keen on the
antithesis of science or a less social science.

------
z3t4
If _all_ you cared about was fibonacci, or some other "quality", then fine,
use that as benchmark. But others might care more about productivity, cost,
etc. Then it's also a matter of taste and what is currently fashionable.

------
fulafel
There's talk and comments about the measure ("DFCs"). What would be better?
And is it possible to use this one to eke out a signal even though the
measurement has some known problems?

------
mcguire
Two comments:

" _FSE was a preprint: a paper that hadn’t necessarily gone through peer-
review, editing, and publication in a journal. Academics share preprints
because 1) the process of getting a paper journal-ready can take years, so
this gets it out faster, and 2) academics can share their research with people
who aren’t able to afford the ludicrous journal fees._ "

To my knowledge, all ACM conferences including the "ACM Joint European
Software Engineering Conference and Symposium on the Foundations of Software
Engineering" are peer-reviewed. In CS, in my experience, peer-review for
conferences is more thorough than journal review. Conferences are more
important. The idea that they are preprints is something from linguistics or
some other field.

" _How did FSE take the rebuttal rebuttal? Not well._

" _So that’s technically not a dox, because he didn’t publish Berger’s private
information, but still. That’s a really asshole thing to do!_ "

I'm not sure I would call linking to someone's Facebook page from Twitter to
be "doxxing" in any sense. Calling a fellow researcher a "donkey" _is_ a bit
of an asshole thing, too. (I have no link to said comment, but I know Emery
Berger. He's very positive he's right, the smartest in the room, and very
abrasive.)

Tl;dr: Software engineering research is a garbage fire. But we knew that.

