
The impact of language choice on github projects - fogus
http://corte.si/posts/code/devsurvey/index.html
======
philh
The graphs in a grid layout need more whitespace below them. I frequently
thought the labels were for the graph above, which made his interpretations
make no sense.

> I suspect that the Perl result is due to the fact that it becomes harder and
> harder to contribute to a Perl codebase, the bigger it gets.

And a codebase in any other language retains its original complexity no matter
how large it grows? A more reasonable explanation from the comments:

> My experience is that a lot of Perl projects are helper modules with focused
> scope, which get written as supporting units in the course of other work.
> They progress to a point where the author(s) consider(s) them satisfactory,
> then go into maintenance mode.

~~~
cortesi
The surprise isn't that Perl's commit rate declines at all - it's that the
decline seems to be so much sharper than that of other languages. A number of
dedicated Perl programmers have fixated on that one speculative aside in my
post - in fact, this is the first blog post I've ever written that has
garnered me some genuinely nasty email. Perhaps a sign of a wee bit of
defensiveness in the Perl community at the moment?

A number of people have argued that Perl projects somehow approach completion
more quickly than other projects, and that this explains the decline. I'm
pondering ways to test this idea from the data, but I must say that it sounds
pretty implausible to me.

~~~
chromatic
> I'm pondering ways to test this idea from the data, but I must say that it
> sounds pretty implausible to me.

It sounds exactly right to me, if you consider that some 21,000 of those Perl
projects are CPAN distributions. There's a very strong bias among CPAN
contributors to building small, reusable tools. I am the primary developer of
some 30 of those Perl 5 projects. My commit rate has followed the graph of
several commits in the initial stages, then very few after a year or eighteen
months because the projects entered bugfix-only maintenance mode.

> Perhaps a sign of a wee bit of defensiveness in the Perl community at the
> moment?

You made some flippant provocative statements apparently based on poor
assumptions unsupported by any of the evidence you presented. What did you
expect?

~~~
cortesi
What surprises me is that NONE of my charming Perl correspondents have argued
that the Github data is simply atypical. As I point out in my post, it's not
just possible, it's likely that this is the case. Instead, every single one
has claimed with the type of absolute certainty you can only achieve by having
no data at all that it's due to CPAN, and some magical tendency towards
completeness that it imparts to projects... Again, I think it's implausible,
but I'm open to suggestions of ways to settle the matter with actual data.

~~~
chromatic
> ... it's due to CPAN

Given the huge jump in the number of Perl projects available on GitHub thanks
to the recent BackPAN import, it's a reasonable conclusion. Likewise the
commit history; CPAN's fourteen years old.

> ... charming ... the type of absolute certainty you can only achieve by
> having no data at all ... some magical tendency...

You'll have a much more fruitful discussion without this condescension. > I'm
open to suggestions of ways to settle the matter with actual data.

Easy suggestion: find the percentage of Perl projects in your study that came
from the BackPAN import. See if they match the experiences of the CPAN
contributors who've offered explanations.

~~~
cortesi
It's genuinely difficult to avoid being condescending given the type of
conversations I've been having about this. For what it's worth, it's not aimed
at you specifically.

At any rate, I'm happy to ditch the snark, and talk about something concrete.
I'm not sure what your suggestion is - I could separate out the CPAN projects
(is there any way to do this programmatically?), and see if they show a
greater decline in commits than non-CPAN Perl projects. But that wouldn't
settle the issue, because I would still need some way to distinguish between
projects that have decreasing commits because developer impetus is petering
out as projects become more unwieldy, and projects that have decreasing
commits because they are nearing "completion". I would also want to compare
the results with an equivalent set of Python or Ruby libaries - choosing an
appropriate set would be tricky.

------
kscaldef
Most of the interpretation on these graphs seems like subjective speculation.
One thing I think is worth pointing out is that the observation that "C, C++
and Perl projects are significantly more "top-heavy" than those in other
languages, with a smaller core of contributors doing more of the work" may be
almost entirely explained by the fact that projects in those languages also
have a larger median number of contributors. If you postulate that the size of
the core group of committers is the same for all languages and projects (in my
experience, this number is very close to 1), then projects which attract more
occasional contributors will appear more "top-heavy" because the core is a
smaller fraction of the total population of contributors.

~~~
cortesi
What you describe is possible, but I'd be surprised if the size of core
committer groups was that stable as project size grows. Intuitively, I'd
expect the size of the core committer group to grow more or less at the rate
of the total active committer pool. At any rate, this can be tested quite
easily, given an appropriate definition of what a "core committer" is... I
released the dataset precisely to make it possible for other people to check
this type of conjecture.

------
city41
I've been curious how much position of curly braces in C derived languages
affects open source popularity.

Nowadays it seems like placing the opening bracket on its own line is more
popular. I have found people are unusually turned off by these choices. ie, if
you prefer the bracket on its own line, code where it is at the end of the
line really bothers you, and vice versa. I wonder how much this affects
adoption of new projects.

~~~
boucher
In an ideal world (and apparently this is already a feature in some Java IDEs)
you would just tell your editor how you wanted curly braces and whitespace
formatted, and it would always show it to you that way.

My understanding of the existing implementation, I believe its in IDEA, is
that it will do this, and save new changes back in whatever style had the
highest frequency when the file was opened.

I'm a fan of having project wide style guidelines. But, if you seriously don't
use open source code because of where the curly braces are, you're being
pretty stupid.

------
draegtun
Interesting stats but there are too many "assumptions" made on what they
actually mean!

Interestingly the latest language stats from GitHub
(<http://github.com/languages>) shows this:

    
    
      Ruby	        22%
      JavaScript	15%
      Perl	        14%
      Python	9%
      Shell	        7%
      C	        6%
      PHP	        6%
      Java	        4%
      C++	        4%
      Objective-C	2%

~~~
cortesi
For comparison, after eliminating projects with 3 or fewer watchers and
duplicate projects, my language breakdown looks like this:

    
    
    	 Ruby 35.3%
    	 Python 11.5%
    	 Javscript 9.4%
    	 PHP 7.6%
    	 C 5.4%
    	 None 5.3%
    	 Java 5.3%
    	 C++ 4.0%
    	 Objective C 3.8%
    	 Perl 3.6%
    	 C# 1.7%
    	 Erlang 1.4%
    	 ActionScript 1.3%
    	 Scala 1.0%
    	 Lua 0.9%
    	 Clojure 0.7%
    	 Lisp 0.6%
    	 Haskell 0.5%
    	 Go 0.5%
    	 Objective J 0.3%
    

Pretty close to the overall estimate by github. Some of the difference can
probably be explained by the fact that Github tried to eliminate commonly
included libraries when they did their file line counts, while I didn't.

~~~
draegtun
Ruby & Perl are currently at 21% & 18% respectively on Github, so it bears no
resemblance to your figs.

I understand what you trying to do by eliminating projects with less than 4
watchers but this is arbitrary figure and the results you produced are
therefore affected by this decision.

When you play with population samples then side effects can creep in. You can
see this in the difference in Ruby being 35.3% in your figs and it being 21%
on Github. Its a big difference and can possibly be explained by things like:
<http://www.ruby-toolbox.com/>

------
davidw
I love these kinds of things, and he's done some cool stuff with the data he
has. It'd be interesting to see how this would look on a more 'mature' site
like SourceForge.

~~~
pilif
I don't know whether "mature" is the right word.

Different. Yes. Older. Yes.

Github and SF represent two wholly different development paradigms: SF
(mostly) represents the central way for doing development where a central
repository contains the officially blessed code and additions are done by
sending patches to maintainers when then go ahead and igno^H apply them.

github is based on forking and moving patches around in a more fluid manner.

Due to how the traditional systems (CVS, SVN) work, finding contributors on SF
would actually be very hard because the traditional systems don't discern
between author and commiter: If I send a patch to awesome-project to a
maintainer, they would commit it in their name and it would be impossible to
automatically determine my initial ownership.

If I fork awesome-project on github, create a patch and make them pull (or
cherry-pick or whatever) from my repository, git will track them as commiter
and myself as author.

This is probably another reason why, on the original article, ruby projects
seem to have so many more contributors: Some of the older non-ruby projects on
github are mirrors of the projects main SVN repository, thus losing all author
information on patches coming from external contributors.

One must be careful to really be comparing apples to apples here, but still,
in light of these inherent limitations, the article was very interesting.

~~~
davidw
I meant 'mature' in the sense of having been around for many more years than
github.

By the way, though, git does not change the human nature of projects: there is
still usually one official one. Git 'forks' are just a more efficient means of
doing patch management (you point out some of the benefits), rather than
(hopefully), different, competing versions of the same codebase. I say
'hopefully' because in _most_ cases it's nicer for everyone involved if the
code in question has one more or less authoritative source, rather than
forcing the user to figure out which one of the 788 rails forks is the 'real'
one. Project forks are occasionally necessary, but they are also costly, which
is why they should be rare, and only for situations where it is impossible to
collaborate. In some ways it's a pity that git/github use the 'fork'
terminology: while it's technically a fork, in most cases it does not
represent a genuine attempt to create a competing code and user base.

~~~
pilif
yeah. of course.

but if I send a pull-request or even just a patch created by "git format-
patch" and that patch gets applied, then I am credited as author and the
person who put it into the official repository is credited as commiter.

In SVN or CVS, you would do this by some comment in the commit message or the
changelog but it's entirely optional and often such contributions are only
visible in some mailinglist or bugtracker.

This skews any analysis of contributor frequency in the traditional style of
managing projects.

------
mindstab
There are no Lisp projects on Github? This seemed a lacking piece of data or
maybe I'm missing something

~~~
cortesi
The Lisp dataset was really small - after eliminating projects with less than
3 watchers and duplicates, there were 19 projects left. Haskell - with 18
projects - was left out for the same reason.

