First, many repositories are not a single language. For example, this PHP framework is reported as a CSS project . While it has more lines of CSS than PHP, it only has a single CSS file .
Second, GitHub has a problem with correctly identifying programming languages. For example, PrimeCoin  is identified as one of the most popular TypeScript repositories, but it has 0 lines of TypeScript code. Instead, it has... large localization files with the extension *.ts . BitCoin used to have the same problem, but it looks like GitHub hack fixed it for that particular repository as less popular forks of BitCoin still have this issue.
It took me a few minutes to find these examples, just by examining trending repositories . I'm sure there are many more. So do not be rash in drawing conclusions from this data! :)
Ideally, the project manager should be able to define the language composition in their own projects. Something GitHub should consider IMO.
About half of the repos that I'm dealing with are mis-detected (usually as CSS or HTML due to included documentation).
Why open source at all, what do they get out of it?
Lots of data right there, and nicely visualised at that, only what it actually means is unfathomable without knowing any broader context.
For instance: C++ has the greatest number of opened issues per repository, then comes Rust, then Scala. All right.
Does it indicate that they're more tricky than others and hence more bug reports?
Or perhaps that projects written in these languages are under more intense scrutiny?
Or that people watching these repositories are just more eager to step up and file an issue instead of sulking in silence (a trait of programming culture surrounding these languages)?
And so on, and so on.
Or it could be one in case of C++, another in case of Rust - since they differ under so many aspects.
Wide field for wildest speculations, but no meaningful correlations identified.
 - Breaking changes have simmered down a lot in the last few weeks. We still have one more semi-big one in front of us, but hopefully smooth sailing from there...
For practical or business purposes, this is a nice bit of incomplete information to help make a decision. I want to take a serious, time-invested dive into a new statically compiled language, but which one should I pick? An old die-hard or the new-hotness? I could make a guess from reading the docs and such, but I'd also want to know community activity and support. This is a handy chart for getting a sense of that.
Or, I'm a business owner who just hired my first engineer. He's saying the backend should definitely be built in Groovy, or maybe he'd be willing to do Scala or something else, but definitely Groovy, yeah, Groovy man. I might be able to get a better idea of which would be beneficial for my long-term business prospects (hiring more engineers, etc) by checking out a chart like this as I might not have time to do real in-depth research.
As a scientist you require complete, sound and accurate statistical data. As a business practice (this site is about start-ups, no?) you need to be comfortable making serious and important decisions based off of incomplete and possibly inaccurate information because making fast decisions is often paramount. You can and will always make other fast decisions later and decide whether it's worth the effort to course-correct if you need to due to new and more accurate information.
This is maybe too deep an analysis of a fun little infographic, but as a former professional poker player who made a living off of incomplete information you got my cockles up.
No, not really, because you've no idea what assumptions are baked into the data. For some decisions, you can make fast, gut-based ones. For others, you need to take a much more considered and scientific approach. The difference can be defined by the ability to course-correct after-the-fact (the harder to course-correct, the more stringent the decision-making process). There's an entire academic (and military) discipline around decision-making processes and with good reason. People want to make good decisions as well as quick ones.
Anyone making business critical decisions based on this chart, without doing the extra work to understand the data, is basically lying to themselves. That's why vanity metrics and data-porn should be handled with extreme caution.
The entire chart? Wouldn't the first column be sufficient? Number of repositories gives you some idea about language popularity.
Well, kind of: there's bias of hype here. Obviously choices behind open-sourced projects on GitHub aren't representative for the industry. It's the software's world avant-garde, if anything.
And even so, that's just one parameter out of five, and it can be very well be considered in isolation from all the rest.
I wouldn't make business decisions based, for instance, on the average number of open issues. Because it's an outcome of many different variables. So how would you know what it means? Is high good? Is high bad?
Interrelations between data - shown by this clever chart - are even more mysterious.
TeX has a very high number of pushes per repository (second best), while there's fairly few repositories, and they are rarely forked.
At the same time R has low number of pushes (second lowest of all), whereas it wins in the "new forks per repository" category (#1).
What do you make of that, businesswise?
Titbits of incomplete information are often placed as a result of publicity campaigns. In the specific case of the github source info for this graphic, the languages near the bottom of the list can easily have that information manipulated by their backers scamming the stats. All you need is one change to be pushed during the measured period for a project to be registered as active, a data point which I know is being scammed for at least one language near the bottom of the list.
Is it? Does the fact that people open lots of issues in C++, Rust and Scala make you more or less inclined to pick one of those for your new project? Why?
I'm all for making the best use we can out of incomplete or noisy data, but that stat doesn't tell you anything, it's just a number.
I have seen "There are two thousand open issues, do they ever fix any bugs?" on a few occasions.
Having said that, I do think the visualisation is beautiful and there are definitely useful things that could be drawn from the data if someone were willing to do the extra work. However, I'm not sure I have much faith in the data quality e.g. some of my repos are considered 'CSS' just because I've added some boilerplate from elsewhere.
> Does it indicate that they're more tricky than others and hence more bug reports?
They're all static languages. In fact, so are C# and Java (the next two of the top 5).
Two of these (C++ and Rust) are also in the top 5 for pushes per repo. C++ being top overall.
Imho, some of these stats needs to be scaled with respect to LOC. The push/issue ratio might be good as well.
Or that these projects are actually used.
The question "What is ranked above Ruby for New Watchers Per Repository?" seems to be a question this dataset should be answering, but it is enormously difficult to parse here.
- Ruby (that was a bit surprising)
- Dart (I guess the lack of native browser support is the killer here)
- Typescript (I'm surprised this didn't take off)
- Puppet (Interesting.)
- ActionScript (obvious now that Flash is dead)
- Common Lisp
- Logos (huh?)
(I know near flat is subjective, but still these are the languages that are not seeing much growth in 2014, and what likely isn't growing strong in 2014, is likely to continue that trend in 2015.)
And totally agree that Ruby is surprising. I'm a Pythonista myself, but always thought Ruby was fairly comparable if having a different approach. I don't have enough experience with it though to understand the possible reasons for the drop.
Also, the tendency for many small Rubygems (and Bundler's support for installing gems from git) meant you had many more repos than you would for languages like Java, where it's pretty common to build multiple jars out of a single repo. The npm community seems to be if anything even more prolific in producing large quantities of very narrowly tailored libraries.
I think this is a case where the pie has just gotten bigger, rather than anyone's piece getting smaller.
Based on what?
- Make sure you add Racket to the Scheme total.
- Interest in Common Lisp has tended to come and go over the years.
In addition, some of the repos that have OCaml code may not be recorded as such. Repos where the 'brains' is in an expressive language might be overshadowed by boilerplate from elsewhere.
I think the adoption problem for OCaml is compounded because it suffers from lack of stackoverflow hits for any given errors that you might encounter or any given queries you might have. Searching for something as mundane as "how to read large files in OCaml" leads to just a single hit (Streams at OCaml.org) .
Also, OCaml needs a "recipe/patterns" book-- on how-to get some of the things done the right way in OCaml.
 BTW, a big fan of your work.
However if you find OCaml in the tiny graph on the bottom you'll see that it's steadily increasing at least. Up about 50% in active repos from 2013 to 2014.
Err... based on what exactly? I've been working in the OCaml community for several years now and it's going from strength to strength. Do we have different definitions of 'strength' or something?
Also, another thing that's peculiar with OCaml is that a lot of libraries are LGPL 3+ which makes it that much harder for corporates to adopt. And sometimes, alternatives to certain libraries are either hard to find, or are not actively maintained. It could also be that, I have been looking in all the wrong places.
I'm a big fan of the parallel lines chart, and this one is well executed. The data labels are unobtrusive, appearing on hover to let you dive in. The data set is coordinated with navigation on the timeline above using the principle of object constancy . I really like that you can click a language to pin it; you can focus on a few languages and watch their evolution over time by scrubbing the line chart. (I don't like that if a pinned language falls off the chart at one point in time it isn't restored when you go back to a time that it's on the chart.)
I like the idea with the small multiples below, but I wish there was less wasted space; it's hard to see very many at one time. There's not really a need for full-blown small multiples here - vertically-aligned sparklines would be more effective. If they were in the same table as the parallel lines it would allow a deeper exploration of the data.
I've been trying to learn Rust myself by spending ~30 mins everyday on on it for the past 2 months. It's strange how simple it is to make something, but it's hard if you have no experience in functional programming.
It seems that everybody speaks about these languages but then they don't use them.
If that's true then large bodies of application specific code would exist off the GitHut radar, I suspect.
Not sure about Erlang.
CSS: has 80% more pushes than C++ WOW :O
Safe Languages: are probably not as safe as we think
-misleading: the fact that this isn't talking in anyway about the industry itself but about the LOVE given to each programming language for the following reasons:
a)Developers in general contribute to opensource programming projects with the same concept gcc devs used when saying "compiling GCC as C++, we are writing code if you want it as C do so your self" as i understood it
b)Interest and Time and Location on Github diverge from reality:
Interest: Developers are interested in doing new things when it comes to Open Source so this may affect numbers alot
Time:time changes everything
It's interesting that strong static languages have more issues open (top 5) - easier to spot them?
for reference: http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0...
edit: the use of "small multiples" is superb as well
I'd love to see open issues / LoC / repo for each language.
I very much like how GitHut used issues/commits. In my interpretation:
(1) If your project has a lot of commits and few issues it has a very high quality.
(2) If your project has a few of commits and many issues it has either very low quality or is not being developed.
(3) Having a lot of commits and a lot of issues and vice versa is kinda expected, since new features(commits) often introduce new bugs and small projects often have few of both.
When you cross that with popularity(new forks, new watchers) over the years you can narrow (2) with some confidence.
Using that approach is trickier when it comes to comparing languages, but the data GitHut gives seems to be in line with common knowledge, at least when it comes down to open source software and and when you compare the most popular languages.
Hard to say much for sure without breaking down the details, who's discovering the issues, how many are real, how many are serious/blockers versus minor annoyances with workarounds or feature requests, are the commits new features, bug fixes, refactoring, etc.
This is very interesting in my opinion: whenever someone asks why one of those languages doesn't do $featureA "like Java", you can just reply: "because Java wasn't a thing back then".
Unfortunately, someone who became a "despot" of the project at its repository (Codehaus) on 4 May 2004 started referring to himself as Groovy's creator in publicity articles about a year ago. A few months ago, someone even tried deleting the Wikipedia link to James Strachan's webpage announcing the Groovy Language.
In Ruby's case, the total number of repositories on Github has continually increased -- it's just that since it was such a huge part of Github's early user base (the Rails community was probably the first big adopter of Github, which makes sense since Github itself is written in Rails) percentage-wise it has dropped significantly as more communities adopt Github.
But I'd like to learn more about those kinds of graph if I'm wrong :)
It doesn't show up much on Github since its main repositories are hosted on on Sourceforge and Fossil. That includes the core language and most of the major extensions. Check the wiki for details.