
How The Rise Of The "R" Computer Language Is Bringing Open Source To Science - mathattack
http://www.fastcolabs.com/3028381/how-the-rise-of-the-r-computer-language-is-bringing-open-source-to-science
======
ACow_Adonis
Allow me space for an editorial and downvote magnets :P

I'm actually a little concerned by the "phenomenon" of R. I say this as
someone whose workplace uses both SAS and R. I'm also familiar with python,
and wish we could use it more at work, but it doesn't have the "cult-
following/network effect" amongst statisticians that R has.

The "problem" I speak of is that R is very popular for people applying a quick
little stats script for a package they've downloaded using a technique they
don't understand with output they haven't verified on a tiny problem that
won't scale. And 95%+ of users are just doing it by rote, and now they're
trying to apply it to problems outside of its domain.

But ACow_Adonis you say, doesn't that just describe everyone with every
programming language ever?

Yes. But you see, R seems almost designed (or not designed) as a language of
unseen problems. It is several multiples slower than regular python (if you
thought that possible), and several HUNDRED times worse than other compiled
languages. It has no un-boxed primitive numbers. Let me just say that again. A
language for numbers that doesn't have primitive unboxed numbers. It is the
poster boy of Wirth's law.

But not only that, i said its basically been designed for "dodgy results".
Watch how its attempt at lexical scope combines with lazy evaluation for
ridiculous fun. Bizarre, automatic and random conversions behind the scenes.
1-indexing of arrays...but 0-indexing doesn't throw an error. Automatic
repetition of values in smaller arrays when combined with arrays of larger
size. Internal functions of one letter names in a language with KIND OF one
name-space for people dealing with MATH with a long history of using these
individual letters for other things!

So combine these "features" of the language with people implementing things by
rote, not checking their results, returning results without error
messages/warnings...

A SAS marketing person made the comment once of “We have customers who build
engines for aircraft. I am happy they are not using freeware when I get on a
jet.” and we all piled-on the hate, and rightly so.

But after using R, what scares me more is the thought that professional stats
people ARE using it when i get on a jet :(

~~~
ACow_Adonis
Also, allow me just a quick addon to which people probably aren't responding
in the comments below. I assure you banks and the like using R aren't passing
round all their R/python/C code either.

R didn't bring open source to science. It brought free (as in price) software
to stats, and that, along with its script-like ability to apply formulas
quickly, its vast library, and its universal teaching in stats courses in
universities, is the reason for its popularity. I'd even go so far as to say
that python has had more of an influence outside of stats/bio/pharma.

But spend a bit of time in the SAS community to which it is commonly
contrasted and you'll see massive amounts of code sharing, examples and how-
tos. The interesting thing is to observe how these things play together. I
argue that sharing of source has less value if the run-time on which it
operates is not available to you.

Of course, SAS is so widespread in big business that you might point out that
it quite clearly is available to a lot of people, and that it quite clearly is
valuable to them, its just not available if you can't pay or aren't in a
connected uni/job. I know my SAS and I can do several things in it that whip R
and python's butt if the task is that which SAS is good for. It has its own
separate and relevant issues in terms of design abd implementation. I can rail
against all the tools i use :P

But the high-entry-cost of the software itself is the prime reason I'm trying
to turn my back on it (because I don't plan on being employed by a big company
or being locked into a software vendor forever, and subsequently it is not
available for me to use for my own projects, which are often more
valuable/complex than the ones i'm writing for employment...). I imagine there
are a large number of other programmers/hackers feeling the same way, and you
might even say that its evidenced by the parallel (as in along-side SAS, not
multithreading) success of R. Perhaps there is a symbiotic relationship,
subsequently, between free as in price vs open source code. Who knows. I need
a "free as in beer"....

~~~
mathattack
_R didn 't bring open source to science. It brought free (as in price)
software to stats_

This is a very powerful statement. Never underestimate the power of free. It
is very hard to compete with. Look at the browsers.

------
KingMob
As a former cognitive neuroscientist, I pray for the day Matlab is displaced.
Given the generally low level of programming ability in the sciences, I'm
personally rooting for Python to win, but I'll take what I can get.

Unfortunately, the dominant EEG and fMRI packages (Fieldtrip and SPM) were
written in Matlab, and my labs standardized on them. Plus, when I was in
school, R was unable to handle the multi-GB data sets that result from
neuroimaging.

~~~
sjtrny
Other than MATLAB being closed and very expensive. What is wrong with it?

~~~
bambam12897
Both issues are solved by Octave... which doesn't require the bifurcation of
the community of people writing new code.

~~~
rcxdude
Octave is hampered by the fact that it's still basically the same language as
MATLAB without any of the reasons you put up with using MATLAB (the
toolboxes).

~~~
bambam12897
While the language definitely has it's problems (I'm not a fan of the syntax
at all), I think dropping it for something marginally better like R is a
little silly. So many man hours have been put into writing MATLAB/Octave code
- redoing it seems like mostly a waste.

I don't know if you have similar experiences, but I often find that I want to
use X feature in MATLAB in combination with Y feature in R and there isn't any
easy way to do it. The bifurcation of coding efforts is vastly more
frustrating than some bad/inconsistent syntax.

The toolboxes are great. I haven't used them much, but I feel like a lot of
the time you can get away without using them. If you really need them, then
it's not unreasonable to pay a license for the documentation and robustness -
which you won't get in open source most of the time.

But like.. that's just my opinion man =)

------
aaron-lebo
I'm in political science, and I'm pretty surprised how aware of open source
tech some of my professors are, R especially. But I've even heard from a few
of them a desire to pick up Python or C++ for other data work, and at least
one of them knows emacs.

Proprietary software like STATA still gets used as much or more than R, but
hopefully it continues to pick up steam. R Studio in particular is a pretty
compelling environment.

~~~
goldfeld
Interesting, that's a domain I'm just getting into, can you point me towards
projects and research at the intersection of political science and data
science/programming?

~~~
aaron-lebo
I tried to do some searching for some specific projects that I've heard of but
I'm coming up blank.

Really any decent quantitative study that isn't just an absolute basic
regression is going to have degree of data processing done to it. Not
exciting, but it is there and on a large level.

The other more interesting projects are doing stuff like scraping news
sources, constitutions, etc, using natural language processing to pick out
relevant parts and then matching those to some kind of database in order to
code the necessary data.

Here is an example of this:
[https://github.com/openeventdata/phoenix_pipeline](https://github.com/openeventdata/phoenix_pipeline)

Then you've got something like Nate Silver's analysis and predictions of
recent elections which is dealing with popular political issues + data.

------
emhartEco
As one of the people interviewed in the article I feel somewhat compelled to
explicate a bit further. I'd be the first to admit that R is good for some
things and bad for others. It's full of quirky parts that make coming from any
other more standard type of scripting language (e.g. python) make a user want
to pull their hair out. However that said, in the world I come from (EEB,
ecology and evolutionary biology), it's by far the most popular language. At
rOpenSci, we develop tools in R because that is the language our audience
works in. I think the mistaken assumption of many commenters is that R users
are actual programmers. Most EEB scientists I know don't want to get bogged
down in learning multiple languages. They want to learn something that will
make doing their science easier. R provides that. For all the credit that
SciPy and Numpy deseveredly get, they still are way behind when it comes to
certain statistical tasks. For instance there are whole books written doing
mixed effects models in R, but you can't get those in python yet (I know
statsmodels is coming along but it's nowhere near where lme4 is). Yes, if
you're a python programmer you could just call that one R routine from python
and go back on your merry way, but that's _you_ , not the average ecology
graduate student. Also, MatPlotlib is just not on par with the capabilities of
ggplot2 and other R graphing libraries (although there is a ggplot2 port to
python that is being developed).

The other important component that I think is missing from the discussion
about R's merits is that it's facilitating open science. We're talking about
fields that are moving from SAS/ Matlab / JMP, etc...and the creation of
totally reproducible documents and experiments with tools like Sweave. Is it
going to provide the fastest environment for running regression trees on a
dataset with 10 million rows, no. But is it a powerful scripting language with
well developed tools for manipulating data (plyr), visualization (ggplot2,
lattice), doing GIS (rgdal, sp), getting data from API's (httr, jsonlite,
anything rOpenSci does :) ), writing reproducible documents (knitr) and doing
complex statistics (lme4, nml4, gam), yes. It allows scientists to learn one
language to be able to accomplish 99% of the analytical tasks they want to be
able to. I think that's the point of the article. Yes FOSS has been part of
science for a long time, yes R is not the best language for many things, but
there's a culture at play where it's been adopted and extended by many
scientists to accomplish a lot of valuable science, and brought FOSS, openness
and reproducibility to a vast number of scientists that probably wouldn't
otherwise have adopted those practices.

------
cwal37
Depends on the field of course. I'm in environmental science/energy economics
so python is kind of a no-brainer if you want to go open source (and we do).

However, my significant other is working on a physics PhD and everything she
does is in C or C++ with CERN ROOT. I used to use Matlab, and she thought it
was adorably weak. I get a little more respect using Python now at least.

~~~
jedrek
Why do you say that it's a no-brainer?

~~~
cwal37
If you want to be completely open source. There seem to be a lot more
libraries and general capability with the specific license we need. I really
should have clarified better, I seem to have lost a word or two in there.

What I'm working on needs to interface with many different existing modeling
and optimization efforts at some point, and of the options out there Python
seems to be the most understood by the largest group of people
(statisticians/scientists/programmers). With python we can keep everything
100% open and available to the largest number of people.

------
anonu
I work at a big bank in quant research. I can easily say that open source
tools are favored here over their more expensive counterparts. Futhermore,
over the last few years I've definitely seen a shift away from R and toward
Python. NumPy, SciPy, Pandas libraries in Python are all excellent (and way
faster) than equivalent options in R.

------
vezzy-fnord
Science has been benefiting from open source for quite a while now?

~~~
jychang
Yeah, I'm not sure what this is saying.

Python has been the lingua franca of choice for most sciencey things for a
while now.

~~~
ovis
I wish that were true. I'm sure it depends on the field, but in my experience
(more physics/natural science), MATLAB still leads for interactive scientific
programming. I think Python is at this point recognized as a legitimate
alternative, rather than the lingua franca.

------
Stubb
Mis-quoting Churchill: R is the worst numerics software, except for all those
others I've tried from time to time.

I've used R extensively for analyzing network simulation results databases
that run in the tens or hundreds of MB. One can find well-documented libraries
that work for interfacing with nearly everything. In my case, it's pulling
data from MySQL or SQLite databases, performing graph-theory analysis using
Boost Graph Library, and generating output with Graphviz and other plotting
tools. It's a solid toolchain, and R's inherent slowness is somewhat
manageable via the parallel flavors of apply.

The main problem for me has been the lack of a clean analog to namespaces or
utility classes. Environments sort of do the same thing but are ugly
syntactically.

I'm hopeful about Julia, but there are a couple showstoppers for me presently.
Maybe in a few years.

------
AdmiralAsshat
R is rising now? I thought R was in decline because it was being displaced by
Python!

[https://news.ycombinator.com/item?id=6808127](https://news.ycombinator.com/item?id=6808127)

------
gms7777
> But the ballooning cost of the software and dwindling research budgets have
> prompted scientists to turn to R instead.

I know some people using R, though at least in my field (Computer
Science/Bioinformatics), Python seems to be more popular. Both of which happen
to be free. That said, I don't know any research groups that chose R or Python
specifically because they were free.

~~~
LanceH
Being able to have everyone install it on every computer, without any thought
of licensing definitely gets it in the door for some people.

The interactive nature of it is handy compared to SAS even when that is also
available. I've known people to use R first, make a plan, then go back to
programming a SAS on the massive data sets that R might not handle as well.

------
transfire
Julia is another such language.

~~~
otoburb
Julia will ride R's coattails. The Julia story arrived at a good time and
seems to be slowly gaining traction in various niches. This is based purely on
reading the mailing list and scanning relevant HN headlines.

------
ISL
Don't forget Octave ( www.octave.org ), which belongs on any list of open-
source analysis suites.

~~~
bambam12897
My general opinion has settled on:

Doing linear algebra -> Matlab/octave

Twiddling data tables around and making plots -> R

~~~
runarberg
Doing linear algebra -> Julia

twiddling data tables making plots -> python/numpy/matplotlib

abstract math/calculus -> python/sympy

statistics/quick data analysis -> R

------
gkya
Off-topic, but have to state: With this website open, the cpu usage of xorg
flew up to 65%. Congrats to fastcolabs.com for making such a slick, html5,
javascript asynchronous streaming website.

~~~
Yetanfou
xorg takes 2% here, not by the grace of a superfast computer - far from it -
but by the simple expedience of having both Noscript and RequestPolicy
installed. I can still read the article so I don't know what I'm missing by
not allowing all that JS and external content to run/load. Not much, I
assume...

~~~
gkya
Thanks! Your comment has encouraged me to test out NoScript. How bad I haven't
checked this out before. I had this superstition that with NoScript I'd have
to spare a lot of time configuring to just be able to reach a comfortable
level, but I've been able to get going in a minute. Massively useful.

~~~
jjgreen
If a page is hosed under NoScript then just temporarily allow all scripts. You
can spend a few hours building up a whitelist of hosts for your most visited
sites if you can be bothered, but that's not needed for it to be useful.

~~~
gkya
Actually, while using NoScript, I've found out that there are a very small set
of websites I visit often to bother to add exceptions: HN, github, hurriyet (a
news website), tumblr, duckduckgo and maybe a couple others I can't recall.

------
jareds
What advantages does Python have over R for basic statistics work? I’m trying
to play with data for fantasy sports and was planning on getting the data into
a MySQL database then using R to look for patterns. Is R the right choice for
this, or is it a matter of both Python and R being able to the same thing in
different ways so there is no wrong choice? Given the fact that it will be a
fairly small dataset I’m not overly worried about performance.

~~~
dlib
No wrong choice I think, mostly personal preference. If you know Python I'd
stick with that, iPython notebooks with pandas etc. is a solid choice. R might
be a little harder to start with, sapply/lapply can be confusing but there's
plenty of info and libraries on the web to make your life easier. For
plotting, ggplot still wins over matplotlib in my opinion but Python has other
strengths.

~~~
keypusher
[https://pypi.python.org/pypi/ggplot/0.4.7](https://pypi.python.org/pypi/ggplot/0.4.7)

------
jdoggy64
The good thing about R is that it has forced me to learn statistics. Python
has never done that for me. Use R where it works, don't use R where it
doesn't. Call R from python or python from R, see if I care, as one would go
to C or Fortran anyway. And why bother with python when you have Julia?

So what is all the guff is about? It does remind me of the ongoing religious
war between frequentists and Bayesians ...

------
jedanbik
Python is doing the same thing. It's great!

------
lmartel
Honest question: why is Python so dramatically more popular than Ruby for
scientific computing?

~~~
dagw
Libraries, libraries, libraries and history. The first serious numeric library
for python was released in 1995 or 1996 and things have just grown and grown
since then. Ruby is far behind in its offerings in this space.

------
gberger
, z

------
mylons
Uh -- Python? R is the biggest prima donna (yes, I just used the Italian, deal
with it xenophobes ( yes, I used x to start a word, FUCK OFF) ) in data
science. It is also a very obfuscated and poor performing language overall.
Statisticians shouldn't be allowed to drive languages into popularity.

