
Data Analysts Captivated by Power of R - kalvin
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
======
scott_s
I am baffled that a _programming language_ is getting coverage in the NYT. I'd
like to figure out what, exactly, it is about this one that merits mainstream
coverage, but I'm afraid I've been at the office too long and I think my
perspective on the matter is permanently skewed.

As an aside, I used R briefly for a econometrics/philosophy course I took. I
recognized immediately it was a powerful, functional language. What I wonder,
though, is if the scientific programming libraries in Python might eventually
be a better environment. Surely there must be some R users here who can
comment.

~~~
brent
There are a number of packages for relatively obscure statistical techniques
(say, nonmetric multidimensional scaling) that are available for R, but not
for python (that I have seen).

I also believe R is easier for statisticians who cannot and do not want to
know how to program. There are many advanced techniques that are one-liners.

~~~
gaius
You've hit the nail on the head there. A programmer looks at R (or MATLAB) and
says "this sucks as a programming language". An engineer or a scientist or a
statistician says "so what, I'm not a programmer, nor do I want to be one,
I've got actual work to do".

------
wesm
having programmed R on the job for some heavy statistics, I will say this:
good for quick analyses but burdened by legacy functionality from the S+ days.
I switched to Python/NumPy and rewrote all the R code I had, could not be
happier with the results. Of course, you have to create your own data
structures if you want something like R's data frame, but at least you have a
rich language to do that with.

however, if you need to do anything systematic, do NOT use R, bugs are elusive
and extremely tedious to debug

~~~
yters
I have used R for a number of minor projects. I really like its functional
programming design, but its syntax can be obtuse. It's preferable to MatLab,
IMHO. What do you consider to be the legacy aspects that weigh it down? What
do you like better about NumPy? While I'm asking questions, have you tried
Sage? If so, what do you think of it as a meta package of mathematical
software?

Also, here are interesting links I've found comparing R and similar
statistical programs. Sorry, don't know much about Numpy's capabilities.

R equivalent in speed to Matlab:

<http://www.sciviews.org/benchmark/index.html>

Someone's research for a good data analysis language:

<http://www.cs.ubc.ca/~murphyk/Software/which_language.html>

~~~
wesm
One of R's big benefits is the huge amount of statistical functionality, for
example numerous different quantilization algorithms.

If you are working with a lot of heterogeneous data in R it becomes a real
headache. Merging data frames seems like it should work like you think it
should but if one of your sets of keys (strings) are 'factors' (what I am
calling 'legacy S+ functionality', I'm sure they're useful for many
algorithms), you'll end up with garbage. There's a hack you can put in your
code ('options(stringsAsFactors=FALSE)') which alleviates some of this but in
general aligning data I found to be a huge pain. If you're running regressions
this is pretty important

haven't tried sage but have heard good things. NumPy is a good alternative
because it's extremely well implemented and has consistent behavior across the
board. Extensibility (with Fortran, Cython/Pyrex, C/C++) is clean and easy.
Never thought I'd write Fortran 77 code being born quite a few years after '77
but it's an easy way to speed up simple procedural algorithms 50x or more.

~~~
earl
factors are nothing but enums and are used to shrink data and speed
processing. Plus that matches what you typically want to happen in
regressions: strings turn into (n-1) indicator variables. Otherwise, what is
the meaning of using a string as an explanatory variable in a regression?

If you want to merge data frames that were created w/ different factors,
perhaps the easiest thing to do is turn your factors into strings?

If d is your data frame, then:

d$factorVar <\- as.character( d$factorVar )

merge your two data frames, then

merged$factorVar <\- as.factor( merged$factorVar )

should set you right...

earl

------
zandorg
"We have customers who build engines for aircraft. I am happy they are not
using freeware when I get on a jet." - Ugh, SAS isn't about science, it's
about administration.

~~~
nailer
My day job manages $USD 68B in funds.

Our largest fund and most profitable fund (around $US 25B), which includes
some our of brightest people, which uses the Linux server infrastructure I
design, also uses R and Python.

Trades are made based on the models created in these languages by our research
teams. Our traders simply execute what our researchers models tell them to,
when the model tells them to.

So both R and Python are fundamental parts of our business without which our
best products could not function.

We don't use freeware either, as neither R, Python, or Linux are freeware.
They are licensed software, with OSD compatible licenses.

~~~
kirubakaran
_> Our traders simply execute what our researchers models tell them to_

Can the traders be eliminated? Why pay them if all they do is carry out
orders? (I am asking sincerely)

~~~
nailer
It's a fair question. Everyone is aware the traders, which are normally a big
deal, are slaves to the research gents and their algorithms.

My guess is largely, they can be. The actual trades could be automated (this
would become increasingly necessary with rapid-fire [millisecond] trading,
which we don't do now but could in future). The meatware oversight could be
consolidated to a smaller group of individuals.

------
gruseom
Meanwhile the creator of R wants to return to a Lisp-based statistics
environment:

[http://books.google.com/books?id=8Cf16JkKz30C&pg=PA21...](http://books.google.com/books?id=8Cf16JkKz30C&pg=PA21&lpg=PA21)

~~~
jderick
Another interesting paper hidden behind the academic firewall.

~~~
Anon84
At least in the physics world people just post everything to arXiv and their
personal websites. This way you only need to pay attention to a couple of
places to keep up to date.

~~~
scott_s
As I said in another thread, most computer science papers are on authors'
webpages. Then this one comes along - but it looks like the authors are stat
people. I don't know what their culture is - and what kind of copyright
agreements they sign.

I can access the paper because my school subscribes. If anyone wants to read
it, figure out how to email me and I'll send a copy.

While searching, I did find a brief email from Ihaka talking about the jump to
Lisp: <https://stat.ethz.ch/pipermail/r-devel/2008-May/049501.html>

~~~
gaius
Thanks :-)

------
bbgm
In the life science space, R dominates research informatics. A large chunk of
molecular profiling methods and techniques use R, or quite often the
Bioconductor package, <http://www.bioconductor.org/> (from Gentleman's group).
Most commercial bioinformatics apps also implement a number of methods using R
and provide ability to implement R-based classifiers, etc.

In the clinical space, it's all SAS. Pretty much the de facto standard.

~~~
stcredzero
My girlfriend was using Stata. She's an epidemiologist.

------
manny
I can't believe nobody here has mentioned PDL, the Perl Data Language:
<http://pdl.perl.org>

Admittedly, i think R and PDL do different things... (I have never played with
R).

~~~
draegtun
Looks like PDL == NumPy (<http://news.ycombinator.com/item?id=363159>)

There are CPAN modules to directly use R from Perl (for eg... R::* &
Statistics::R).

I have a stats friend who's been singing the praises of R & PDL for donkey
years.

/I3az/

------
asnyder
Is it just me, or was this article not very well written? I felt it was all
over the place. It brings up S, then mentions that S isn't open source. It
mentions open source and brings up things like apache, the web, and Microsoft,
I don't see how it relates much to R. Though, I'm probably spoiled due to the
usual quality of news I get here.

------
tokenadult
"The co-creators of R express satisfaction that such companies profit from the
fruits of their labor and that of hundreds of volunteers."

That's an interesting reaction from the first designers of the program.

------
jessep
Interesting. My girlfriend is a statistician for the WHO and they definitely
still use SAS, at least in her area (calculating global burden of disease).
I'm going to ask her if anyone there uses R.

~~~
jessep
she says apparently there's a big debate going on at the WHO about whether
everything should be done in R. currently they use R, Stata, and SAS.

------
rdixit
My 2 cents: Numpy+Scipy+Matplotlib and other packages, which u can download
together in a convenient package at Enthought. That enthought distribution
also comes with Ipython, which, is, REALLY nice. I checked out R, Sage and am
still sometimes forced into Matlab, but u just can't beat a programming
language (Python) which can be used OUTSIDE of whtever problem space you
happen to be working on.

------
waldrews
At least it's a functional language, and you can do things like manipulating
code symbolically, showing its lisp heritage.

The tooling, library integration, and debuggers aren't as good as, say,
Python, though.

------
rdixit
My 2 cents: Numpy+Scipy+Matplotlib and other packages, which u can download
together in a convenient package at Enthought. That enthought distribution
also comes with Ipython, which, is, REALLY nice. I checked out R, Sage and am
still sometimes forced into Matlab, but u just can't beat a programming
language (Python) which can be used OUTSIDE of whtever problem space you
happen to be working on.

------
mojonixon
"But R has also quickly found a following because statisticians, engineers and
scientists without computer programming skills find it easy to use."

whaaaa?

------
Prrometheus
Could someone explain how this differs from Matlab, which is the most popular
language for statistics and machine learning at my university?

~~~
aposteriori
As earl said, it's free! But that's not the most important aspect (although
nice for a student).

It's an actual programming language. Like Python, you can use both non-OOP and
OOP styles. You can define your own packages and namespaces (unlike MATLAB
where there's just one big namespace). There are hundreds of contributed
packages, you don't need to buy separate toolboxes. Not to forget, it's GPL,
you can look at the source and learn a few things from people who know what
they're doing (to name a few: Brian Ripley, Terry Therneau, Douglas Bates,
Bill Venables).

On the other hand, it can be a bit of a steep learning curve at the beginning,
but I feel it's definitely worth it. It's not perfect, I stub my toe on
obscure language features from time to time, but to paraphrase Winston
Churchill, R is the worst form of statistical languages except all those other
forms that have been tried from time to time...

I think that Andrew Robinson's introduction is pretty decent
([http://cran.r-project.org/doc/contrib/Robinson-
icebreaker.pd...](http://cran.r-project.org/doc/contrib/Robinson-
icebreaker.pdf)), but there are many others at
<http://cran.r-project.org/other-docs.html>

------
earl
The power of R (speaking as a very heavy user who has deployed it in multiple
production environments and been using it for 5 years) is that it makes it
very fast, easy, and natural to do statistics. It also has the nicest data
structure I've ever seen for manipulating table data, called a data frame --
I'll elaborate, if anybody cares. In addition, it encourages people to create
packages to extend the functionality. There are extant packages to do almost
every analysis you can think of -- time series, kmeans, other clustering
techniques, cox-box style analyses, regular maximum likelihood style GLM,
hierarchical regression, HB, etc. Further, the amount of knowledge and the
open source nature of the language, base, and packages encourage additional
development and widespread adoption. See: <http://cran.r-project.org/> and
<http://cran.r-project.org/web/views/> ^ is task views. Explore it -- it's
well worth your time.

The downsides are, well, it's slow for large data sets and debugging can be
difficult. But as a desktop / rapid development platform for statistics it is
without peer, IMO.

ps -- unlike Matlab, which often costs thousands of dollars, and the
Statistics Toolbox, more thousands, R is free. This is pretty important on its
own -- instead of $5k per server and workstation and home pc, install it on
any linux, Mac, or windows box you have and get to work for $0.00.

