I'm actually a little concerned by the "phenomenon" of R. I say this as someone whose workplace uses both SAS and R. I'm also familiar with python, and wish we could use it more at work, but it doesn't have the "cult-following/network effect" amongst statisticians that R has.
The "problem" I speak of is that R is very popular for people applying a quick little stats script for a package they've downloaded using a technique they don't understand with output they haven't verified on a tiny problem that won't scale. And 95%+ of users are just doing it by rote, and now they're trying to apply it to problems outside of its domain.
But ACow_Adonis you say, doesn't that just describe everyone with every programming language ever?
Yes. But you see, R seems almost designed (or not designed) as a language of unseen problems. It is several multiples slower than regular python (if you thought that possible), and several HUNDRED times worse than other compiled languages. It has no un-boxed primitive numbers. Let me just say that again. A language for numbers that doesn't have primitive unboxed numbers. It is the poster boy of Wirth's law.
But not only that, i said its basically been designed for "dodgy results". Watch how its attempt at lexical scope combines with lazy evaluation for ridiculous fun. Bizarre, automatic and random conversions behind the scenes. 1-indexing of arrays...but 0-indexing doesn't throw an error. Automatic repetition of values in smaller arrays when combined with arrays of larger size. Internal functions of one letter names in a language with KIND OF one name-space for people dealing with MATH with a long history of using these individual letters for other things!
So combine these "features" of the language with people implementing things by rote, not checking their results, returning results without error messages/warnings...
A SAS marketing person made the comment once of “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.” and we all piled-on the hate, and rightly so.
But after using R, what scares me more is the thought that professional stats people ARE using it when i get on a jet :(
...but if you know it, it lets you tap into this huge collection of statistical libraries with relatively little effort. So you're right: most people are using it as a glorified graphing calculator. But given that the alternative is (usually) implementing complicated, error-prone algorithms in another language, I'm still glad it exists.
R didn't bring open source to science. It brought free (as in price) software to stats, and that, along with its script-like ability to apply formulas quickly, its vast library, and its universal teaching in stats courses in universities, is the reason for its popularity. I'd even go so far as to say that python has had more of an influence outside of stats/bio/pharma.
But spend a bit of time in the SAS community to which it is commonly contrasted and you'll see massive amounts of code sharing, examples and how-tos. The interesting thing is to observe how these things play together. I argue that sharing of source has less value if the run-time on which it operates is not available to you.
Of course, SAS is so widespread in big business that you might point out that it quite clearly is available to a lot of people, and that it quite clearly is valuable to them, its just not available if you can't pay or aren't in a connected uni/job. I know my SAS and I can do several things in it that whip R and python's butt if the task is that which SAS is good for. It has its own separate and relevant issues in terms of design abd implementation. I can rail against all the tools i use :P
But the high-entry-cost of the software itself is the prime reason I'm trying to turn my back on it (because I don't plan on being employed by a big company or being locked into a software vendor forever, and subsequently it is not available for me to use for my own projects, which are often more valuable/complex than the ones i'm writing for employment...). I imagine there are a large number of other programmers/hackers feeling the same way, and you might even say that its evidenced by the parallel (as in along-side SAS, not multithreading) success of R. Perhaps there is a symbiotic relationship, subsequently, between free as in price vs open source code. Who knows. I need a "free as in beer"....
This is a very powerful statement. Never underestimate the power of free. It is very hard to compete with. Look at the browsers.
qplot(hp, mpg, data=subset(mtcars, mpg<30),
mtbadcars <- subset(mtcars, mtcars$mpg<30)
Oh wait, you were trying to convince me that was a bad thing?
Hadley doesn't write everything, you know.
Not sure what you're getting at.
> A lot of these "quirks" of R are nice for end users when they're implemented well, but are unintuitive to program
It's not a quirk, it's a core feature. It's used consistently and to great effect. Notice how the "subset" function takes advantage of the same flexibility. I'm about 95% sure neither the subset function nor the other standard library functions that use this "trick" were written by Hadley. The "trick" was expected to make expressions significantly easier to read and easier to write from the very beginning.
It might surprise a few people who come from another language and think they've seen it all, but once they figure out what's going on (which should happen on the first tutorial or 2nd or 3rd copypaste) it'll be a pleasant surprise. Unless they kneejerk and hate on it because it's unusual among languages.
The reason I use R over python most of the time is because, despite some amazing improvements in this area by the python community, there's no better tool for fluidly interacting with data that offers the same power.
That said, R is not for writing software systems. People used to refer to many interpreted languages as "scripting languages", and while this is clearly not the case for Python and Ruby, this is exactly what R is. There's a good reason in RStudio it says "New File > R Script". The limit of using R is when you have a bunch of scripts that interact with each other to create a bunch of visualizations/reports. If your system gets more complicated then that, write it in something else.
Of the many language/environment combos I've used, I don't think I've found one better for rapid prototyping than R, and following from that R has no place near anything that would be called "production". I also happen to think, if used properly, this is a good thing since it means your "prototype" never accidentally creeps into suddenly being your production system.
What you're leaving out of this makes me think that you don't understand the field at all.
Everything is a vector. There's no need for an 'unboxed' number when you have vectors. If you're doing computation thinking of operating on individual data points rather than vectors, matrices, and multi-dimensional arrays of datapoints, you're doing it wrong. R is doing it right.
R, like APL, is much better in its intended domain for being a vector-based language rather than a scalar based language.
If the application domain is a scalar based domain, then R is probably the wrong tool for the job. If somebody doesn't understand the cases where vector-based language is better, they've probably never encountered the right application domain for R.
Unfortunately "vector" operations typically cause a lot of temporary intermediate vectors that you never see, which is why a "scalar" language like Julia or C can provide such performance improvements when they handle the whole algorithm without the unnecessary intermediates having to be allocated and filled.
I think the vector operations do provide a certain brevity though. Whether this is an advantage or not is very subjective. For myself, I prefer having anonymous functions and operations like "map" for transforming collections (including matrices and vectors) element wise, instead of having all operations that make sense on numbers automatically also operate element wise on vectors and matrices. (Because there are operations on vectors and matrices as a whole, sometimes with the same name as a scalar function, which can lead to confusion - exp being one example and multiplication being another). But I can understand opinions differ in that regard and a lot depends on how "general purpose" the surrounding programming context needs to be.
Julia seems headed in the right direction and I am very excited about it. I really wish they had a mechanism to desugar vectorization syntax in to dumb loops which could then be JITed. Vector syntax can be really expressive, aligned with the problem domain and succinct, I would hate to let go of that in the interest of speed. Its strange that many correlate verbosity with clarity/readability.
@ihnorton sadly I can only offer a single upvote
(not 100% general yet AFAIK, but already very useful. The code is a great read too)
It's easy to imagine situations where languages that directly support vector types could have performance benefits, especially on HW that supports vector instruction sets. On the other hand, languages that don't directly manipulate numbers in the native format of the CPU(s) will peform poorly for large datasets.
R is basically Lisp with syntactic sugar for BLAS/ATLAS and incredibly easy bindings for FORTRAN, C, and C++. If this description doesn't sound like an amazing combination for an application domain, then it's not in R's sweet spot.
Of course you want to be working with vectors for these kinds of problems. But the ability to work with vectors of primitives rather than vectors of boxed types is about enormous gains in efficiency and memory usage. This is one of the reasons that R is so slow compared to other tools.
R is easy to implement small things, but gluing them all together is awful. Additionally, the debugging features are not very good, especially if you created a package. Everything using Rcpp basically requires you to make a package and believe it or not the C/C++ is easier to debug than the R. I do believe that going the other way (Rinside) or using R in C++ is a much better solution in terms of efficiency, and getting the actual results you expect.
I am, however, impressed with the amount and, in general, quality of R packages.
As a side project at a client of mine, they wanted me to expose some R reports via a web interface. The reports themselves were incredibly slow - and did the most horrendous SQL queries. The reports could have been achieved in several other languages, and in much more performant ways, but it absolutely had to be R because that's all the analyst in question knew.
From what I've seen of Julia it could be a massive contender for this kind of usage.
Instead of being positive and considering the community, you just bring a negative and unrelated message (that silly "airplane anecdote"). So yes, sad.
(nascent bio community with some good people onboard already; there are a handful of other bio-related things in the Julia pkg repository that have not migrated under the BioJulia umbrella yet)
That said, it sure is a bitch of a language to try to develop for (since the consensus seems to be: write everything in C) and cran is a ghetto.
If business users want a free, plug-and-play statistics package they can throw into their analytics stack, R is not the right tool for that job.
Unfortunately, the dominant EEG and fMRI packages (Fieldtrip and SPM) were written in Matlab, and my labs standardized on them. Plus, when I was in school, R was unable to handle the multi-GB data sets that result from neuroimaging.
You look around and find the strcat function. Nice, this thing should just concatenate whatever stringy thing I throw into it. Well, sorta, kinda: http://www.mathworks.es/es/help/matlab/ref/strcat.html
> combinedStr = strcat(s1,s2,...,sN) horizontally concatenates strings in arrays. Inputs can be combinations of single strings, strings in scalar cells, character arrays with the same number of rows, and same-sized cell arrays of strings.
So you can throw a bunch of different string-like stuff and everything will be concatenated. A bit strange but okay. However, there's more:
> If any input is a cell array, combinedStr is a cell array of strings. Otherwise, combinedStr is a character array.
Ouch. So you can build your program, test it only with non-cell-array arguments, and later on someone trows in an extra thing to concatenate (or uses a cell array to define the output separator).... and that changes the output type of the function!
But that's not the only side-effect! It truns out that
> For character array inputs, strcat removes trailing ASCII white-space characters: space, tab, vertical tab, newline, carriage return, and form-feed. For cell array inputs, strcat does not remove trailing white space.
Oh yeah, you also get different "concatenation" rules when these types change. And that's even before discussing why on earth would a "strcat" function remove trailing spaces from within the stuff you tell it to concatenate...
TL;DR: What is wrong with matlab is that it is designed to write one-liners that probably perform what you want. This is achieved by an endless stream of tweaks in the basic language's functions that automagically try to do "what you probably want". As a result, it is an extremely compact, easy to write for language when the tweaks work, but an utterly terrible experience when the magic doesn't work that makes you feel like walking through a minefield.
I don't know if you have similar experiences, but I often find that I want to use X feature in MATLAB in combination with Y feature in R and there isn't any easy way to do it. The bifurcation of coding efforts is vastly more frustrating than some bad/inconsistent syntax.
The toolboxes are great. I haven't used them much, but I feel like a lot of the time you can get away without using them. If you really need them, then it's not unreasonable to pay a license for the documentation and robustness - which you won't get in open source most of the time.
But like.. that's just my opinion man =)
Proprietary software like STATA still gets used as much or more than R, but hopefully it continues to pick up steam. R Studio in particular is a pretty compelling environment.
Really any decent quantitative study that isn't just an absolute basic regression is going to have degree of data processing done to it. Not exciting, but it is there and on a large level.
The other more interesting projects are doing stuff like scraping news sources, constitutions, etc, using natural language processing to pick out relevant parts and then matching those to some kind of database in order to code the necessary data.
Here is an example of this:
Then you've got something like Nate Silver's analysis and predictions of recent elections which is dealing with popular political issues + data.
The other important component that I think is missing from the discussion about R's merits is that it's facilitating open science. We're talking about fields that are moving from SAS/ Matlab / JMP, etc...and the creation of totally reproducible documents and experiments with tools like Sweave. Is it going to provide the fastest environment for running regression trees on a dataset with 10 million rows, no. But is it a powerful scripting language with well developed tools for manipulating data (plyr), visualization (ggplot2, lattice), doing GIS (rgdal, sp), getting data from API's (httr, jsonlite, anything rOpenSci does :) ), writing reproducible documents (knitr) and doing complex statistics (lme4, nml4, gam), yes. It allows scientists to learn one language to be able to accomplish 99% of the analytical tasks they want to be able to. I think that's the point of the article. Yes FOSS has been part of science for a long time, yes R is not the best language for many things, but there's a culture at play where it's been adopted and extended by many scientists to accomplish a lot of valuable science, and brought FOSS, openness and reproducibility to a vast number of scientists that probably wouldn't otherwise have adopted those practices.
However, my significant other is working on a physics PhD and everything she does is in C or C++ with CERN ROOT. I used to use Matlab, and she thought it was adorably weak. I get a little more respect using Python now at least.
There are of course the C/C++/FORTRAN gurus that write LAPACK/BLAS/OpenCV etc. , but they're kinda in their own world. MATLAB is the de-facto wrapper for these libraries, and all prototyping is done in it.
What I'm working on needs to interface with many different existing modeling and optimization efforts at some point, and of the options out there Python seems to be the most understood by the largest group of people (statisticians/scientists/programmers). With python we can keep everything 100% open and available to the largest number of people.
I have to think a large part of the open source ethos comes from the academic setting at MIT and other institutions.
Python has been the lingua franca of choice for most sciencey things for a while now.
I've used R extensively for analyzing network simulation results databases that run in the tens or hundreds of MB. One can find well-documented libraries that work for interfacing with nearly everything. In my case, it's pulling data from MySQL or SQLite databases, performing graph-theory analysis using Boost Graph Library, and generating output with Graphviz and other plotting tools. It's a solid toolchain, and R's inherent slowness is somewhat manageable via the parallel flavors of apply.
The main problem for me has been the lack of a clean analog to namespaces or utility classes. Environments sort of do the same thing but are ugly syntactically.
I'm hopeful about Julia, but there are a couple showstoppers for me presently. Maybe in a few years.
I know some people using R, though at least in my field (Computer Science/Bioinformatics), Python seems to be more popular. Both of which happen to be free. That said, I don't know any research groups that chose R or Python specifically because they were free.
The interactive nature of it is handy compared to SAS even when that is also available. I've known people to use R first, make a plan, then go back to programming a SAS on the massive data sets that R might not handle as well.
Doing linear algebra -> Matlab/octave
Twiddling data tables around and making plots -> R
twiddling data tables making plots -> python/numpy/matplotlib
abstract math/calculus -> python/sympy
statistics/quick data analysis -> R
The advantage of Python is that it's a nice, well designed general purpose programming language. But in basic statistics work you don't need to develop large, well organized programs, so in practice I'd say there is no advantage.
R provides a larger collection of all kind of statistical routines out of the box.
So what is all the guff is about? It does remind me of the ongoing religious war between frequentists and Bayesians ...