Allow me space for an editorial and downvote magnets :P
I'm actually a little concerned by the "phenomenon" of R. I say this as someone whose workplace uses both SAS and R. I'm also familiar with python, and wish we could use it more at work, but it doesn't have the "cult-following/network effect" amongst statisticians that R has.
The "problem" I speak of is that R is very popular for people applying a quick little stats script for a package they've downloaded using a technique they don't understand with output they haven't verified on a tiny problem that won't scale. And 95%+ of users are just doing it by rote, and now they're trying to apply it to problems outside of its domain.
But ACow_Adonis you say, doesn't that just describe everyone with every programming language ever?
Yes. But you see, R seems almost designed (or not designed) as a language of unseen problems. It is several multiples slower than regular python (if you thought that possible), and several HUNDRED times worse than other compiled languages. It has no un-boxed primitive numbers. Let me just say that again. A language for numbers that doesn't have primitive unboxed numbers. It is the poster boy of Wirth's law.
But not only that, i said its basically been designed for "dodgy results". Watch how its attempt at lexical scope combines with lazy evaluation for ridiculous fun. Bizarre, automatic and random conversions behind the scenes. 1-indexing of arrays...but 0-indexing doesn't throw an error. Automatic repetition of values in smaller arrays when combined with arrays of larger size. Internal functions of one letter names in a language with KIND OF one name-space for people dealing with MATH with a long history of using these individual letters for other things!
So combine these "features" of the language with people implementing things by rote, not checking their results, returning results without error messages/warnings...
A SAS marketing person made the comment once of “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.” and we all piled-on the hate, and rightly so.
But after using R, what scares me more is the thought that professional stats people ARE using it when i get on a jet :(
Also, allow me just a quick addon to which people probably aren't responding in the comments below. I assure you banks and the like using R aren't passing round all their R/python/C code either.
R didn't bring open source to science. It brought free (as in price) software to stats, and that, along with its script-like ability to apply formulas quickly, its vast library, and its universal teaching in stats courses in universities, is the reason for its popularity. I'd even go so far as to say that python has had more of an influence outside of stats/bio/pharma.
But spend a bit of time in the SAS community to which it is commonly contrasted and you'll see massive amounts of code sharing, examples and how-tos. The interesting thing is to observe how these things play together. I argue that sharing of source has less value if the run-time on which it operates is not available to you.
Of course, SAS is so widespread in big business that you might point out that it quite clearly is available to a lot of people, and that it quite clearly is valuable to them, its just not available if you can't pay or aren't in a connected uni/job. I know my SAS and I can do several things in it that whip R and python's butt if the task is that which SAS is good for. It has its own separate and relevant issues in terms of design abd implementation. I can rail against all the tools i use :P
But the high-entry-cost of the software itself is the prime reason I'm trying to turn my back on it (because I don't plan on being employed by a big company or being locked into a software vendor forever, and subsequently it is not available for me to use for my own projects, which are often more valuable/complex than the ones i'm writing for employment...). I imagine there are a large number of other programmers/hackers feeling the same way, and you might even say that its evidenced by the parallel (as in along-side SAS, not multithreading) success of R. Perhaps there is a symbiotic relationship, subsequently, between free as in price vs open source code. Who knows. I need a "free as in beer"....
Amen. I've been using R since ~2000, and it's a terrible hack of a language. The syntax is quirky, the speed is underwhelming, and the memory usage makes it unsuitable for all but the smallest data analysis problems. It's a programming language designed by people who don't really know how to program.
...but if you know it, it lets you tap into this huge collection of statistical libraries with relatively little effort. So you're right: most people are using it as a glorified graphing calculator. But given that the alternative is (usually) implementing complicated, error-prone algorithms in another language, I'm still glad it exists.
Why thanks, I do find it ridiculously fun to use short, readable names while maintaining encapsulation so that I can 1) keep multiple versions of a dataset around 2) not prefix every reference to a column in the dataset with the dataset name, cluttering the argument list and requiring 20 replacements or a kludgy temporary variable every time I want to plot a different subset.
Oh wait, you were trying to convince me that was a bad thing?
Did you write qplot? (rhetorical question) A lot of these "quirks" of R are nice for end users when they're implemented well, but are unintuitive to program and, as a consequence, are inconsistently implemented across packages.
> A lot of these "quirks" of R are nice for end users when they're implemented well, but are unintuitive to program
It's not a quirk, it's a core feature. It's used consistently and to great effect. Notice how the "subset" function takes advantage of the same flexibility. I'm about 95% sure neither the subset function nor the other standard library functions that use this "trick" were written by Hadley. The "trick" was expected to make expressions significantly easier to read and easier to write from the very beginning.
It might surprise a few people who come from another language and think they've seen it all, but once they figure out what's going on (which should happen on the first tutorial or 2nd or 3rd copypaste) it'll be a pleasant surprise. Unless they kneejerk and hate on it because it's unusual among languages.
It's used consistently in the core language and in well written packages but very inconsistently across the ecosystem. I certainly don't use it in packages I write for my own use and it's pretty unused in most of the packages I download.
The thing with R that I think is important to note is that you don't have interactivity to support code (eg in ruby and python the huge advantage of the repl is to interact with the code you are writing while you're writing it) but rather your code is a way to the make interactive experience better. R is an amazingly advanced calculator.
The reason I use R over python most of the time is because, despite some amazing improvements in this area by the python community, there's no better tool for fluidly interacting with data that offers the same power.
That said, R is not for writing software systems. People used to refer to many interpreted languages as "scripting languages", and while this is clearly not the case for Python and Ruby, this is exactly what R is. There's a good reason in RStudio it says "New File > R Script". The limit of using R is when you have a bunch of scripts that interact with each other to create a bunch of visualizations/reports. If your system gets more complicated then that, write it in something else.
Of the many language/environment combos I've used, I don't think I've found one better for rapid prototyping than R, and following from that R has no place near anything that would be called "production". I also happen to think, if used properly, this is a good thing since it means your "prototype" never accidentally creeps into suddenly being your production system.
Have there been any thoughts of making a python package that emulate's R's "ease of use and immediately having X statistical methods available"? I imagine it would mostly be a renaming wrapper around NumPy and SciPy.
> It has no un-boxed primitive numbers. Let me just say that again. A language for numbers that doesn't have primitive unboxed numbers.
What you're leaving out of this makes me think that you don't understand the field at all.
Everything is a vector. There's no need for an 'unboxed' number when you have vectors. If you're doing computation thinking of operating on individual data points rather than vectors, matrices, and multi-dimensional arrays of datapoints, you're doing it wrong. R is doing it right.
If so, I have absolutely no clue what the point was. He referred to Wirth's Law, but the way to make software faster as the hardware gets faster is to abandon the idea of individual unboxed numbers, and move to vectors, as that's what the hardware is using.
R, like APL, is much better in its intended domain for being a vector-based language rather than a scalar based language.
If the application domain is a scalar based domain, then R is probably the wrong tool for the job. If somebody doesn't understand the cases where vector-based language is better, they've probably never encountered the right application domain for R.
It almost has to be a vector based language given its speed shortcomings, because that's the easiest way to package the canned C routines in manageable chunks to speed things up. Which C routines are just loops that manipulate scalars, btw.
Unfortunately "vector" operations typically cause a lot of temporary intermediate vectors that you never see, which is why a "scalar" language like Julia or C can provide such performance improvements when they handle the whole algorithm without the unnecessary intermediates having to be allocated and filled.
I think the vector operations do provide a certain brevity though. Whether this is an advantage or not is very subjective. For myself, I prefer having anonymous functions and operations like "map" for transforming collections (including matrices and vectors) element wise, instead of having all operations that make sense on numbers automatically also operate element wise on vectors and matrices. (Because there are operations on vectors and matrices as a whole, sometimes with the same name as a scalar function, which can lead to confusion - exp being one example and multiplication being another). But I can understand opinions differ in that regard and a lot depends on how "general purpose" the surrounding programming context needs to be.
Spot on. One major reason why languages such as numpy are slow compared to what they are implemented in is precisely what you said, gratuitous creation and destruction of costly temporaries. In the Numpy world this can be mitigated somewhat with tools like numexpr. R is lazy, so I expected it to be faster than Numpy so was surprised that its is so much slower.
Julia seems headed in the right direction and I am very excited about it. I really wish they had a mechanism to desugar vectorization syntax in to dumb loops which could then be JITed. Vector syntax can be really expressive, aligned with the problem domain and succinct, I would hate to let go of that in the interest of speed. Its strange that many correlate verbosity with clarity/readability.
It's easy to imagine situations where languages that directly support vector types could have performance benefits, especially on HW that supports vector instruction sets. On the other hand, languages that don't directly manipulate numbers in the native format of the CPU(s) will peform poorly for large datasets.
I still don't know what the OP's point is, then. There are vector types and there are list/object types in R. The advantage of vectors over scalars for this field. Having to have a special case in the language of vector length equal to one is a disadvantage when all modern workstation architectures are vector based for computation.
R is basically Lisp with syntactic sugar for BLAS/ATLAS and incredibly easy bindings for FORTRAN, C, and C++. If this description doesn't sound like an amazing combination for an application domain, then it's not in R's sweet spot.
Vectors vs scalars is a different issue than boxed/primitive types.
Of course you want to be working with vectors for these kinds of problems. But the ability to work with vectors of primitives rather than vectors of boxed types is about enormous gains in efficiency and memory usage. This is one of the reasons that R is so slow compared to other tools.
Now I am confused. My understanding is that when I apply a function on a vector etc. in R, I call some compiled C code in the background that translates this into a straightforward for loop. The hardware would not work on vectors but on scalars in this case and vectors are just a more condensed and more math-y interface for users. You're saying that R actually never works on scalars in the background? Maybe I'm missing something here.
I am using R for a fairly large project (because I am working with other people who are using R) and I really agree with this.
R is easy to implement small things, but gluing them all together is awful. Additionally, the debugging features are not very good, especially if you created a package. Everything using Rcpp basically requires you to make a package and believe it or not the C/C++ is easier to debug than the R. I do believe that going the other way (Rinside) or using R in C++ is a much better solution in terms of efficiency, and getting the actual results you expect.
I am, however, impressed with the amount and, in general, quality of R packages.
I used to think R's debugging was awful. Then someone told me about options(error="recover"), and I get a nice Lisp/Matlab like stack when something bad happens. Doesn't help with the C/C++ FFI piece, but that's not usually what is needed.
There's a lot of little doodads like that in R. The real problem is there is no good book teaching people who need to write complex things how to do it properly.
As a side project at a client of mine, they wanted me to expose some R reports via a web interface. The reports themselves were incredibly slow - and did the most horrendous SQL queries. The reports could have been achieved in several other languages, and in much more performant ways, but it absolutely had to be R because that's all the analyst in question knew.
From what I've seen of Julia it could be a massive contender for this kind of usage.
I use R almost every day. I'm tempted to start porting Bioconductor packages to Julia, so I can go back to working in languages that don't make my brain melt. It's scary when you realize you actually got used to working around the hacks/quirks of R.
I more or less agree with your specific points, but in the larger scheme of things I'm more concerned that these hypothetical people don't understand the stats than that the tool won't scale or is badly implemented.
That said, it sure is a bitch of a language to try to develop for (since the consensus seems to be: write everything in C) and cran is a ghetto.
Well yes, a dynamic interpreted scripting language will have all the drawbacks of a dynamic interpreted scripting language. If you are trying to build a large scale application that requires a lot of debugging using R, you have chosen the wrong tool for the job. R is not Python or C++, it's intended for writing data analysis and visualization scripts.
As one of the people interviewed in the article I feel somewhat compelled to explicate a bit further. I'd be the first to admit that R is good for some things and bad for others. It's full of quirky parts that make coming from any other more standard type of scripting language (e.g. python) make a user want to pull their hair out. However that said, in the world I come from (EEB, ecology and evolutionary biology), it's by far the most popular language. At rOpenSci, we develop tools in R because that is the language our audience works in. I think the mistaken assumption of many commenters is that R users are actual programmers. Most EEB scientists I know don't want to get bogged down in learning multiple languages. They want to learn something that will make doing their science easier. R provides that. For all the credit that SciPy and Numpy deseveredly get, they still are way behind when it comes to certain statistical tasks. For instance there are whole books written doing mixed effects models in R, but you can't get those in python yet (I know statsmodels is coming along but it's nowhere near where lme4 is). Yes, if you're a python programmer you could just call that one R routine from python and go back on your merry way, but that's you, not the average ecology graduate student. Also, MatPlotlib is just not on par with the capabilities of ggplot2 and other R graphing libraries (although there is a ggplot2 port to python that is being developed).
The other important component that I think is missing from the discussion about R's merits is that it's facilitating open science. We're talking about fields that are moving from SAS/ Matlab / JMP, etc...and the creation of totally reproducible documents and experiments with tools like Sweave. Is it going to provide the fastest environment for running regression trees on a dataset with 10 million rows, no. But is it a powerful scripting language with well developed tools for manipulating data (plyr), visualization (ggplot2, lattice), doing GIS (rgdal, sp), getting data from API's (httr, jsonlite, anything rOpenSci does :) ), writing reproducible documents (knitr) and doing complex statistics (lme4, nml4, gam), yes. It allows scientists to learn one language to be able to accomplish 99% of the analytical tasks they want to be able to. I think that's the point of the article. Yes FOSS has been part of science for a long time, yes R is not the best language for many things, but there's a culture at play where it's been adopted and extended by many scientists to accomplish a lot of valuable science, and brought FOSS, openness and reproducibility to a vast number of scientists that probably wouldn't otherwise have adopted those practices.
As a former cognitive neuroscientist, I pray for the day Matlab is displaced. Given the generally low level of programming ability in the sciences, I'm personally rooting for Python to win, but I'll take what I can get.
Unfortunately, the dominant EEG and fMRI packages (Fieldtrip and SPM) were written in Matlab, and my labs standardized on them. Plus, when I was in school, R was unable to handle the multi-GB data sets that result from neuroimaging.
> combinedStr = strcat(s1,s2,...,sN) horizontally concatenates strings in arrays. Inputs can be combinations of single strings, strings in scalar cells, character arrays with the same number of rows, and same-sized cell arrays of strings.
So you can throw a bunch of different string-like stuff and everything will be concatenated. A bit strange but okay. However, there's more:
> If any input is a cell array, combinedStr is a cell array of strings. Otherwise, combinedStr is a character array.
Ouch. So you can build your program, test it only with non-cell-array arguments, and later on someone trows in an extra thing to concatenate (or uses a cell array to define the output separator).... and that changes the output type of the function!
But that's not the only side-effect! It truns out that
> For character array inputs, strcat removes trailing ASCII white-space characters: space, tab, vertical tab, newline, carriage return, and form-feed. For cell array inputs, strcat does not remove trailing white space.
Oh yeah, you also get different "concatenation" rules when these types change. And that's even before discussing why on earth would a "strcat" function remove trailing spaces from within the stuff you tell it to concatenate...
TL;DR: What is wrong with matlab is that it is designed to write one-liners that probably perform what you want. This is achieved by an endless stream of tweaks in the basic language's functions that automagically try to do "what you probably want". As a result, it is an extremely compact, easy to write for language when the tweaks work, but an utterly terrible experience when the magic doesn't work that makes you feel like walking through a minefield.
While the language definitely has it's problems (I'm not a fan of the syntax at all), I think dropping it for something marginally better like R is a little silly. So many man hours have been put into writing MATLAB/Octave code - redoing it seems like mostly a waste.
I don't know if you have similar experiences, but I often find that I want to use X feature in MATLAB in combination with Y feature in R and there isn't any easy way to do it. The bifurcation of coding efforts is vastly more frustrating than some bad/inconsistent syntax.
The toolboxes are great. I haven't used them much, but I feel like a lot of the time you can get away without using them. If you really need them, then it's not unreasonable to pay a license for the documentation and robustness - which you won't get in open source most of the time.
I'm in political science, and I'm pretty surprised how aware of open source tech some of my professors are, R especially. But I've even heard from a few of them a desire to pick up Python or C++ for other data work, and at least one of them knows emacs.
Proprietary software like STATA still gets used as much or more than R, but hopefully it continues to pick up steam. R Studio in particular is a pretty compelling environment.
I tried to do some searching for some specific projects that I've heard of but I'm coming up blank.
Really any decent quantitative study that isn't just an absolute basic regression is going to have degree of data processing done to it. Not exciting, but it is there and on a large level.
The other more interesting projects are doing stuff like scraping news sources, constitutions, etc, using natural language processing to pick out relevant parts and then matching those to some kind of database in order to code the necessary data.
Depends on the field of course. I'm in environmental science/energy economics so python is kind of a no-brainer if you want to go open source (and we do).
However, my significant other is working on a physics PhD and everything she does is in C or C++ with CERN ROOT. I used to use Matlab, and she thought it was adorably weak. I get a little more respect using Python now at least.
It's funny how domain dependent this is. In controls/filters/vision/applied-math no one don't takes anything other than MATLAB seriously.
There are of course the C/C++/FORTRAN gurus that write LAPACK/BLAS/OpenCV etc. , but they're kinda in their own world. MATLAB is the de-facto wrapper for these libraries, and all prototyping is done in it.
If you want to be completely open source. There seem to be a lot more libraries and general capability with the specific license we need. I really should have clarified better, I seem to have lost a word or two in there.
What I'm working on needs to interface with many different existing modeling and optimization efforts at some point, and of the options out there Python seems to be the most understood by the largest group of people (statisticians/scientists/programmers). With python we can keep everything 100% open and available to the largest number of people.
I work at a big bank in quant research. I can easily say that open source tools are favored here over their more expensive counterparts. Futhermore, over the last few years I've definitely seen a shift away from R and toward Python. NumPy, SciPy, Pandas libraries in Python are all excellent (and way faster) than equivalent options in R.
Mis-quoting Churchill: R is the worst numerics software, except for all those others I've tried from time to time.
I've used R extensively for analyzing network simulation results databases that run in the tens or hundreds of MB. One can find well-documented libraries that work for interfacing with nearly everything. In my case, it's pulling data from MySQL or SQLite databases, performing graph-theory analysis using Boost Graph Library, and generating output with Graphviz and other plotting tools. It's a solid toolchain, and R's inherent slowness is somewhat manageable via the parallel flavors of apply.
The main problem for me has been the lack of a clean analog to namespaces or utility classes. Environments sort of do the same thing but are ugly syntactically.
I'm hopeful about Julia, but there are a couple showstoppers for me presently. Maybe in a few years.
> But the ballooning cost of the software and dwindling research budgets have prompted scientists to turn to R instead.
I know some people using R, though at least in my field (Computer Science/Bioinformatics), Python seems to be more popular. Both of which happen to be free. That said, I don't know any research groups that chose R or Python specifically because they were free.
Being able to have everyone install it on every computer, without any thought of licensing definitely gets it in the door for some people.
The interactive nature of it is handy compared to SAS even when that is also available. I've known people to use R first, make a plan, then go back to programming a SAS on the massive data sets that R might not handle as well.
Julia will ride R's coattails. The Julia story arrived at a good time and seems to be slowly gaining traction in various niches. This is based purely on reading the mailing list and scanning relevant HN headlines.
xorg takes 2% here, not by the grace of a superfast computer - far from it - but by the simple expedience of having both Noscript and RequestPolicy installed. I can still read the article so I don't know what I'm missing by not allowing all that JS and external content to run/load. Not much, I assume...
Thanks! Your comment has encouraged me to test out NoScript. How bad I haven't checked this out before. I had this superstition that with NoScript I'd have to spare a lot of time configuring to just be able to reach a comfortable level, but I've been able to get going in a minute. Massively useful.
If a page is hosed under NoScript then just temporarily allow all scripts. You can spend a few hours building up a whitelist of hosts for your most visited sites if you can be bothered, but that's not needed for it to be useful.
Actually, while using NoScript, I've found out that there are a very small set of websites I visit often to bother to add exceptions: HN, github, hurriyet (a news website), tumblr, duckduckgo and maybe a couple others I can't recall.
What advantages does Python have over R for basic statistics work? I’m trying to play with data for fantasy sports and was planning on getting the data into a MySQL database then using R to look for patterns. Is R the right choice for this, or is it a matter of both Python and R being able to the same thing in different ways so there is no wrong choice? Given the fact that it will be a fairly small dataset I’m not overly worried about performance.
One of the big differences is that Python is a general purpose programming language, which just happens to have great support for statistics through packages. That means interfacing with a database is a very common workflow, and there are solid tools to do so like sqlalchemy. Now let's say you you want to calculate all this data and serve it on a webpage. R kind of falls over at this point. Python? No problem, you can start up a Flask webserver in half a dozen lines of code and have your data visible to all your fantasy team friends. Oh, you want to pull down the live data from the web, parse it from html/xml/csv, analyze it, put it into the database, and spit out the new analysis in realtime on your webpage? Python has you covered. R, not so much. To me Python also feels much more robust and well thought out as a language, but that could just be personal prejudice.
> "What advantages does Python have over R for basic statistics work?"
The advantage of Python is that it's a nice, well designed general purpose programming language. But in basic statistics work you don't need to develop large, well organized programs, so in practice I'd say there is no advantage.
R provides a larger collection of all kind of statistical routines out of the box.
No wrong choice I think, mostly personal preference. If you know Python I'd stick with that, iPython notebooks with pandas etc. is a solid choice.
R might be a little harder to start with, sapply/lapply can be confusing but there's plenty of info and libraries on the web to make your life easier. For plotting, ggplot still wins over matplotlib in my opinion but Python has other strengths.
The good thing about R is that it has forced me to learn statistics. Python has never done that for me. Use R where it works, don't use R where it doesn't. Call R from python or python from R, see if I care, as one would go to C or Fortran anyway. And why bother with python when you have Julia?
So what is all the guff is about? It does remind me of the ongoing religious war between frequentists and Bayesians ...
I wish that were true. I'm sure it depends on the field, but in my experience (more physics/natural science), MATLAB still leads for interactive scientific programming. I think Python is at this point recognized as a legitimate alternative, rather than the lingua franca.
Seems like Python has been gaining a lot of ground in areas formerly ruled by Fortran. A lot of those folks made forays into Java and Python seems like a breath of fresh air to a lot of the academics I know.
Libraries, libraries, libraries and history. The first serious numeric library for python was released in 1995 or 1996 and things have just grown and grown since then. Ruby is far behind in its offerings in this space.
Uh -- Python? R is the biggest prima donna (yes, I just used the Italian, deal with it xenophobes ( yes, I used x to start a word, FUCK OFF) ) in data science. It is also a very obfuscated and poor performing language overall. Statisticians shouldn't be allowed to drive languages into popularity.