Hacker Newsnew | comments | show | ask | jobs | submit login

Are data hackers gravitating to R? Given that Oracle and (surprisingly to me) SAS now both support R in their offerings, it seems that at least the enterprise will be taking up more R for analytics.



I've been using R because I have an amateur's interest in statistics.

As a programmer, it seems like an awkward language, although no more awkward than SAS, SPSS, etc. And as I do more analysis the language makes more sense. It's a special-purpose language made to do a specific task.

The general workflow for doing data analysis is 1) import the data 2) clean it and format it properly as input for a pre-built package that does the actual analysis 3) feed it to the package and 4) interpret the results.

To that end, typically R programs are short and pretty declarative. R packages contain C or FORTRAN extensions that do all the heavy lifting. Substantial amount of imperative R code is going to be slow. For instance, looping over a vector is always worse than applying a vector transformation, and R provides a rich set of transformations for all its data types.

R has gotten popular because the proprietary guys dropped the ball at the universities. I recall reading a posting by one researcher who said he switched to R because his students could only use SAS at the school's stats lab, whereas they could run R on their computers at home.

Once researchers switched to R, they started publishing their work with code meant to be run with R. The cutting edge is important in stats, so people want a short lead time between when a new test or model is published and when it's available. SAS's "cathedral" can't really keep up. Combine that with SAS's licensing costs (both arms and a kidney too) as well as its overall "mainframey" feel, and you can see why R is winning.

EDIT: Another big win for R that I forgot to mention is its support for visualizations. A step that should perhaps come after importing the data above is investigating it with various diagnostic charts (scatter charts, box charts, etc) these are all just function calls in R. In addition, R has a powerful graphics engine and there are a huge number of packages available to create more sophisticated visualizations: http://addictedtor.free.fr/graphiques/

-----


This is a great description of R usage (import, clean, fit models), but I think a slightly erroneous explanation of R history.

John Chambers created "S" at Bell Labs. S was a programming language designed for interactive statistical analysis. Much like gcc and icc are implementations of C compilers, R and S-PLUS are implementations of S. S-PLUS was/is the primary proprietary implementation of the S language, whereas R is the primary free one (also, sometimes called GNU S). (SAS and SPSS are completely different languages/systems as far as I know.) I think that statisticians at some point made a conscious effort to publish their work in R, rather than S-PLUS (or any other statistical system like SAS) because it was more widely available. That in turn led R to be a viable competitor to S-PLUS (and other systems) because it had vast amounts of recent statistical libraries, often implemented by the people who developed the techniques. That said, SAS and SPSS seem to pretty much still have social science students locked up --- the market for R is probably statisticians who are also excellent functional programmers.

This history is in really marked contrast to MATLAB and its corresponding free version Octave, where computer scientists pretty much refuse to use Octave, despite MATLAB's massive price tag to pretty much everyone involved (even with 90% discounts).

(That said, if anyone lived through the change over from S-PLUS to R, I'd love to hear if this history is wrong!)

-----


I think that the Bioconductor project (http://www.bioconductor.org/) has also been a big part of R adoption as it has produced a core of well and consistently documented libraries for importing, managing and analyzing biological data that is not really matched anywhere else. R co-creator Robert Gentalman was/is a big driving force in that so of course it is in R.

-----


> This history is in really marked contrast to MATLAB and its corresponding free version Octave, where computer scientists pretty much refuse to use Octave, despite MATLAB's massive price tag to pretty much everyone involved (even with 90% discounts).

Do you have any insights as to why Octave does not have higher adoption?

-----


I've always been a bit sad about it, but everyone involved is probably a rational actor.

Computer science professors probably view a couple hundred dollars per MATLAB network license as a tiny expense on a $1m+ grant (whereas statistics grants are apparently often smaller), and they may be charged for it in departmental overhead anyway (removing the incentive to cut costs).

The type of people who could contribute either core code or toolbox type code to Octave often have an extremely rare quantitative skill set that is worth hundreds of dollars an hour, so there is a huge incentive to get paid to do similar work instead. There probably isn't much community recognition (to balance things out) for implementing a library in Octave. (Though, in the R world there are certain recognizable superstars like Hadley Wickham.)

Graduate students (who might work for cheap on these problems) are probably more focused on publications and networking.

As long as all of this is the case, Octave will always kind of just be a worse MATLAB that happens to be open source, so a new user choosing between them will probably just choose MATLAB by default.

-----


Octave core developer here.

It is true that we have a lot of trouble attracting new contributors. Most of our users keep demanding features that seem to us unimportant but to them are all the world: a GUI ("whatever for?", we think. "Use a real text editor!"), a JIT compiler ("here's a nickle, get better vectorised code, kid"), perfect Matlab compatibility (a never-ending chase, not very fun, in which we must always be behind).

Of these, we're finally slowly listening to our users. Two of our current three GSoC students are working on a GUI and a JIT compiler respectively. I have wild hope that this will attract more users and developers. I'm also currently hosting an Octave conference in a few days towards this goal:

    http://www.octave.org/wiki/index.php?title=OctConf_2012
By the way, Octave is GNU (so is R, supposedly), so we're not really open source; we're free. ;-)

I don't know why Octave hasn't been able to replicate R's success. I don't know if R's not really being GNU despite in name has something to it (R developers routinely try to find new ways to get around the GPL and link R to non-free code, and I don't doubt that this linking to Oracle's database is another example of that). I don't know if it's just that a lot of people with big money care more about statistics and R than they care about Octave (banks and brokers for R, electrical and civil engineers for Otave). Maybe our code sucks more than R's.

Do you have any suggestion how to make Octave the standard instead of Matlab? The recent gratis classes that emerged from Stanford gave Octave a lot of publicity. Do you have any suggestion of what else we might do?

-----


You're probably in a much better position to evaluate than I am! My guess is that more Octave-based classes would translate into more users and more code written for Octave down the line, but I'm not sure how to encourage more use of Octave in the classroom in the first place.

-----


Matlab is truly the RAD tool of choice for numerical programming and has a solid grip in universities combined with enterprise-level support.

I do not think Octave ever tried to replicate its workflow (which is not general programmer centric at all) and domain-specific documentation but merely focused on the underlying language compatibility, which is really the least important part of Matlab.

On top of that, I seem to recall, Matlab was one of the first of the specialist programming toolsets to offer a very competitive "Student Edition". This was a godsend for schools and universities before the Internet took off.

In short, Octave was too little too late, and Numpy/Scipy, while catching up fast, has supporting tools spread all over the place as well as being geared more to general programmers who want access to convenient numerics rather than numerical modellers/engineers wanting a RAD tool.

Numpy/Scipy etc. may well overtake Matlab eventually, but that will be purely a function of its infrastructure, not the something as mundane as even the nice language (which admittedly was its initial driving force). At least in this respect, it has done a lot better than Octave in much less time.

-----


Actually, there are tons of people who want to run Matlab code freely. The code is already written. They need to run it in clusters, or they need to run it at home.

This is why we are doing Octave.

-----


MATLAB is commonly used in introductory CS courses for engineering majors, since it's useful for a lot of general tasks and is pretty forgiving.

MATLAB has a nice GUI and IDE. It also generates good graphs with minimal effort.

Octave has a command-line REPL.

-----


I've looked into it, but went with Python instead, since it allows easier access to get through proprietary single-sign-on stuff that sits in the way of getting data. Plus I don't need to worry about when I need to drop down to screen scraping, or build up into a web interface—much nicer to just share modules between everything.

-----


It is also supported in SAP's HANA (in-memory database).

http://en.wikipedia.org/wiki/SAP_HANA#R_integration

-----


The HANA integration is more marketing blurb than a true integration at this point. It's meant to draw in the stats guys so other SAP products can be sold to them.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: