Not having used R or being more than passingly familiar with it, I'm wondering if anyone could shed some light on what this is about? I notice that on the Revolution Analytics website, they are selling an "enterprise" version of R which they claim has a number of advantages over the mainstream form of R.
How exactly is a proprietary form of R even able to exist if R's codebase is GPL'd?
I'm sort of torn on this issue because setting aside the terms under which RA releases their libraries, what they produce comes with great documentation, and tends to be pretty useful. Using foreach and doMC I was able to cut down a calculation that normally takes 8 hours to 3 hours on a 6 core machine.
On the other hand, I strongly believe that proprietary code should be avoided when doing scientific research because it inhibits peer review and makes it harder for others to replicate your work. As Warren DeLano said: "The only way to publish software in a scientifically robust manner is to share source code, and that means publishing via the internet in an open-access/open-source fashion."
And from their FAQ:
Yes. CRAN R is licensed under the GNU General Public License (version 2). As permitted under that license, Revolution R is based on a modified version of CRAN R. The source code to that modified version (including a list of changes) is available for download when you download the binary version of Revolution R, as required by the GPL.
As an open-source company, Revolution Analytics respects the Free and Open Source Software philosophy and respects all free-software licenses. Revolution Analytics is also a direct supporter of the open-source R project: financially (as a benefactor of the R Foundation), organizationally (by sponsoring events and user group meetings), and technically (by contributing or modifications to CRAN R back to the community).
RA claims to have adopted this "arm's length" approach, releasing their direct extensions to the community. They ask me for identifying information to let me download the software; don't know if that's against the rules. Critically, their enterprise product is pitched as a collection of separate tools: IDE, debugger, analysis tool, etc.
There are a few other ways that people have made GPL'd code proprietary: (1) providing a hosted service instead of distributing software, or (2) distributing only a hardware appliance that contains inextricable software. The latter is controversial with respect to GPL2; GPL3 more clearly prohibits it.
R's aggregate data types are: vector, matrix, array, dataframe, and list. The semantics of these types and the relationships between them are extremely confusing. I wish I had gathered examples of this so I could be more specific, but I have basically come to the conclusion that I will never get familiar enough with them to do any better than random guessing until it works right. And I've written somewhat in-depth analyses in R.
list => are basically hash, or an array that can have mixed objects inside
vector, matrix, array => are all the same thing. They are what in most computer languages are called arrays, and can have only one type. The difference between those three is just the number of dimensions (vector:1, matrix:2, array:3+).
dataframe I will concede is a little more complex, and I still have some problem with it. But I basically think of it as a table, where a row represents a value (say temperature) and the column different measuremnts. So, for example:
rows=> temperature, humidity, hours of light, peak UV
columns=> Day1, day2, day3, day4, ...
Hope that helps.
> all.equal(1:10, matrix(1:10, ncol = 1))
 "Attributes: < target is NULL, current is list >"
 "target is numeric, current is matrix"
> all.equal(matrix(1:10, ncol = 1), array(1:10, c(10, 1)))
Vectors, matrices and arrays are atomic/homogeneous objects, and only differ in their dimensionality. Vectors are 1d, matrices 2d, and arrays are 3d or higher. Calling a 2d homogenous structure is a matrix is just a convention: a matrix is identical to a 2d array in every important way.
Lists and data frames are heterogenous/recursive. Lists are 1d, and data frames are (essentially) 2d (each row is homogenous, but each column can be a different type).
But I wonder why R actually needs to exist as its own language. It seems it could be recast in Ruby for example or one of the latest functional languages.
So I am kind of pleased my this news... if there's gonna be a need for R to have its own language, speed seems to be the most important distinguishing feature. A bit like Fortran is still used in science.
(incidentally... don't let people tell you otherwise: Fortran(90+) is a very nice language... much more pleasurable than C to use and gives you better performance (unless you know a lot about compilers and compiling flags... but most scientist don't ^_^))
I agree if we're talking about defining a new R (like the blog post discusses), but the existing R makes sense to me to exist as its own language. It wasn't really invented from scratch gratuitously, but began as an open-source reimplementation of the Bell Labs "S" language, which had already become close to a de-facto standard in the statistics community. After 30 years of writing S and R code, I think there's going to be a big uphill effort if you want to convince statisticians to read and write Ruby code instead. You'd also lose the ability to run snippets of code from the thousands of existing papers that include R/S code in an appendix.
One in-between possibility could be to retain the standard syntax/semantics but target an existing VM with a bigger development community. Ihaka seems to think that's impossible (he briefly discusses attempts to compile R as futile), but lots of weird and highly dynamic languages now have more efficient implementations than most people would've thought possible 10 years ago.
I think that's a great move for a number of languages as they lose popularity over time. The Scheme on lisp machines was outpaced by version on compiled machines, and is the version we use today.
Indeed - it would be a shame for them to start over from scratch and end up coming up with a brand new language, brand new syntax, brand new quirks, brand new performance problems, etc., while they could have simply searched around a bit for something that's already mature and somewhat optimized as well as suiting their needs.
If they wanted to add on to or modify an existing language (for instance, to provide more concise syntax for some of the things that are more important in statistics than in general purpose programming), that would be just fine, it could become a dialect of some other language, but starting fresh seems like an awful waste of energy...
Something that ran on the JVM would be awesome, they'd have no trouble at all rebuilding the massive library of contributions.
I've heard bad things of JVM for tightly coupled jobs on HPC (though I know there's been some improvement: e.g. a lot of work done by EPCC in Edinburgh). Does Clojure manages to offer a good parallel implementation on top of the JVM or has no work been done in this area?
R of all things was running faster.
I reimplemented in java and it takes < 1 second.
Not sure what that could be, though; even low-level algorithms usually run 10-100x slower than C speed if naively coded in plain Python, so a 30000x slowdown using a specialized library sounds rather odd. I assume you checked for memory leaks and swapping. Did you do any profiling?
In general, numpy/scipy is quite faster than R: it does not have the pass by value semantics for once. I am also skeptical about writing a "new" R: the main value of R is in the R packages. Any new language would threw that away.
(I forked your gist so you know how to contact me)
http://gist.github.com/578226#file_gistfile1.py (sorry, that's an editable link, so please be nice)
Anyway -- the machine is not swapping -- python grabs ~20GB of ram, there is another ~100GB available. It pegs one of the cores. Nothing else was running during this test so there was no competition for the fsb. This is on a recent-gen 16 core xeon server. The bit with the pipe and the popen is just me writing the header for the matrix market file in a separate file because it's simpler to dump it from hadoop that way. My time measurements above did not include file reading time -- just the experience of running the norms code (which is just a dot product of each column with itself, essentially). The matrix file is 2GB uncompressed on disk; dimensions are [5e5, 13e7] with 1.2e8 nz.
I rewrote as a sparse, column major matrix in java, running on the same machine, with a sparse dot product implementation. I was off a little before -- the time to stripe the entire dataset is ~2.7 seconds, averaged over 1000 tries. I was confused because I'm spawning 5 threads which gets the time to compute a dot product of one column against every other column down to an average of 1 pass through the data per second. If you contact me, I'm happy to share the code, but it's rather more lengthy.
One more edit -- the java code is not particularly optimized and doesn't batch columns. It just spawns 5 threads that bang on my giant array. Where the data is stored as a column major array of sparse vectors, and each sparse vector is an array of integer indicies and an array of integer values.
Indeed, you write
c = mat2.getcol(j)
norms[0, j] = scipy.linalg.norm(c.A)
which means (i) extract a sparse column vector, (ii) convert it to a dense vector, and (iii) compute the norm. Now, this should explain the speed difference. Looking at the nnz, a dense norm can take up to a factor 5e5/(1.2e8/1.3e7) ~ 54000 longer :)
The main issue here is that the linear algebra stuff under `scipy.linalg` doesn't know about sparse matrices, and tends to convert everything to dense first. You'd need to muck around with `m2.data` to go faster.
I'd actually guessed that it might be making columns full, but I'd expected to see a step-ladder up and down memory pattern as fectors were allocated, gc was triggered, vectors were allocated, etc. I didn't observe such a pattern; memory usage was almost constant.
Anyway, thanks again for your help -- I'd offer via email to buy you a beer if you're ever in SF, but no email, so...
For sparse SVD as I implemented in scipy, even fast dot does not seem to make that much of a difference (would be interesting to do real benchmarks, though).
It is somewhat confusing that python base types and numpy differ in behavior, for instance when dealing with inf or divide by zero exceptions. I think this gets to hadley's point that it will be hard to bolt on R to an existing language.
As for python vs numpy differences: yes, those can be confusing, and that's inherent to the fact that we use a "real" language with a library on top of it instead of the language designed around the domain. If you want to do numerical computation, you do want the behavior of numpy in most cases, I think. There is the issue of "easiness" vs what scientists need. You regularly have people who complain about various float issues, and people with little numerical computation knowledge advising to use decimal, etc... unaware of the issues. Also, python will want to "hide" the complexities, whereas numpy is much less forgiving.
As for the special case of divide by 0 or inf, note that you can get a behavior quite similar to python float. You can control how FPU exceptions are raised with numpy.seterr:
import numpy as np
a = np.random.randn(4)
a / 0 # puts a warnings, gives an array of +/- inf
a / 0 # raise divide by 0 exception
Of course if you're always careful, if your NA values in other languages are stored as numbers, you can avoid this error. But it's made easy by R's approach to NA.
# read a csv file with headers into ram
data <- read.csv(file='blah.csv', header=T)
# compute a linear model with dependent variable income, explanatory variables gender, education, and ethnicity, with automatic creation of (n-1) indicator variables as appropriate for categorical data
model <- lm(income ~ gender + education + ethnicity + age)
# if instead, I want to use a glm family of models, this is also trivial..
# note this is nonsensical statistically with respect to my data, but I just thought of this off my head, and it demonstrates how easy it is to do sophisticated things in R
model.glm <- glm(income ~ gender + education + ethnicity + age, family=binomial, link=cloglog)
Also, the data frame is the single best data structure I've ever used for manipulating tabular data. Finally, the excellent repl makes working in R and exploratory analysis absolutely awesome.
(incidentally, if you are a mpaa operative, call your blockbusters with just one very generic name... a lot easier than suing half the world, and just as annoying ;) )
"Incanter is a Clojure-based, R-like platform for statistical computing and graphics."
i think essentially no one using R professionally would accept Haskell or Clojure as a substitute. having spoken to a few: lisp-like syntax is "unreadable", and Haskell is too much effort.
has, RPy2 so you can still access R's statistics libraries http://scikits.appspot.com/statsmodels, and http://pandas.sourceforge.net/ (dataframes)
plus, scipy has a decent stats library as well (random variables, etc)
still rough around the edges, but a good solution in my opinion