

R for the JVM: 60% complete, help wanted! - bedatadriven
http://code.google.com/p/renjin/wiki/PlanOfAttack

======
epistasis
The primary reason to use R is the large number of stats packages, which are
written in a mixture of C, Fortran, and R. I don't see mention of JNI
anywhere, but hooking into Fortran and C seem like the most difficult but most
essential piece of making this useful.

~~~
Wilduck
I agree. The reason I'll drop into R instead of python is often that I need to
do some complicated analysis like adaptive multivariate integration over
hypercubes (<http://cran.r-project.org/web/packages/cubature/index.html>) or
general maximum pseudolikelihood estimation for multistage stratified,
cluster-sampled, unequally weighted survey samples
(<http://cran.r-project.org/web/packages/survey/index.html>).

There are currently > 3500 packages in CRAN, many of which can be found
nowhere else. Being able to access java libraries is great, but I an access
java libraries from all sorts of different languages. Until it is seamless to
install and use these CRAN packages, I'm sticking with the standard
interpreter.

~~~
bedatadriven
Yeah, there are some killer packages that won't be able to run on Renjin
without porting their C sources to java/R. That's a bummer.

But others -- like the survey package -- are pure R and run well on renjin.
(Just do library(survey, lib.loc='/path/to/R/library') )

The central goal at this point is to support embedding packages like 'survey'
in web apps or larger java apps. If you're looking for a seamless user
experience for ad-hoc analysis, we're still quite a ways off!

~~~
aardvark179
Looking at that project page is there a reason you went for an interpreter
rather than compiling to JVM byte code directly? Obviously it's a little
harder if you're targeting JVMs <v7 as you don't have invokeDynamic to play
with, but most of that work will be common to an interpreter or compiler. On
the project I'm working on we haven't even implemented an interpreter, the
REPL simply compiles to class files that are then loaded to evaluate them.

~~~
bedatadriven
Mostly inexperience :-)

There are a few features of the R-language that make direction translation
into byte code daunting for a muggle like myself:

1\. Computing-on-the-language: R code expects to be able to access and modify
the AST and frame of itself, its caller, and other closures. 2\. Impure call-
by-need argument-passing semantics.

The compiler that's in the trunk is experimental but evolving fast, I think
the next steps will probably to start compiling simple but performance-
critical basic blocks to byte code at runtime, and then slowly expand the
scope of language that can handled from there... (Expert advice welcome!!)

~~~
aardvark179
Ah right, those are going to make things fun. :-) I think in your position I'd
write a compiler, but keep the AST associated with the byte code and back off
to an interpreter if the AST is modified, maybe recompile after enough calls
without further change. The call-by-need arguments don't sound too bad, but
could take some thought on the memoisation strategy, I think I'd do it at the
caller and pass those structures through, but I'd want to think about it.

I'd hardly count myself as an expert, but I think the best win we've had is in
thinking carefully about callsite caching strategies and having a eureka
moment about just how insanely powerful MethodHandles.exactInvoker can be.

------
jacobquick
I'm not looking to move things to the jvm anymore, I can't trust oracle with
the future of stuff I may need to update and maintain.

------
Nrsolis
What about Incanter? It seems like a perfect opportunity to gather two
languages trying to do the same thing into a single JVM-focused project.

~~~
S4M
Incanter is quite nice and I really like Clojure (disclaimer: I am still a
beginner), but it has far less libraries than R. I also think the creators of
R did a really good job in making seamless the installation of a package
(install.packages(...)) and having lots of functions pretty well documented,
so a statistician who is not a programmer can easily do his work and quickly
come up with results. AFAIK this is still unmatched anywhere else.

~~~
_delirium
Perhaps as importantly, R has significant buy-in in the statistics community,
so a paper on a new technique will often be accompanied by an R package
implementing it; in fact several journals explicitly prefer R packages for
accompanying code, because the reviewers are likely to be familiar with how to
use it. That's partly due to its semi-continuity with Bell Labs S
(<http://en.wikipedia.org/wiki/S_(programming_language)>), I believe, which
was a language designed by-statisticians-for-statisticians. It's fairly hard
to replicate that; would require considerable effort to migrate the whole
community to a new consensus environment.

------
bigbird
Looks like a very interesting project.

Does the JVM environment have any tools that would facilitate building a
better R debugger? That's one area of the R ecosystem that could use a serious
upgrade IMO.

~~~
bedatadriven
Eclipse is a pretty good framework for building IDEs; there's already a set of
plugins for R (StatET) that support debugging with the original interpreter.

One of the projects for 2012 is to integrate Renjin into StatET, including a
line-by-line debugger. Any takers?

------
zeratul
bedatadriven: R native code is usually slow and always memory hungry.
Nonetheless, running R on Google AppEngine is very tempting. Could you give us
some idea how the memory usage looks like using Renjin when compared to any
other R distribution? Here is example how to measure memory:

[http://heuristically.wordpress.com/2010/01/04/r-memory-
usage...](http://heuristically.wordpress.com/2010/01/04/r-memory-usage-
statistics-variable/)

~~~
bedatadriven
Re: performance of R language code, Renjin is a bit faster there than R2.X,
and should get faster the more we get into byte code. (Though renjin is
currently slower in other areas like giant matrix manipulation)

As for memory usage, I believe object.size() will double-count your input data
when it is referenced by the resulting model objects. Better to check
memory.profile()

At present, Renjin benefits from the JVM's state-of-the art garbage
collection, so you may see some improvements even at present, but I expect the
big difference will be once we roll out non-memory-backed stores for R
Vectors. Then your input data could be stored in a database and only partially
loaded into memory as needed.

