
Faster R with FastR - nirvdrum
https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb
======
WhompingWindows
"Moreover, support for dplyr and data.table are on the way. "

Well, I can't really use it in my day to day work, since that almost always
involves cleaning and munging via one of those two packages. And it's not like
ggplot2 is where my R code is most delayed, usually I'm working on aggregate
data or perhaps a very much smaller analytical dataset which requires much
less speed for plotting. My hang-ups are in initial munging phases where the
data is still very large, which often calls for data.table over dplyr due to
the latter's much slower performance.

~~~
ekianjo
Yeah, data.table provides already significant speedup vs dplyr - so much that
the "better" syntax of dplyr makes no sense anymore when you have to deal with
very large datasets. But maybe FastR can somewhat change that?

~~~
baldfat
Wait so the time difference of you running your code is longer then working
with a "better syntax?"

I spend hours cleaning up data and only have to run the code once (I normally
save the output to a feather and then work with a separate file from there).

I still believe that the 'tidyverse' is hands down the best thing that has
happened to R and is the whole reason why R has grown so fast.

~~~
WhompingWindows
Sometimes it can take 12 or more hours to run the code on the millions of
observations. There's also competition from other researchers who use
computational resources, which can mean I have to leave something running for
hours due to the server being heavily queried. My workflow also doesn't allow
easy interruption of the execution, sometimes it has to execute fully
incorrectly before I can change an error or parameter.

~~~
semi-extrinsic
I dunno how large your data set is, but I just set up for work a 16-core
Threadripper workstation with 32 GB RAM and 1 TB M.2 SSD for approx. $2500. If
it can regularly save you hours or days of waiting, getting something
equivalent should be a no-brainer.

------
claytonjy
Maybe 3-4 years ago there was a big push to speedup R by replacing the
runtime; at least 3 competing replacements were talked about pretty actively.
None of them achieved much mindshare. R trades runtime speed for dev speed,
and we juice performance by writing slow stuff in C++ and linking Intel's MKL.
The RStudio folks are also making the low-level stuff faster and more
consistent through the r-lib family of packages, which are awesome.

Big barriers to adoption here: not a truly drop-in replacement, R people have
an aversion to Java (we've all spent hours debugging rJava; luckily most of
those packages have been rewritten in C++ now), and nobody likes Oracle.

I think the best-case scenario here is that progress on FastR pushes the
R-Core team to improve GNU-R.

~~~
truculent
I never fail to be amazed at all the work the RStudio et al. team do to push R
towards the wonderful programming language/environment it could be, rather
than what it has been.

~~~
digitalzombie
They recently added terminal to RStudio. I'm so happy not switching between
two app Iterm2 and RStudio.

~~~
truculent
Yep. The python support is starting to get pretty decent as well. I much
prefer Rmarkdown for R and python (or both at the same time!) for e.g.

------
ellisv
This article compares FastR to GNU-R v3.4.0 -- but there were some important
changes in v3.5.0 (see
[http://blog.revolutionanalytics.com/2018/04/r-350.html](http://blog.revolutionanalytics.com/2018/04/r-350.html)).

I'm not even sure GNU-R is the most important comparison (although it is _an_
important comparison). How does it compare to R with Intel MKL? How does it
compare to other (faster) languages?

~~~
steve_s
FastR also uses native BLAS and LAPACK libraries. It should be possible to
link it with Intel MKL as well.

We didn't want to include comparison to R-3.5.X, because FastR itself is based
on the base library of 3.4.0, but the results for GNU-R 3.5.1 almost the same
as for R-3.4.0.

AFAIK ALTREP is not used that much yet inside GNU-R itself. They can now do
efficient integer sequences (i.e. 1:1000 does not allocate 1000 integers
unless necessary), which would save a little bit of memory in this example,
but that's about it. FastR also plans to implement the ALTREP interface for
packages. Internally, we've been already using things like compact sequences.

~~~
ellisv
This post does a comparison to 3.5.x (and Julia).

[https://nextjournal.com/sdanisch/fastr-
benchmark](https://nextjournal.com/sdanisch/fastr-benchmark)

------
droelf
There is also the xtensor initiative which aims to provide a unified backend
for array / statistical computations in C++ and then makes it pretty easy to
create bindings to all the data science languages (R, Julia and of course
Python). Usually, going to C++ provides a pretty sizeable speedup.

[https://github.com/QuantStack/xtensor-r](https://github.com/QuantStack/xtensor-r)
[https://github.com/QuantStack/xtensor](https://github.com/QuantStack/xtensor)

Disclaimer: I'm one of the core devs.

~~~
claytonjy
This is very interesting! Have you gotten any buy-in from the wider R
community, is anyone rewriting their packages atop xtensor? Does R 3.5 and
ALTREP make such a transition any easier?

~~~
droelf
I actually can't tell, but it has not yet been significant. It takes quite a
bit of time to really get a library like this started. So far we've mostly
dealt with people who are using xtensor from C++ or bind it to Python.

We've mainly gone through RCpp for the R language, and that has been working
great. I don't know about changes in R 3.5 or ALTREP. Is there something we
should know/change for it?

------
lottin
I recommend watching this video - Making R run fast

[https://www.youtube.com/watch?v=HStF1RJOyxI](https://www.youtube.com/watch?v=HStF1RJOyxI)

It's a little disappointing, because the conclusion is that R will probably
never "run fast", but very interesting nonetheless.

~~~
nerdponx
Great talk, thank you.

------
truculent
At this point, the tidyverse packages probably cover >90% of my data analysis
workflow, so it'd be great to see all of those compatible with FastR. I'd
guess tidyr and dplyr would be the trickiest, and dplyr is already being being
worked on!

Great work, thank you for sharing.

~~~
steve_s
FastR can actually run all tests of the development version of dplyr with a
simple patch. We're working on removing the need for that patch altogether.

data.table is a different beast and we will probably provide and maintain
patched version for FastR. They do things like casting data of internal R
structure to byte array and then memcopy it to another R structure. This is
very tricky to emulate if your data structures actually live on Java side and
you're handing out only some handles to the native code.

~~~
truculent
That's awesome! Personally, I don't use data.table much/at all, so (selfishly)
that's not an issue for me.

------
tofflos

      Context ctx = Context.newBuilder("R").allowAllAccess(true).build();
      Value rFunction = context.eval("R",
              "function(table) { " +
              "  table <- as.data.frame(table);" +
              "  cat('The whole data frame printed in R:\n');" +
              "  print(table);" +
              "  cat('---------\n\n');" +
              "  cat('Filter out users with ID>2:\n');" +
              "  print(table[table$id > 2,]);" +
              "}");
      User[] data = getUsers();
      rFunction.execute(new UsersTable(data));
    

The example above combined with "JEP 326: Raw String Literals" and an IDE that
understands Java with embedded R code would be cool to play with.

------
ubiyubix
The thing I miss most in R are 64 bit integers. I am aware of the bit64
package, but I would prefer native support.

~~~
amelius
Can't you use floats with a large mantissa instead?

~~~
chrisseaton
That's going to be less than 64 bits of usable space isn't it? I think the
largest integer you can fit in a float precisely is 56 bits.

~~~
amelius
Yeah, but it's still better than a 32 bit integer, I suppose.

------
simondanisch
If anyone wants to reproduce the benchmarks, I put them into a reproducible
article and added a Julia baseline: [https://nextjournal.com/sdanisch/fastr-
benchmark](https://nextjournal.com/sdanisch/fastr-benchmark)

------
shelajev
The last graph is a bit hard to read with the log scale. It's 10x improvement
from GNU-R to FastR+rJava and another 10x with the native GraalVM interop.

------
lliamander
I've actually tried porting some existing R applications that are currently
run with RApache to Graal to try and get simpler deployment and better/more
consistent operational support. Unfortunately at the time the gsub() function
was broken, and that broke some of our core logic.

Hm... looks like the issue may have been fixed. I'll have to try again.

~~~
steve_s
Plese open an issue on GitHub if you encounter any more problems with gsub or
anything else.

~~~
lliamander
Next time I try it, if it's still an issue then I will report.

Thanks!

------
nerdponx
It'd be great to have something like Numba for R, where you can write a
restricted subset of R and have it JIT compiled to native code.

That, or something like Cython where, instead of writing inline C++, you
translate a restricted subset of R to C, which is then compiled.

~~~
thanatropism
That R is still around while not enjoying the wide array of benefits of
general-purpose programming languages is impressive. It must truly have pluses
that Python users don't even dream about.

E.g. can you quickly spin up a REST-like HTTP interface for your goods?

~~~
nerdponx
_E.g. can you quickly spin up a REST-like HTTP interface for your goods?_

On the contrary, it started life as a Bell project called S, more or less a
math/stats DSL. It was implemented in GNU as R, and R became one of many
competing "stats packages" you may or may not be familiar with: SAS, Stata,
SPSS, etc.

While it can be used for general purpose programming, its main advantage is
that it is still primary a math, statistics, and data analysis DSL at heart.
The concept of a "data frame" (which you are familiar with if you've used
Pandas) as a data structure originated, as far as I can tell, in R. Data
frames are built into the language, and the language offers custom syntax
support for them.

Also, the standard library is full of high-quality statistics tools. Fitted
model objects have handsome, human-readable string representations. The
formula DSL is elegant and convenient. Manipulating data (replacing missing
values, etc) is easy and relatively concise. Math and linear algebra is
similarly and it is linked to BLAS so it's pretty fast. Plotting is built into
the language and it's pretty intuitive, even if the defaults aren't that
pretty. The language is also fully homoiconic and wildly dynamic, allowing you
introspect and modify pretty much any chunk of code.

And all that's just in the standard library. The package ecosystem is
downright enormous. You can write R packages in C/C++ just like in Python if
you need something to go fast, aided by Rcpp. There's Shiny, which is a self-
contained HTTP server for data-driven web applications. GGPlot2 was a minor
revolution in elegant data visualization. The Tidyverse package collection was
similarly mold-breaking by letting users write organic "data pipelines"
instead of imperative code. Caret is at least as good as Scikit-learn for
general-purpose machine learning. XTS takes the pain out of time series
manipulation and modeling. Data.table can efficiently join and subset billion-
row datasets in memory using indexes. The list goes on.

Long story short:

    
    
        - domain-specific niceties
        - batteries-included standard library that mimics features found in big monolithic stats packages
        - has general-purpose programming capability
        - extensible in C for speed
        - built-in plotting that's not perfect but it's pretty good
        - huge package ecosystem.

~~~
thanatropism
That's interesting. I used to be a professional user of Stata, really day-to-
day stuff; but I never saw R positioned as an alternative to Stata.

~~~
nerdponx
I only used Stata in school but that's how it turned out for me. "Why learn
Stata, SAS, or SPSS when I can just use R?" It made no sense to me (and still
doesn't, honestly).

------
ufo
IIs there any information about how Graal+FastR are right now with respect to
memory usage and warmup speeds? Are these benchmarks for total wall time or
just the post-warmup speed?

~~~
steve_s
There is a plot of warm-up curves for this specific example. Search for "To
make the analysis of that benchmark complete, here is a plot with warm-up
curves".

However, it is true that the warm-up and memory usage are something we need to
improve. We're working on providing native image [1] of FastR. With that, both
the warm-up and memory usage shold get close to GNU-R.

[1] [https://www.graalvm.org/docs/reference-manual/aot-
compilatio...](https://www.graalvm.org/docs/reference-manual/aot-compilation/)

