Hacker News new | comments | show | ask | jobs | submit login
Faster R with FastR (medium.com)
162 points by nirvdrum 10 days ago | hide | past | web | favorite | 78 comments





"Moreover, support for dplyr and data.table are on the way. "

Well, I can't really use it in my day to day work, since that almost always involves cleaning and munging via one of those two packages. And it's not like ggplot2 is where my R code is most delayed, usually I'm working on aggregate data or perhaps a very much smaller analytical dataset which requires much less speed for plotting. My hang-ups are in initial munging phases where the data is still very large, which often calls for data.table over dplyr due to the latter's much slower performance.


Yeah, data.table provides already significant speedup vs dplyr - so much that the "better" syntax of dplyr makes no sense anymore when you have to deal with very large datasets. But maybe FastR can somewhat change that?

Wait so the time difference of you running your code is longer then working with a "better syntax?"

I spend hours cleaning up data and only have to run the code once (I normally save the output to a feather and then work with a separate file from there).

I still believe that the 'tidyverse' is hands down the best thing that has happened to R and is the whole reason why R has grown so fast.


Sometimes it can take 12 or more hours to run the code on the millions of observations. There's also competition from other researchers who use computational resources, which can mean I have to leave something running for hours due to the server being heavily queried. My workflow also doesn't allow easy interruption of the execution, sometimes it has to execute fully incorrectly before I can change an error or parameter.

I would then say your using the wrong tool for your problem? I can't imagine 12 hour runs. I would imagine Spark is a better bet or is that not an option?

https://spark.apache.org/docs/latest/sparkr.html


I dunno how large your data set is, but I just set up for work a 16-core Threadripper workstation with 32 GB RAM and 1 TB M.2 SSD for approx. $2500. If it can regularly save you hours or days of waiting, getting something equivalent should be a no-brainer.

How large are we talking? I haven't had any problems with dplyr performance as long as my data fits in main memory. (I have 16GB, so that means single digit GB data frames at most - I realize that doesn't qualify as "very large".) It does slow down considerably for larger data sets, but I assumed that that was because it was hitting the pagefile.

I have had the same experience with dplyr.

In the event that the data doesn't fit into memory, it's better to preprocess w/ SQL at the data-store level. There hasn't been a case where I'd need to feed massive amounts of data into a ggplot2 visualization unaggregated.


FastR doesn't alter the semantics of R, so when dplyr copies a vector in GNU-R then FastR has to copy it too. However, FastR does use reference counting (not sure if that's turned on in GNU-R 3.5.1 now) so it may avoid some unnecessary copies.

You can use dplyr syntax on data tables, usually with data table speed, especially if you load dtplyr.

How is dtplyr on the memory aspect? Doesn't it force dplyr-style deep copies?

dplyr does not do deep copies. See discussion in https://adv-r.hadley.nz/names-values.html#df-modify

Yeppo. Also, for myself at least in the geospatial realm, I need raster, rgdal, sp, sf, and parallel. The primary allure of R (imo) is the thousands of packages that allow you to quickly and easily implement whatever you want to do. Combine those with data.table and parLapply, and you're off to the races.

Maybe 3-4 years ago there was a big push to speedup R by replacing the runtime; at least 3 competing replacements were talked about pretty actively. None of them achieved much mindshare. R trades runtime speed for dev speed, and we juice performance by writing slow stuff in C++ and linking Intel's MKL. The RStudio folks are also making the low-level stuff faster and more consistent through the r-lib family of packages, which are awesome.

Big barriers to adoption here: not a truly drop-in replacement, R people have an aversion to Java (we've all spent hours debugging rJava; luckily most of those packages have been rewritten in C++ now), and nobody likes Oracle.

I think the best-case scenario here is that progress on FastR pushes the R-Core team to improve GNU-R.


I never fail to be amazed at all the work the RStudio et al. team do to push R towards the wonderful programming language/environment it could be, rather than what it has been.

They recently added terminal to RStudio. I'm so happy not switching between two app Iterm2 and RStudio.

Yep. The python support is starting to get pretty decent as well. I much prefer Rmarkdown for R and python (or both at the same time!) for e.g.

I'm in the same boat, and would have gladly left R years ago if not for all their efforts

> R trades runtime speed for dev speed

This claim is made about a lot of things, Ruby, Python etc. I think the important point is it that there is no trade going on. It just that these things are all slower / less efficient than they need to be.


Maybe that's true, but I think Julia is the first effort to prove that out in the numerical/statistical world, and while lovely the ecosystem is far behind because of how much newer it is.

javascript showed that dynamically typed languages can be jitted well. It is just hard, and we spread our efforts over so many languages they don't all have the resources to do it.

What Julia showed is that if you carefully design the language with JIT in mind, the task is MUCH easier.

Julia gets very good performance without the massive manpower that has gone into Javascript VM's.


SELF and Dylan were there first.

sure, but there's plenty of other reasons why JS isn't a contender in this interactive-data-analysis space

oh for sure, but for Python/R the barrier to speed isn't any of their important productivity features (as far as I know) but just a high quality compiler/JIT

If I was Lord Of Computing I wouldn't let languages out of beta until they had a high quality compiler or JIT. Turns out I am not though.


Is PyPy not high quality?

There's also microsoft's R-Open (https://mran.microsoft.com/download) which I've found is faster than the out of the box R since it supports better multi-threading of commands.

IIRC most of that is because they use Intel's MKL and a better BLAS; if you like docker, using the Rocker containers uses the better BLAS, and I think adding MKL isn't too hard either.

This article compares FastR to GNU-R v3.4.0 -- but there were some important changes in v3.5.0 (see http://blog.revolutionanalytics.com/2018/04/r-350.html).

I'm not even sure GNU-R is the most important comparison (although it is an important comparison). How does it compare to R with Intel MKL? How does it compare to other (faster) languages?


FastR also uses native BLAS and LAPACK libraries. It should be possible to link it with Intel MKL as well.

We didn't want to include comparison to R-3.5.X, because FastR itself is based on the base library of 3.4.0, but the results for GNU-R 3.5.1 almost the same as for R-3.4.0.

AFAIK ALTREP is not used that much yet inside GNU-R itself. They can now do efficient integer sequences (i.e. 1:1000 does not allocate 1000 integers unless necessary), which would save a little bit of memory in this example, but that's about it. FastR also plans to implement the ALTREP interface for packages. Internally, we've been already using things like compact sequences.


This post does a comparison to 3.5.x (and Julia).

https://nextjournal.com/sdanisch/fastr-benchmark


There is also the xtensor initiative which aims to provide a unified backend for array / statistical computations in C++ and then makes it pretty easy to create bindings to all the data science languages (R, Julia and of course Python). Usually, going to C++ provides a pretty sizeable speedup.

https://github.com/QuantStack/xtensor-r https://github.com/QuantStack/xtensor

Disclaimer: I'm one of the core devs.


This is very interesting! Have you gotten any buy-in from the wider R community, is anyone rewriting their packages atop xtensor? Does R 3.5 and ALTREP make such a transition any easier?

I actually can't tell, but it has not yet been significant. It takes quite a bit of time to really get a library like this started. So far we've mostly dealt with people who are using xtensor from C++ or bind it to Python.

We've mainly gone through RCpp for the R language, and that has been working great. I don't know about changes in R 3.5 or ALTREP. Is there something we should know/change for it?


I recommend watching this video - Making R run fast

https://www.youtube.com/watch?v=HStF1RJOyxI

It's a little disappointing, because the conclusion is that R will probably never "run fast", but very interesting nonetheless.


Great talk, thank you.

At this point, the tidyverse packages probably cover >90% of my data analysis workflow, so it'd be great to see all of those compatible with FastR. I'd guess tidyr and dplyr would be the trickiest, and dplyr is already being being worked on!

Great work, thank you for sharing.


FastR can actually run all tests of the development version of dplyr with a simple patch. We're working on removing the need for that patch altogether.

data.table is a different beast and we will probably provide and maintain patched version for FastR. They do things like casting data of internal R structure to byte array and then memcopy it to another R structure. This is very tricky to emulate if your data structures actually live on Java side and you're handing out only some handles to the native code.


That's awesome! Personally, I don't use data.table much/at all, so (selfishly) that's not an issue for me.

  Context ctx = Context.newBuilder("R").allowAllAccess(true).build();
  Value rFunction = context.eval("R",
          "function(table) { " +
          "  table <- as.data.frame(table);" +
          "  cat('The whole data frame printed in R:\n');" +
          "  print(table);" +
          "  cat('---------\n\n');" +
          "  cat('Filter out users with ID>2:\n');" +
          "  print(table[table$id > 2,]);" +
          "}");
  User[] data = getUsers();
  rFunction.execute(new UsersTable(data));
The example above combined with "JEP 326: Raw String Literals" and an IDE that understands Java with embedded R code would be cool to play with.

If anyone wants to reproduce the benchmarks, I put them into a reproducible article and added a Julia baseline: https://nextjournal.com/sdanisch/fastr-benchmark

The thing I miss most in R are 64 bit integers. I am aware of the bit64 package, but I would prefer native support.

This is true. Even if you manage to build 2 billion + matrices, with bit64, I don't know any modeling packages that can handle those objects.

Can't you use floats with a large mantissa instead?

That's going to be less than 64 bits of usable space isn't it? I think the largest integer you can fit in a float precisely is 56 bits.

Yeah, but it's still better than a 32 bit integer, I suppose.

The last graph is a bit hard to read with the log scale. It's 10x improvement from GNU-R to FastR+rJava and another 10x with the native GraalVM interop.

I've actually tried porting some existing R applications that are currently run with RApache to Graal to try and get simpler deployment and better/more consistent operational support. Unfortunately at the time the gsub() function was broken, and that broke some of our core logic.

Hm... looks like the issue may have been fixed. I'll have to try again.


Plese open an issue on GitHub if you encounter any more problems with gsub or anything else.

Next time I try it, if it's still an issue then I will report.

Thanks!


It'd be great to have something like Numba for R, where you can write a restricted subset of R and have it JIT compiled to native code.

That, or something like Cython where, instead of writing inline C++, you translate a restricted subset of R to C, which is then compiled.


I think you could get a lot by chopping out R's non-standard evaluation. It's described pretty well here:

http://adv-r.had.co.nz/Computing-on-the-language.html

Functions in R are not referentially transparent, so replacing an argument with its value is not necessarily the same. That is a clear restriction on optimizations. If you would want to choose a restricted subset of R to speedup, then this would be a good candidate to cut out since the standard place to compile is at the function level (Numba, Cython, and Julia all do it at functions).


I'm not sure this is right; the NSE stuff tends to be at the shell, the user-facing API. The workhorse functions generally are referentially transparent, and writing pure functions is both natural and recommended in R. The slow parts are deeper than the NSE, so removing NSE wouldn't open up much room to optimize.

I suspect pass-by-value is a much bigger barrier to speed in R than non-standard evaluation.


Oh yes, I forgot about its pass-by-value. Removing pass-by-value is a double edged sword though. I generally dislike it, but you have to admit that having everything pass-by-value is much simpler to a non-programmer. If you chop that out then the "fast R subset" suddenly can act very differently. In order to really write efficient code you'd want to start making use of mutation on this fast part. This means throwing a macro on some array-based R code won't really be automatic: it would need a bit of a re-write for full speed but the re-written version would be incompatible with pass-by-value semantics. This is quite an interesting and tough problem to solve. I think it might be better to keep things pass-by-value and try to optimize pure functions.

What about copy-on-write semantics? Or is that not a big deal (since you can just "not do it").

That R is still around while not enjoying the wide array of benefits of general-purpose programming languages is impressive. It must truly have pluses that Python users don't even dream about.

E.g. can you quickly spin up a REST-like HTTP interface for your goods?


RStudio is pretty amazing for interactive statistical work. Also, A lot of open source developers tend to ignore Windows, but the less technical users are on Windows, and so proper Windows support is a key win. R's CRAN has a very clean documentation system and the setup for packages ensures that most things work on Windows (Windows CI is required). Also, its non-standard evaluation and associated metaprogramming is very integrated into the language, so you can build very intuitive APIs. Most users wouldn't know how to program what you just did, but that doesn't matter since the workflow for the average R user is "package-user" not "package-developer". So while R does have quite a few downsides, there's a lot that other general-purpose programming languages can pull from it.

E.g. can you quickly spin up a REST-like HTTP interface for your goods?

On the contrary, it started life as a Bell project called S, more or less a math/stats DSL. It was implemented in GNU as R, and R became one of many competing "stats packages" you may or may not be familiar with: SAS, Stata, SPSS, etc.

While it can be used for general purpose programming, its main advantage is that it is still primary a math, statistics, and data analysis DSL at heart. The concept of a "data frame" (which you are familiar with if you've used Pandas) as a data structure originated, as far as I can tell, in R. Data frames are built into the language, and the language offers custom syntax support for them.

Also, the standard library is full of high-quality statistics tools. Fitted model objects have handsome, human-readable string representations. The formula DSL is elegant and convenient. Manipulating data (replacing missing values, etc) is easy and relatively concise. Math and linear algebra is similarly and it is linked to BLAS so it's pretty fast. Plotting is built into the language and it's pretty intuitive, even if the defaults aren't that pretty. The language is also fully homoiconic and wildly dynamic, allowing you introspect and modify pretty much any chunk of code.

And all that's just in the standard library. The package ecosystem is downright enormous. You can write R packages in C/C++ just like in Python if you need something to go fast, aided by Rcpp. There's Shiny, which is a self-contained HTTP server for data-driven web applications. GGPlot2 was a minor revolution in elegant data visualization. The Tidyverse package collection was similarly mold-breaking by letting users write organic "data pipelines" instead of imperative code. Caret is at least as good as Scikit-learn for general-purpose machine learning. XTS takes the pain out of time series manipulation and modeling. Data.table can efficiently join and subset billion-row datasets in memory using indexes. The list goes on.

Long story short:

    - domain-specific niceties
    - batteries-included standard library that mimics features found in big monolithic stats packages
    - has general-purpose programming capability
    - extensible in C for speed
    - built-in plotting that's not perfect but it's pretty good
    - huge package ecosystem.

> Caret is at least as good as Scikit-learn for general-purpose machine learning

Oh how I wish this was true! Luckily RStudio hired the author of Caret to develop a family of smaller tidy modeling packages (https://github.com/tidymodels), and with recipes we're finally close to having something like sklearn's Pipelines, which IMO is one of the best parts of sklearn.


True, the pipeline is a great feature. I haven't used tidymodels yet but it looks like the start of a great ecosystem. I do remember seeing Broom at a talk a couple years ago and thought it was a nice idea.

That's interesting. I used to be a professional user of Stata, really day-to-day stuff; but I never saw R positioned as an alternative to Stata.

I only used Stata in school but that's how it turned out for me. "Why learn Stata, SAS, or SPSS when I can just use R?" It made no sense to me (and still doesn't, honestly).

Tons of former Stata users are now R users, especially over the last decade. Stata pretty much lives in Econ departments now.

> E.g. can you quickly spin up a REST-like HTTP interface for your goods?

With R? Why would you want to do that with R? R is not suitable as a web server. May be you can write a package for that using C. There are 13170 packages for R. ın fact 99% of R consists of packages. You don't sit and write web server with R.

R is used for statistical data analyses. I was using R to get the most occurring error in Apache/PHP error logs, only with 2 lines of R code. https://cran.r-project.org/web/packages/ApacheLogProcessor/i...


I dunno, I was able to cobble together a timeseries forecast API using the plummer and forecast packages in an afternoon that a product team was then able to work against to create demos for customers. Yeah, they’d probably eventually want to rewrite the API to be “production ready.” But on the other hand, for prototyping and getting to show something real to prospective customers? Pure dynamite.

Even then, if the stats being done in the background were hard to reimplement, I suppose plummer & R could still work with the right cloud / load balancing infra. Might end up being more expensive than it needs to be in the final iteration, but in the meantime money could be flowing in and customers gettin’ happy.


The big pluses are the huge range of libraries that make developing analyses easier, faster, and more reproducible. Python has some fine libraries, but its leagues behind whats available in R.

I use R like I use bash for neuroimgaing analysis: I utilize a whole lot of powerful/specialized command line executables (e.g. R lmer|e.g. neurogimaging AFNI) the outputs and inputs of which I link together into a pipeline using R/Bash utilities.

admittedly there are tools like nipype that use pythonto create an interface for those different neuroimaging tools, but most of the time bash scripting works perfectly reasonably for this.

The article makes mention that FastR supports GraalVM's polyglot mechanism. One possible option for your task is you do your data analysis with FastR and render it with Node on Graal.js or Sinatra on TruffleRuby. At first blush this might not sound all that different from CGI of yore, but the key thing is all Truffle-based languages can optimize with one another. So, when your web server endpoint gets hot, Truffle's PE can inline nodes from FastR and JIT the whole thing with Graal.

You get to use the best language for the task at hand and don't have to worry about performance penalties for doing so.


In answer to your question -- my sense is that you can spin up super nice dashboards using shiny, and those will be opinionated HTTP interfaces. If you want to combine the flexibility of a bonafide web framework, and R shiny dashboards, you're going to have a rough time. R shiny itself has a pretty rough HTTP implementation built in.

So I'd say the answer is yes, and you'll have a good time as long you only need the HTTP interface to do certain things (responsive dashboards; and do them well!).

Webserver implementations exist in R, but don't have near the time / attention put into them as with Python.


Yes, R has the now-RStudio-supported plumber package (https://github.com/trestletech/plumber), roughly flask for R.

There's also opencpu (https://github.com/opencpu/opencpu), though the pros/cons of one vs the other has never been clear to me.


> roughly flask for R

Unfortunately, not even close to flask. Plumber doesn't even handle concurrency...

https://www.rplumber.io/docs/runtime.html#performance-reques...


FWIW this is on their roadmap, using the future-backed promises package which also powerw async in shiny: https://github.com/trestletech/plumber/pull/248

That's fair--I'm definitely looking forward to when it's added, since it will make putting up quick prototypes with R very convenient!

I've used plumber, and it's pretty easy to get started, though doesn't feel very polished. Handling multipart form data took some hackarounds with the underlying "Rook" package.

I'm curious to see how https://github.com/thomasp85/fiery performs and if anyone has used that. May be higher performance than plumber (re: concurrency) because I get the sense from the docs that it's closer to libuv.


Shameless plug:

Doing something like that is definitely possible, all the parts are there and work well. Shiny gives you a lot out of the box, is great for prototyping and can be customized. I’ve been working on a less opionated package that isn’t ready for anything but gives an idea of what would be possible:

https://github.com/bobjansen/mattR


IIs there any information about how Graal+FastR are right now with respect to memory usage and warmup speeds? Are these benchmarks for total wall time or just the post-warmup speed?

There is a plot of warm-up curves for this specific example. Search for "To make the analysis of that benchmark complete, here is a plot with warm-up curves".

However, it is true that the warm-up and memory usage are something we need to improve. We're working on providing native image [1] of FastR. With that, both the warm-up and memory usage shold get close to GNU-R.

[1] https://www.graalvm.org/docs/reference-manual/aot-compilatio...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: