
Boost Your Data Munging with R - michaelsbradley
http://jangorecki.github.io/blog/2016-06-30/Boost-Your-Data-Munging-with-R.html
======
minimaxir
While data.table does have impressive benchmarks, the rise of dplyr in the
years since with its _significantly_ more readable and concise syntax makes
life a lot easier for data sets that are not super large.
([https://cran.rstudio.com/web/packages/dplyr/vignettes/introd...](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html))

The vignette linked above describes some compatability between dplyr syntax
and data.table, but I admit I have not tested that.

Nowadays, if you are working with multigigabyte datasets, it may be worth
looking into Spark/SparkR
([https://spark.apache.org/docs/latest/sparkr.html](https://spark.apache.org/docs/latest/sparkr.html))
for manipulating big data in a scalable manner instead, as there is feature
parity with data.table + analytical tools [GLM] as a bonus. (The syntax is
still messy, to my annoyance)

~~~
localhost
You can rent an r3.8xlarge instance with 344GB RAM on AWS for $2.66/hr. If
that isn't enough, an x1.32xlarge instance with ~2TB RAM is $13.34/hr.
Assuming your data is already in AWS, of course. Of course those machines have
a crap ton of CPU power (I'm explicitly not saying cores because I'm mystified
by what vCPU really means) as well which you will be hard pressed to take
advantage of but if it's RAM you need they're super-cheap.

If you prefer Azure, a G5 instance with 448GB RAM is available for $9.65/hr.

Not a lot of need for Spark/SparkR for the vast majority of data sets given
the cheapness of compute these days.

~~~
michaelsbradley
The CPU power would not be too hard to utilize, if one's task is amenable to
fairly coarse-grained parallelization, which is rather easy to implement in R
scripts with the help of foreach[1] and doFuture[2].

Such an approach worked to great effect for me recently, when I needed to
perform a `zoo::rollapply`[3] across a time series with tens of millions of
rows. The speedup when throwing more cores (7, on my laptop) at it is roughly
linear. If I ever need to scale up the analysis to hundreds of millions of
rows, the 128 vCPUs of an x1 EC2 instance would be well worth the $$/hour.

[1]
[https://cran.r-project.org/web/packages/foreach/index.html](https://cran.r-project.org/web/packages/foreach/index.html)

[2]
[https://cran.r-project.org/web/packages/doFuture/index.html](https://cran.r-project.org/web/packages/doFuture/index.html)

[3]
[http://www.rdocumentation.org/topics/1505094](http://www.rdocumentation.org/topics/1505094)

~~~
huac
Shameless self promotion: my `slide_apply` function shows 5-10x improvement
over `zoo::rollapply` (despite no Rcpp call, somehow...)

[https://gist.github.com/stillmatic/fadfd3269b900e1fd7ee](https://gist.github.com/stillmatic/fadfd3269b900e1fd7ee)

if your function has a well-defined rolling form, e.g. rolling mean or
standard deviation, you should use a pre-optimized function for that. package
`catools` has a bunch of useful ones.

~~~
michaelsbradley
Thanks! As it happens, my function doesn't conform to any pre-optimized
facilities, though I wish it did. At some point, I may look into writing an
optimized variation, with the help of Rcpp, or even using the (inspiring!)
fortran approach of quantmod/xts.

------
stewbrew
Does any of these support out of memory data? IIRC dplyr can use a db as
backend but I could be wrong. Anyway, I wonder why not just use a db & sql for
more demanding data munging?

~~~
gnufx
Presumably that depends at least on the type of data and whether you have a
suitable database to use. As another approach, for R there's an interface to
parallel NetCDF in the pbdR system that you might use for distributed
processing on an HPC system. There doesn't seem to be anything like pdbR for
Python, for instance.

------
hackaflocka
How to convert Wide format data to Long format data (process called
'reshaping')is very very important to understand. Certain statistical tests
cannot be performed if not in Long format.

The article mentions Reshape2. I also love TidyR. It's much easier to use than
Reshape2 when dealing with a large number of variables (i.e., columns).

~~~
nerdponx
I find Tidyr more frustrating than useful a lot of the time.

~~~
hackaflocka
Interesting, I'll have to watch out for it then. I found it more helpful than
Reshape2 when trying to convert a dataset with 55 variables from wide to long
format. 8 of those variables needed to be collapsed into 1 variable. With
Reshape2, I had to identify all the ones I didn't want in the final result,
with TidyR I just needed to identify the ones I wanted. Hence it was easier
with TidyR.

------
mikeskim
I went from barely using data.table to only using data.table for basically
everything in less than a few years. I think this is the trend given it's
faster than basically everything:
[https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A...](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)

------
denzil_correa
In Python, you have pandas and then Python is great for text processing in
general. Is there anything that R provides more for data munging?

~~~
sgt101
It's a bit lame, but I am so in love with rstudio... I can't think of a python
IDE that's near for data science. Simple things like being able to see all the
entities you've created in your session and being able to jump back and forth
through plots are really nice.

~~~
danielecook
Rodeo is a good start:

[http://blog.yhat.com/posts/rodeo-
native.html](http://blog.yhat.com/posts/rodeo-native.html)

~~~
rz2k
Also a silly reason: for me the text rendering in Rodeo on a Mac without a
retina display is a lot worse than the text rendering in RStudio.

------
elchief
The only good solution to ETL/munging/wrangling is streaming. Loading
everything into memory is not a good idea.

