
Ask HN: Big Data Performance Issues: R vs. Stata? - WhompingWindows
I have a healthcare data-set with 3 million rows and 150 columns that I am analyzing partially in R (for descriptive stats, diff testing, table creation, visualizations) and partially in STATA (for bivariate and multivariate fixed effects logit models).<p>My question is about the relative performance of R and STATA for logit models. Since I have a large data set, I experience quite slow computation times for my code, in both languages.<p>-R: I am using tableone() package which isn&#x27;t particularly fast, I suspect, and I use dplyr and data.table for munging. If I wrote my own function for making tables, do you think it would be any faster? I could also try to utilize data.table for a higher % of the coding process?<p>-STATA: Do I really need to do my regressions in STATA? I have heard they are much slower in R, especially given that our fixed effects model has around 150 hospital sites to adjust each model for. Ideally, couldn&#x27;t I do this within R so I don&#x27;t have to pipe the data from STATA back into R? Or is there an easy way to export STATA output into R to then munge into tables?<p>Thanks for your help. I am stressing on this project and I want to level up the automation. I just spent 6 hours hand-filling cells for a table and I vowed to never do so again!
======
uptownfunk
Don’t know what your hardware is, but there are libraries that can do some
computations in parallel.

You’ll want to look into using a parallel backend in R. It will depend on the
system architecture.

As for libraries that may help potentially, speedglm, biglm,

There’s also some options from bioconductor:
[https://bioconductor.org/packages/release/data/experiment/vi...](https://bioconductor.org/packages/release/data/experiment/vignettes/RegParallel/inst/doc/RegParallel.html)

You could also try glmnet with a parallel backend.

------
butteredpopcorn
It’s possible that if you could spin up an open source relational database
like Postgres or MySQL and load your data set, you could do everything but the
models and viz themselves in SQL. SQL will handle 3 mil rows in a snap. If you
are transforming the contents of the cells then R is potentially nicer, but if
you are mainly summarizing, grouping, subsetting, etc SQL is fast and readable
and you can “dump” a CSV that’s fast to read in R or Stata. Your R or Stata
code will be really short, really just a load step, a viz / pretty tables
step, and then a model run step. It’ll be easy to swap out languages according
to the team/prof/project requirements.

------
bckygldstn
data.table is typically much faster than dpylr or custom code for large
datasets. Figure out what parts of your preprocessing are taking the most
time, and try to rewrite those using data.table.

R has some non-default packages for linear regression [0]. If your usecase is
supported by one of those then I'd imagine it would be faster than Stata.

[0] [https://stackoverflow.com/questions/25416413/is-there-a-
fast...](https://stackoverflow.com/questions/25416413/is-there-a-faster-lm-
function)

