
R: the good parts - urlwolf
http://hackerretreat.com/r-good-parts/
======
x0x0
I love R, and I think that the insight people often overlook for R's success
is pretty simple: the easy things are easy. Doing hard things in R can be very
hard, but the easy things are easy. Eg loading a csv full of data and running
a regression on it are two lines of code that are pretty easy to explain to
people:

    
    
       $ R
       data <- read.csv(file='some_file', header=T, sep=',')
       model <- lm(Y ~ COL1 + COL2 + COL3, data=data)
    

and if you want to use glm -- logistic regression, etc -- it's a trivial
change:

    
    
       model <- glm(Y ~ COL1 + COL2 + COL3, family=binomial, data=data)
    

It really allows people to do quite powerful statistical analyses very simply.
Note that they built a dsl for specifying regression equations -- and you
don't have to bother with bullshit like quoting column column names; quoting
requirements are often hard to explain to new computer users.

R's other key feature is it includes a full sql-like data manipulation
language to manipulate tabular data; it's so good that every other language
that does stats copied it. If df is a dataframe, I can issue predicates on the
rows before the comma and columns after the comma, eg

    
    
       df[ df$col1 < 3 & df$col2 > 7, 'col4']
    

that takes my dataframe and subsets it so -- row predicates before the comma
-- col1 is less than 3 and col2 is greater than 7 -- and column predicates
after the comma -- just returns a new dataframe from the subset with col4 in
it. It's incredibly powerful and fast.

~~~
showerst
I think of this as the statistics version of PHP's

<?

$paragraph = 'hello world';

echo '<p>'.$paragraph.'</p>';

?>

It makes a basic tasks super easy, which allows people who don't know what
they're doing to make mistakes. It's just a different philosophy that has some
negatives.

~~~
guard-of-terra
The difference perhaps is that PHP code is facing the whole hostile internet.
Which is eagerly waiting you to make a single mistake to own you.

R code usually isn't.

------
urlwolf
There are people who think that there's elegance in R's design. Remember that
is as old as C (if not more!), but it still feels like a modern language
(warts and all). You don't compare C to say clojure for language features,
they show the advances in language design over the years.

R says: 'everything is a vector, and vectors can have missing values'. This is
profound. It was only recently that other matrix-oriented language extensions
(say panda) got missing values, even though they are meat-and-potatoes for
data analysis.

~~~
bokchoi
SAS has missing values as well and goes further -- there are multiple types of
missing values (ie, "not available", "doesn't pass quality control review",
etc.) We added missing values to our product to interop with both R and SAS:

[https://www.labkey.org/wiki/home/LabKey%20Server%20Documenta...](https://www.labkey.org/wiki/home/LabKey%20Server%20Documentation/page.view?name=manageMissing)

(Pardon the extremely old screenshots.)

~~~
hadley
In SAS, beware that missing values are treated as the smallest possible value
(e.g. -Inf). This means that statements like x < 10 return true if x is
missing.

~~~
lsiebert
Oh god yes, and it will bite you if you don't know it. That takes me back to
cleaning SAS data.

------
JasonCEC
My startups [1] does flavor profiling and statistical quality control for beer
and bourbon producers - it's a fun job!

Our entire back-end is built in R, mostly within the Hadly-verse, and we use
Shiny [2] as our web framework.

Our team works a bit differently than most, I suspect; our data-scientists
build features and analysis directly in R, and then add the functionality to
our Shiny Platform. Our "real devs" are all server + DB, or Android guys. This
has created a great development system where all of the "cool findings" and
'awesome visualizations' are immediately implemented in our system, and made
available for our clients!

[1] www.Gastrograph.com [2]
[http://shiny.rstudio.com/](http://shiny.rstudio.com/)

\---- EDIT ---

Edited to add; R is a great language and is 100% suitable for production
systems. Its older than Python(!), and, with some experience, can be made in
to high performance code.

~~~
reyan
How do you deal with interoperability? You Android clients use your Shiny
frontend over HTTP? Have you used rserve?

~~~
JasonCEC
All of our API's are built in R using rserve, and we share code and functions
through an internal package hosted on bitbucket.

You can call R functions on a server from Android and return the results in a
list or array - makes for some really cool internal API magic between our devs
and data-scientists :)

------
Fomite
While I'm not as fond of ggplot2 as the author is, and actually prefer base
graphics when making things for publication, I think he hits on a lot of
strong points.

I'm rather fond of R as a language, and hop between it and Python as my
preferred tools of choice for a given task. I think the package ecosystem is
it's biggest plus - for statistical work, Python _might_ have a package to do
something, R almost certainly will.

~~~
bertil
I think the opinion of ggplot2 is unanimous: you can do so much more with it —
but Lord Almighty, what an expense of time it is to do anything specific!
People like it or not depending on how they value their free time.

I always check in details with newbies what they are trying to do before I
mention that name (it used to be surprisingly hard to come across it randomly)
because once I have, it’s a rabbit hole -- and they generally have weird
ideas, that need more several single-dimension graphs, easily done with hist()
and plot(); however, anything a little subtle benefits so much from that
flexibility.

I still don’t understand why bucketed log-scale for histogram and properly
typed percentage scales (i.e. “10%” and not “0.1”) are so hard to do, but I
love impressing the one guy who tried by showing those casually.

~~~
bosie
isn't it simply scale_y_log10(labels=percent_format())?

~~~
bertil
That’s one case, that doesn’t work well with user-defined buckets, has an
unexplainable tendency to shift to ‘10.00%‘ when there is no room to do so and
works with only certain of ggplot many wonderful graphs… But yes, that one,
when it works is generally great.

~~~
hadley
It should work with any plot that has a y range greater than 0. ggplot2 can't
do anything to fix the fact that log(x) for x <= 0 is undefined.

------
sveme
Great intro, just a minor nitpick to not spread confusion: Julia does actually
have named arguments:

[http://docs.julialang.org/en/latest/manual/functions/#keywor...](http://docs.julialang.org/en/latest/manual/functions/#keyword-
arguments)

------
minimaxir
I hadn't used data.table or plyr because the native R functions were giving me
good performance even at tens of thousands of rows.

But now that I'm doing analysis on hundreds of thousands of rows, doing
aggregation takes awhile. This article convinced me to give those packages a
try. If data.table and plyr aggregate functions are indeed paralellizeable,
that's a big deal, especially when implementing bootstrap resampling.

~~~
jasonpbecker
You should know that dplyr, plyr's replacement is already fairly stable and
worth using. For basic tasks, it's as fast or faster than data.table in my
experience, with the caveat that it is more likely to copy for some methods
than data.table which is very strict about this.

~~~
peatmoss
And being able to combine it with an RDBMS means that you can potentially do
plyr-ish things on datasets that can't fit in memory. I think the SQL
generated by dplyr tries to be smart / efficient too.

------
baldfat
I still think that R has more users then Python with Pandas BUT the perception
is that Pandas is bigger and better.

I started with Pandas and learned R. I find that R is just better and if R
isn't right then Julia or Closure will do the work.

The tools in R are just better and more varied.

~~~
jimmar
I've done some social science statistical analysis in R. I tried to reproduce
the analysis using Python over the weekend, but the tools just aren't there.
For example, doing a within-subjects ANOVA in R is maybe 3 lines of code. No
native functions exist to do it in Python. Structural equation modelling? It
doesn't exist in Python. I'm fairly sure that going forward, people will
implement common social science statistical procedures in Python, but I need
them today.

~~~
peatmoss
I think Python may be closer to ready if you're doing Bayesian stats. Going
through some quantitative coursework for a phd in a social science, my
experience has been that for frequentist stats, R has everything and Python
has a tiny fraction.

On the other hand, I'm taking a Bayesian course now, and am thinking that I
could probably do the whole works in Python with little effort. That said, I'm
not sure doing things in Python would actually buy me anything over R. If I
were to do something other than R it would probably be something like Julia or
Clojure, but that would also be more for my amusement rather than for any
practical reason.

------
Myrmornis
Random R gripe: it's hard to reuse code cleanly, because it lacks a nice
import system like python, haskell, etc. Related: making a package is
complicated (or was last time I looked).

------
bernardom
Does anyone know of a good explanation of how plyr, dplyr, data.table and
*apply functions differ? I'd love to read an in-depth analysis of each and
make an informed decision on which one to use going forward.

My current m.o. is to use data.frames as needed and plyr if I need to do any
serious manipulation (which means that every time I use plyr, I need to read
the docs). There's a lot of benefit to picking one direction and sticking with
it...

~~~
x0x0
In R, the sql group by operation -- ie take a group of things, split them on
an identifier, run a function on each group with a distinct identifier, and
collect the results -- is called tapply. tapply has the signature tapply(X,
INDEX, FUN, ...). In this case X is the data to be split, INDEX is the
identifier, and FUN is the function. There are some limitations, including X
must be a vector.

a code example:

    
    
       > df <- data.frame(id=c('a','a','a', 'b', 'b'), vals=rnorm(5,10))
       > df
         id     vals
       1  a 10.86507
       2  a 10.71303
       3  a 11.15321
       4  b 10.78187
       5  b 10.80042
       
       > # calculate a mean on vals grouped by id
       > tapply(df$vals, df$id, mean)
              a        b 
       10.91044 10.79114
       > # similarly a median -- both mean and median are built in functions 
       > tapply(df$vals, df$id, median)
              a        b 
       10.86507 10.79114 
       > 
       > # now let's build our own function; I'm going to build a function that drops outliers
       > f <- function(xs){ qs <- quantile(xs, probs=c(0.025, 0.975)); mean( xs[xs >= qs[1] & xs <= qs[2]])}
       > tapply(df$vals, df$id, f)
              a        b 
       10.86507      NaN 
       > 
       > # well, this is just a demo and we ran out of bs, so it got NaN, but you see the idea
       >
       > # and plyr
       > library(plyr)
       > f2 <- function(dfs){ qs <- quantile(dfs$vals, probs=c(0.025, 0.975)); mean(dfs[ dfs$val >= qs[1] & dfs$val <= qs[2], 'vals'])}
       > ddply(df, .(id), f2)
         id       V1
       1  a 10.86507
       2  b      NaN
    
    
    

plyr relaxes that limitation, that is, X can be a data frame itself -- which
brings the huge benefit that your group by logic can operate over more than
one column. The function signature changes, but that's the basic innovation.
The first letter indicates what X is, and the second letter indicates the
output. Thus ddply runs an enhanced tapply over a data frame input (first d)
and collects the output into a data frame (the second d). It also offers a
bunch of nice enhancements; it's really a solid bit of work.

data.table, otoh, removes some of the speed problems with the built in
dataframes. It offers keys/indices for quick lookup.

I haven't spent much time with dplyr, but I think it does a couple things: (1)
move the plyr code from R to c for performance reasons; it allows you to run
plyr operations (with all code written in R) then translates most of that to
something that can run in sql against remote dbs (for the obvious reason that
plyr/group by produce summary stats, and those can be orders of magnitude
smaller than the source data, so pulling that all into R to immediately
discard most of it sucks).

~~~
craigching
> I haven't spent much time with dplyr, but I think it does a couple things

One of the cooler features in dplyr is the '%.%' operator which allows you to
chain operations. So you can write something like this in dplyr:

    
    
      Batting %.%
        group_by(playerID) %.%
        summarise(total = sum(G)) %.%
        arrange(desc(total)) %.%
        head(5)
    

which is very readable. That example stolen shamelessly from [1] ;)

[1] -- [http://blog.rstudio.org/2014/01/17/introducing-
dplyr/](http://blog.rstudio.org/2014/01/17/introducing-dplyr/)

------
Malarkey73
I agree with most of what they say here - except using data.table. I much
prefer using data.table structures with dplyr a far simpler more familiar
syntax. Indeed data.table - the object implementation - is brilliant and
should just replace data.frame ...if that were possible.

~~~
oddthink
The real problem with data.table is that it has too much magic. It's great for
interactive use, but it's hard to program to.

If, for example, you want to aggregate a table using a list of supplied
variable names, like if you wanted to abstract out some aggregation code, you
need to descend into horrible quote/substitute/etc. hackery to make it work.

------
kachnuv_ocasek
Does anyone else have issues viewing the page in Chrome 33? This is what I
see: [http://puu.sh/7ZAfb.png](http://puu.sh/7ZAfb.png)

~~~
urlwolf
WOw! sorry about that! could you give more details about your setup? I tested
it under chromium 31, and chrome canary 36.0. Anything helps.

