
The R learning curve - sonabinu
http://datagrad.blogspot.com/2013/02/the-r-learning-curve.html
======
abraxasz
I've had to learn R during my master (in statistics), and found the process
very unpleasant.

There are different ways in which a language can be difficult to learn.
Haskell has a steep learning curve because it has a lot of unusual concepts
(for those who come from an imperative background) and requires a shift in the
way we think about code. J is also difficult to learn because the syntax is
crazy, and again, the language is very different from what we're used to.

I found R difficult to learn because it is seems inconsistent to me. Now one
reason might be that I picked up bits of R here and there without taking the
time to learn the syntax from a book (unlike what I did for other languages),
but it took me quite some time to be able to make sense of the different
structures (vector, list, matrix, dataframe), their differences, and in
particular how functions operate differently on these structures. I also have
a deep hatred for the system of attributes (why would anyone want to give
attributes to a vector...), and find the indexing system (especially for
lists) to be nonsensical. In fact, I think that lists themselves are terrible
to work with.

The general impression that I've had learning R is the language is not
coherent and systematic in its design in the way other languages are (I know
python, c, common lisp, ...), and I find myself spending a lot of time in the
interpreter simply trying out things because I'm not sure that it will return
what I want, in the format that I want (which never happens to me in any other
language).

Now don't get me wrong, I use R daily and it's a useful tool. It has lots of
great libraries, including the fantastic ggplot2, and the remarkable Rcpp
(best C++ interface I've ever seen.. ok I haven't seen any other, but this one
is really great). But learning it was no fun, and if the statistics community
decided to move to a cleaner language, I'd definitely be running ahead..

~~~
EzGraphs
In R, I often can write a dozen line program that will do something
incredible, but it might take an hour to hammer out those lines and cannot
reconstruct them from memory because of inconsistencies in syntax. It sort of
feels like a statistics DSL built on Lisp.

~~~
pseut
R is _exactly like_ a statistics DSL built on Lisp. The first sentences of a
talk by Ihaka [1] are:

> _R began as an experiment in trying to use the methods of Lisp implementors
> to build a small testbed which could be used to trial some ideas on how a
> statistical environment might be built. Early on, the decision was made to
> use an S-like syntax. Once that decision was made, the move toward being
> more and more like S has been irresistible._

Since I find it basically impossible to remember how 'eval', 'quote',
'substitute', etc. work in R, I suspect that the Lispers are onto something
when they say that the lack of syntax in Lisp is an important feature.

[1]
[http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Interface9...](http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Interface98.pdf)

------
RockyMcNuts
A few resources that might speed the learning curve

Survival guide

<http://www.win-vector.com/blog/2009/09/survive-r/>

Tutorials

<http://www.statmethods.net/index.html>

<http://heather.cs.ucdavis.edu/~matloff/r.old.html>

<http://www.twotorials.com/>

<http://en.wikibooks.org/wiki/R_Programming>

<http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf>

Docs

<http://cran.r-project.org/manuals.html>

StackOverflow

[http://stackoverflow.com/questions/tagged/r?sort=votes&p...](http://stackoverflow.com/questions/tagged/r?sort=votes&pagesize=15)

Cheat sheeets

<http://cran.r-project.org/doc/contrib/Short-refcard.pdf>

<http://bit.ly/VHPA3G>

R Journal <http://journal.r-project.org/>

R News (predecessor to R Journal) <http://cran.r-project.org/doc/Rnews/>

Rseek search engine <http://rseek.org/>

R could use something like the Python ecosystem intro, explaining where to
find stuff, not sure if any of the tutorials are as canonical as for instance
Dive Into Python or Learn Python The Hard Way.

~~~
pseut
> _R could use something like the Python ecosystem intro, explaining where to
> find stuff, not sure if any of the tutorials are as canonical as for
> instance Dive Into Python or Learn Python The Hard Way._

It only became popular recently; the expectation used to be that you'd learn R
while enrolled in a statistics class. The canonical references are books:
Modern Applied Statistics with S, R Graphics, etc. I assume that will change
soon.

~~~
RockyMcNuts
Would be great if the CRAN package repositories had ratings and download
counts.

It can be hard to find the 'right way' to do something.

------
billwilliams
R is my mistress and my wife.

Hadley Wickham held the ceremony.

Data.Table is the glue that holds together our tenuous marriage.

cats[color="brown",summary(tooth_length),by=list(breed,age)]. This line of
code will efficiently grab all the brown cats in your data and summarize their
tooth length by mean, median and other quartiles then break it out by the cats
breed and age.

Every time I use a language I can't lapply I despise the language and all
those involved.

R is hugely inconsistent and terribly inefficient. But its also the most
irresistible blend of lispy and object oriented for when you actually need to
get data analysis done.

------
EzGraphs
I spent about a year blogging about R as I learned it
(<http://www.r-chart.com/>). It is not a general purpose sort of language but
is great for math / charting tasks. A couple of ways that software developer
types can minimize time to get up and running:

1) Check out <http://www.r-bloggers.com/>. The site gives you a good sense of
the state of R and Tal (who runs the site) has done a great job of promoting R
and encouraging the community.

2) Pick one graphics package and stick with it. The standard functions are
sufficient but ggplot2 has a more finished appearance.

3) Review available libraries. If you want to do something, someone else
likely already has and has posted a package to CRAN.

4) One way for SQL developers to limit the need of learning the idiosyncrasies
of the language is to use the sqldf package to manipulate data frames and use
ggplot2 (which generally takes data frames as arguments) to display charts.

------
Pitarou
This sounds like the experience of a beginning programmer learning pretty-much
any popular scripting language.

> _I found that regular expressions ... are a great way to isolate the string
> that one is looking for..._

Well I never!

I'm not criticising the guy who wrote this, but I don't know why this has been
voted up on Hacker News.

~~~
Schwolop
To be honest, I found the entire coursera course on programming for data
analytics to be in the same vein. So much time was spent explaining what each
optional parameter to a function did that I skipped at least 50% of each
lecture.

~~~
bcbrown
Same here, I gave up. I'd rather read about the syntax. Save the lectures for
concepts that are hard to get across in text. I thought part of the problem
was that the course was geared towards grad students for whom this might be
their first programming language, as opposed to programmer who already know a
couple languages.

~~~
groovy2shoes
I've been considering dropping the class as well. The course definitely seems
like an R class for statisticians, whereas I am a programmer hoping to learn
more about statistics. I'm trying to stick with it but I don't feel as though
the course has been beneficial thus far.

~~~
Schwolop
I just skimmed through all the videos and then moved on to
<https://class.coursera.org/dataanalysis-001> instead. It's from the same
university, but it's much more "here are the statistical concepts, and I'll
show you an implementation in R" rather than "here's the syntax for the R
command that solves this particular riddle." Much better value for time!

------
zmmmmm
R is probably the most frustrating language I've ever had to learn. For me, R
doesn't so much have a learning "curve" as a learning "gradient". After a
long, long time the rate of frustrations is not really leveling off much. Each
time I have to negotiate a new library or module that I'm not familiar with
I'm guaranteed hours of frustration as I try to grapple with what all the data
types are, how they are subtly overridden to behave differently from the
normal data types they inherit from, what things are going to be magically
coerced into what other things without me realising it.

Even after 2 years of involvement with it, I regularly meet problems that
frustrate me for hours, because they are hard to express in R's functional
style. Even when I figure out the final solution, it often performs very
poorly. In between these frustrations R behaves almost magically. It
manipulates huge data sets with an ease and simplicity that defies logic. I've
eventually come to the conclusion that an important aspect of R is knowing
when NOT to use it. If your task is inherently stateful, involves random
access to lots of sparse relational data that is associated in complex ways
and with complicated logic - R is going to make your life hell. Write a quick
script to pre-process your data and get things into more like a straight data
table form and then proceed with R to analyse and visualize it. This is my
experience, anyway!

------
banachtarski
3 packages:

plyr ggplot2 reshape

Learn those 3 and use them for everything you do in R.

~~~
jasonpbecker
plyr, while having awesome syntax, is really slow for anything beyond even the
most modest dataset. Try using it with 50k rows, not to mention true "big
data".

data.table is much faster. It's an extension to data.frames that adds some
additional constrains/rules that allows for much faster operations including
aggregating, subsetting, and merging data.

Cleaning data, however, does not have a steep learning curve or high
difficulty level in R-- it has a steep learning curve and high difficulty
level period. Implementing good procedures for data munging is 80% of the job.

~~~
EzGraphs
Yeah - for the small data sets, Ruby or Python is often easier easier to use
to whip a data set into a form that can be simply slurped into R as a
dataframe.

~~~
aheilbut
With pandas / statsmodels / patsy, it's getting easier to stick with python
for everything.

~~~
GrumpySimon
Don't forget Bokeh[1] - a replacement for ggplot2

[1] <https://github.com/ContinuumIO/Bokeh>

------
mkhattab
> Cleaning the data - This takes time and it can be annoying.

Have you tried using Google Refine[1]? It's an excellent tool for cleaning
datasets and a useful addition to your workflow. I was miffed I didn't know
about it when I had to collect and analyze data from surveys awhile ago. I was
using Python for cleanup (Google Refine supports Jython).

[1]: <http://code.google.com/p/google-refine/>

~~~
jasonpbecker
Google-refine does some really cool stuff, but one of the great developments
in the last couple of years within the R community is a big push toward
reproducible research and building a powerful toolchain to assist to that end.

The problem with Google-Refine is that it is very hard to recreate precisely
the same steps to get from raw to processed data.

------
xijuan
Taking the exact same course and going the same process of learning!

------
pvaldes
Is useful, and yes when something don't go in your code can be very
frustrating... but not more than other languages. It is frustrating its own
complex ways.

The documentation, for example. Often very abstract. Not to forget, the
dificulty of the subjacent statistics for a lot of people. Some good Books can
help with this.

The wow factor of R is very hight in any case. There is a lot of things
currently that I can do only with R, thus... keep hammering _and_ take a look
to as many R code as you can

------
dnc
Maybe it's just me, but I find it difficult not to make analogies between R
and JavaScript. Both are dynamic, "script" languages with C-based syntax and
with LISP-like nature that lurks underneath it. Both have functions as first
class citizens. Both are inconsistent in different ways. Just to name a few
... I think that these analogies reduced frustration that I had felt about R
once when I "discovered" them and helped me to adopt and learn R.

------
mbq
My personal observation (after 6 years of using and 3 years of teaching R) is
that people misunderstand R by skipping the array paradigm part. If you don't
get it, others' code looks like magic and either your code is slow and ugly or
you seek rescue in "more-magic" wrappers like sqldf, plyr or data.table.

It's a bit like JavaScript; if you treat JS like "Java script" or "C without
types", you'll suffer. Exploit its true nature and it will flourish.

~~~
shared4you
I disagree. I've used Matlab for 4 years and switched to Numpy/Scipy 2 years
ago -- so I understand the array paradigm part. The problem with R is the
inconsistency and verbosity in syntax and semantics of the array language
underlying it -- as other comments have pointed out.

Some differences in ease of use:

[1] Construct a matrix

MATLAB: y = [1 1; 2 2; 6 6];

R: y <\- matrix(c(1, 2, 6, 1, 2, 6), 3)

[2] Insert a new row r = [3 3];

MATLAB: x = [y; r];

R: x <\- rbind(y, c(r));

Which is intuitive and concise? ;)

~~~
mbq
So you also think that

    
    
      mean(iris(iris(:,5)==2,2)) 
    

is more intuitive than

    
    
      mean(iris[iris$Species=="versicolor","Sepal.Width"])?
    

Or that constructing matrix by mapping f on 1:10 like

    
    
      M=[]
      for not_i=1:10
        M[:,not_i]=f(not_i)
    

is more concise that sapply(1:10,f)? (I know there is arrayfun, but I have
never seen it used except in wow-MATLAB-is-functional blog posts)

~~~
psb217
Or, you could just write:

    
    
        M = repmat(f(1:10), n, 1);
    

where 'n' gives the number of rows you want in 'M' and 'f' is written in
proper "Matlab" style (i.e. behaves reasonably when given an array as input).
Or, to be more in the spirit of linear algebra, you could write:

    
    
        M = ones(n,1) * f(1:10);
    

And, if you only want one row-wise copy, you could (succinctly) write:

    
    
        M = f(1:10);
    

Or, as you suggested, you could write something like:

    
    
        M = repmat(arrayfun(@(x)f(x), 1:10), n, 1);
    

Or, getting more silly, and using the handy bsxfun, you could write:

    
    
        M = bsxfun(@times, ones(n,10), f(1:10));
    

If you don't feel like implementing 'f' so as to permit array inputs, you
could modify this to:

    
    
        M = bsxfun(@times, ones(n,10), arrayfun(@(x)f(x), 1:10));
    

Anyhow, Matlab is very productive if you can effectively wield its powerful
built-ins.

~~~
mbq
You didn't get my example -- it was about how to make a matrix from the
results of a non-vectorised function that gets one number and returns a vector
(say performs a complex simulation).

The problem you've solved has an equally simple implementations in R; f(1:10)
for a single copy, matrix(f(1:10),10,n) for n columns,
matrix(f(1:10),n,10,byrow=T) for n rows, etc.

------
wiradikusuma
Don't forget the free <http://tryr.codeschool.com/> from Code School and
O'Reilly.

------
grn
Wouldn't it be better to have a Python statistics library instead of creating
a new language? When I looked at the documentation of R it looked very
convoluted and it seems that my feeling is confirmed by other comments in this
thread.

~~~
shared4you
Of course, these days Pandas is _the_ Python library for statistics. There is
also scikit-learn.

<http://pandas.pydata.org/>

~~~
grn
I haven't used R nor Pandas. How does the two compare? Why do people use R and
not Python?

~~~
reyan
> Why do people use R and not Python?

R style DataFrames was raison d'etre of Pandas in the first place.
(<http://pandas.pydata.org/#why-not-r>)

------
gklitt
I'm currently learning R and was hoping for a more detailed blog post. I found
this to be a pretty bland report that would have been made much more
interesting and useful if the author had included some concrete examples.

------
sonabinu
I've compiled all the resources from the comments here
[http://datagrad.blogspot.com/2013/02/some-learning-
resources...](http://datagrad.blogspot.com/2013/02/some-learning-resources-
for-r.html)

------
andrewcooke
for learning more about plotting, the gplot2 book is a tad expensive, but
worthwhile. <http://www.amazon.com/gp/product/0387981403> (it helped me
understand the structure behind the library, which i found confusing from
cookbooks etc)

~~~
jasonpbecker
I have found this [site] (<http://www.cookbook-r.com/>) to be more helpful
(plus the online documentation for ggplot2 which is really good).

Hadley's book is really good and a solid treatment (from the author of so many
clear, powerful package it's not a surprise), but there have been a lot of
changes to ggplot2 since 2009. I have personally found ggplot2 to be so
powerful because it lends itself very well to actually learning through using
Cookbook-style examples.

~~~
hadley
The cookbook is now an O'Reilly book: <http://amzn.com/1449316956>

------
capkutay
I don't know why a programmer would limit them self with static data tools
when d3 and processing provide beautiful, dynamic, and interactive solutions
with a similar learning curve. The only difference is you can use the
JavaScript chops you pick up learning d3 in any front end work you do.

~~~
pseut
Sounds great! What's the MCMC package for javascript?

~~~
disgruntledphd2
I worked with a computer scientist once, and he tested out lots of languages
for replacing his VB setup (I know, I was surprised too). JS turned out to be
the fastest language for a lot of his tests. If the libraries ever get there,
it will be lethal. Presumably there might be a way to compile java libraries
to JS using Rhino, but its not something I've ever investigated.

