Hacker News new | past | comments | ask | show | jobs | submit login
The R learning curve (datagrad.blogspot.com)
58 points by sonabinu on Feb 8, 2013 | hide | past | favorite | 71 comments



I've had to learn R during my master (in statistics), and found the process very unpleasant.

There are different ways in which a language can be difficult to learn. Haskell has a steep learning curve because it has a lot of unusual concepts (for those who come from an imperative background) and requires a shift in the way we think about code. J is also difficult to learn because the syntax is crazy, and again, the language is very different from what we're used to.

I found R difficult to learn because it is seems inconsistent to me. Now one reason might be that I picked up bits of R here and there without taking the time to learn the syntax from a book (unlike what I did for other languages), but it took me quite some time to be able to make sense of the different structures (vector, list, matrix, dataframe), their differences, and in particular how functions operate differently on these structures. I also have a deep hatred for the system of attributes (why would anyone want to give attributes to a vector...), and find the indexing system (especially for lists) to be nonsensical. In fact, I think that lists themselves are terrible to work with.

The general impression that I've had learning R is the language is not coherent and systematic in its design in the way other languages are (I know python, c, common lisp, ...), and I find myself spending a lot of time in the interpreter simply trying out things because I'm not sure that it will return what I want, in the format that I want (which never happens to me in any other language).

Now don't get me wrong, I use R daily and it's a useful tool. It has lots of great libraries, including the fantastic ggplot2, and the remarkable Rcpp (best C++ interface I've ever seen.. ok I haven't seen any other, but this one is really great). But learning it was no fun, and if the statistics community decided to move to a cleaner language, I'd definitely be running ahead..


In R, I often can write a dozen line program that will do something incredible, but it might take an hour to hammer out those lines and cannot reconstruct them from memory because of inconsistencies in syntax. It sort of feels like a statistics DSL built on Lisp.


R is exactly like a statistics DSL built on Lisp. The first sentences of a talk by Ihaka [1] are:

> R began as an experiment in trying to use the methods of Lisp implementors to build a small testbed which could be used to trial some ideas on how a statistical environment might be built. Early on, the decision was made to use an S-like syntax. Once that decision was made, the move toward being more and more like S has been irresistible.

Since I find it basically impossible to remember how 'eval', 'quote', 'substitute', etc. work in R, I suspect that the Lispers are onto something when they say that the lack of syntax in Lisp is an important feature.

[1] http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Interface9...


R makes the hard things easy and the easy things hard.


> I found R difficult to learn because it is seems inconsistent to me. Now one reason might be that I picked up bits of R here and there without taking the time to learn the syntax from a book (unlike what I did for other languages), but it took me quite some time to be able to make sense of the different structures (vector, list, matrix, dataframe), their differences, and in particular how functions operate differently on these structures.

If it makes you feel any better, I've read several R books and they normally present it in an inconsistent manner as well. I've written programs in some 30 languages over the last 13 years, and I've yet to encounter one that's as difficult to pick up as R. Materials that purport to teach R are, in my experience, almost always presented as a set of recipes for very specific statistical methods. That approach stands in stark contrast from traditional programming introductions that attempt to teach general concepts rather than specific how-tos.

I agree, though, that despite the learning curve, R is rather useful.


I'd appreciate any comments on https://github.com/hadley/devtools/wiki - it's my attempt to teach R like a programming language, focussing on cross-cutting concerns and general concepts. It's still a work i progress, so your feedback can help make it better.


I read it cover to cover (so to speak) about two weeks ago, and although still a work in progress, it is indeed the best written guide to R that I've read to date. It appealed to me because it insisted on the functional programming side of R, and explained some of the intricacies and advanced concepts in R. It's pretty dense though and will certainly require a number of re-reads before I internalize all of the material.

One thing I would find very useful is a case study on how to design more complex packages. In particular, I would love to have an executive summary of the inner-working of ggplot2 and/or ddply. In particular, how can I manage to pass formulas as arguments to my functions, how can I achieve the pseudo DSL effect of ggplot2, etc... I know, I could read the source on github, but ggplot2 is pretty big, so a summary would be helpful.


I think as I flush out the functional programming and computing on the language sections, those techniques underlying ggplot2 and plyr should become more obvious. I like the idea of case studies on more complex packages, but I don't know if I can bring myself to try and describe how they work :/ Part of the problem is that I didn't understand the techniques terribly well myself when I wrote ggplot2 and plyr, so they don't have particularly clean implementations.


I have to second the usefulness of devtools/wiki. Its wonderfully clear and concise, though it does take some getting one's head around. And I write R every day and have read all of Venables and Ripley (more than once) and R still trips me up quite a lot. I tend to understand the error messages now, which is certainly progress.


I've been working with R 90% of my time over the last two years and I agree it's a freakishly inconsistent language.

Me and the team I work with are still finding out weird completely illogical errors that effect the language but I think I've somehow started loving it despite all it's flaws and I did actually semi-enjoy the process of figuring out all those little wtf issues.

For those starting out with it reading this book + r inferno helps a lot with understanding the underlying weirdness and inconstancies of the language: http://www.amazon.co.uk/The-Art-Programming-Statistical-Soft...


A few resources that might speed the learning curve

Survival guide

http://www.win-vector.com/blog/2009/09/survive-r/

Tutorials

http://www.statmethods.net/index.html

http://heather.cs.ucdavis.edu/~matloff/r.old.html

http://www.twotorials.com/

http://en.wikibooks.org/wiki/R_Programming

http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

Docs

http://cran.r-project.org/manuals.html

StackOverflow

http://stackoverflow.com/questions/tagged/r?sort=votes&p...

Cheat sheeets

http://cran.r-project.org/doc/contrib/Short-refcard.pdf

http://bit.ly/VHPA3G

R Journal http://journal.r-project.org/

R News (predecessor to R Journal) http://cran.r-project.org/doc/Rnews/

Rseek search engine http://rseek.org/

R could use something like the Python ecosystem intro, explaining where to find stuff, not sure if any of the tutorials are as canonical as for instance Dive Into Python or Learn Python The Hard Way.


> R could use something like the Python ecosystem intro, explaining where to find stuff, not sure if any of the tutorials are as canonical as for instance Dive Into Python or Learn Python The Hard Way.

It only became popular recently; the expectation used to be that you'd learn R while enrolled in a statistics class. The canonical references are books: Modern Applied Statistics with S, R Graphics, etc. I assume that will change soon.


Would be great if the CRAN package repositories had ratings and download counts.

It can be hard to find the 'right way' to do something.



R is my mistress and my wife.

Hadley Wickham held the ceremony.

Data.Table is the glue that holds together our tenuous marriage.

cats[color="brown",summary(tooth_length),by=list(breed,age)]. This line of code will efficiently grab all the brown cats in your data and summarize their tooth length by mean, median and other quartiles then break it out by the cats breed and age.

Every time I use a language I can't lapply I despise the language and all those involved.

R is hugely inconsistent and terribly inefficient. But its also the most irresistible blend of lispy and object oriented for when you actually need to get data analysis done.


I spent about a year blogging about R as I learned it (http://www.r-chart.com/). It is not a general purpose sort of language but is great for math / charting tasks. A couple of ways that software developer types can minimize time to get up and running:

1) Check out http://www.r-bloggers.com/. The site gives you a good sense of the state of R and Tal (who runs the site) has done a great job of promoting R and encouraging the community.

2) Pick one graphics package and stick with it. The standard functions are sufficient but ggplot2 has a more finished appearance.

3) Review available libraries. If you want to do something, someone else likely already has and has posted a package to CRAN.

4) One way for SQL developers to limit the need of learning the idiosyncrasies of the language is to use the sqldf package to manipulate data frames and use ggplot2 (which generally takes data frames as arguments) to display charts.


This sounds like the experience of a beginning programmer learning pretty-much any popular scripting language.

> I found that regular expressions ... are a great way to isolate the string that one is looking for...

Well I never!

I'm not criticising the guy who wrote this, but I don't know why this has been voted up on Hacker News.


To be honest, I found the entire coursera course on programming for data analytics to be in the same vein. So much time was spent explaining what each optional parameter to a function did that I skipped at least 50% of each lecture.


Same here, I gave up. I'd rather read about the syntax. Save the lectures for concepts that are hard to get across in text. I thought part of the problem was that the course was geared towards grad students for whom this might be their first programming language, as opposed to programmer who already know a couple languages.


I've been considering dropping the class as well. The course definitely seems like an R class for statisticians, whereas I am a programmer hoping to learn more about statistics. I'm trying to stick with it but I don't feel as though the course has been beneficial thus far.


I just skimmed through all the videos and then moved on to https://class.coursera.org/dataanalysis-001 instead. It's from the same university, but it's much more "here are the statistical concepts, and I'll show you an implementation in R" rather than "here's the syntax for the R command that solves this particular riddle." Much better value for time!


Not convinced, the non-programmers on the discussion boards were having real problems...

Certainly wasn't the best course I've done on Coursera, although it did "force" me to learn R, so in that sense it achieved its aim.


I have to ask, why would a programmer want to learn R as a language? Why not just learn statistics?


I'm also reading statistics texts. I've had colleagues use R for visualizations, and another tool in the toolbox is always good. But some of the design decisions made me recoil, and I think I'll skip it until I have a need for it at work.


A big problem with R is that much too much semantics gets buried in critical optional parameters to functions.


This is a good place for 'And I am no man' from Lord of the Rings. The author appears to be a female, not a guy.


R is probably the most frustrating language I've ever had to learn. For me, R doesn't so much have a learning "curve" as a learning "gradient". After a long, long time the rate of frustrations is not really leveling off much. Each time I have to negotiate a new library or module that I'm not familiar with I'm guaranteed hours of frustration as I try to grapple with what all the data types are, how they are subtly overridden to behave differently from the normal data types they inherit from, what things are going to be magically coerced into what other things without me realising it.

Even after 2 years of involvement with it, I regularly meet problems that frustrate me for hours, because they are hard to express in R's functional style. Even when I figure out the final solution, it often performs very poorly. In between these frustrations R behaves almost magically. It manipulates huge data sets with an ease and simplicity that defies logic. I've eventually come to the conclusion that an important aspect of R is knowing when NOT to use it. If your task is inherently stateful, involves random access to lots of sparse relational data that is associated in complex ways and with complicated logic - R is going to make your life hell. Write a quick script to pre-process your data and get things into more like a straight data table form and then proceed with R to analyse and visualize it. This is my experience, anyway!


3 packages:

plyr ggplot2 reshape

Learn those 3 and use them for everything you do in R.


plyr, while having awesome syntax, is really slow for anything beyond even the most modest dataset. Try using it with 50k rows, not to mention true "big data".

data.table is much faster. It's an extension to data.frames that adds some additional constrains/rules that allows for much faster operations including aggregating, subsetting, and merging data.

Cleaning data, however, does not have a steep learning curve or high difficulty level in R-- it has a steep learning curve and high difficulty level period. Implementing good procedures for data munging is 80% of the job.


Yeah - for the small data sets, Ruby or Python is often easier easier to use to whip a data set into a form that can be simply slurped into R as a dataframe.


Has to be said I cheated a bit on the Coursera course, I used Ruby to clean up some of the data. Seemed a lot easier, and it's a lot more likely I'd use some kind of scripting language (be it Perl, Ruby, Grep, ...) to do a first clean up than head directly to R to do the job.


With pandas / statsmodels / patsy, it's getting easier to stick with python for everything.


Don't forget Bokeh[1] - a replacement for ggplot2

[1] https://github.com/ContinuumIO/Bokeh


This is a good point. Usually, I use haskell or something to do large scale transformations prior to a plot or something. However, I still use plyr from time to time to do quick one offs on smaller samples.


With 50,000 rows plyr should be fine, and I've tested it up to 1,000,000 and it's reasonably. The problem is more that it doesn't scale terribly well in the number of groups.


I really like ggplot2. I always had a hard time getting on board with the other packages. sqldf allows SQL Queries against dataframes. Perhaps plyr and reshape feel familiar to folks from other backgrounds (science/stats) but SQL is the lingua franca among a good portion of the developer world (and explains innovations like LINQ in .NET).


> Cleaning the data - This takes time and it can be annoying.

Have you tried using Google Refine[1]? It's an excellent tool for cleaning datasets and a useful addition to your workflow. I was miffed I didn't know about it when I had to collect and analyze data from surveys awhile ago. I was using Python for cleanup (Google Refine supports Jython).

[1]: http://code.google.com/p/google-refine/


Google-refine does some really cool stuff, but one of the great developments in the last couple of years within the R community is a big push toward reproducible research and building a powerful toolchain to assist to that end.

The problem with Google-Refine is that it is very hard to recreate precisely the same steps to get from raw to processed data.


Taking the exact same course and going the same process of learning!


Is useful, and yes when something don't go in your code can be very frustrating... but not more than other languages. It is frustrating its own complex ways.

The documentation, for example. Often very abstract. Not to forget, the dificulty of the subjacent statistics for a lot of people. Some good Books can help with this.

The wow factor of R is very hight in any case. There is a lot of things currently that I can do only with R, thus... keep hammering _and_ take a look to as many R code as you can


Maybe it's just me, but I find it difficult not to make analogies between R and JavaScript. Both are dynamic, "script" languages with C-based syntax and with LISP-like nature that lurks underneath it. Both have functions as first class citizens. Both are inconsistent in different ways. Just to name a few ... I think that these analogies reduced frustration that I had felt about R once when I "discovered" them and helped me to adopt and learn R.


My personal observation (after 6 years of using and 3 years of teaching R) is that people misunderstand R by skipping the array paradigm part. If you don't get it, others' code looks like magic and either your code is slow and ugly or you seek rescue in "more-magic" wrappers like sqldf, plyr or data.table.

It's a bit like JavaScript; if you treat JS like "Java script" or "C without types", you'll suffer. Exploit its true nature and it will flourish.


I disagree. I've used Matlab for 4 years and switched to Numpy/Scipy 2 years ago -- so I understand the array paradigm part. The problem with R is the inconsistency and verbosity in syntax and semantics of the array language underlying it -- as other comments have pointed out.

Some differences in ease of use:

[1] Construct a matrix

MATLAB: y = [1 1; 2 2; 6 6];

R: y <- matrix(c(1, 2, 6, 1, 2, 6), 3)

[2] Insert a new row r = [3 3];

MATLAB: x = [y; r];

R: x <- rbind(y, c(r));

Which is intuitive and concise? ;)


There's an important difference between matlab and R: in matlab matrices & arrays are the most important data structure, while in R data frames are the most important. There is no "array language" underlying R - working with arrays and matrices in R is usually painful, and your life is much easier if you stick with data frames. (This is something that could be fixed in R by a package, but no one has done so yet)


> This is something that could be fixed in R by a package, but no one has done so yet

I'm curious about about what you mean by this. How would a single package fix that? And does "fix" mean to make matrices easier or to make dataframes more broadly effective?


The problem isn't with the underlying data structures, it's with the methods that have been implemented for them. A package would fix the problem by fleshing out r with a decent set of consistently named and parameterised matrix manipulation functions.


I don't agree; the core of array paradigm is vectorisation, as started in APL and continuing in Fortran, J, K, lush or R. The idea that it has something to do with matrix algebra is wrong -- MATrix LABoratory is simply an orthogonal story.


Huh? I don't understand your point, e.g. I didn't mention matrix algebra.


So you also think that

  mean(iris(iris(:,5)==2,2)) 
is more intuitive than

  mean(iris[iris$Species=="versicolor","Sepal.Width"])?
Or that constructing matrix by mapping f on 1:10 like

  M=[]
  for not_i=1:10
    M[:,not_i]=f(not_i)
is more concise that sapply(1:10,f)? (I know there is arrayfun, but I have never seen it used except in wow-MATLAB-is-functional blog posts)


Or, you could just write:

    M = repmat(f(1:10), n, 1);
where 'n' gives the number of rows you want in 'M' and 'f' is written in proper "Matlab" style (i.e. behaves reasonably when given an array as input). Or, to be more in the spirit of linear algebra, you could write:

    M = ones(n,1) * f(1:10);
And, if you only want one row-wise copy, you could (succinctly) write:

    M = f(1:10);
Or, as you suggested, you could write something like:

    M = repmat(arrayfun(@(x)f(x), 1:10), n, 1);
Or, getting more silly, and using the handy bsxfun, you could write:

    M = bsxfun(@times, ones(n,10), f(1:10));
If you don't feel like implementing 'f' so as to permit array inputs, you could modify this to:

    M = bsxfun(@times, ones(n,10), arrayfun(@(x)f(x), 1:10));
Anyhow, Matlab is very productive if you can effectively wield its powerful built-ins.


You didn't get my example -- it was about how to make a matrix from the results of a non-vectorised function that gets one number and returns a vector (say performs a complex simulation).

The problem you've solved has an equally simple implementations in R; f(1:10) for a single copy, matrix(f(1:10),10,n) for n columns, matrix(f(1:10),n,10,byrow=T) for n rows, etc.


Ironically, your "iris" object can't be a matrix the way it's written. I've had lots of students get confused about the distinction between matrices and data frames. And sapply will sometimes give (IIRC) matrices of lists with a single element if f is constructed incorrectly (requiring an explicit 'unlist' in the return statement of f). So I generally agree with you, but I think you made shared4you's point.


Nope; this entity you describe is not a matrix of lists but a single list with dimension, and sapply creates it when mapped function returns list with the same length for all iterations (which is a perfectly consistent behaviour to how sapply treats vector output).

The confusion is from the fact that people think that matrices (or data frames) are R's base types -- they are not, only vectors and lists are. Matrix is just something with dim attribute, data frame is a list of equal-length elements with a proper class.


Cool, I didn't realize that lists could have dimension too. Thanks for the correction.


Don't forget the free http://tryr.codeschool.com/ from Code School and O'Reilly.


Wouldn't it be better to have a Python statistics library instead of creating a new language? When I looked at the documentation of R it looked very convoluted and it seems that my feeling is confirmed by other comments in this thread.


Python is inherently not conducive to data analysis the same way R is. Python is not a lisp. R is more lispy, and its functional attributes make it incredibly useful for producing data analysis based code. For fully production ready systems R is not the ideal choice. For off the cuff analysis, python is a meager second.


Of course, these days Pandas is _the_ Python library for statistics. There is also scikit-learn.

http://pandas.pydata.org/


I haven't used R nor Pandas. How does the two compare? Why do people use R and not Python?


> Why do people use R and not Python?

R style DataFrames was raison d'etre of Pandas in the first place. (http://pandas.pydata.org/#why-not-r)


People started using R about the same time they started using Python (that is, the first half of the 1990s).

Since then, R has gained a huge number of statistics packages. Python has fewer of them.


I'm currently learning R and was hoping for a more detailed blog post. I found this to be a pretty bland report that would have been made much more interesting and useful if the author had included some concrete examples.


I've compiled all the resources from the comments here http://datagrad.blogspot.com/2013/02/some-learning-resources...


for learning more about plotting, the gplot2 book is a tad expensive, but worthwhile. http://www.amazon.com/gp/product/0387981403 (it helped me understand the structure behind the library, which i found confusing from cookbooks etc)


I have found this [site] (http://www.cookbook-r.com/) to be more helpful (plus the online documentation for ggplot2 which is really good).

Hadley's book is really good and a solid treatment (from the author of so many clear, powerful package it's not a surprise), but there have been a lot of changes to ggplot2 since 2009. I have personally found ggplot2 to be so powerful because it lends itself very well to actually learning through using Cookbook-style examples.


The cookbook is now an O'Reilly book: http://amzn.com/1449316956


I don't know why a programmer would limit them self with static data tools when d3 and processing provide beautiful, dynamic, and interactive solutions with a similar learning curve. The only difference is you can use the JavaScript chops you pick up learning d3 in any front end work you do.


people use R for much more than data visualization, doing regressions, fitting SVMs, decisions trees, GARCH and ARMA time series models, isn't really trivial in javascript.

also ggplot, while more limited is way easier and faster than d3 for doing the specific things that ggplot is good at. But for statistics, that's usually what you want


Sounds great! What's the MCMC package for javascript?


I worked with a computer scientist once, and he tested out lots of languages for replacing his VB setup (I know, I was surprised too). JS turned out to be the fastest language for a lot of his tests. If the libraries ever get there, it will be lethal. Presumably there might be a way to compile java libraries to JS using Rhino, but its not something I've ever investigated.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: