Hacker News new | past | comments | ask | show | jobs | submit login
Advanced R programming (had.co.nz)
173 points by iamtechaddict on Nov 17, 2013 | hide | past | favorite | 40 comments

> Although R has its quirks, I truly believe that at its heart it is an elegant and beautiful language. While R is a fairly mature language, we are still learning how to craft elegant R code: much code seen in the wild is written in haste to solve a pressing problem, and has not been rewritten to aid understanding.

First of all, what a great endeavor by Hadley...if "all" he had done was produce ggplot2 (and write a great book about it), that's enough to cement his elite status. However, what I don't get is...why R? After a few days of hacking, I was able to produce some nice graphics with ggplot2, but I have to say that it was by far the hardest high-level language I've had to learn as a programmer...I haven't used it enough to love, so I'm not at the stage that I am with JavaScript. That is, I know of JavaScript's problems but know of the strengths that sometimes derive from its weirdness...and of course, JS is too ubiquitous to just ignore. However, with R, it just seems some of its quirks are just bad.

I guess my question is aimed more at the angle of: how does R do the things it does so well? ggplot2 is great enough to learn R for it alone. And some of the data munging methods, such as `melt`, don't seem to have a well-supported port in all the other popular languages. I know that Python's pandas has one...Ruby does not. Is there something about R the language that makes it especially good at its data and statistical methods (in the way Matlab is geared toward matrix manipulation)? Or is it just that R was so heavily adopted by the stats community that, if they had picked another language, that language would have just as great as functionality as R does.

Note: I suffer from selection bias, though...a lot of the people I chat with are data scientists, where R is so ubiquitous. It may be that Python pandas is just as good as the R libraries, but I just know more R-users than Python-users.

Python's NumPy, SciPy, Pandas, Matplotlib, SciKits, and StatsModels are very formidable, and have most of the good stuff R has, plus Python itself has a lot more good stuff (from Boost Python to really basic stuff like argparse), minus some horrible stuff that R has (such as the affinity for global functions like `rm()` which seem to be named like Unix tools but which do other things, or the `c()` function which is impossible to Google for, or the abysmal default error reporting, or the use of dots in variable names).

But R has some things going for it. There are some algorithms and tools which exist in R but nowhere in Python (this set seems to both shrink and grow over time as both languages add more stuff). R's overly-terse syntax for some things is annoying for maintainers of R code, but R hackers enjoy it because they tend to be all about banging out piles of stuff quickly.

R also comes with a lot of stuff included that in the Python world would fall under many different umbrellas (see the several names I mentioned at the beginning--those are just some of the basics). Whether it's true or not, R users perceive Python as being relatively balkanized, with that long list of packages just to get started, and with the Python 2 vs. 3 divide which has plagued it for years and will continue for a while still.

How is rm different from what you'd expect? R also has head, tail, grep, ls... all likewise.

And why would you need to google the function c? I don't think there's ever been anything more I've wanted to know about it than is written on ?c.

But your second paragraph makes a good point. For any given big csv of numbers it's a whole lot faster and fewer LOC to clean, organise and plot in R than in Python, even with Python's ever-growing list of imports.

rm() in R is like unset in Unix rather than rm. And ls() in R is like set in Unix. File operations in R have other names.

My experience with R is about 2 years old but your comments are spot on. I selected R initially because it had the only good autoregressive-moving-average (ARMA) calculation that was good and also fast that was requested by my users to do some data extrapolation. I could see its promise but I'll be damned if it wasn't the most annoying language to use for general things like accessing a database to get the data. I eventually got it everything to work but it was not easy to automate and deploy.

Ultimately the ARMA calc didn't do what they wanted mostly because ARMA was the wrong thing to use on the dataset in the first place, IMNSHO. This could my general lack of experience with R but I've been programming for 15+ years and it was one of the rougher languages to work with.

Anyway I ported the code to python, numpy, scipy, scikits (and most significantly the time series stuff) and it was much easier to pull in the data an apply smoothing filters and do some general data clean up work but the ARMA was nowhere to be seen and I settled for simple linear and quadratic fits and think it did a better job of forecasting. I really liked some things that R did automatically like when trending data it added confidence intervals on the forecasts. I was actually tempted to port the ARMA libraries to python over this but didn't want to dedicate the time to debug and validate it. R was really good for interactive manipulation but python was better for actual deployment.

Connecting to databases in R is way harder than it should be. It's something I want to work on in the future.

This is what's really weird to me about every conversation that pops up with people complaining about R. I've been using R daily for nearly 8 years now, and there are plenty of things that I could complain about.

But other people always seem to have big problems with things that never even occurred to me.

In this case, I've been using R to pull data out of SQLite, SQL Server and Oracle db's every single day, for years. And I've never had any problems at all. It wouldn't even occur to me to think that R's ability to get data out of a db was anything other than "just fine".

Yes, database access is probably not the strongest side of R.

I think part of the issue is that the typical use cases of Python and R are a bit different, so a lot of functionality that in case of Python comes in well-debugged and well-documented standard libraries, in case of R comes in relatively little-supported user packages.

Also, the standard package documentation system in R is absolutely atrocious; I am convinced that R would have been far better off without any package documentation standards at all.

R also is considerably better than python at distributing windows binaries. About 70% of R users are windows, and many statistics packages have some C/Fortran code, so this is really important in terms of putting the tools in the hands of users.

That may have been a case, but not recently. Here's a one-stop shop for all Windows binaries: http://www.lfd.uci.edu/~gohlke/pythonlibs/ , and not to mention, there are Python distributions that come with the necessary stats/numerical packages such as Anaconda. Cloud-based services like warkari.io also make it really easy to get up and running.

Maybe it's just me, but I find the number of ways to get python libraries to be very confusing. Do you use an egg? distutils? pip? easy_install?

I just skimmed the first few google results for "install python module windows" and none seemed particularly helpful. The page you point to says "The files are unofficial (meaning: informal, unrecognized, personal, unsupported) and made available for testing and evaluation purposes." Anaconda looks appealing, but wants my email address (and automatically checks the bother me box), ugh.

Hey Hadley, you can get anaconda without giving us any contact information.


I don't think you can judge a language completely in isolation from its community. There are features of R that make it particularly well suited for statistical computing (vector-oriented, missing values at fundamental level, ...). Those features influenced early statistical adopters, which in turn lead to a virtuous cycle: as more statistical/data analysis functionality was available in R, the more obvious it made as a first choice for statistics/data analysis.

It's also worth bearing in mind that python has only become a reasonable competitor to R (for statistics) in the last couple of years. Without pandas and IPython, python is a much less compelling option, especially given that most R users are not programmers and just want to figure out what's going on in their data.

(And thanks for the kind words :)

This. "Why R" is because context matters when picking a programming language. Python might have libraries to work on the models I need. R will have them, unless they're really, really obscure.

"how does R do the things it does so well?"

R is a vector/array-based language which fits the problems of its domain in a natural way. On the other hand, you wouldn't really want to use such a language for anything else then data munching.

The language has it's flaws but those are well described, e.g., in "The R Inferno" (http://www.burns-stat.com/documents/books/the-r-inferno/).

R was, apparently, inspired by S (a stats language) and Scheme. Scheme is an extremely elegant language, which also served in part as the inspiration for Javascript. It's somewhat amusing that two so poorly designed languages have been inspired by one of the best.

As the above suggests I don't hold R the language in much esteem. What it had was familiarity for people who had used S, and now, years and years of accumulated libraries. I don't believe there were any technical features in R that led to success. It was purely social: a free and open source language that was close enough to an already familiar tool.

I really don't think R the language is poorly designed. Sure the implementation isn't great, and the standard library is patch and inconsistent, but the core of the language is elegant and well-suited to its domain (which is not just programming, but also interactive data analysis).

I'm interested to hear dissenting opinions, but you'll need to back them up with specifics.

Ok, here is one I remember from my R days: the save and load functions.

First some preamble. One of the fundamental design features of Scheme is lexical scoping, at the time a relatively unusual feature. Lexical scoping greatly simplifies program comprehension and compilation. In a lexically scoped language, a binding -- that is, an association between a name and a value -- is only visible in the scope in which it is defined, and any scopes within that scope. This means the textual source of the language determines which bindings are visible -- simple and no surprises.

The save function in R doesn't save a value, it saves a binding. When you call load you actually add a binding into your scope (local or global scope? I'm not sure and the docs don't say.) This is absolute madness. It means you need to know the name that was bound to the value when it was saved, coupling the code that uses the saved value to the code that produces it. Imagine programmer A writes the code that calls save and programmer B loads values. They agree on a name, but then programmer A changes that name ... and breaks programmer B's code!

Now you might argue this is a standard library issue, not a language issue, but I argue the two are so tightly coupled you can't consider one in isolation.

Yes, save and load suck, and I never use them (and encourage others to avoid them too). Use saveRDS and readRDS instead (and yes those are their totally inconsistent names). (Also the documentation for load does tell you where the objects get bound)

Language and standard library are tightly coupled, but problems with the standard library are _much_ easier to fix than problems with the language.

Alternately, load() to an environment other than the global one and ls() to figure out what you just loaded... But since attending one of your talks that mentioned saveRDS/readRDS in passing, I've switched over entirely to those.

Also load returns (invisibly) the names of the objects it loaded, so `(load(...))` will print them out.

Interesting. What features of R did you particularly dislike, and what was your background before then?

I'm asking because S-PLUS (subsequently mostly replaced by R) was one of the first statistical languages I learned, and I have learned and actively used many other languages since, but I never before (or after) had this feeling that "this is the most convenient language in existence, and it does everything precisely the way I want and expect".

I don't know Ruby (the snippets I've seen do look very nice, but it's mostly used in a very different domain), but Python does not come anywhere close (e.g. compare the treatment of defaul parameter values!), nor does Matlab (one function per file? wtf?), or C++, or Gauss, or really anything reasonably high level that I can think of. SAS and Stata might have a slight edge over R in very specific use cases, but outside of those, there is absolutely no comparison. Julia has a lot of potential, but imo it's not quite there yet. Also, R is ridiculously easy to incorporate C/C++/Fortran/etc. code into, and S3 is a really wonderful OOP/abstraction system.

It's definitely not perfect, and today R does not quite evoke the same sentiment as S+ did back ~15 years ago, as R did away with some of my favorite S features and introduced a lot of complexity (although, it's also possible that my typical use cases became more complex, so I have to deal with internals more often). Also there are features like the apply() family of functions that I've seen done better, and some R features apparently make it hard to optimize code. But there is very little in the language that I could honestly say I seriously dislike.

S was developed by statisticians, for statistics — so it's really no surprise that web programmers come along and say "ew that's weird, why does it do that!".

I wish it was developed by Computer Scientists for statisticians, for statistics.

Having seen how this occasionally works out, I don't.

"I haven't used it enough to love, so I'm not at the stage that I am with JavaScript. That is, I know of JavaScript's problems but know of the strengths that sometimes derive from its weirdness."

I'll take a risk and ask: what are these strengths you're talking about that JS has?

I really like the prototypal inheritence -- it's a feature few modern languages use, and yet it's very flexible.

I think when it comes to inheritance and object-oriented programming, R and JavaScript share the problem that there is not one standard why of doing it. Libraries can and do implement different systems. In JavaScript prototypal inheritance is built-in, but there are also libraries that implement there own way of doing OO. In R you have the S3, S4 and Reference class system, and their are also packages that implement other approaches.

For prototypal inheritance in R you can try the proto package [1].

[1]: http://cran.r-project.org/web/packages/proto/index.html

Interesting that R has a proto package, but to be honest, if I get to the point where I need classes or prototypal inheritance in R, I reach for another tool. R is excellent for exploratory data analysis, but in my (admittedly limited) experience with R, it makes for a poor choice for anything more complex than a small script.

Besides the Stockholm Syndrome?...I think you have me there. It's not that JS doesn't have strengths, it's whether some of its widely derided flaws resulted in happy tradeoffs? I'm banging my head thinking of something specifically (and widely agreed on) but it's early. I guess the same question could be asked of JavaScript...does its exclusively great libraries, such as D3, exist in JS purely because JS is the most popular language for interactivity, rather than because JS was well-suited for such libraries? Probably the former.

I agree with Crockfords sentiment that all the parts of JavaScript that are like Scheme can be considered good.

Wasn't there a time when biologists used perl a lot and thought it was great? It's all about being the first to market (while not totally sucking). And S is so old I'm not even sure if they had any scripting languages with operator overloading back then.

That still happens, anecdotally a recent blog poll found Perl still being voted the third ""best"" language for a bioinformatician, after R and Python.

[1] http://computationalproteomic.blogspot.co.uk/2013/10/which-a...

Wow! Terrific. We've needed a resource like this for a long time in the R community, and Hadley is the one to write it!

This is a tremendous resource on the level of John Chambers's book Software for Data Analysis.

I always look at a language's error handling. First piece I see is 'There are three ways that a function can fail' followed by a six item list.

No one expects the exception.

That chapter (like the entire book) is still a work in progress and I'll hopefully fix the most egregious errors before publication ;)

Is the whole book available in one page somewhere?

No, because it will be for sale eventually, and that's the deal I struck with my publisher. But if you dig around in https://github.com/hadley/adv-r you can find a script to make a single pdf...

This is a great contribution to the community, thanks so much. I'm sure it will make writing R code even more enjoyable.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact