
The R language, for programmers - tacon
http://www.johndcook.com/blog/r_language_for_programmers/
======
plinkplonk
An "R for programmers" style book is Hadley Wickham's glorious "Advanced R"
(available at [http://adv-r.had.co.nz/](http://adv-r.had.co.nz/)). R never
made too much sense to me till I read through this. Reccomended.

------
jmount
To my mind some of R's plusses are data frames and the ability to indicate
missing values in vectors of any type. Some of the weird stuff is the lazy
evaluation of arguments, the ability to know the names of variables bound to
function arguments, and the ability to snoop up the environment stack. Some
distinct minuses are the changing of types (dropping of dimensions on select),
semi-reserved terms, and c()'s squashing of complex types. One of my articles
on the topic: Survive R [http://www.win-
vector.com/blog/2009/09/survive-r/](http://www.win-
vector.com/blog/2009/09/survive-r/) .

~~~
craigching
And plotting, especially ggplot2!

------
minimaxir
The biggest "gotcha" for learning R as a programmer is that R interprets
character vectors of data frames as factor vectors _by default_ , which will
usually break something in your code.

If you're learning R, learn to use dplyr for data manipulation and ggplot2 for
plotting. Both will save you a _lot_ of time.

~~~
craigching
> If you're learning R, learn to use dplyr for data manipulation

I had been learning data.table, but I really like dplyr's % operator and the
compositional functions better. I think I'm going to make the move to dplyr.

~~~
hadley
FWIW The performance difference is insignificant unless your working with >>10
of millions of rows.

~~~
craigching
I thought dplyr (note, not plyr, dplyr is the next iteration of plyr
implemented mostly in native code) was pretty much as fast as data.table.

------
mjt0229
R is one of those languages that looks like it was designed in a vacuum by a
very smart person. It has many common, modern PL constructs, but they're
expressed syntactically in a way that in no way resembles any other language
I've seen. The entire syntactic legacy of Algol, Pascal, C, etc, all are
thrown by the wayside. Familiarity with any of those syntaxes felt to me like
more of a liability than a help. That's not to say that the concepts don't
apply, just the syntax.

~~~
claytonjy
I agree, and I think that's exactly why an article like this exists. The R
learning curve seems to be much gentler on people without too much serious
programming experience in another language.

Have you looked at Julia at all? I'm only mildly familiar, but it looks super
promising and I'm curious if the syntax there seems more normal or predictable
for an experienced dev.

~~~
mjt0229
Yeah, I should have mentioned that - R for Programmers is exactly the kind of
thing I'd need, even if it's not useful for my friends and family (largely
scientists rather than programmers for whom the legacy of programming language
syntax is completely unknown).

Julia looks cool; I think the syntax is meant to look familiar to people
who've used Matlab or Octave extensively. I don't do tons of scientific
computing, but Julia is on my list of tools to learn.

------
Alupis
I actually quite like the R language. A buddy of mine is in his University's
PoliSci program and one of the requirements is to learn R for statistical and
trend analysis. He could not stop complaining until I offered to help him
learn it by learning it with him. After doing his first assignment, we were
both impressed with what could be easily done in R to visualize data. I think
he now realizes how useful of a tool R can be in his future career.

~~~
epistasis
I actually really like it too. The programming language features of it are
quite different from what's going on in a Java/C++/C# world, but they are
super convenient.

Argument matching is really amazing and useful for prototyping. No doubt
there's a penalty, but it's exactly the type of power that's needed to build
expressive and useful reusable components with rapidly changing designs. And
pattern matching like that really helps with the REPL because it allows far
faster exploration with fewer keystrokes. Best programming practice in library
code would be to have things more fully fleshed out however.

~~~
jghn
I've said for several years now that most complaints about R's syntax and
idioms really boil down to "this isn't doing things the way I'm used to doing
them, i.e. the C/C++ branch of the language family tree"

------
chuckcode
I find R to be a great language for exploring a data set and doing some
prototyping. There are a lot of wonderful statistical tools available through
the core packages and even more through the various community extensions. It
does have some significant issues that I've found limit the usefulness outside
of prototyping

\- pass by value only means code tends to end up as monolithic functions

\- very slow in loops so lot contorting to move things to matrix operations

\- they just last year got a version out that starts support for vectors and
matrices with > 2^31 -1 elements which limits larger data applications.

I find the plotting with ggplot and statistical functionality to be second to
none though.

~~~
rm999
> pass by value only means code tends to end up as monolithic functions

I've actually found R works very well as a functional language with very lean
functions. It's perhaps worth noting that R doesn't copy a dataframe in a
function call if you don't modify it, which is a very common use-case for me.
(I'm not sure if this extends to other datatypes)

> very slow in loops so lot contorting to move things to matrix operations

This is a fair criticism, I think more modern languages like Julia will win
out here. That said, R has huge library support, I've often found there are
compiled versions for a lot of what I want to do.

> they just last year got a version out that starts support for vectors and
> matrices with > 2^31 -1 elements which limits larger data applications

Again, a fair criticism. I've never considered R a "big data" tool, my
workflow is usually a funnel where each step involves reducing data size by
1-3 orders of magnitude. For example, I may have 1 PB of transactional data,
aggregate it in Hadoop to 20 TB of daily aggregated data, run a query that
filters and aggregates it further, and then run my analysis in R on final
data. In the end I may end up with 20 GB of data, which R can very easily
handle.

~~~
chuckcode
Python has better and better support for R with Rpy2 and R like data frames
with Pandas, which is helping me take advantage of the incredibly useful
analysis libraries in R.

Also note that loops are slow enough that it is really worth learning the
*apply() functions in R to avoid iterating over collections. For a relatively
in depth explanation check out Hadley Wickham's book
[http://adv-r.had.co.nz/Functionals.html](http://adv-r.had.co.nz/Functionals.html)

~~~
mbq
*apply functions are loops underneath -- they only look better and save you time possibly wasted on growing some dynamically sized output structure. The way of solving slow loop in R is to find package which implements it in C/Fortran (or write your own in case there is none).

~~~
chuckcode
It'a actually a little complicated but if you're interested in the details
check out this stack overflow thread [1]. High level summary is that lapply()
and functions built on top of it do some work in native C and so are generally
faster but not all of the *apply() functions are faster.

[1] [http://stackoverflow.com/questions/2275896/is-rs-apply-
famil...](http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-
than-syntactic-sugar)

~~~
mbq
The problem here is not the for-loop itself but the time used by the R runtime
on executing the mapped function multiplied by the number of iterations (this
is BTW the main source of advantage for dynamic and GCed but JITed languages
like JS or Julia).

------
capnrefsmmat
I'd like to see a detailed explanation of R's scoping. It's not just lexical
scoping; callees can deliberately manipulate the scope their arguments are
evaluated in, for example. So you can call a function and pass arguments that
are available in local scope, but the arguments are lazily evaluated, and the
callee might evaluate them in an entirely different scope.

Typically this is done for manipulating datasets. You might have a data frame
with columns Width and Height, and so you want to be able to call
do.stuff(Width, Height, data=foo), and have Width and Height automatically
taken from within foo. But sometimes it crops up in unexpected places.

~~~
claytonjy
Are you familiar with Hadley's Advanced-R book? You can buy a hardcopy, but
it's free online: [http://adv-r.had.co.nz](http://adv-r.had.co.nz) There's a
section on lexical scoping, and lots of other non-basic stuff that is hard to
find covered elsewhere at all, much less well. From what I've seen, this is
absolutely the best reference for deep R stuff that exists.

~~~
capnrefsmmat
I've seen it but haven't read in depth. Now that I see the scoping section
I'll have to read through it.

~~~
claytonjy
Now that it's in print I suspect updates are less frequent, but until a few
months ago sections were being added and rewritten pretty frequently, so it
might have things now that it didn't when you looked last

------
canjobear
Why no discussion of dataframes? I find these to be the most useful aspect of
R which I miss the most in other languages.

~~~
eric_bullington
If you use Python, have you checked out Panda's dataframes? It's not quite the
R experience, but pretty close plus you get all the benefits of the Python
ecosystem.

------
elliott34
I love pandas much more than R but GOD I love Rstudio. Such a great IDE.
Rstudio server, actually. My equivalent is running ipython notebook on an ec2
instance, which....is fine, but is a lot of scrolling.

~~~
claytonjy
As an every-day R user but only-occasional python user, everytime I do a
python project I spend some time looking for a comparable IDE. Closest i found
was Spyder, but random lock ups made it unusable. Back to terminal+ipython and
sublime. Sublime REPL + ipython doesn't cut it either.

What do you love about pandas, is it performance, syntax, access to other
python modules? If performance, take a look at R's data.table package: almost
any manipulation can be done by reference.

~~~
elliott34
Now if we can only get sublime text 2 within an ipython notebook on the
cloud.....

For me, R was my first language, and then I learned Python, and beautiful
things like list comprehensions, and it just clicks with my brain a bit more.

In pandas, a group by operation is beautiful

dataset2=dataset1.groupby([stuff], as_index=False).mean()

Same with pivot table...

When I did this with dplyr2 my work flow would be a few more steps, creating
the "summarise" object and so forth and so on, which seems like more steps.

------
sytelus
To me writing tutorials for teaching R these days is same as writing tutorials
for Fortran (and I'm sure Fortran still has some nice goodies not available
elsewhere). It misdirecting people eager to learn something to a wrong thing.
As you can see in this article, every 3rd section of R book or tutorial is
often dedicated in gotachas to deal with. We have iPython, Notebook, sci-
kitlearn, numpy etc and massive number of R-packages already migrated. I hope
there is little need to take trouble of learning R for most people new to it.

~~~
minopret
In the Python portfolio I'd mention matplotlib.

Then Sage (sagemath.org) just dazzles me. It's a grand integrated environment
using Python with lots of math/stat software built in (including NumPy and R)
and lots more optional (including Matlab). You can just go see it and try it
at cloud.sagemath.com. If you like it you can continue to use it there or you
can download it - it's free open-source software.

------
nkurz
I've been looking at a variety of R packages, mostly for the purposes of
rewriting them in C++ for greater speed, and my assessment is that most of
them are of very low code quality. I don't mean that they don't work (they
usually do), or that they are too slow (they usually are, but this is
explained by selection bias given the reasons I'm looking at them), but that
there is little standardization even with a given package, and the
'foundations' seem weak.

Variable names are a hodgepodge of unhelpful single letter abbreviations
theSecondArgumentToTheFunction; functions alternate between camelCase, dots,
and underscores; and any form of architecture seems at best an afterthought.
It seems like the base language encourages this, or at least does nothing to
prevent it. It's commonplace to pick on Perl, but the overall quality of
popular packages seems considerably lower on CRAN than CPAN. Perhaps this is
because Perl is so conscious of its reputation at this point that the
remaining programmers take great pain to write clear code?

I feel like R is currently in the stage where Perl and PHP were as the
internet was just when the internet started to explode. The first-to-market
CGI scripts and libraries, often written by domain expert non-programmers,
became the default choices which the rest of the infrastructure was built on.
At some point, the weight became too great for the shoddy[1] construction, and
most people moved on to languages with better attention to maintainability and
foundational detail (Python, Ruby).

Those who remained with the language evolved it in similar directions, by
replacing the earlier libraries with better designed ones and by setting a
higher standard for community norms. I'm not sure about PHP, but contrary to
reputation, modern Perl is often a really clean and consistent language. Julia
seems to be playing a parallel role for R, although the new-found strength of
Python in the data analysis space complicates the analogy.

But I wonder: is R undergoing (or about to undergo) a similar renaissance? Are
there already examples of "Modern R" out there to serve as templates for the
future direction of the language? Or is R happy where it is?

[1] Did you know that 'shoddy' was originally a legitimate but low grade of
wool, and wasn't necessarily pejorative?

~~~
jghn
I'll admit that as someone who has a package on CRAN, has been using R since
~2001 and who is a normal software developer in their day to day job that the
lack of standardization is something I'm guilty of.

For me what happened was that my thoughts on appropriate naming, structure,
etc has evolved over the 6 (I think?) years of the package's existence but I
simply haven't had the time to make the wholesale changes necessary. It's on
my todo list, but frankly things like "fix actual bugs" have been sitting on
that list for a very long time as well.

In general though, I've always found that most packages are pretty crappy and
not just for code quality. With a relatively small amount of exceptions what I
found over the years was that if you needed to do something it was almost
always better to write something yourself than shoehorn someone else's junk
into your system. There's an exception w/ Bioconductor, particularly the
packages created and maintained by the core devs.

And on your point about the renaissance, yes I believe that has been
happening, largely driven by Hadley Wickham.

~~~
nkurz
Could you suggest some examples of packages using current best practices that
I could try to pattern mine after?

~~~
jghn
I should be clear that there's not yet a One True Way in terms of coding
standards and such, but things are improving.

Anyways, a good place to start would be the Hadleyverse:
[https://github.com/hadley](https://github.com/hadley)

One could do a _lot_ worse than following his lead.

------
chappi42
The language might experience the 'peak R' point right now.

'Badass' statistic packages but R always felt a bit 'hacked together'. With
Julia on the other hand, I get the impression that there are developers in
charge which have a deep understanding about programming languages and
computer science. It's (too) early times for Julia but I wouldn't be surprised
if in two years many users will (partly) switch.

------
kyberias
When reading this article, I started to wonder whether it would be plausible
to create a REPL or a compiler from some "real" programming language (like, I
don't know, C#, C++ or Python) to R to utilize R's statistical libraries
without going insane. This might be a fun exercise as an LLVM backend. :)

------
aagha
I'm curious for others thoughts on what to use for complex statistics if you
needed high performance/speed.

Running an R script on a server to process data isn't efficient, but does that
mean you have to roll your own stats package if you want to have a Java (for
example) back-end?

~~~
scroy
Not sure how complex your use case is, but I've found Pandas (on Python) to be
just as powerful and much more performant than R for working with scientific
data. It's built on Numpy so you can use Scipy's statistical functions with it
seamlessly.

~~~
hadley
Recent benchmarks show the performance of pandas, data.table and dplyr to be
pretty similar, with data.table usually being the fastest.

------
lottin
For me the main problem with R is that I find it hard to break lines nicely. I
tend to end up with massive one-liners, and no matter where I break it never
looks quite right.

------
daveloyall
TL;DR: Tragically, R has a lot in common with old-skool PHP and MS Excel, at
the same time.

~~~
kyberias
I don't know why exactly you're getting downvoted. I've tried learning R many
times, but the resemblance to PHP problems is just too much.

~~~
daveloyall
Probably there aren't legitimate semantic or syntactic similarities between
the three.

But for proposes of a TL;DR, I believe that a qualitative description of the
situation should suffice.

I'll try again:

TL;DR: The language R lacks the quality without a name.

Or:

Jeeze, now I know why none of my previous attempts to learn a little R were
fruitful.

Or:

The language R seems to have been developed in isolation and thus it fails to
adhere to any particular convention--it is its own beast. Further, it
sometimes lacks self-referential integrity and coherence.

Or:

haha cf. PHP or Excel.

:)

~~~
nkurz
I voted you up here because you were below zero and I was also about to post a
comment comparing R to PHP, but I'll mention that I frequently downvote one
line posts that start with "TL;DR". The concept of "too long didn't read"
implies (to me, and probably others) that that the article isn't worth
reading. If I think the article is worth reading, and the short comment isn't
incredibly insightful, I'd usually prefer such comments to be at the bottom of
the page and grayed out.

~~~
daveloyall
Oh, I hadn't considered "too long, DON'T read". Hm. This article is worth
reading if you're interested in R but not yet well acquainted with it.

------
xname
One thing I don't like about R is the OOP part of R.

I don't see any reason of R including OOP into its design and sometimes it
just creates confusions.

~~~
jghn
R didn't include OOP into its design, it was all grafted on later, are there
are multiple systems.

There's the S3 OOP system, the S4 OOP system, reference classes and at least
one add on package on CRAN which does something different.

So which one are you complaining about? :)

------
fallat
Haskell is better than R, for both programmers and data analysts.

~~~
michaeltoth
It's going to be harder to do just about anything related to data analysis in
Haskell than in R. In R I can load a dataset, do some formatting, and produce
a well designed plot in fewer than 5 lines of code. Haskell might run faster,
and it's certainly more versatile, but for data analysis it isn't even
comparable.

