

The Zen of R - mariuskempe
https://gist.github.com/2053607

======
steve19
I love R, but it can be frustrating to code with and learn. Some of its
datatypes are immensely powerful but work in mysterious ways. It allows you to
manipulate expressions in a LISP macro-like fashion which, when used badly by
library authors, can make many things appear magical (and inconsistant). There
are many inconsistencies in the standard library because of different
programming paradigms used (for example (s|m|t|r)apply() vs. Map() and
filtering with df$col[] vs. subset() vs. Filter() ).

Yet I love writing code in it. So much can be done in so little code. I am
always amazed at how little code I write to accomplish a task.

The RStudio IDE ( <http://rstudio.org/> ) is a very pleasant environment to
write code in.

~~~
disgruntledphd2
Map and subset would be discouraged, they are just wrappers around the apply
functions and [] anyway.

~~~
wildanimal
I wouldn't discourage them uniquivocally - for instance, 'subset' is an idiom
often used in the context of data/relational tables. That matters a little bit
(enough to pay a little penalty in raw performance), I would think.

------
jacobolus
In some ways this is better, and in others it is much worse. Using named
constants instead of magic numbers, and breaking up complex logic into simpler
named parts is a huge win for readability/maintainability, as any programmer
quickly learns. The big one liner would be a lot nicer in about 3-4 chunks.

I’d rewrite this example as something like:

    
    
      num.steps <- 1000
      num.walks <- 100
      step.std.dev <- 0.03
      start.value <- 15
    
      rand.row <- function() rnorm(num.steps, 1, step.std.dev)
      walk <- function () cumprod(rand.row()) * start.value
      all.walks <- t(replicate(100, walk()))
      plot(colMeans(all.walks), type = "l")
    

[I’m not an R guy, so that might not be the most typical style ever, but you
get the idea...]

~~~
jacobolus
Aside: One of the things that I really don’t like about R coming from a
background in other programming languages is that a function is evaluated
before its arguments; i.e. an expression passed as a parameter to a function
is not a value but is an implicit lambda, to be evaluated multiple times
somewhere inside the function.

For example:

    
    
      replicate(2, rnorm(2))
    

creates a 2x2 matrix of independent random values. Whereas:

    
    
      x <- rnorm(2)
      replicate(2, x)
    

Instead creates a 2x2 matrix with a single random value for each row, repeated
across all columns. And so if you want to decompose it, you need to do:

    
    
      x <- function() rnorm(2)
      replicate(2, x())
    

Which will re-call x twice inside the replicate function.

~~~
bedatadriven
R does have call-by-need semantics, like (an impure) Haskell, but that's not
why replicate's second argument gets evaluated multiple times:

    
    
      replicate <- function (n, expr, simplify = TRUE) 
         sapply(integer(n), 
            eval.parent(
               substitute(function(...) expr)), 
          simplify = simplify)
    

The replicate function uses R's compute-on-the-language to construct a _new_
function that has expr as its body.

A closure's argument will be evaluated _at most_ once.

------
svdad
I've often felt that there is a world of terse, clear code waiting to be
unlocked in R, but haven't been able to find it myself. This is a great
example.

------
danielharan
Had the same experience recently in Octave. Those moments of clarity are a big
part of why I write software.

------
Estragon
Slightly OT: I'd be really interested to hear a list of specific advantages to
using R over using scipy. I have been holding off on learning R for years
because scipy has served my purposes pretty well so far, but sometimes I
wonder what I'm missing.

~~~
jacobolus
The extent of my experience with R was a couple of somewhat introductory
statistics courses, so I’m not the best person to answer probably.

But I like dealing with numpy/scipy _much_ , _much_ more than R. Python as a
language is I think much better designed, and numpy is a really nice tool for
interacting with multidimensional arrays. When I write Python code, or read
Python code written by anyone competent, I find program intent very easy to
design/follow. Most of the R code I’ve seen “in the wild” is kind of a mess,
because it is written by non-programmers many of whom have little experience
or concept of code style. Additionally, as soon as a program has to do
anything other than statistical analysis (examples: text munging, internet
scraping, network communication, dealing with file formats, user interaction,
etc.) Python is miles ahead.

The big advantages of R that I saw: (1) it has become the tool of choice in
the academic statistics community, meaning that there is quite a lot of
existing code for doing various sophisticated things, some of which you might
have to implement yourself in Python, (2) it has some really nice graphing
tools, (3) there seemed to be a few examples where a particular few lines of R
code were more compact and clearer than the equivalent Python (can’t think of
anything off-hand though).

------
choffstein
Zen? This is madness!

In my mind, Matlab's (Octave's) array based programming makes sense. This?
This does nothing that I expected it to do!

replicate seems to pretty randomly takes a function. Is that an R thing? I
think what confuses me most is that there is nothing about this syntax that
tells me that "cumprod (rnorm (1000, 1, 0.03))" hasn't already been evaluated!
I could not, for the life of me, figure out why replicate didn't just create
100 exact copies. For example, why does replicate(...) evaluate, but the
internals don't? This is driving me crazy!

------
radikalus
This is slick...but:

How often do you want to generate random walks of this type where the variance
of the process isn't dependent on its current level?

Observe that, as you increase the standard deviation of the random normal (to
even .1), your "random walk" always walks to zero.

I don't mean to be a beady-eyed-pterodactyl but, as cool as clever one-liners
sometimes are, often they solve toy problems. I say this as someone who loves
R and uses it everyday and constantly is forced to brute-force with ugly
inlined Rcpp code. (Which is fine)

~~~
mariuskempe
That's an interesting point. In fact, the expectation of the random walk is
always the same as its starting value, because, although most of the walks go
to 0 like you observed, there are very occasionally walks that drift upward to
astronomical values.

As for the usefulness of this kind of walk: the process we're modelling is an
evolutionary one, where the rate of change is fixed (in this case within the
species) and we'd like to detect 'random' (non-selected) evolutionary paths by
comparing simulations to historical data.

------
psb217
In Matlab, where one is generally compelled (for better or worse) to think in
terms of array operations, the obvious code would be:

plot(mean((cumprod(randn(100,1000) .* 0.03) .* 15))

Well, that is assuming one wants a line plot of the mean value of the
"location" at each time point across the population of walks. Personally, I
find R's rather idiosyncratic approaches to data handling and function
wrangling a bit hard to digest.

------
mbq
The most of this Zen comes from array paradigm, not functional; still it is
nice to see another soul on the path to Renlightenment (-;

~~~
psykotic
That's a matter of perspective. Folds and scans exist for all regular data
types, not only arrays, so I consider this a part of functional programming
and type theory.

Here he is computing the history of states of a random walk as scan (*)
initial updates. If you wanted to simulate branching random walks, you'd work
over the data type of multiway trees instead of arrays, but the algorithm to
compute the history of intermediate states would otherwise be identical, using
the scan function for multiway trees.

------
swah
Those moments, its been a while since I've had one of those. (Must open
compiler book)

------
the_cat_kittles
How did that take a couple months? Even in R's slightly awkward language,
random walks shouldn't be very difficult! I'm baffled, I must be missing
something.

~~~
snprbob86
Writing the code did not take him several months; that's just the period of
time he utilized this code. During that time, he tweaked it for evolving
requirements. What took a couple of months was the path to his epiphany.

