
The Grammar of Data Science - stared
http://technology.stitchfix.com/blog/2015/03/17/grammar-of-data-science/
======
stared
Speaking as a Pythonist (but who is in love with ggplot2 and dplyr) the
wonderful thing about IPython Notebook is that its possible to inline R code
with no more fuss than adding "%%R" in a cell.:

[http://nbviewer.ipython.org/github/davidrpugh/cookbook-
code/...](http://nbviewer.ipython.org/github/davidrpugh/cookbook-
code/blob/master/notebooks/chapter07_stats/08_r.ipynb)

BTW: For pandas-dplyr dictionary:
[http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5f...](http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0)

------
j2kun
> The language is byzantine and weird

This is my biggest beef with R. It is constantly changing the dimensions and
types of your data without telling you. Want to grab some subset of the rows
of a matrix? Better add some extra post-processing in case there's only one
row that satisfies your query, or else R will change its type!

The solution is not to make the programmer memorize obscure edge cases.

~~~
jghn
You know that you can tell it not to do that, right? drop=FALSE

~~~
j2kun
I think this only reinforces my point: this is a ridiculous default. But that
is news to me :)

~~~
andy_wrote
In light of this issue, dplyr's tbl_df structure (a light but helpful wrapper
around data.frame) actually has different drop defaults, for example

    
    
      > x <- data.frame(foo=1:5, bar=1:5, baz=1:5)
      > dim(x[,'foo'])
      NULL
      > dim(x[,c('foo','bar')])
      [1] 5 2
      > dim(x[,'foo',drop=FALSE])
      [1] 5 1
    

compared to

    
    
      > x <- dplyr::data_frame(foo=1:5, bar=1:5, baz=1:5)
      > dim(x[,'foo'])
      [1] 5 1
    

Although I think these are more reasonable (I've got multiple commits at work
with messages bemoaning drop=FALSE), this can ironically also mess you up if
you got used to the old defaults :)

------
canjobear
I am torn, because Hadley Wickham's tools are truly wonderful, but the
underlying R language is such a mess. For example, R has lazy evaluation
despite being an imperative stateful language.

I wish Hadley had developed these tools for some other language, such as
Python, or in a language-agnostic way. Hopefully that is the direction things
will go in the future.

~~~
x0x0
It's python that's really a mess for data science; you can't avoid it being a
programming language first and a tool for data science a distant tenth. Syntax
that only a programmer would like is necessary, and quite a bit of it at that.
R is a much better fit for people who want to do statistics first, and as
little programming as possible.

Thinks like function parameters being promises make it far easier to deal with
functions like optimizers where there really are 10+ tuning parameters or
things you may want to tweak. Iterative languages are far easier to understand
for people who don't want to be programmers.

You cannot develop plyr or ggplot in a language agnostic way, because they
need the purpose built syntax R has. Contrast to eg the fight in python to get
an infix matrix multiplication operator.

~~~
j2kun
I don't see how not having innate language support for an infix matrix
multiplication matters. In R all "infix" operators are really functions
anyway[1] so you could write your library that way. Alternatively, you could
use python's overloading for infix operators. (Also, since when was Python
considered not iterative? And for that matter, doesn't R's widespread use of
*apply make it more functional anyway?)

[1]: Here is a simple example of that for +, note the very strange overloading
of quotes.

    
    
        R version 2.15.2
        > "+"(7,5)
        [1] 12

~~~
x0x0
I well understand R's operators, but why on earth is that relevant?

math:

    
    
       S = ( H β − r )^T * ( H V H^T )^{-1} * ( H β − r )
    

python, ugly mess

    
    
       S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r)
    

python, better: (although @ is an ugly matrix operator)

    
    
       S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)
    

the latter is an order of magnitude easier to understand, and looks just like
the math. Having one layer of indirection: math to code, is far better than
two: math to code to obfuscated code because you won't create infix operators.

Edit: examples stolen from the matrix operator pep

~~~
j2kun
The problem is that you want to do everything in one line. The proper way to
do this would be to save things like H.dot(beta) - r in a separate variable
and compute it just once.

Moreover, if it's hidden in a library then why does the user care if it's
ugly? It's the library designer's job to test it and make sure it's right.

~~~
x0x0
No, I simply want my math to look like math instead of a bunch of code that,
after careful reading and some notes, implements some math. It's not a library
function, it's the code I write. R, matlab, julia all allow users to write
something that looks very close to the actual math. Python doesn't see that as
a priority.

------
andy_wrote
Wickham et al's R packages are great, especially dplyr, and I think should be
taught to new R users pretty much right off the bat. I find R's syntax to be a
big hangup for new learners, especially on indexing and apply-to-each (sapply,
mapply, just plain apply...), but dplyr really makes life much easier.

The %>% operator alone (which to be fair was originally from magrittr) is a
great help. Not sure if this is my personal biases, but I always find it
easier to read calls chained postfix-style.

~~~
craigching
I'm actually reviewing a book due out this summer called "Data Computing" that
introduces the "Hadley stack" as the way of getting started in data analysis
and statistics. It's by a professor here in Minneapolis at Macalester College.

I agree with you about dplyr + ggplot and was pretty much gobsmacked at the
obviousness of "this is the way it should be taught" and am glad I'm in the
position to help review such a text!

I wonder if eventually this is the future of standard R.

~~~
hudibras
The book sounds great; can you give any more details?

------
chillacy
It's often a tradeoff between conciseness in one domain and generality in
others. A similar story: Matlab is great for doing math and plotting, but I
hated my life when I was developing a GUI in it. I later ported this project
to Python, which was great for the GUI (relatively so) and a little less
concise for the math. I find that tradeoff to be okay

------
gweinberg
I don't understand why in this example python allows carat weight to go below
zero, but surely that's not a problem with python as such.

