
One Page R: A Survival Guide to Data Science with R - sytelus
http://onepager.togaware.com/
======
Permit
I was surprised at the difficulties I had using R at my previous job. It's
simultaneously a language I enjoyed, and one I'd feel hesitant to recommend a
friend because of the flood of complaints I'd have directed at me as they
learned it.

I love the concept of vectorized operations, but God almighty did I struggle
with the differences between indexing a list with list[1] vs list[[1]]. Other
hangups included the differences between lapply, sapply, apply or the
difficulty with which one has applying operations to the rows of a dataframe.
Other frustrations like naming conventions are apparently results of legacy
code.

I'd strongly recommend the book The R Inferno[1] by Patrick Burns. It eases
you through the nine circles of R hell that most newcomers seem to endure.

------
pvnick
Does anybody with experience doing analysis in both R and python have any
insight as to whether one can replace the other, or do they significantly
supplement each other? I get the impression that python with pandas/scikit-
learn/scipy/numpy/matplotlib can be used almost completely in lieu of both R
and matlab, but I don't feel that I have enough experience in R or matlab to
make such a claim.

~~~
chubot
I am a pretty die hard Python person, having used it as my main language for
over 10 years. I think there is a tendency to make that claim because we want
it to be true :)

Python is such a nice language that we want to use it for everything. But I
found I became more productive once I got out of that mindset. Right now, I
have a very multilingual workflow including Python, R, shell scripts, C++, and
a handful of other DSLs (various sql dialects, html templating, etc.)

I feel like I am finally productive working with data; I was always astounded
by how MUCH code you have to write to work with data. There is a pretty big
cost to learning all of that, and I won't lie and say I learned R quickly, but
it was worth it.

I use Python to interface with MANY systems, often to generate cleanly
formatted CSV files. R reads the CSV files and does data slicing and dicing
(using the data.table package, an enhanced data frame). And then ggplot2 for
plotting. And C++ for the big data stuff, and shell scripts to glue it all
together (with concurrent processes, importantly).

I find that this Unix-style architecture is actually more maintainable in the
long run. Parts of your analysis and data cleaning become more modular and
reusable. There are a lot of reasons why a boatload of Python packages aren't
very cleanly reusable. You also end up with MUCH less code using the multi-
language strategy vs. trying to do it all in one language (I've seen people
try to do that both with Python and R).

ggplot2 is unrivaled; Python and Julia people are busily trying to copy it.
Python and Julia are both also copying R's data frame structure. The area is
moving extremely quickly now, so I'll be interested to see what progress they
make. But there are actually areas where R the language is more suited to data
analysis (has more of a Scheme core than Python, more appropriate data
structures, lazy expression evaluation, etc.)

In the end I am using R for the same reason I chose to use Python over C/C++
-- because it enables you to write 5x less code (in the domain of data
analysis, compared with Python). And once you use other tools, you see
Python's flaws, like bad package management (R's is actually better), and a
relatively bad REPL. And Python's C++-inspired class system is often a
hindrance for data analysis; R is a more data-oriented language.

R definitely has its problems, including a lot of horrible R code out there.
But it also has a lot more books and so forth. IMO data.table + ggplot2 alone
make it worth it. I recommend reading Hadley Wickam's papers on "tidy data"
and the "split-apply-combine" style of analysis. In that sense learning R
actually helps you learn how to do data analysis properly.

~~~
craigching
Thanks for the "tidy data" reference, getting that now. Have you explored
dplyr yet, since you mentioned "split-apply-combine" and you like data.tables?

~~~
chubot
I'm familiar with dplyr but haven't really used it. I prefer using stuff
that's mature, and I'm sure that dplyr is going to undergo a lot of evolution
in the early days.

I independently came to the same conclusion as this guy:
[http://blog.datascienceretreat.com/post/69789735503/r-the-
go...](http://blog.datascienceretreat.com/post/69789735503/r-the-good-parts)

That is, "use everything from Hadley, except use data.table instead of plyr".
This was before dplyr came out, so maybe that could change. But I kind of like
the syntax of data.table, although I don't understand all of it.

Of course plyr is ridiculously slow and can't be used for even modest-sized
data sets.

~~~
craigching
> Of course plyr is ridiculously slow and can't be used for even modest-sized
> data sets.

Right and that's what dplyr is supposed to fix. The benchmarks so far mark it
at roughly the same performance as data.table. But, as you said, it's early
days for dplyr ;) Thanks for your comments!

------
craigching
Wow, this is a really nice resource, especially for someone like me just
getting started. The PDFs appear, at first glance, to be quite in-depth.
Wondering if they include using tools like plyr/dplyr.

------
austinl
This is definitely a good starting resource. Some of the tutorials on R can a
bit too "academic" for some tastes, so its helpful to have a guide on working
with specific data sets.

