
Starting data analysis with R: Things I wish I'd been told - ingve
http://reganmian.net/blog/2014/10/14/starting-data-analysiswrangling-with-r-things-i-wish-id-been-told/
======
scishop
Things that are likely to trip up new R programmers:

* Function arguments are always passed by value. Objects are copied if they are modified in a function.

* Function arguments are lazy evaluated.

* Watch out for automatic factor conversion when importing data. R will display your string data as text, but behind the scenes it will treat it as an integer.

* R is slow. Really, really slow. All your intensive calculations should be handled by libraries written in C, Fortran or some other compiled language. Your R code should be mostly for glueing things together.

~~~
bachmeier
In my experience teaching R, the single biggest issue is that it is a dynamic
language that is in love with silent casting [or really returning objects with
unexpected types]. Although other bugs might be more common, none will
frustrate the user more than the type of a variable changing without warning,
and most of the time the error message refers to something else.

A couple of examples are stripping the time series attributes of a ts object
and default conversion of a row of a matrix to a vector. These are the cases
where they come to my office and say they have no idea what's going on. After
using R for a decade, I know the language well enough that these are about the
only errors I get.

------
stdbrouw
Also, read "R language for programmers"
[http://www.johndcook.com/blog/r_language_for_programmers/](http://www.johndcook.com/blog/r_language_for_programmers/)

It explains a lot of the quirks.

~~~
hadley
Or try "Advanced R", [http://adv-r.had.co.nz](http://adv-r.had.co.nz). It's my
attempt to explain the language in such a way that programmers from other
languages can appreciate the beauty and elegance behind the quirkiness.

~~~
earino
As a programmer, I can enthusiastically endorse this book. R "as a language"
is hard to grok when you come to it with the mentality of "a programmer" due
to it's history as an interactive environment for data analysis, and an
interface to algorithms first, and a programming environment second.

Hadley's book is a solid introduction when you are trying to map concepts from
other, more traditional, programming environments to the R way of thinking.

~~~
jghn
R "as a language" is _much_ easier if you don't come in with the Standard
c-ish language blinders on

~~~
earino
there was a great keynote at this last useR! conference by John Chambers about
how R's predecessor, S, was designed as an interface language
([https://www.youtube.com/watch?v=_hcpuRB5nGs](https://www.youtube.com/watch?v=_hcpuRB5nGs)).
He discusses how as part of it's fundamental cognitive model, it was there to
provide a facile interactive "interface" to best-of-breed algorithms. Very few
other programming languages have that pedigree, they were mostly designed to
architect systems.

------
spo81rty
I just recently started playing with R via Azure's new Machine Learning
service. From what I have seen I am really impressed. It helps with the input
and outputs of the data and makes it easy to turn your process in to a web
service that can be easily consumed. I had planned on setting up some linux
boxes to host R, but now I can just use Azure ML and not have to jack with it.

------
leeber
He forgot the last one: "to use python instead"

That was my attempt at a joke. But seriously, I love python when it comes to
manipulating data and doing anything statistical. You've got numpy, scipy,
scikit-learn, etc.

~~~
Gatsky
How do you feel about reproducible computing in python? R is very well set up
to A) get it running on any platform easily B) report the crucial parts of the
environment. I know that if I grab someone else's (published) code written in
R, I'm pretty confident I can make it work. Part of this is the great package
management through CRAN or Bioconductor, and also because often important
reference data for bioinformatics is actually available through the package
manager.

I haven't done much with Python, but I don't quite get the same feeling (happy
to be told that the reality is otherwise!). For example, the opening line of
the installation guide for Pandas doesn't inspire great confidence in me: "The
easiest way for the majority of users to install pandas is to install it as
part of the Anaconda distribution, a cross platform distribution for data
analysis and scientific computing."[1] Do I really need to install the HDF5
package so I can split a concatenated variable into two columns??

[1] [http://pandas.pydata.org/pandas-
docs/stable/install.html](http://pandas.pydata.org/pandas-
docs/stable/install.html)

~~~
vegabook
Python's pip is pretty good though not quite as polished as CRAN. I have had
few problems running complex code from third party sources, though one always
has to be aware of the Python 2 v 3 "problem" (though it is diminishing now
with most things available on 3). If you get pip up and running on a new
Python installation you can avoid Anaconda/Canopy if you want a clean
installation, and I have installed fairly complex Python setups in multiple
locations without too much trouble. Let's be fair, R can also be tough if it
calls a lot of third party libraries. Just try to get rJava working properly
for example if the local R and Java installations are not both 32 or 64 bit.
It can be a complete mess to disentangle this sort of stuff in R. Or for
example running code that uses Cairo, on a mac. My experience is that Python's
poor package management reputation is not really deserved anymore. Python's
virtualenv also allows you hermetically to seal away an entire python
environment, including its libraries, so that it will not conflict with other
python environments that might have different versions of the interpreter
and/or libraries. I am not aware of anything this robust in R.

Reproducible computing? The ipython notebook is awesome, though I am not sure
if there is anything as good as knitr if your workflow is LaTeX oriented.

R "hands" will usually find Python a backward step when it comes to vectorized
data manipulation, but its a forward leap if your data becomes too big or if
you have to step out of the comfy environment of exploratory analysis into any
form of (even trivial) production settings.

And no you definitely do not need HDF5 to effectively use Pandas.

~~~
hadley
The closest equivalent to virtualenv for R is packrat:
[http://rstudio.github.io/packrat/](http://rstudio.github.io/packrat/). It
doesn't (yet) support different R versions for different projects, but that's
on the roadmap.

~~~
mrgordon
Yeah packrat is great! It is a really important package which has greatly
increased my willingness to use R in production.

------
sytelus
R is very acquired taste - or rather you struggle through until you can
memorize all the gotchas. For people who are not doing doing data analysis as
a full time and perhaps only job, remembering intricacies of R is going to be
deal breaker. Personally I would shift towards Python unless there are
exclusive packages in R that you have to have it - and that's becoming rarer
by the dat. Combined with iPython Notebook and recent wave of migrations of
packages to Python, I see little need to deal with R on regular basis.

~~~
micheljansen
Thanks for introducing me to iPython Notebook. I had never heard of this, but
it looks very promising. The browser-based interface does not look as polished
as R's studio application and I'm a bit confused why this functionality is not
in the Qt console, but I'll check it out nonetheless.

~~~
afarrell
Much of python's power for data science comes from the scipy/numpy libraries
and the tools built on top of them. Unfortunately, installing these requires
first installing a bunch of fortran libraries. Fortunately, There is a free
distribution of python that comes with them and comes with a package manager
you can use to install python libraries that have C extensions. You can also
install R packages and use them from iPython notebook.
[https://store.continuum.io/cshop/anaconda/](https://store.continuum.io/cshop/anaconda/)

------
greenleafjacob
Something I've found useful is the sqldf package [0], which lets you use data
frames as tables in SQL with all the power of joins and so on.

[0] [https://code.google.com/p/sqldf/](https://code.google.com/p/sqldf/)

~~~
hadley
NB: sqldf works by copying your data frame into a temporary SQLite database,
running the query and then copying the data back. So it's not that fast.

Instead, learn a native R package like dplyr or data.table that supports all
the power of SQL, and is v. v. fast.

~~~
ggrothendieck
(!) sqldf works with H2, MySQL and PostgreSQL too - its not limited to SQLite.
(2) Most alternatives to SQL available in R don't work well with complex
multi-way joins. You wind up materializing the full outer join first which can
be quite problematic. (3) sqldf is fast enough and R is slow enough that quite
often sqldf is faster than base R so if R is fast enough for you then its
likely that sqldf is too. (4) The speed is often the last thing you need to
worry about. The ability to express a query in a familiar way (if you know
SQL) is often the more important consideration than having to learn a new
system.

------
psychometry
Strange that someone who knows about plyr (and is clearly not a beginning R
user) is still using "for" loops.

~~~
houshuang
There are lot's of ways of doing things in R, and I am always learning. I
always use vectorized functions when I do something to every row in a
data.frame, but when I want to do something to every column, like here, I
sometimes use for. How would you use something like plyr here, and would it be
as easy to understand?

    
    
      likertcat <- c("1"="Not at all", "2"="To a small extent", "3"="To some extent",
        "4"="To a moderate extent", "5"="To a large extent")
      
      for(e in names(db[,9:44])) {
        db[[e]] <- revalue(db[[e]], likertcat)
        db[[e]] <- ordered(db[[e]], levels= c("Not at all","To a small extent",
          "To some extent","To a moderate extent","To a large extent"))
      }

~~~
nograpes
Like this:

    
    
      db<-data.frame(replicate(44,1:5))
      likertcat <- as.ordered(c("Not at all", "To a small extent", "To some extent",
                     "To a moderate extent", "To a large extent"))
      db[ ,9:44] <- lapply(db[,9:44], function(x) likertcat[x])

~~~
houshuang
Thank you. As said, there are always many ways of doing things in R.

