I've written a few hundred lines of R sporadically over the last several years. The absolute worst thing about it in my opinion is the type system. It does not matter how many times I use R, I cannot for the life of me remember or understand the difference between vectors, arrays, lists, data frames, and matrices. A list is sort of like a mix between an array and a map, a matrix is sorta like a 2d vector but can have row/column names, an array is like a matrix but different, and data frame is like a heterogenous matrix. And converting between them is always tricky.
As much as R may be capable of, I just can't get past how inconsistent and complicated its basic types are.
The terminology is weird. I'm not an R expert, but here's how I think of it:
vector: this one is clear based on the name; it's a homogeneous sequence (with very aggressive type conversion). A sequence of strings, a sequence of numerics, etc. One thing worth knowing is that there are no atomic types, so c(1) == 1. That is, the value 1 is identical to the singleton vector containing 1. Also the empty vector c() is identical to NULL! is.null(c()) == TRUE. Weird.
list: the name is confusing, but I think of it basically like a dict in Python. And the syntax is the same: list(a=1, b=2) vs dict(a=1, b=2). I think you can use it like a sequence as you are saying, but I never use them that way. Lists are for ad hoc composite types -- if I want to return 2 values from a function, I return a list() of them. I think you can convert lists to environments easily, or they are the same -- also similar to Python's dicts.
data frame: This is the core type AFAICT, it is basically a collection of named column vectors of the same length. e.g. data.frame(name=c("a", "b", "c"), value=c(1,2,3)). This seems pretty intuitive. A row has different types (like a DB relation) but the columns have the same type since a column is vector.
matrix: I don't use these too much, but it basically seems like a homogeneous type like vector, except you specify the dimensions.
array: I don't use this, but the R documentation says "A 2-dimensional array is the same thing as a matrix". So I think I am confused and what I typed above is an "array", and matrix is the special 2D case. Yes the names are bad. I think of a matrix as having arbitrary number of dimensions (e.g. in matlab).
I think where it gets confusing is that there are all these arbitary conversions. And you can use things more than the prescribed ways, so you might stumble across code that uses them wrong. But after a fair amount of R programming, there is my mental model, whether right or wrong :)
I think a lot of the mess comes from the fact that dealing with real data is just messy. R takes the mess and makes the common case convenient, and people like that. But it's like Perl in that it's a "Do what I mean" language and tries to guess a lot, rather than "Do what I say" like Python. And when it's guessing your intent wrong it can leave you very frustrated, as with Perl.
I have to come to love R (for what I use it it for), but reading this makes me realize how unusual my R-workflow must be, because most of the 'advantages' of Julia over R don't really come up in my daily workflow anymore - it seems that's likely because I've adapted to the shortcomings of R and have twisted other tools to my needs. I'll add Julia to my list of languages to check out in more detail, because perhaps Julia could replace my need for this rather esoteric workflow that I've developed out of sheer necessity.
I use Python (NumPy/SciPy) for most of the data preprocessing, and perhaps that's why. I used to do this in R, and I realized that it's just a lot easier to get done in Python (and it ends up being faster anyway). The problem is that Python/NumPy/SciPy still doesn't lend itself quite as well as R does to certain aspects of the statistician's use case. It's possible that things have changed since the last time I evaluated the two, but I still find it easier to prototype various models in R, even if I do all of the preqrequisite data munging in a different environment.
I understand that R, like Perl, is 'blessed' (pun intended) with two different, incompatible type systems - in fact, this is the reason I avoid using R's type system, and whenever I'm advising newcomers, I always recommend the same. I don't write statistical packages, so this doesn't come up, but when I find myself needing to write a method in R, I ask myself if this would actually be done more easily another way instead. Generally, I find the answer is 'yes, yes it would'.
I really do think the problem is the type system. The kind of type system that lends itself well to data manipulation is not the same type system that lends itself well to model manipulation - when I think about it, I've unconsciously segregated my workflow into two parts, doing everything naturally done with Python's type system in Python, and likewise for R. Maybe that's just the way that I happen to approach data manipulation, but I think it's non-coincidental. R's relative homoiconicity (compared to Python) makes it really nice for some things, but there are other warts with its typing that are just too annoying to work around, when a python shell is just a few keystrokes away.
I guess the answer is (as always!) to use a purely homoiconic Lisp dialect, so you get the best of both worlds but that's asking a lot of statisticians.
I really have come love R for what it does do, though. Of all all the statistical software packages I've seen (comparable: SAS, SPSS, Stata, MATLAB), it's far and away the best (and the GNU license makes it very, very attractive to broke students looking to avoid the still-absurdly-priced student licenses for the alternatives). That said, I still sigh every time I realize that I'm essentially gluing together two separate runtime environments for something that should really be easily integrated. I do what I do now because it ends up being faster than using either Python or R for everything, but it still strikes me as weird that a language so perfect for munging data (Python) can still be so awkward for analyzing it, and vice versa.
I find that I do the same thing, except with Ruby for data processing instead of Python. It may be that I just don't know R all that well, but there are so many tasks that are incredibly awkward in R, often requiring a third-party library like plyr which are easily expressed in a language with more "normal" semantics.
An example, from this week: I have a bunch of CSV data files from various trials of an experiment. I want to combine them into one data frame with a new column that includes an id for trial. This took me about a half hour to figure out in R, and five minutes to write in Ruby.
I think the main problem with R is that there's a different way to do everything. It feels like a language that was not so much designed as gradually evolved. In a functional-ish language like Ruby or Python you have a few workhorse data manipulation tools: map, fold, etc. But in R everything is different depending on whether you're dealing with row vectors, column vectors, data frame, or arrays. It makes it hard to generalize over slightly different problems to find common solutions.
Julia looks really awesome, though, and I'm excited to see something that might be able to replace R and bring all of this comfortably into one language.
> I think the main problem with R is that there's a different way to do everything. It feels like a language that was not so much designed as gradually evolved.
I don't know how much you know about the history of R, but you're spot on about that.
> I guess the answer is (as always!) to use a purely homoiconic Lisp dialect, so you get the best of both worlds but that's asking a lot of statisticians.
For what it's worth, Julia is homoiconic and "underneath" the Matlab-like syntactic exterior, quite a lot like scheme.
That is very interesting. I use Python to pre-process data for Matlab, and have been giving serious thought lately to learning R for its free license and easy(?) integration with Hadoop. Can you briefly comment on the advantages or R over Matlab aside from licensing?
If you're already used to Matlab, then you may not find my comments as relevant. If you were already proficient in both, then they're both interchangeable for many tasks (which is in fact why I always recommend learning R over learning Matlab).
Licensing isn't just a minor thing - getting Matlab to run on non-Debian Linux is a painful ordeal. I never actually got it working, because I never bothered to debug its cryptic error messages, and since it's distributed as a precompiled binary, I wasn't going to sit around trying to patch it. A corollary is that R is easier to integrate into other toolkits, and there are a ridiculous number of freely available R libraries that make your life easier.
My issues with Matlab may be things that someone familiar with the language would care less about. That said, I find Matlab to be incredibly, incredibly irritating, and I think that's because it's design is tailored towards people with minimal experience with other programming languages (like research scientists), whereas R's design is simply based off of S - so I find it violates the Principle of Least Surprise less. Matlab is not like Lisp or Haskell (where the journey of understanding the language is valuable in itself) - it's really just a means to an end (number crunching), so the POLS is especally important.
R, unlike Matlab, imposes almost no restrictions on the structure of a program. The
way I see it, Matlab makes Java's broken one-class-per-file model even worse, by imposing more filesystem-level restrictions on my program.
R, unlike Matlab, uses a type system that's more familiar to someone used to programming with multiple datatypes, as opposed to someone used to thinking in terms of strictly numerical structures. I never got the hang of when I should index with () or {} or [] Matlab ... I'd have to look it up to tell you. R, on the other hand, is more like Python in this regard - even if it's not quite as clean as Python, it makes basic things like importing/maniplating CSVs much easier than Python (or even Excel, which is even designed around that exact purpose).
R, unlike Matlab, returns the last value computed, not the last values with the same local names as the return value names.
R, unlike Matlab, uses a more intuitive (to me) definition of dimensions (and of row- vs. column-vectors). I spent 80% of my time in Matlab figuring out how to get dimensions to match in a robust manner, and I've never had to do that in R.
You get the idea - my frustrations with the language itself are mostly with the fact that it's so unlike most other languages, and it's too much of a hassle to learn. My frustrations with the language environment is that the free alternative (R) is much easier to work with, and much more cross-platform.
Thank you, that's a good list! Dealing with cell arrays and text vs. numerics is why I do my pre-processing in Python. Matlab's job is to read in the data, run it through various algorithms, collect accuracy statistics, and show plots.
For us the issue is not so much Matlab as a programming language, but rather availability of new algorithms and ease of parallel processing. The licensing issues involved in getting the parallel toolbox running on multiple workstations seems like a headache, which is part of what is motivating us to look at R.
One thing is probably the huge number of statistical packages it has (see: http://cran.r-project.org/web/views/), including (static) graphics (e.g: ggplot2, lattice, and just the base graphics).
Hi chimeracoder,
I am very curious to better understand how you find python better for the data pre-processing stage.
I use only R, and would love to know what "I am missing" here.
Any simple example will also make it easier to understand.
Well explained by chimeracoder. Data-table centric operations are much more natural in R while sequential objects (lists, tuples, and strings) are quickly manipulated in Python (there are more string/regex methods there).
I am a heavy Python user, but when I use Numpy/Scipy I don't feel like I'm using Python much anymore so at that point I either switch to R (or Fortran)... though I'm quite optimistic that at some point the pandas DataFrame can become my default storage structure from which I can parse out R tasks through Rpy, SQLite, HDF5, or possibly Reddis.
matplotlib is very verbose though; I almost prefer Matlab's graphics model... though less so than R's basic and lattice graphics.
I haven't used it - I'll check it out. Though the more I think about it, the more I think that my issues stem from the fact that I need two fundamentally different ways of looking at types in each part of the workflow, so it may be difficult to simulate that within Python - we'll see.