
R, the master troll of statistical languages (2012) - Goldenromeo
http://www.talyarkoni.org/blog/2012/06/08/r-the-master-troll-of-statistical-languages/
======
mziel
The problem is people using R without trying to learn about the language
itself, just assuming it works like their favourite language.

For example complaining that R is slow and then writing iterative solution
instead of using vectorization. When I saw the example the author gave my
first thought was "sapply/lapply". Lapply is essential to the R use, and is
being taught early on in every book/course on R I've ever saw.

"In 2012, I’m the kind of person who uses apply() a dozen times a day, and is
vaguely aware that R has a million related built-in functions like sapply(),
tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of
those actually do. "

~~~
todd8
It's been a few years since I really looked at R, but I don't think the
problems with R are simply that people don't learn the language. Some
languages are simply not as good as others. We can all learn more about the
tools we use when programming, I know that I certainly could. But this doesn't
make it our fault that a language is tricky or hard to debug or hard to
understand. If we worked at it, I suppose we could all write more efficient
programs by using assembler, but that doesn't mean that assembler is the best
possible programming language for, say, statistical programming.

Someone, Ross Ihaka, that knows a thing or two about R wrote a short post 6
years ago and said "simply start over and build something better". Take a
look:

[http://www.r-bloggers.com/“simply-start-over-and-build-
somet...](http://www.r-bloggers.com/“simply-start-over-and-build-something-
better”/)

My hope is that Julia will eventually be adopted as a basis for a future
statistical programming language.

~~~
sdenton4
From your link, hilariously relevant to the blog post at hand:

"First, scalar computations in R are very slow. This in part because the R
interpreter is very slow, but also because there are a no scalar types. By
introducing scalars and using compilation it looks like its possible to get a
speedup by a factor of several hundred for scalar computations. This is
important because it means that many ghastly uses of array operations and the
apply functions could be replaced by simple loops. The cost of these
improvements is that scope declarations become mandatory and (optional) type
declarations are necessary to help the compiler."

------
wanderfowl
I'm a Post-Doc in a small social sciences department in a major university,
and am probably the department's ranking R-geek. I did my dissertation, and
much of my current work, doing modeling, analysis, and even machine learning
in R.

In many ways, I owe much of my success to the power that R has allowed me to
wield. Multicore lapplys and ggplot2 are my life these days. But even with
this, R drives me absolutely batty, and the documentation, even battier.

I may be competent relative to most, but R feels so taped-together and
idiosyncratic that even on my best days, I just feel like a newbie who's built
up an army of ugly hacks.

Someday, I'll learn more about the python stats tools and do my stats there.
But for now, R it is. Troll on, you crazy bastard.

~~~
avs733
I am in an almost identical situation, just still in the PhD stage.

I moved from Matlab (originally a mechanical engineer) and the biggest shock
was documentation and just the internal help stuff in general. The help files
on a regular basis require you to understand how something works to understand
the thing explaining how it works.

I am often just shocked at little quirks I find trying to do things in R, not
that it is worse than SATA of SAS but the goals was to be better. I am all for
FOSS, and R provides many extensive capabilities not available in Matlab, but
in terms of being user friendly Matlab is so superior it is honestly sad. Oh
and '<-' just drives me nuts...I will never understand the choice of two
characters where one is entirely sufficient.

~~~
jghn
Iirc it comes from the old APL keyboard

~~~
avs733
well thank you, I always appreciate someone teaching me something :)

~~~
jghn
np. As was mentioned elsewhere in the thread, S is a pretty ancient language
so R brings along some baggage with it :)

------
louden
R is a language with a lot of gotcha's. I usually get burned by characters
being converted to factors in read.csv() and converting factors to numeric (it
works, but not how you intend). The R Inferno ([http://www.burns-
stat.com/documents/books/the-r-inferno/](http://www.burns-
stat.com/documents/books/the-r-inferno/)) has a lot of other gotcha's and is
worth a read for people who use the language.

That said, the power, flexibility and user community make it my go-to for any
first crack at an analysis of data.

~~~
mziel
Factor is one of the worst thing of R-world. I don't recall ever needing
factors, yet they creep in with many functions (read.csv, cut).

Btw there's a nice readr package (from Hadleyverse) that has a read_csv method
that does away with factors by default.

~~~
stewbrew
How would you represent categorical data then? R's primary use case isn't text
processing. And HW isn't always right.

~~~
klmr
As character, for instance (in particular, they can do everything factors can
do when used in conjunction with `unique`, and sorted factors can be
represented as a conjunction of characters and numerics). Factors work better,
but only barely. In particular, they are nowadays not any more efficient than
using character (!). They used to be, which is why they are liberally used
everywhere in R’s base libraries.

~~~
stewbrew
"In particular, they are nowadays not any more efficient than using character"

How could a comparison of two strings of unknown size be as efficient as
comparing two integers? I'm curious to learn something new.

~~~
hadley
R uses a global string cache so any string comparison is just comparing two
pointers.

------
nathell
I have a love-hate relationship with R, being a predominantly Clojure (and
Ruby these days) programmer who only occassionally dabbles in data crunching.

The apply/sapply dichotomy that the article mentions (actually a hexachotomy,
there are also mapply, sapply, tapply and vapply) is one example of a
gazillion warts that the language has.

Another random one: R has a useful function, paste, that concatenates strings
together. Only it takes varargs, not a character vector, so if you have a
vector v of strings, you have to use do.call(paste, v). Only not, because
do.call insists that its second argument be a list, not a vector, so you do
do.call(paste, as.list(v)). And if you want to separate the strings, say, by
commas, you have to affix the named argument sep, obtaining do.call(paste,
c(as.list(v), sep=",")).

And R's three mutually incompatible object systems. And so on and so on and so
on.

There are things to love. The packaging system works really well. I like the
focus that R puts on documentation: hardly anywhere is it so comprehensive,
with vignettes and all. There are things plainly inspired by Lisp (R is just
about the only non-Lisp I know that has a condition and restart system akin to
CL). And ggplot2 is one hell of a gem of API.

In many ways, R is the PHP of data science. (Though the core language's still
nowhere near as abysmal as PHP.) Despite all the warts, there are all sorts of
statistical analyses that are just a package.install() away. Put another way,
R is to data science what LaTeX is to typesetting. It's a heavy pile of
ducttape, but it's here to stay because it's just so damn useful.

~~~
sooheon
I wonder what disagreements or counterarguments downvoters might have, and if
they might share them.

For the parent, given what you've observed, do you still go to R for data
crunching, or have you found anything in Clojure land that measures up?

~~~
nathell
R. Or a mixed approach, with Clojure for data preprocessing and R for the
analysis proper. Case in point: I wrote the Clojure scraping library,
Skyscraper [1], and made it output CSV by default so as to be able to easily
drop the resulting files to R.

For statistics, Clojure has Incanter, but it's very basic in comparison. There
are easily usable Java libraries for certain tasks (MALLET comes to mind), but
these are few and far between.

[1]:
[https://github.com/nathell/skyscraper/](https://github.com/nathell/skyscraper/)

------
capnrefsmmat
My biggest complaint about R isn't the inconsistency and obtuseness -- I've
been using it long enough to get familiar with the documentation and the
zillions of varieties of apply. My problem is the data structures.

R has only a few core data structures: vectors, lists, arrays, and matrices.
Data frames are built on top of lists, and admittedly data frames are
incredibly useful for statistics -- there's a reason pandas exists, and a
reason data analysis is much more tedious in other languages.

But there are no hash maps or sets (lists have named elements, but with O(n)
indexing; the only hash tables available use environments and accept limited
types of keys), no tuples, no structural or record types, stacks and queues
only recently became available on CRAN (through C), and so on.

This leads to the folk belief that the only way to optimize R is to vectorize
code or to write in in C or C++ (with Rcpp, for instance). No statistical
programmer ever thinks about choosing the right data structure for the job,
since you basically only ever use lists and data frames. Fast operations on
data structures (like graph algorithms) have to be written in C. There's just
no way to do it in R.

When I co-taught a statistical computing course, covering the basics of data
structures and algorithms, I included some homework assignments where the
difference between a fast and a slow algorithm was the choice of data
structure. R users struggled because they had very little available to them.
If their code wasn't fast because they were doing O(n) list lookups in a loop,
there wasn't anything they could do to fix it.

I hope Python and Julia can eat R's lunch. Some day I'll have to get around to
trying Julia for a serious project...

~~~
hadley
The lack of data structures in R is a totally fixable problem.

~~~
capnrefsmmat
Sure, through packages, but you'd need to adapt the entire standard library to
take advantage of them, so you could pass new data structures to built-in
functions and get meaningful results.

Generic iterators would also be extremely useful to build in, so it's easy to
work with a wide variety of structures.

~~~
hadley
And generic functions allow you to fill in those missing pieces from a package
too.

------
noelwelsh
I think the general consensus is that R is a terrible language with a lot of
useful libraries. I especially like that R claims to be inspired by Scheme,
but the memo seems to have been "Make sure we f __*ck this all up " taped to
the front of the "Lambda the Ultimate" papers[1]. In particular, lexical
scoping was one of the key innovations in Scheme and R has pervasively
buggered up their implementation, from not distinguishing between defining and
mutating a variable to making the default save/load procedures mutate the
environment. OMG does R drive me insane (as a programming language person.)

[Saying "there is package on CRAN that fixes this" is not a solution. A
language shouldn't require extensive knowledge of the ecosystem to get the
basics working properly.]

[1] Scheme was introduced to the world in the "Lambda the Ultimate" series of
papers. See
[http://library.readscheme.org/page1.html](http://library.readscheme.org/page1.html)

~~~
hadley
No, that is not the consensus. R is (at its heart) a beautiful language that
is extremely well suited to its domain. Most people who use R are not
professional programmers (or even identify as programmers) so it's not
surprising that there's a lot of bad code written in R.

------
tmalsburg2
Writing a variant of this article has become a rite of passage for all serious
users of R. There are two issues that contribute to the difficulties people
experience with R. First, yes, R can be confusing at times. Tal explains this
really well, but only scratches the surface. There is so much more confusing
and counter-intuitive stuff, for example with regards to factors that only
very few people seem to understand fully. However, there is a second issue,
and this is less often acknowledged: People expect R to be immensely powerful
and at the same time easy to use, which is really not a very reasonable thing
to expect. This attitude is fairly specific to the R community. No C++
developer would dare to write a long rant about the shortcomings of C++ while
at the same time nonchalantly admitting that they never made a serious attempt
to learn it. One symptom of this problem is that hardly any self-proclaimed R
hacker has read Matloff's book "The Art of R Programming" which was for a long
time, and perhaps still is, the only book on R programming. The mere fact that
there is (or was) only one such book speaks volumes.

~~~
hudibras
I agree with everything here, including the praise for Matloff's book; it
should be the very first book any serious R user picks up.

But Matloff is no longer alone: Hadley Wickham's _Advanced R_ is now also a
must-read for R programmers.

[http://www.amazon.com/Advanced-Chapman-Hall-CRC-
Series/dp/14...](http://www.amazon.com/Advanced-Chapman-Hall-CRC-
Series/dp/1466586966)

~~~
hadley
And Advanced R is also available for free at
[http://adv-r.had.co.nz/](http://adv-r.had.co.nz/)

------
justin_oaks
Way too many languages are trolls, or have troll features. Or in other words,
too many languages have features that don't do what a reasonable person would
expect them to do.

I've long considered implicit type conversion to be a troll feature,
especially how Javascript does it. Another one is how differently Java treats
primitives and Object types. Oracle databases treat nulls and empty strings
the same.

At times like this, all I can do is lament and search in vain for a language
with no troll features.

~~~
mackwic
You can take a look at Rust. The authors are really careful to design an
elegant C++ replacement, with no troll features and zero-cost abstractions.

------
chollida1
I think this article nails exactly what's right and wrong with R.

This in particular sums up the learning curve of R.

> Thankfully, I’m long past the point where R syntax is perpetually confusing.
> I’m now well into the phase where it’s only frequently confusing, and I even
> have high hopes of one day making it to the point where it barely confuses
> me at all.

Warning personal opinion ahead...

R, the language can get you up and running alot faster than other languages
for statistics like say python with Pandas or scipy but even people who use it
on a daily basis will curse the languages "quirks". I find most of the
confusion comes from R trying to be too friendly to the user via type
conversions. The ease in which the R's type system will convert values has
probably caused me more grief when first learning the language than any other
issue I ran into.

And this illustrates the down side of using R

> library(Hmisc) apply(ice.cream, 2, all.is.numeric)

> …which had the desirable property of actually working. But it still wasn’t
> very satisfactory, because it requires loading a pretty large library
> (Hmisc) with a bunch of dependencies just to do something very simple that
> should really be doable in the base R distribution.

Since R is rarely a programmer's most used language, I find there tends to be
an above average use of google and paste type code that pulls in 50 different
packages, each of which is used on 1-2 lines of a 1000 line script. Perhaps
this is just a function of most programmers not really understanding the
mathematical domain and hence they slowly google and iterate their way towards
a solution.

Often I'll see people pull in 5 different time series libraries just because
each of them operate on a ts object, so they all can work on the same object,
and each one provides one additional method the other's don't and the
programmer needs to create their solution.

You'll hear people talk about writing R in the Hadley universe or the basic R
universe but there isn't much talk about what a canonical R solution looks
like. R is a great language in the sense that Perl and C++. It allows you to
do anything but there often isn't an agreed upon way of writing it and two
different programmers can come up with wildly different but valid solutions to
the same problem.

~~~
mziel
I think it's also due the fact that installing R package is just one command
away, install.package("abc") and library(abc) and you're done. It a blessing,
but encourages loading swaths of libraries.

------
evandev
My thoughts (a little of a Rant) on R as the Lead Engineer at a data science
focused company is that R is a great statistical language, but a poor
programming language. I use the term programming language as a language which
is very versatile for a variety of needs (web app, commandline app) such as
python, ruby, etc. R has the capabilities to use as a programming language
like a climbing rope can be used as a belt. It can, but shouldn't because of
some points I have below.

It is great for exploratory analysis, as it is forgiving and easy to use in
the console for testing things; but once it needs to be put into practice, it
has issues. For a non-programmer, grasping R isn't too hard thanks to some
great developers in the community.

There is a lot of good in the R community, but people are focused on making it
isn't. Just look at deploying R into production, that can be a nightmare. I've
spent days looking over code to figure out where an error in production lies.
One of the errors was a package of a package which was updated for the first
time in years. That package depended on another package which my package
called another function that called the first one; basically it was a mess of
dependencies. And there are some misconceptions, while doing the engineering
work in R and learning I learned not to use for loops. Then one day I timed it
and the for loop was 10x+ faster than any apply/plyr function including using
a gpu.

The things that separate a programming language from a statistical language
are a programming language have more than one of these:

* Good dependency management

* Easy deployment into production environment

* A clear way to setup environment (e.g. naming, folder conventions)

* Ability to do most of the things you want with the base packages

* Good documentation about the above.

Basically, I believe a good data scientist is someone who can use R (or
something else) to explore data and then create the algorithm in a compiled
language to be put in production. And for someone who just needs to create
analysis for research or a paper, R is the perfect use case. R is an excellent
language for its use cases, just don't think about using it for general
programming. It has caused a lot of extra dev hours working on issues with it.

Little plug, we wrote a piece on hiring data scientists.[0]

[0]: [https://gastrograph.com/blogs/gastronexus/interviewing-
data-...](https://gastrograph.com/blogs/gastronexus/interviewing-data-science-
interns.html)

~~~
JasonCEC
To contrast this - I'm the lead data scientist at the same company, and head
over heels in love with R....

It is the only language I can quickly and efficiently jump from algebraic
topology for novel pre-processing, straight into model building and validation
- with just about every potential variation of every major algorithm freely
available and packaged on a well curated package manager (CRAN), and then
ensemble them.

I _agree_ that it's a bit difficult to use in production, and that dependency
management needs work (Packrat is trying to do that), and that blindly
trusting packages on CRAN can cause errors - but 98% of the time - it just
works. Graphics, models, crazy niche things that are currently only used by
one post-doc locked away in a top secret research lab... it all just works.

Of course, take this with a grain of salt: this is coming from a guy who's
built web-servers (HTTP responses and all) in R.

~~~
makeset
R's server sockets can't select(), so unless you've reimplemented that as
well, don't count on handling concurrent requests with that web server.

------
th0ma5
My own personal rant, I think the specific feeling I get is the conceptual
idea of R has long since outpaced the reality of R.

People like to fetishize data, and R sure lets you do that. The data science
landscape however is growing such that R is really just a one-trick pony,
however, that one trick is for better or worse being the gold standard of
statistics and modeling, somehow.

But everything else wants to sugar coat the software surrounding the
statistics, and leaves you no room to grow.

This is a very bad over-simplified example, but you sort of can't learn much
about graphic design or good communication skills by using ggplot2 ... you can
make something look very very nice, hopefully, in the general case, sure. And
you can definitely do all kinds of hacks and crazy code to make it do whatever
you want, but by doing that you produce ever more fragile and environment
dependent code. You'd be better off learning just about anything else for
graphics (Straight SVG, D3, Processing, Cairo directly, etc) because it is of
course a bit more of a problem starting up, but a generalized skill set that
could allow you to grow.

You also learn pretty much nothing about web development from Shiny. Shiny is
a wonderful idea, but ultimately prevents a statistician from implementing
what it promises, which is an analytic application. At some point, you have to
ditch it and learn more traditional web stacks. It is also something of a
sales funnel into a server solution that's a DDOS or security nightmare just
waiting to happen.

So instead of just griping, I guess I have some ideas... it would be nice to
have a Ruby/JS/Java/Python service generator. It would be nice to have a
D3/React/whatver based generator. It would be nice for there to be a data
munging solution (or even whole models, more like more PMML type stuff) that
can be generalized into something that could be compiled or generates
Python/Java/Bash/JS/Whatever code.

Ultimately you start thinking along those lines, and you realize that the
promises R is making about empowering the analyst are just teasing them rather
than helping.

R could do with less magic and more concentration on being simply a great
statistics engine that integrates better. I guess it is that to some degree,
but it sure fails the rest of the technology world that tries to live with it.

~~~
JasonCEC
I disagree on multiple fronts!

1) ggplot does exactly what it is supposed to do: create data visualizations.
It made no promises for interactivity or display, and in fact, it was
originally designed for creating publication quality charts, which it
continues to do well.

1.5) ggvis is a D3 API wrapper on ggplot and allows for interactive graphics.
Do you want to pay your data scientists for creating production ready graphics
or let them focus on what they're best at?

2) R has been growing - outside of neural networks (which R needs to catch up
on), R gets almost every pre-processing and modeling algorithm first, and
distributes it for free. Furthermore, it has better sampling options, metric
options, augmentation options, and model ensabling tools (stack or meta-model)
VS any other language or framework - it is the gold standard.

3) I don't think there's any "magic" in R. It's just a language with a
learning curve and lack of opinions.

4) Last point: R is really not built for the web (it's older than Python!) -
its built for data science. There's no reason you need to run your modeling
stack in the same language as your application server. R is perfectly capable
of writing to databases or sending API responces in JSON or PMML.

/endrant

Not trying to start a flame-war - but this type of difference in opinion is
important to see when thinking about hiring data scientists or deploying
models.

~~~
th0ma5
I sincerely appreciate your thorough and well reasoned response, and thank you
for taking the time with it.

1) I agree, perhaps I was trying to allude to visualizations being more than
charts. ggplot's charts are absolutely gorgeous and simpler to make than even
I remember them being in Lotus 123 for DOS. However, there are some things in
the periodic table of visualizations ([http://www.visual-
literacy.org/periodic_table/periodic_table...](http://www.visual-
literacy.org/periodic_table/periodic_table.html)) that it can't do. And what
about hybrid combinations in the same chart? Could I have a bar chart where
the bars are also mini-spectrograms? I can instantly think of how to do this
in SVG or Processing, but I'm not sure where to begin with ggplot ... maybe it
is possible. Of course why would you?

1.5) I guess I don't want to pay someone else to do the custom thing in D3
that the statistician can _almost_ do with their code, or try to get a regular
web stack developer familiar with D3 to actually get it working the way the
statistician says.

2) Yup, totally agree!

3) I think Shiny takes a lot of liberties and makes a lot of assumptions that
users of it can't even express to me are important to them because using Shiny
completely hides the underlying concepts of how it is implemented. I guess I
would definitely call that magic. You're right there's quite a lack of magic
in most of the language and packages, however.

4) I think data science can/should/does embrace the web. I think the modeling
stack shouldn't be on the application server, but a trained model perhaps
should be? I also wouldn't trust the stability of R for performance critical
API calls without a lot of redundant instances and a lot of load balancing.

Anyway... the real problem is that you're also absolutely right. There is
quite a bit difference in opinion between the tooling of an analysis effort,
and the robustness expected by IT.

Thanks again!

------
DangerousPie
I have my fair share of problems with R, but that first example (4 ways to
select a column) seems a bit silly. Just off the top of my head, I could think
of plenty of ways to do the same thing in Python/pandas:

    
    
        ice_cream.icol[0]
        ice_cream['col']
        ice_cream.iloc[:, 0]
        ice_cream.loc[:, 'col']
        ice_cream.ix[:, 'col']
    

And if you wanted to make things more convoluted, you could also wrap things
into lists like the author did in the R example. So this is definitely not a
problem that is unique to R or any reasonably flexible language.

~~~
mziel
For R at least some of it is due to R's flexibility

x$name syntax stems from data frames being really lists in disguise
x[["name"]] ditto, plus because it's useful to access by string (see
reflection in other languages)

x[,"name"] and x[,1] because we can also apply the matrix syntax to data
frames

------
Gatsky
Ok, but this seems pretty trivial compared to the many exclusive advantages R
has. I've had minimal problems using and extending other people's software
packages written in R (for bioinformatics). This has definitely not been the
case with Java, R or perl, where just installing said software package is
often unusually painful or impossible.

I think R is a prime example how useful a domain specific language can be. As
such, I see Julia as the most viable replacement, although that will take a
long, long time.

------
stevetrewick
So, as a code person rather than a stats one, my first reaction was that in
the first example there is in fact only a single way to access a column but
multiple ways to specify which one, all of which made immediate intuitive
sense to me.

So I wonder if this is less about R specifically and more a feature of people
approaching a language (any language) without that code geek intuition for the
underlying affordances ?

------
pak
There are certain languages that are good for a first-time programmer.

R, despite being one of the first languages a budding "data scientist" might
want to use, is probably not one of them for the many reasons given, among
them:

\- there are way too many ways to do everything

\- implicit iteration (although great for statistics) makes performance issues
hard to spot

\- the data structures are a bit _too_ flexible (it is Lisp-y in places), and
you really need to understand them all to deploy the *apply and plyr functions
effectively

\- 3+ object-oriented programming systems

\- non-standard evaluation. It's all over popular libraries like ggplot2,
because it increases terseness, but it just looks like magic to beginners.

Basically, all the chapters listed here [0] -- which happens to be a great
guide for experienced programmers to really understand R as a language --
happen to be the same reasons beginners give up too quickly.

Python, although it sufficiently nags me with its one-way-to-do-it motto and
its many warts [1] to not want to use it regularly, is just well-rounded
enough that it is a much better language for beginners. With Anaconda and
iPython installed, I've found that a total programming beginner can actually
get productive pretty quickly, even on stats and math problems.

[0]: [http://adv-r.had.co.nz/](http://adv-r.had.co.nz/)

[1]:
[https://wiki.python.org/moin/PythonWarts](https://wiki.python.org/moin/PythonWarts)

------
mikeskim
This could change your life:
[http://adv-r.had.co.nz/](http://adv-r.had.co.nz/)

------
dmlorenzetti
_The upshot is that unless you carefully read the apply() documentation...,
you’re hosed._

One thing that jumps out at me, having returned to R after several years in
the Python world, is how obtuse its documentation can be.

The standard format for R documentation does a few things that I find impede
understanding. First, the help pages are organized into sections giving the
high-level description, the arguments, the details, and the results
("values"). The "details" generally are organized by argument keyword, and the
arguments section draws on the language laid down-- usually in a vague, high-
level way-- by the description section. Finally the practical effects of the
details are deferred till the results section. That means unless you already
know what's going on, you end up having to jump around among sections, trying
to synthesize everything.

This is particularly a problem for those help pages-- and there are a lot of
them-- that describe a raft of related functions all at the same time.
Describing a bunch of related functions in the same place sounds like a good
idea (it should help you figure out `apply` vs `sapply`, right?). Yet this is
exactly when the documentation organization results in the most scattershot
reading, because in addition to having to synthesize between sections, you
have to mentally prune away text that, for one reason or another, doesn't
apply to your particular case (for example, because different functions don't
all share the same arguments, or because you want to read about the values for
just one variation on the function).

Another idiom I dislike in the standard R documentation is how the examples
don't actually show any sample output. There are generally some attempts at
comments to explain what the sample code should or shouldn't do, but they are
very much written in the style of programmer's comments, not in the style of
documentation or learning points. So you end up having to run the code, and
sometimes puzzle over the results for a while.

Here's an example, from the help page that I happen to have open right now,
`help(sample)`:

    
    
         # sample()'s surprise -- example
         x <- 1:10
             sample(x[x >  8]) # length 2
             sample(x[x >  9]) # oops -- length 10!
             sample(x[x > 10]) # length 0
    

The comments alert me that there's a "surprise" in store, and they even allude
to the (apparently surprising) fact that the second line produces a 10-vector.
Notably lacking is any explanation of what's meant to be surprising here, how
that relates to the internal logic of `sample`, or how to avoid falling into
the trap.

Overall, I feel like R's documentation is a bit like a conversation among
experts, with a rather sink-or-swim attitude towards newcomers.

Documentation is far from the first thing that stands out about R vs Python,
but it's the most salient, I think, in the context of the original article.

~~~
masklinn
> This is particularly a problem for those help pages-- and there are a lot of
> them-- that describe a raft of related functions all at the same time.
> Describing a bunch of related functions in the same place sounds like a good
> idea (it should help you figure out `apply` vs `sapply`, right?). Yet this
> is exactly when the documentation organization results in the most
> scattershot reading, because in addition to having to synthesize between
> sections, you have to mentally prune away text that, for one reason or
> another, doesn't apply to your particular case (for example, because
> different functions don't all share the same arguments, or because you want
> to read about the values for just one variation on the function).

This reminds me very strongly of man pages. Man pages group either similar
(man 3 printf) or closely related (man 3 malloc) functions, and intersperses
bits about each of the functions documented by the page, which ranges from
difficult to read to mind-boggling (when you have half a dozen near-identical
functions being documented at the same time). Reading an lapply documentation
page[0] it looks very similar in organisation, and similarly difficult to
parse/use.

> The comments alert me that there's a "surprise" in store, and they even
> allude to the (apparently surprising) fact that the second line produces a
> 10-vector. Notably lacking is any explanation of what's meant to be
> surprising here, how that relates to the internal logic of `sample`, or how
> to avoid falling into the trap.

On
[http://www.inside-r.org/r-doc/base/sample](http://www.inside-r.org/r-doc/base/sample)
the surprise is explained by the first paragraph of the details, with the hell
of an understatement that "this convenience feature may lead to undesired
behaviour" but without the big red blinking box it would definitely deserve.

[0]
[https://stat.ethz.ch/R-manual/R-devel/library/base/html/lapp...](https://stat.ethz.ch/R-manual/R-devel/library/base/html/lapply.html)

------
elcapitan
Not intending to start a language war here, but if somebody who has experience
with both R and Python/pandas/etc could answer - how's the current state of
the emerging Python data/statistics ecosystem compared to R? (not counting all
the other differences like R being allegedly weird or Python more general
purpose and so on).

~~~
TheLogothete
We did an evaluation recently. Not even close.

~~~
elcapitan
Is that evaluation somehow public, or could you share some details?

------
joelberman
I think if you are a programmer or have some programming language experience,
R is not very weird. But if you are a financial analyst or a social scientist,
or a statistician, and only want to get your work done; it depends on your
first programming language. If it was S3 you are golden. If it was Basic, you
are not so golden. Mine was LISP.

------
superuser2
Most of my university classmates' first exposure to programming is using R in
a statistics class. It's awful. I wish they'd make Python or something a
prerequisite, so that giant swaths of people don't get turned off of computing
or start with the strange ideas it teaches.

~~~
bachmeier
It sounds like you are arguing for imperative programming over functional
programming.

~~~
superuser2
R is a fine tool, but (like Java or C) not the best window for a beginner into
the joy that programming can be. The syntax is pretty weird and its semantics
don't align that neatly with broadly-useful ideas for reasoning about
programs.

We run 3 different intro sequences in Python (for non-majors), Scheme (for
most majors), and Haskell (for those who are already strong imperative
programmers). They're all great.

------
Bluestrike2
I just introduced a friend of mine to R. He's working on his PhD in
microbiology and beside himself once he started working with R. Personally, I
can't believe he hadn't used it before. It really is a beautiful language to
work with once you get a handle on it.

------
stevehiehn
The only issue I have with R is when exposing it as a web service R is not
great. For example if using R you will need one container for "dployr" and
another for your web service. It's not the end of the world but more moving
parts means more problems.

------
rcthompson
Maybe R should have optional "training wheels" that produce a warning every
time an implicit conversion happens. In the OP's case, it would warn that a
data frame was implicitly being converted to a matrix, and maybe also warn
that the numeric vectors within were being converted to character in order to
get slotted into that matrix.

------
haddr
R is great language but at the same time it can be a real pain.

Sometimes I imagine that some very wise guy designs a language much more
consise and coherent, that could at the same time take advantage of the huge
number of existing libraries written in R and C++... Maybe it's a dream but so
many times I wonder if that's even be possible.

------
numlocked
At my previous job we used to play "Guess what R does" over lunch. Someone
would write a few R statements and we'd have to guess the output. Extremely
difficult!

    
    
        >> a=c(1,2,3,4)
        >> b=c(1,2)
        >> a+b
    

Any guesses?

~~~
epistasis
\+ is a vector operation, and does it elementwise. R recycles vectors. You get
warnings if there are elements left over. This is something you should know
within the first 5 minutes of learning the language, hopefully?

------
makeset
Yes, I'm well aware of R's many faults, I have my own long list of R caveats I
hand to new hires, but not bothering to learn the damn language is no reason
to complain about it. First, RTFM.

~~~
avs733
look, I am all for RTFM but my experience with R has been a different case in
my experience. Can I make it do powerful things? yes. can I take code that
wont run, copy it to a new file and have it run? yes.

1) There are just quirks in implementation that are nonintuitive. I constantly
find myself doing things I would do in other languages, to do simple tasks,
which fail for no good reason, albeit an obvious one once I find the right
documentation.

2) The manual you describe is flat out unhelpful in many cases. The
suggestions that constantly come about "check stack / google" are suggestive
of just how poor that documentation is

~~~
mikeskim
You wouldn't find it intuitive if your first language was Scheme.

~~~
avs733
absolutely fair enough

------
gradstudent
ITT: people who don't understand R complain about R.

------
tempodox
Ah, the joys of a dynamic language and its implicit conversions.

------
misiti3780
previous discussion:
[https://news.ycombinator.com/item?id=5450097](https://news.ycombinator.com/item?id=5450097)

