

R beats Python, R beats Julia, Anyone else wanna challenge R? - jordigh
http://matloff.wordpress.com/2014/05/21/r-beats-python-r-beats-julia-anyone-else-wanna-challenge-r/

======
srean
I read the post with great interest but it turned out to be a provocative but
shallow trolling that can be summarized as, "R is better because it is better
LOl!" I was expecting something more illuminating.

It says or shows nothing new, does not addresses the well known complaints
that users have against R: _performance per resources used_ , _ease of
achieving correctness_.

"...built by statisticians, for statisticians" can be a dangerous place to
take comfort in, when there are well aired problems in _scalability_ ,
_running costs_ , _maintenance costs_.

If one computes stats on 600 data points with 10 dimensions and feels king of
the hill, please continue, but there is likelihood that some one else will be
eating your lunch and you will be left behind. Quite sadly, this has already
happened and is quite evident if one steps out of the stats bubble. Statistics
could have been what machine learning and datamining is now, been the main
driving force, the owner of initiative. On the contrary other communities are
using statistics and probability motivated approaches but engineering them
well to grab (funding) attention, well deserved in my opinion. It is them who
are pushing the frontier of influence.

I have not taken a wholesale plunge into Julia yet, but to me one of its
significant aspects that I think does not get enough attention among its many
nice features is the break away from the "vectorization" paradigm.

Complain about speed to a Numpy or Octave person and a common stock response
would be "but it dispatches to high-speed precompiled C loops and so is just
as fast." This ignores the fact that these precompiled loops are typically too
general and target the worst possible data patterns and have to be defensive.
Thus they have do more work than a corresponding loop on array a C, C++ or a
Fortran programmer would have written. Further, the vectorization approach
typically requires more loops and more memory to fill out, not only those
temporaries that are created within the vectorization operations, but extra
objects required to write the operation as a vector expression in the first
place.

Numpy's broadcast does eliminate a lot of extra memory allocation at the cost
of some more indexing work. Matlab did not have this for the longest time.
Another Numpy tool that is great is numexpr
[http://code.google.com/p/numexpr/](http://code.google.com/p/numexpr/). So
much so that if a colleague complains about numpy being slow I playfully
refuse to take a look till whatever can be numexper'ed has been numexper'ed.
It tries to elide temporaries and parallelize operations. The general
bottomline remains, (i) individual primitives have to do more work and (ii)
the primitives have to be called more often (iii) needs more memory copies.

Julia's approach is quite fresh in this regard, so is Haskell's loop fusion
based array primitives. The problem with Haskell is that these optimizations
can be very opaque unless you are an expert. A small change can make your code
go 30 times faster, and similarly a small change can make it 30 times slower.

An aspect of vectorization that I do like a huge lot is their terse
expressiveness. Code is a lot shorter and once you are used to it, such code
is easy to read. So I thought I would miss this aspect in Julia, but then I
was pointed to devectorize.jl on HN and its aimed to address this very issue.

------
StefanKarpinski
It's hard to evalutate this comparison because the original Julia code isn't
posted – who knows if it was well written or not? However, the same
vectorization approach presented in R can also be used in Julia (or Python or
Matlab):

    
    
        julia> rw(n) = cumsum(2randbool(n) .- 1)
    
    	julia> @time rw(1000000)
    	elapsed time: 0.02370671 seconds (25125400 bytes allocated)
    	1000000-element Array{Int64,1}:
    	    1
    	    2
    	    1
    	    ⋮
    	 -314
    	 -313
    	 -314
    

Since there's no timings or comparison code in the article, we can't really
compare performance beyond this. There is, however, this statement:

> This vectorized R code turned out to be much faster than the Julia code –
> more than 1000 times faster, in fact, in the case of simulation 1000000
> steps. For 100000000 steps, Julia actually is much faster than R, but the
> point is that the claims made about Julia’s speed advantage are really
> overblown.

It's a little odd to argue that "it’s very unlikely that ... Julia will become
more popular than R among data scientists" while completely dismissing Julia
being "much faster than R" for bigger problems. This only makes sense if one
assumes that addressing bigger problems is not of increasing importance in
data science – which seems counterfactual, to say the least.

"R’s speedy vectorization features" mentioned in the article are not actually
features – they are limitations. In R (and Python and Matlab), you _have_ to
write vectorized code because using a for loop is slow. In Julia, you can
choose which approach is better for you and for the problem at hand. Want to
write vectorized code? Not a problem. Want to use a for loop? Also not a
problem. Want to use recursion? Yes, _even_ that is ok.

I also find it interesting that the poster of this article is self-described
as an "octave core dev" ;-)

~~~
StefanKarpinski
I decided to do some timing comparisons against the given R implementation on
my system so that we have some numbers. The minimum time rw(1000000) takes in
R is 0.02 seconds – sometimes it's 2x or 3x that, probably because GC kicks
in. The vectorized Julia version above takes 0.01 seconds minimum – twice as
fast as R – but often it's 2.5x slower than that, also because of GC. Here's a
stab at what an obvious iterative random walk implementation might look like
in Julia:

    
    
        function rwi(n)
            a = Array(Int,n)
            s = 0
            for i = 1:n
                s += ifelse(randbool(), -1, 1)
                a[i] = s
            end
            return a
        end
    

rwi(1000000) takes 0.005 seconds. That's twice as fast as the vectorized Julia
version and 4x faster than R. It also allocates much less memory than either
one – just the output array. For 100000000 steps, R takes 2.717 seconds, the
vectorized Julia version takes 1.52 seconds, and the iterative Julia version
takes 0.6 seconds.

------
influx
Python threads are real OS threads that are scheduled by the kernel. Yes,
there is a GIL, but it is released for blocking IO. Granted CPU heavy loads in
threads that are pure Python are not great, but there is a subtle difference
here most people miss.

The multiprocess module solves this and has almost exactly the same interface
as the threading module, I'm confused why the author paints it as more
complicated.

That said I have nothing against R, or the use case the author presents.

~~~
bwood
Using multiprocessing in Python is more complicated because each process has a
separate memory space and interprocess communication can be quite expensive.
If you don't carefully design your communication patterns, any performance you
expect to gain by using multiple cores can easily be lost.

~~~
influx
You'd be mistaken to think just because you share memory space that you don't
have to think about or design around the same issues with threads. The main
issue I've ran into with multiprocessing is objects that are unpickleable,
otherwise, it's pretty transparent usually.

~~~
bwood
That's interesting, what sort of communication issues have you run into with
threading? I've always found it to be mostly a non-issue since you can use
global variables or just pass along a reference.

------
gmac
Inflammatory title! R gets a lot right, but a lot wrong too. In my experience
— as an experienced programmer but a mediocre statistician — it's often not
obvious how to make it work quickly or cope with large data sets, there's a
profusion of similar but distinct data types (vector, matrix, list, data
frame, ...), endless cryptic function names (c, lapply, ...), confusingly
flexible subscripting, and a community where discussions usually end up in
someone getting told what they did was _obviously wrong_. I'd love to see a
good alternative emerge (and Julia seems promising).

~~~
eshvk
> community where discussions usually end up in someone getting told what they
> did was obviously wrong.

Fascinating how that problem _never_ appears in the programming community. How
we never have a situation where a user is told that they are _obviously_
wrong. /end snark.

Having said that, I sympathize with where you are coming from. However, you
have to realize a few things:

1\. R comes from a community designed to create things that are correct, a
community where data is typically cleaned and small enough that sophisticated
mathematical models exist. Most large data set + large sophisticated models
problems can be broken down into test in R and then write C++ code.

2\. On the cryptic function names, there is an element of inspiration from
Mathematical symbology, it makes it easier to take a proof and turn into code.

3\. Yeah, I got nothing on the data types. They confuse me as fuck as well.

4\. R has ggplot2. This is amazing. I am a firm believer that a data scientist
is nothing if they can't visualize their data. And yet, nothing comes close to
ggplot2. This comes from a guy who will grab an ML engineer and talk their
head off on the joys of d3.

Now having said that, I do hope for another great language to emerge. The
closest for me has been Matlab, second best numpy. Somehow a paid software
really fucking incentivizes people to clean their shit and make a solid
product. (Well solid enough for my purposes).

~~~
StefanKarpinski
> Fascinating how that problem never appears in the programming community. How
> we never have a situation where a user is told that they are obviously
> wrong. /end snark.

This is a huge problem all over programming communities. However, R is reputed
to be significantly worse than most. We're trying very hard to innovate not
only with Julia's technology, but also with its community. There is no reason
for open source programming projects to be "jerkdoms" – we're professionals
and our behavior should be professional, civil, helpful, and respectful. Even
if we weren't professionals, that's still just decent behavior. If you need
evidence of a civil, non-snarky, supportive community look no further than
this:

[https://github.com/JuliaLang/julia/issues/6829](https://github.com/JuliaLang/julia/issues/6829)

A more typical issue interaction is this:

[https://github.com/JuliaLang/julia/issues/6769](https://github.com/JuliaLang/julia/issues/6769)

This is now the longest issue discussion Julia has ever had, and a fairly
divisive one (although a bit obscure), but entirely civil, polite, and
respectful.

> nothing comes close to ggplot2.

IMO, Gadfly does:
[http://dcjones.github.io/Gadfly.jl/](http://dcjones.github.io/Gadfly.jl/) –
of course, it's very heavily inspired by ggplot2 and uses D3 for plotting in
the browser.

------
izyda
I think this article hits the nail on the head with the statement "built by
statisticians, for statisticians". As a statistics student, I have found the
majority of proponents of Python or Julia, as R replacements, are developers
who complain either about R's speed or problems such as type checking when
deploying R applications. On the other hand, most statisticians I have spoke
with seem to prefer R to anything else - CRAN has a lot to do with that I
think.

~~~
atmosx
Seriously????

A language with moto "The R Project for Statistical Computing" is _better than
anything else_ when it comes to statistics?!

Who would tell right?! ...

------
bendmorris
>For the same reason, I don’t see Python or Julia building up a huge code
repository comparable to CRAN.

CRAN has 5,566 packages. PyPI has 44,024.

I recognize that the author is trying to make a point about statistics
packages specifically, but the R community is simply dwarfed by Python, so I
don't think it's so farfetched that Python could overtake R even in its own
little niche. It's also much easier to add your package to PyPI than CRAN,
which is curated by a small and sometimes opinionated team.

------
esbranson
R is to Python as Latin is to English. Sooner or later you will have to
communicate with the rest of the world, and you're going to find your choice
of Latin, while having a substantial use case, to not be sufficient for the
purpose.

~~~
patricklynch
I think you overestimate the difficulty.

The Python programmers can probably step through R source code and figure out
what's going on without too much trouble.

The other statisticians probably already know and use R.

The non-programmer, non-statistician, business types probably aren't
interested in your source code--be it Python or R--and will want you to make
pretty graphs and give presentations anyways.

~~~
esbranson
Probably. I think my complaint boils down to a complaint that every new
language is so intent on using an uncommon syntax style for little if any
benefit. To me, R is amazingly cryptic for what little it is trying to
achieve.

------
doug1001
i doubt there is any personal bias (from the author of the blog post). The OP
has been an active and prolific member of both the Python and R communities
for over a decade (don't know what his interest, if any, is in Julia).

i have never met Dr Matloff, (who i believe is a professor at UC Davis), but i
have read/studied a fair amount of his work over the past 5-7 years, which
includes excellent extended tutorials in python on various topics such as co-
routines, discrete-event simulation, and simulation using simpy. He is also
the author of a book i highly recommend called "Art of R Programming"

~~~
fredliu
Prof. Matloff teaches almost all of his classes using his own material, e.g.
discrete event simulation, statistical analysis, stochastic process, etc. (at
least for advanced undergraduate and graduate level courses, not sure about
intro level undergraduate courses). I found his approach of teaching very
intuitive and helps you really understand the topic. Even now that I've
graduated, I still find myself checking his materials every now and then.

------
singingfish
I'm sick of this perl bashing from pythonistas. Why can't people just accept
that they do the same job equally well in slightly different ways?

~~~
esbranson
Because one of the jobs is to communicate to the reader the code's purpose and
method clearly. And in this respect, Python and Perl are leagues apart on
making this easy and natural for the author. Perl is widely known for being,
commonly, overly cryptic, with no small part of the blame owing to the design
of the language.

~~~
singingfish
See, more pointless perl bashing from a pythonista. Writing good perl is like
writing good english, easy to learn the basics, tricky to learn (but aren't
all programming languages), but can express good ideas fast and clearly.
Python trades off the expressiveness for greater uniformity. Both languages
have their (different) warts, but at the end of the day the by both do the
same job and do it well. I'd probably recommend python for people who
prioritise uniformity over expressiveness, and perl for those who want to try
to realise their potential more. But at the end of the day they both do the
same job and do it well.

------
terranstyler
I personally do very well with Clojure / Incanter

While I must admit I never worked with R (so I don't know what incanter is
lacking), I do know that whatever problem needs solving, it won't be hard to
solve in Clojure. Two specific advantages of Clojure / Incanter are the
clojure-inherent concurrency stuff plus access to all Java libs.

~~~
ska
One of his main points is about specific statistical codes being done
properly.

In other words, yes, it will be hard to solve in Clojure. It was hard to solve
in other lisps, it was hard to solve in fortran, and it will be hard to solve
in any other language you pick... unless someone else has already done the
heavy lifting.

This isn't the sort of thing you whip up in any language, good numerical
analysis libraries are typically often decades of work by people with
specialized backgrounds.

~~~
shoo
when i was using R a few years ago, a number of the packages i was most fond
of were themselves at least partially implemented in C++ / Fortran.

e.g.

`gbm` is pretty much a C++ project.

`randomForest` now appears to contain a fair bit of R code, but it still has
the (original?) fortran routines for tree construction, as well as C code
wrapping the fortran.

[https://github.com/harrysouthworth/gbm](https://github.com/harrysouthworth/gbm)
[http://cran.r-project.org/web/packages/randomForest/index.ht...](http://cran.r-project.org/web/packages/randomForest/index.html)

~~~
ska
Interesing. I'm not familiar with how much of the R library is implemented in
R, I expect some of it is tried and tested fortran because it's a waste of
time to reimplement.

My comment was really addressing "it won't be hard to solve in Clojure.". Some
things are just hard to get right, period.

------
kiyoto
As I understand it, Matloff's point is "to each his own" and "understand
relative strengths and weaknesses of open source projects, but be very careful
about making direct comparisons."

Especially in the open source world, people like to compare competing
solutions/software. While constructive comparisons help make software better,
it is hardly ever productive to talk down on other open source projects.

As far as "data science" environments go, I really think it comes down to your
preference/needs, and a good understanding of the data/related concepts often
outweighs the differences among various tools. Look no further than John
Foreman's "Data Smart": he does a beautiful job dissecting and analyzing a
wide range of datasets with...Excel.

~~~
jnbiche
>While constructive comparisons help make software better, it is hardly ever
productive to talk down on other open source projects.

Are you saying this in support of a blog post with the title, "R beats Python,
R beats Julia, Anyone else wanna challenge R? (matloff.wordpress.com)"?

~~~
kiyoto
Absolutely not. But it is also true that we no longer read anything unless the
title is reasonably tabloid-y or it's got enough upvotes from other people.

------
yetfeo
This is a pet peeve of mine but...is 'wanna' a commonly used word in the USA?
It grates whenever I read it as it's rarely used where I am.

~~~
esbranson
Does this seem like a natural way of speaking: "I am grated when ever it is
twelve of the clock and I can not fall on sleep."? If you want an example
lesson in language drift, go to Rome and try and talk to people in Latin.

"Wanna" is called a "contraction" or "slang", probably originating from "want
to" and "want a" both pronounced like "want'ta". (For example, I am from
Sacramento, which is often pronounced as "Sacra-minnow". Also see "shoulda
woulda coulda".) Quite common in Northern California at least. Its probably
best described as Internet English. The real question is: Does it make it
harder to understand? Does it obstruct the goal of communicating information?
Or do you normally speak like a 17th Century aristocrat, using archaic words
that very few understand, but are probably technically on point? (Sorry, that
are _veritably apposite_.)

~~~
yetfeo
The way I speak and the way I write are different. Avoiding slang like 'wanna'
when writing is good.

~~~
djur
Not if you're trying to write in a conversational tone, which is common for
many blog post.

'Wanna' has a particular confrontational connotation (possibly a joking one)
that I believe was desired here. 'Want to' wouldn't have had the same effect.

------
NamTaf
I would love to see them stack Matlab up against R, particularly with respect
to the vectorised computation. I suspect Matlab would put up a lot more
challenge to R in that regard. Never mind introducing GPGPU computation.

When it comes to breadth of capability, R would not even give Matlab with its
toolboxes a challenge. Simulink alone is huge in that regard. You pay out the
nose for it though.

------
genofon
I'm not gonna read this..

R is a bit faster? and so? If i was really concerned about performance i
wouldn't use R in The first place.

R is relatively fast when it uses libraries written in c or fortran, like
Python.

------
RA_Fisher
Not surprising since many functions in R are Fortran or C++ calls.

------
smegel
Who would have thought a specialized tool would be better at it's own domain
that a general purpose programming language...

