
R beats Python, R beats Julia, Anyone else wanna challenge R? (2014) - mindcrime
https://matloff.wordpress.com/2014/05/21/r-beats-python-r-beats-julia-anyone-else-wanna-challenge-r/
======
gaze
"In their trial run, Julia was much faster than R. But I objected, because
random walk is a sum. Thus one can generate the entire process in R as vector
calls, one to generate the steps and then a call to cumsum(), e.g."

You missed the point. Everything is fast when you export what you're doing to
a library. Julia is fast when you do, and when you DON'T.

~~~
epistasis
But not using the language's idioms is not a fair comparison.

In modern times, vector based thinking like that that's present in R or numpy
is the only sane way to program. Because these are not "library calls," they
are the primitives of modern architectures.

~~~
idunning
The problem is that you may have to stray outside that vectorized bubble, and
at that point you enter into a two-language situation if you are using R (e.g.
RCpp) or Python (Cython?) - which was one of the motivations for the whole
Julia language, to avoid needing to do that.

~~~
epistasis
But that's no excuse for a bad benchmark. If it actually were necessary to
stray out of vectorization, then the Julia benchmark would have a point. But
those cases are few and far between in what I have experienced, and apparently
in what the benchmark authors have experienced.

------
msellout
The author seems to have a very narrow view of "data science". In the comments
he criticizes all of machine learning as simply unprincipled revival of
nonparametric curve estimation. While there's much truth to the "unprincipled
revival" accusation, some aspects of machine learning arose for good
practicality-beats-purity software engineering rather than simple ignorance.

My interpretation of the term "data science" is much more in line with
practicality of engineering rather than the purity of traditional statistics.
To that end, Python is leagues ahead.

~~~
otabdeveloper
> he criticizes all of machine learning as simply unprincipled revival of
> nonparametric curve estimation

He's right. 'Machine learning' is just statistics with better, cleaner names
for things. The underlying theory is the same.

(The commenter below says that 'statistics explains, ML predicts', but this
isn't really true. Both statistics and ML build models, whether you use these
models for prediction or explanation is up to you.)

One problem is that the 'statistics' as usually taught in a college course or
explained in a textbook comes from a much earlier age, before supercomputers
were available to the average man; in essence, it's like a kind of machine
learning as done by paper-and-pencil. In contrast, 'ML' assumes from the start
that computing resources will be available.

~~~
msellout
> In contrast, 'ML' assumes from the start that computing resources will be
> available.

Right, but I'm not sure that "better, cleaner names for things" actually
follows. Instead, I find that the ML folks just hacked their way to similar
results as traditional statistics, but in many cases were comfortable with the
algorithms as "black boxes" rather than having a clear understanding of _why_
the algorithms worked. In that sense, the author's "unprincipled" criticism is
valid. This is less true today, but the new research in convolutional neural
nets shows how ML starts from hacking things until they produce practical
results then backing into the theory of why. This habit has resulted in much
duplication of effort and naming schemes. My ML prof at Georgia Tech (Isbell
was awesome!
[http://www.cc.gatech.edu/~isbell/](http://www.cc.gatech.edu/~isbell/))
constantly trashed "genetic algorithms" for being a silly form of randomized
hill-climbing.

The beneficial side of these less-principled techniques is that they happen to
work on larger scale datasets. It turns out approximate results are more
scalable than exact results.

------
dalke
(NB: the article was written in May 2014. There are many comments at the
bottom of the page.)

"This has been the subject of huge controversy over the years, so Guido Van
Rossum, inventor of the language, added a multiprocessing module."

I think that assigns too much agency to van Rossum. PEP 371 describes the
justification for bringing pyProcessing into the standard library. The primary
developer was Richard Oudkerk, who along with Jesse Noller volunteered to
maintain it in the standard library.

While it's true that van Rossum accepted the PEP (on Thu, Jun 5, 2008 at 1:22
PM according to the mailing list), that's not quite the same thing as adding
it.

For that matter, the version control logs point out who originally added the
code, in its most literal sense:

    
    
        user:        Benjamin Peterson <benjamin@python.org>
        date:        Wed Jun 11 02:40:25 2008 +0000
        summary:     add the multiprocessing package to fulfill PEP 371

------
gtrubetskoy
"First, R is written by statisticians, for statisticians." To see the fallacy
of this statement substitute "statistician" with any other profession, e.g.
farmer, philosopher, doctor. Good software is written by _programmers_.

~~~
benhamner
Have you used R?

As both a heavy R user and a software engineer, I can promise you that one of
the quintessential aspects of R is it's actually "written by statisticians,
for statisticians."

You can't accuse R of having great code and language design. Or good code and
language design. Or even mediocre code and language design.

Imagine what you would get if you got a million monkeys drunk, put them on a
roller coaster with laptops, and had them bang keys while they were upside
down on loops. And then the result suddenly, miraculously runs and produces
output. Now you understand R's software design.

~~~
jackmaney
Yep. The idea that I should have to code defensively to make sure that __the
matrices I 'm multiplying are of the correct size __[1] is complete bullshit.

    
    
        cat(paste("That", "and", "string", "manipulation", "in", "R", "is", "a", "pain", "in", "the", "ass", sep=" "))
    

[1]: If A is n by m and B is p by q, and m is a multiple of p, then R will
SILENTLY concatenate copies of B to itself to form an m by q matrix, and then
do the multiplication.

~~~
cwyers
R seems to have taken Perl's "there's more than one way to do it" as a credo,
all of the official documentation I've seen seems to be an exercise in trying
to say as little as possible with as many words as can be managed, all of the
syntax seems to be deliberately designed to make no sense whatsoever and I can
give it a task that JMP (no speed demon it) can finish in under an hour and
not get results back after letting it churn for nearly a day. Everything about
R makes me wonder what I've done with my life to end up using it.

~~~
ekianjo
> and I can give it a task that JMP (no speed demon it) can finish in under an
> hour and not get results back after letting it churn for nearly a day

Example, please. I'd really like to see what you are talking about here.

And btw, good luck treating any kind of Big Data with JMP.

~~~
cwyers
> And btw, good luck treating any kind of Big Data with JMP.

Oh I'm definitely not intending to. (To the commenter below, it really doesn't
matter how you define Big Data for that statement.) That much I had already
figured out, which was why I was evaluating R with an eye towards pitching my
boss on it.

The test case I'm referring to here was a pretty simple neural net to my mind
-- roughly 350,000 rows of data, six predictor variables, one hidden layer
with 30 nodes. I can verify that the neural net code ran because if I
truncated the data set down to 1,000 rows I got a result back. But the full
dataset just chugged for hours and hours without stop.

------
59nadir
> This vectorized R code turned out to be much faster than the Julia code–more
> than 1000 times faster, in fact, in the case of simulation 1000000 steps.
> For 100000000 steps, Julia actually is much faster than R, but the point is
> that the claims made about Julia’s speed advantage are really overblown.

What are we supposed to glean from this paragraph?

~~~
wmt
R beats Julia if you ignore all the cases where Julia beats R!

------
n0us
Actually the GIL is not necessarily an issue for NumPy since C extensions can
break the GIL. I don't see why this needs to be a winner take all competition
between languages for statistics. It's worthwhile to have different approaches
and just because in the author's opinion R is better than Python and Julia
doesn't mean that it is always the better option no matter what.

------
leeleelee
I find python's multiprocessing module to be simple to use. I would be
interested in hearing the author elaborate on why he feels it is clunky and
has not had a good experience with it.

------
geomark
These days I use both R and Python regularly. I don't have a strong opinion
about one being better than the other, and I feel pretty good about that.

------
sgt101
" My impression of Julia’s parallel computation facilities so far, admittedly
limited, is similar."

1\. Not good to make assertions that you immediately undermine by admitting
you don't know what you are talking about 2\. I find @spawn and @fetch quite
nice, in fact I think that Julia was built from the ground up around notions
of parallelism (ie. multiple dispatch)

------
waitingkuo
I've tried to learn R several time, but its index started with 1 always stop
me from learning.

~~~
Redoubts
I've always wondered why Julia went this route too. They more or less said
"because matlab does it"

~~~
idunning
Source? One of the best arguments, that is personally relevant to the work I
do, is that indexing in math often starts at 1 (e.g. the indexing of rows and
columns of a matrix). Translating this to code results in a mental overhead of
mapping the math-index to the code-index, a cost I don't pay in Julia.

~~~
Redoubts
[https://github.com/JuliaLang/julia/issues/558](https://github.com/JuliaLang/julia/issues/558)

    
    
      > There is a huge discussion about this on the mailing list; please see that. If 0 is mathematically "better", then why does the field of mathematics itself start indexes at 1? We have chosen 1 to be more similar to existing math software.
    
    

A post above also appeals to mathematica. It's been a while since I've had to
internalize the "cost" of zero indexing, so this never felt very compelling.

