
Python vs. R: The battle for data scientist mind share - okket
http://www.infoworld.com/article/3187550/data-science/python-vs-r-the-battle-for-data-scientist-mind-share.html
======
geebee
Wow, people here are being pretty hard on this article. I'm sure everyone here
on HN has read their share of programming language flamewars, and this one
doesn't come close. It's pretty mild.

I also think it gets something right - it does feel different to use Python
and R, and the reason may be rooted in how these languages arrived at data
science. Python, as the article points out repeatedly, is a general
programming language that scientists liked using for numerical computing, so
slowly it acquired a billion libraries for data analysis, stats, machine
learning, and so forth. R was created by scientists and statisticians
specifically to do stats and analysis, but in order to be useful, it needed to
acquire the full capabilities of a programming language.

What feels most natural to you often depends on what direction you come from.
If you're a programmer taking on data science, you may gravitate toward
python. If you're a statistician getting deeper into code R may feel more
natural. It would behoove you to learn both.

What on earth is wrong with that? I suppose that for a few people on HN, this
might be a bit repetitive. But otherwise, I'd recommend this article as a
relatively non-combative bit on explaining the different languages, especially
for someone getting started.

~~~
geokon
It's been a few years since I've done some real R work, but my general
impression of things was that "core R" \- ie. the R that is explained in the R
manual - is actually a bit deprecated. It's not the correct way to use R. The
correct way is to use the Hadley tools (ggplot, dplyr, etc.)

These tools are grafted onto R - but seem to have a completely different
design philosophy. I actually don't know why they're in R and not Python or
C++ or whatever other language - but they form a set that is very easy to work
with and produce results really quickly (especially in combination with
RStudio).

So the design principles behind R (or I guess the S language) kinda becomes
irrelevant.

~~~
apathy
R is explicitly designed around S-expressions and as such lends itself to
domain-specific languages like these. The choice was not an accident either by
the original Rs (Robert Gentleman and Ross Ihaka) nor by Hadley.

Guido has explicitly stated that he does not want Python to be "more lispy"
e.g. in regards to lambdas (asterisk). Thus I've seen many people even at,
say, Stanford, Harvard, and Cambridge going back to R from Python. Sometimes
there does not exist a language that best suits a workflow, and a DSL works
better. That is where lispy languages hold an advantage.

Use the right tool for the the job, imho, but I fucking hate people that mix
the two within a project intended for wide public release. Worst of both
worlds, again imho.

(Asterisk) apparently functional data structures such as iterators and
generators are OK though. Wtf guido

------
minimaxir
Ugh, not again.

R is good for tabular data and Python is good for text/image/nontabular data.
_And there 's nothing wrong with knowing and using both languages._

Likewise, the world will not end if you use Python pandas for tabular
manipulation or various bespoke R packages for nontabular manipulatition.

This isn't the battle that people should be fighting. It's not even a
_religious_ argument like web development stacks where a language can eek out
better benchmarks. And as others note, _this very article_ concludes that they
both have their advantages.

~~~
BeetleB
>R is good for tabular data and Python is good for text/image/nontabular data.

Everyone I know who's used both prefers Pandas to R for tabular data.

~~~
minimaxir
More specifically, tidyverse/dplyr > pandas, in my opinion. Base R
manipulation is not good.

~~~
lottin
It's a matter of taste. I prefer base R over tidyverse.

------
peatmoss
Hadley Wickham and Wes McKinney take issue with the gladiatorial angle of this
article:
[https://twitter.com/wesmckinn/status/850353331172249600](https://twitter.com/wesmckinn/status/850353331172249600)

~~~
apathy
Both are nice and pleasant guys to work with, unlike a lot of the drooling
idiots that act like this "rivalry" is some sort of football game. Funny how
master craftsmen don't often blame their tools, they sharpen them instead.

Any time someone blames their tools for their own inadequacies, show them this
video of Kelly Slater surfing better on an overturned table than most of us
can surf on a 7 foot three fin board:
[https://m.youtube.com/watch?v=XQ4owd3yQ_4](https://m.youtube.com/watch?v=XQ4owd3yQ_4).
Up your game instead!

Edit: hacker news doesn't use that part of markdown

~~~
peatmoss
I agree—it's about not blaming the tools, but also following Vonnegut's rule:
Goddamnit, you've got to be kind.

Some of our other FOSS luminaries have chosen a different interaction model.
Sometimes that other approach to technical project leadership is touted as
necessary. I'm sure Hadley and Wes suffer their fair share of fools, and they
generally seem to do so with kindness.

------
cwyers
> Yes, Python makes preprocessing easy, but that doesn’t mean you can’t use R
> if you need to clean up your data. You can use any language. In fact, in
> many cases, it’s structurally unsound to mix up your data purification
> routines with your analysis routines. It’s better to separate them. And if
> you’re going to separate them, why not use any language you like? That may
> indeed be Python, but it could be Java, C, or even assembly code. Or maybe
> even you want to preprocess your data within the database or some other
> storage layer. R doesn’t care.

The hell does this even mean?

~~~
MR4D
Yowza.

I'm guessing it means clean your data first; then process your cleaned data;
and if you're doing that, why not try another language.

I think the author has a forest and trees issue here. He's definitely missing
the point of Python for his use case.

------
padthai
All these articles always say that Python has better preprocessing compared
with R. Where? I find tidyverse/data.table much more elegant than pandas,
scipy et al. The only thing that I like more in Python it is how it handles
streams/generators.

~~~
itschekkers
Ya, completely agree with this (having also used both extensively). dplyr can
also connect to remote SQL server so data don't have to be local. Maybe pandas
does this now too but in my experience SQL connections were generally more
painful in python

~~~
Analog24
This is what SQLAlchemy is for. It's one of the best ORMs out there and makes
interacting with a SQL server a breeze.

------
AndyMcConachie
Having recently completed a data analysis project I'd say the biggest thing
that made me choose R was its ability to make really pretty graphs easily.

I have a lot of experience with Python doing all sorts of programming. This
was my first R program and I don't regret it. The libraries in R(ggplot2) for
making pretty graphs are much better than anything I could find for Python.

~~~
RobinL
Having spent lots of time recently looking at vis in both Python and R (I use
both), I'm starting to think Vega lite may well be the future.

In Python, you can use Altair [0] and in R you can use the vegalite package
[1]. Note also that R's ggvis uses vega under the hood.

[0] [https://github.com/altair-viz/altair](https://github.com/altair-
viz/altair)

[1]
[https://github.com/hrbrmstr/vegalite](https://github.com/hrbrmstr/vegalite)

------
kensai
"The first stage of data aggregation can be accomplished with Python. Then the
data is fed into R, which applies the well-tested, optimized statistical
analysis routines built into the language. It’s as if R is a library for
Python. Or maybe Python is a preprocessing library for R."

I really like this approach, actually. Taking advantage of the strengths of
both languages.

~~~
fleetingmoments
We often take this approach at my company. The heavy lifting of feature
extraction from raw data (wearables in our case) are done by python/numpy
models. The population level stuff is then often handled in R by data
scientists with more of a maths/stats background than an engineering one.

------
ianamartin
This is one of the worst articles I've ever read. It's literally (figurative
literal here) creating a shitshow out of a mountain made of non-combative
people who get along with each other mostly, but get a little pissy if you
bone one of their wives, which is totally theoretical because none of them
actually have wives.

But you get the point: this article has no point.

~~~
talloaktrees
I agree. I also take issue with it's content, I think all the points made are
very shallow and not necessarily correct.

~~~
vixen99
its content

------
glup
In the "both of these are awesome, thx for your input infoworld" camp, does
anyone know of an equivalent of purrr for pandas/ python? I've been digging
the pandas/scikit/numpy/numba stack recently but a friend was showing me the
most beautiful data manipulation R code the other day, written in purrr.

------
memracom
Wise advice.

Use both because the real tool that you are using for data analysis is the
computer, not the programming language. R and Python are both just parts of
the toolset.

Nowadays there are lots of ways to combine the two from Rpy2 to Orange3 to
Jupyter and the Beaker Notebook. Notably the last two let you use
Groovy,Java,Scala and a host of other languages as well. Apache Taverna also
plays in this space of integrating multiple tools with different strengths to
do a job.

R will likely never be eclipsed by anything because it has such a broad and
deep collection of statistical libraries. But Python won't go away because it
is a great tool for general purpose computing and even hardcore stats heads
have a lot of general purpose computing problems to deal with.

It is sad to read R code that copies files, gets data from S3 buckets, runs
SQL queries, and so on. So much of it is crudely hacked together and even the
libraries that support this are shoddily built. The best of both worlds is to
use Python for pre and post processing, but R for the stats libraries (CRAN,
BioConductor).

For lots of S3 wrangling the best tool is a Java library called Je tS3t, and
using a language like Groovy or Scala makes it easy to tame. And Groovy is
integrated deeply into Jenkins which has evolved beyond a CI tool into a
general purpose dashboard for managing and running "jobs". Works great for big
data stuff that is not purely Map/Reduce.

Beaker Notebook is leading the charge by integrating seamless conversion of
data frames between languages so that you can write a script in two or three
languages at the same time, building on the strengths of each one.

If you stick with just one language then expect the next generation of data
scientists to leap far beyond you in a few years. A sea change is coming.

~~~
vorg
> the last two let you use Groovy,Java,Scala and a host of other languages as
> well

Neither Python nor R run on the JVM, so if you end up using
Java,Scala,Kotlin,etc then you've decided to open that JVM can of worms which
is another huge pile of tradeoffs.

> Je tS3t, and using a language like Groovy or Scala makes it easy to tame.
> And Groovy is integrated deeply into Jenkins which has evolved beyond a CI
> tool into a general purpose dashboard

If you end up there, know that only a subset of Apache Groovy is used by
Jenkins, e.g. Groovy collections methods aren't supported. Each step along the
"native Python or R" -> "Java on JVM" -> "Scala or Groovy" -> "Jenkins as
dashboard" decision process entails some cost-benefit tradeoffs which need to
be assessed.

------
flohrian
Future data scientists probably also should consider Julia.

~~~
digitalzombie
Data Scientist here.

Only after they learn a boring language either R or Python.

Don't bet your money on Julia, it's only at 0.6 so the API ain't even stable
yet.

Devs promise no changes to the language when it hits 1.0.

~~~
kem
I've been using R intensively for almost two decades, and Python for about
half that time. I enjoy using them both and think they're great languages. I
don't think it's an either-or thing, because they both have something to
offer.

At the same time, they also have a lot of weaknesses, most of which are
summarized by the Julia benchmarks
([https://julialang.org/benchmarks/](https://julialang.org/benchmarks/)). You
can criticize these particular benchmarks, but similar patterns emerge in lots
of other benchmarks.

R was never meant to do the heavy lifting it's doing today. Ihaka sort of
lamented this fact for a while, and then got ignored as people went on to use
it anyway.

Sure, you can wrap things around low-level C/C++/Fortran in either language,
but eventually if you find yourself getting into nitty-gritty stuff, the
computation and/or memory use of R and Python becomes a problem. It also
complicates a task to rely on juggling two platforms at the same time.

Julia is new, but it reminds me a _lot_ of R in its early stages. I started
using R when it was in beta because it offered something new, and Julia has a
similar feel at the moment. Maybe Julia will die away but it doesn't seem that
way to me at the moment. I've seen lots of prospects come and go, and none of
them had the same traction as Julia.

If anything will stem the growth of Julia, it probably will be Python.
Javascript saw a lot of performance gains after Google and other players
invested heavily into it as part of the mobile ecosystem. It seems like Python
is getting similar investments now with ML/DL, and I wouldn't be surprised if
Google, et al. started dumping tons of resources into PyPy or something in the
same way you saw javascript implementations getting that investment. At the
same time, if you look at benchmarks of PyPy, it seems like you might get to
the same level as javascript, which isn't the same as Julia (or C++, which is
maintaining its relevance, or Rust or Go, which are growing and relevant).

I guess my point is if a student asked me, sure, I'd recommend they prioritize
R or Python first, but I would also explain Julia to them and recommend they
become familiar with that as well.

------
hzhou321
Maybe I am the only one, but I find numpy is like alcohol. You may feel
exhilarated in writing numpy code, but the resulting code is often very
difficult to read.

~~~
tnecniv
In my experience, this is just mathematical code in general.

------
tiatia
I have not looked at the link. But would Sagemath not be both? Python AND R?

[http://doc.sagemath.org/html/en/reference/interfaces/sage/in...](http://doc.sagemath.org/html/en/reference/interfaces/sage/interfaces/r.html)

------
rs86
Awful article. Badly researched and badly written. Just a journalist wanting
to write about something he thinks people want to read

~~~
glup
7 ways to spice up your relationship... with data science!

------
lobo_tuerto
Wow, I really liked the introductory paragraphs, gets the point accross very
well for slightly non-technical people.

------
jbmorgado
Oh God the new age politically correct "battles" where in the end you all hug
and sit down to sing the kumbaya and try to please all readers by claming
everyone is actually a winner in this "battle"...

If you are too afraid to actually analyze a situation and give your opinion,
then just don't write about it and spare us all the time it takes us to read
it.

~~~
ygaf
Perfectly put. See also
[https://news.ycombinator.com/item?id=14067039](https://news.ycombinator.com/item?id=14067039)

------
digitalzombie
> Indeed, it is a variant of S with lexical scoping to make large code bases
> cleaner

I have no clue where he's getting this.

R have 3-4 ways to make a class btw. The code base isn't cleaner. Most
packages that need speed is coded in faster languages. So R is gluey for
packages.

The code base is decent overall but I think Python is much better.

> Python does everything any language can do

I want it to preemptively stop processes like Erlang but it can't. So this is
wrong.

> The Python world has been trying to catch up lately by working with existing
> IDEs like Eclipse or Visual Studio.

It have Rstudio equivalent, it's Rodeo.... this guy. He mentioned it later on
for some reason which contradict his previous statement.

I think the article is unorganized brain dump. Maybe he just need to
reorganize his thoughts.

~~~
sampo
> > Indeed, it is a variant of S with lexical scoping to make large code bases
> cleaner

> I have no clue where he's getting this.

The original S language from the Bell Labs (and the commercial version S-PLUS)
used dynamic scoping (like Emacs lisp).

[https://www.stat.auckland.ac.nz/~ihaka/downloads/lexical.pdf](https://www.stat.auckland.ac.nz/~ihaka/downloads/lexical.pdf)

