
R vs. Python for Data Science - Rincevent
https://github.com/matloff/R-vs.-Python-for-Data-Science
======
fantispug
This is missing the most important difference - deployability. R was built as
a language to use interactively and does things like raise warnings for things
that should be errors, requires an external package (packrat) for reproducible
package management, and in general is foreign to most developers running
operations. Python has good error handling, scripting and logging out of the
box and managable package management, and is familiar to most developers and
operations. Python has much better libraries for building general purpose
tools (but fewer libraries for complex statistics).

I disagree with the "learning curve"; if you've learned other programming
languages Python has a pretty simple and familiar core, and Pandas (while the
API is an inconsistent mess) is well documented. Base R is quirky compared
with modern programming languages, and the API is pretty inconsistent.

I also strongly disagree with the Tidyverse bashing. I'd say it has the
shortest learning curve (especially for someone familiar with SQL), and is one
of the main reasons I still use R today outside of deep learning - I find it
_much_ more friendly to work with than any alternative.

~~~
bionsystem
> This is missing the most important difference - deployability.

I've deployed both R and Python for completely junior datascientists team, on
top of a poorly managed infrastructure. I'd say they both have pros and cons
and are actually both pretty bad. But R's packrat makes it slightly better
than python. Python is a mess when you want to reproduce a working
environment. Conda and pip both have huge issues. R's package management is
pretty poor too with completely misleading errors, but at least it's unique
and once you know your way around the most common errors you can build and run
different projects quite consistently.

I've managed both RStudio+Shiny for R and Jupyter for python and overall my
experience is better with the R stuff too. Things look a bit standardized
while Jupyter needs tons of dependancies and (I felt) lacks a clear
opinionated way of doing things.

I have 0 opinion on the actual languages though, as I'm not a developer.

~~~
bayesian_horse
At least in my experience it has been pretty simple to deploy Python software.

~~~
hjk05
Pretty simple is relative. I deploy python applications to cloud instances
using docker through a git push based ci/cd setup. It works great, and I think
it’s simple. But if I have to explain to an analyst how to use 3 different
platforms and 5 or so tools to replicate what he currently gets by clicking
“publish to RStudio connector” in the top right of his code, it seems obvious
that’s not even close to being comparable.

------
geertj
I’m going on a limb here and may get downvoted. But I’ll say it anyway. This
battle is over, and Python is now the standard for data science, just like git
is the standard for version control. The mindshare, number of tools, and
number of people that know and use Python is an order of magnitude higher
those for R. As a new data scientist it does not make sense to start with
anything but Python.

~~~
deng
Well, I'd say you're on a machine learning limb there, and "data science" is -
at least IMHO - much more than that. As the article says, statisticians
usually much prefer R over Python.

~~~
bayesian_horse
My feeling is that the statistician's preference of R is an historic artifact
from when Python lacked clear ways to do things R documents nicely. And it
seems to be changing.

------
gloflo
That's written by an opinionated R person/book-writer and stays on a very
basic level of anecdotes and hearsay. Some comparisons are super short-
sighted. For example the author seems to consider searching pypi for keywords
are reasonable way of finding functions for both quite specific and totally
unspecific terms ("spatial data"...).

You don't miss anything by skipping this.

------
rat9988
After reading this paragraph, I started seriously questioning the
objectiveness of the comparison.

>By contrast, just now I tried to find nearest-neighbor code for Python and at
least with my cursory search, came up empty-handed; there was just one
implementation that described itself as simple and straightforward, nothing
fast.

> The following searches in PyPI turned up nothing: log-linear model; Poisson
> regression; instrumental variables; spatial data; familywise error rate;
> etc.

This is not how you search for things. Usually I search on google for "poisson
regression scipy" if I'm looking for poisson regression.

~~~
patrick5415
How you search for things and You search for things are not the same thing.

~~~
gloflo
A basic level of adjusting to a language's jargon and ecosystem is mandatory.
On the other side there is 'tidyverse', what ever that might be.

------
IanCal
"RStudio is to be commended for developing the reticulate package, to serve as
a bridge between Python and R. It's an outstanding effort, and works well for
pure computation. But as far as I can tell, it does not solve the knotty
problems that arise in Python, e.g. virtual environments and the like."

I'm not sure I follow this, you can just set the interpreter.

I've started using rMarkdown more heavily, with reticulate & python for most
data munging and r for plotting. Partly because I already know how to solve
the problems I have in python more quickly than in R. The only thing I have
against it at the moment is the debugging story isn't very nice by default,
though I've not looked into how to improve this.

edit - if you've not looked into rmarkdown, I heartily recommend it. It is to
me what the final output of notebooks _should_ be. I can easily interleave
code and descriptions, hide what I want, _run it from scratch entirely as a
default_ , and produce a range of outputs including interactive static
webpages. Once web packaging is finally sorted, it'll be near perfect.

------
hackerlurker
Python is the second-best language for everything.

~~~
yesforwhat
2nd best over x categories is pretty good.

~~~
jacobush
Where x is large. Which it is for Python. Which is probably why I use it so
often.

------
teekert
I'm a Python user (and I bet the author is primarily an R user), and probably
I am biased but these two conclusions I find strange:

* R has Better statistical correctness based on "some dude"?

* R has better OO programming because you can print functions to the command line?

In my workplace R and Python are both well represented and I always hear from
the R users that there is no "real support" for classes in R as there is in
Python and that they miss it. I can't judge for myself though.

------
MayeulC
For what it's worth, I think GNU Octave could put up a good fight here as
well.

I'd be interested in seeing it included in the comparison, though (I'm afraid
it would lose on multiple points if only due to lack of funding, but it is
very usable).

I currently have a fairly big project written in Octave, but will likely
rewrite it in python for maintainability (and would rewrite it again in
something else if it grew too much for Python).

there's also Sage, which is an interesting contender, but I do not know enough
about it to know how it compares. Arguably, though, Sage and Octave are more
geared towards numerical computing than data science, and I think that's where
they shine. So, depending on your data and the processing you need, those
could be more adequate.

~~~
dagw
_Arguably, though, Sage and Octave are more geared towards numerical computing
than data science_

Sage is not really aimed at numerical computing, even though it can be used
for that. It's primary use case is more towards computational algebra and
number theory (and related areas) and is generally more focused on features
needed by researchers and academics.

------
tpetry
Theres one reason i was learning some R: the charting capabilities of ggplot2
are awesome. I have never produced so good looking graphics before.

~~~
gloflo
Try [https://plotnine.readthedocs.io/](https://plotnine.readthedocs.io/) or
[https://seaborn.pydata.org/](https://seaborn.pydata.org/)

~~~
gerty
ggplot2 is not just a package, it's an implementation of a grammar language
and this is what makes it so hard to substitute. plotnine tries to mimic it
and is good for easy stuff but still far behind what can be easily done with
ggplot2.

------
haddr
One of the important part that's was not mentioned is the ability to deploy
and operationalize the model. I think Python has a slight advantage in this
area. Especially when focusing on operationalisation and integration with
other systems and flows.

~~~
hjk05
It sounds like you don’t actually know R. Deployment in R is a one-click
thing. In python it’s a complex path through virtual environments multiple
incompatible packaging tools and dependency managers, at times compiling some
c dependencies yourself etc. anaconda originally gained popularity because it
was the only way to get a student started with python-numpy-scipy that didn’t
require extensive prior technical capabilities and tons of up front investment
in reading through guides on how to configure and set everything up.

I mostly use python and prefer it to R, but putting things into production is
not a strength of python and R wins that comparison a thousand times over.

------
randomvectors
Having learned and used both, I disagree with most of his points:

\- Learning curve

\- Machine Learning

\- Parallel computation

\- C/C++ interface

\- Object orientation

All of these are wins for Python, some of them like the learning curve are
wins by a huge margin. I should probably do a point by point rebuttal later
but so many of his points are incorrect and/or poorly justified.

------
deng
He couldn't find a nearest neighbor searcher in Python? There are several in
scikit-learn and scipy has cKDTree. Those are really not hard to find.

~~~
Liquid_Fire
Similarly,

> For instance, though functions are objects in both languages, R takes that
> more seriously than does Python. Whenever I work in Python, I'm annoyed by
> the fact that I cannot print a function to the terminal, which I do a lot in
> R.

I assume "print a function to the terminal" means print its source? If that is
the main complaint, it is available out of the box in IPython (%psource),
which if you are doing data science you are probably using already.

~~~
dagw
There is also the inspect module that lets you retrieve both the source code
and anything else you want to know about a function.

------
physicsguy
I think it's a lot less common for authors of packages to put them on to PyPi
vs CRAN (mostly due to the absolute mess of Python packaging) so it doesn't
surprise me that his searches turned up not much there. On GitHub there are
those packages:

[https://github.com/search?q=poisson+regression](https://github.com/search?q=poisson+regression)

I've not used Rcpp, but Pybind11 is pretty mature, and works well, so I'm not
sure why he's saying it's under development; by the same measure there was an
update to Rcpp last week, so that is too. He mentions Cython which allows you
to compile Python code, but my main use case for that is exactly what he says
- wrapping up C/C++ libraries, which is very easy in it.

"Python is currently undergoing a transition from version 2.7 to 3.x. This
will cause some disruption, but nothing too elaborate."

This is pretty out of date; in Jetbrain's survey, 84% of devs had
transitioned.

------
rizoic
In the course of work I end up using both R and Python. I think both have
their own use cases. Some of the observations I have had are:-

R

\- The tidyverse ecosystem had given a huge boost to R. It had brought
intuitiveness and consistency to R which was much required especially if you
are a programmer coming from other languages. Also there are other ecosystems
like Bioconductor which are also very mature.

\- Rstudio and especially Rmarkdown notebooks are much better for reproducible
analysis than Jupyter.

\- It is very difficult though to develop standalone tools with R. For example
it doesn't have a good argument parser.

Python

\- The language is much more intuitive and more ideal for developing
standalone tools.

\- The ecosystem is many cases very fragmented though with a lot of libraries
doing similar things.

\- It lacks a good plotting system. Matplotlib is very powerful but has a very
steep learning curve. In comparison ggplot2 in R is very intuitive.

~~~
randomvectors
Tidyverse is so overrated that I don't even know where to start... sure dplyr
is nice to use if you're working interactively and ggplot2 has good first
principles as the basis of its design, but that's about as far as the
tidyverse niceness goes. The apis, the documentation, the inconsistencies (on
a language level, between tidy packages and within each package), the problems
with backwards compatibility, the evaluation (standard vs non-standard vs
tidyeval) - it's a big mess.

If you're writing code that's deployed in any way, it's best to avoid the
tidyverse as much as possible. This is also acknowledged to some extent by the
main developer - [https://www.tidyverse.org/articles/2018/06/tidyverse-not-
for...](https://www.tidyverse.org/articles/2018/06/tidyverse-not-for-
packages/)

~~~
Tazinho
The statement in the link means that one should avoid to include the term
tidyverse as a pkg dependency. Instead one should name the specific pkgs from
the tidyverse individually. This makes sense, as the tidyverse is a collection
of pkgs and referring to it as a whole just blows up the pkg. However, there
is no indication to not use tidyverse pkgs (dplyr, ggplot2, stringr,...) in
packages or production.

------
pletnes
There is always Rpy for calling R functions from python, if you have to write
your program in python but need a stats function from R. I guess python is
always the choice if you’re writing software, although for interactive use
YMMV.

------
scottlocklin
Funny, he's dead right about tidyverse, but I've learned to ignore Hadley's
packages for not playing well with ESS. Snake case is moronic anyway.

Someone noted below that python is better for deploys: true. R is better for
interactive use, and has a better package universe, though quality control for
packages is vastly lower than something like scikit learn. R also completely
dominates in classical stats, which is generally bread and butter compared to
having the latest goofy neural thing. Assuming you actually do data science.

------
Gatsky
Software engineers doing data science like Python. Everyone else doing data
science likes R.

~~~
Annatar
So true. Business users love R and most of those using it worked in academia
previously.

------
thom
I think on the whole we're going to care less and less about the distinction
between language bindings in data science as time goes on. If something like
Apache Arrow takes off and we end up with a decent standard representation for
dataframes (either in memory or distributed), and most of the heavyweight
processing (e.g. XGBoost, TensorFlow etc) is written in C/C++ anyway, then I
don't massively care what languages people are using to express themselves
(and I personally think tidyverse on the ingestion side and ggplot2 on the
output side win here).

------
bayesian_horse
Scipy.spatial has everything I ever needed in terms of spatial lookups.

There are lots of regression options. Scipy, Scikit learn, Pymc3, PyStan.

Metaprogramming in Python is easier than in R, and arguably more predictable
and consistent.

------
Petefine
For me the readability of tidyverse code is crucial. I like pandas and use it
daily, but often it requires deciphering to understand what is happening,
especially regarding indexing. But tidyverse code can be easily read, and that
has been a big help in enabling collaberation amongst our data team.

------
Macuyiko
It's strange how many managers and project owners have asked me in the past
whether to go for R or Python in their environment, as if the choice for a
programming language will break or make your data science initiative. (Even
more fun: it's not uncommon to find organizations where IT has finally
accepted to provide Python, but without any access to a package repository,
with some people being surprised that Python alone is not enough).

In any case, I've worked extensively in both environments and I don't think
the author has considered every aspect. Below or my two cents.

\- Elegance: slightly disagree. R might look more concise, but the language
comes with many strange aspects (quoting, non standard evaluation) that can
put a wig between novice and experienced team members. Python is more verbose,
perhaps, but cleaner overall

\- Learning curve: disagree. Even when working in R, modern practice would ask
you to learn the tidyverse or data.table first instead of sticking with base
R. Good tutorials are available for both

\- Libraries: depends, the notion of "libraries" is too broad anyway, better
to split it up according to the subcategories below. Both come with lots of
packages, so I'd agree with it being a tie

\- Statistics: agree with R. R is still the statisticians language, and many
implementations of some more obscure techniques are only available in R. This
being said, most ML shops today would be more interested in e.g. a good GBM
implementation or deep learning rather than some robust statistics package. In
R: think regression, ANOVA, significance tests, time series and niche
subfields like bioengineering. In Python: think RF, GBM, t-SNE, deep learning

\- Parallel computation: I'd say both are lacking, and you'd need to look more
towards tooling such as Spark anyway. I'd also say out of memory computing
becomes your first concern more often. Dask and Pandas on Ray are very nice on
Python

\- Foreign interface: kind of disagree. I think Python has matured better here

\- Object oriented programming: disagree. The problem with R is in fact that
is has about 4 (or more) OOP ways

\- Interop: agree that you should avoid it, at the moment, it will only make
deployment more cumbersome

Some other concerns I'd consider.

\- Pipeline approach to ML ("model dev / model run"): better in Python. E.g.
the clear approach of scikit learn to consider both preprocessing as the model
itself as part of the fit-transform-predict pipeline with clear methods is way
better than R. I've seen many novice R users fall into the trap of
preprocessing a data set before splitting in train/test, for example. This has
been one of the biggest drivers to push me towards Python coming from R. Most
established libraries in Python commit to a shared, best-practice way of
thinking whereas every package in R seems to come with its own ideas in terms
of pipeline and usage

\- Deployment: also a win for Python. Better package management /
reproducibility, though it is possible in R as well

\- Data exploration: I find this easier in R. Packages like dplyr help a lot
here. Pandas' API is somewhat cumbersome

\- Charts / visualizations: ggplot2 in R is still a champion, though good
dashboarding tools exist for Python as well. Still, I find this easier to use
in R

\- Spatial analysis: both come with very solid libraries, though I find
whipping up a quick visualization easier in R

\- Deep learning: clear win for Python. Tensorflow, PyTorch and even Keras are
not fun to use in R

\- Reports authoring: possible in both, though R's markdown functionality
combined with RStudio is fantastic. Nevertheless, Jupyter notebooks can be
made to act as a reporting tool for both languages

~~~
Tazinho
I like your comparison. So it sounds that one may draw the conclusion that it
doesn‘t matter that much which language one chooses. In the end both languages
provide typically required functionalities for data science and it’s probably
better to find out for yourself which language feels more intuitive/effective
to use than listen to an overheated discussion on this topic.

------
jhanschoo
> Python has just one OOP paradigm. In R, you have your choice of several,
> though some may debate that this is a good thing.

> Given R's magic metaprogramming features (code that produces code), computer
> scientists ought to be drooling over R.

The benefit of this... is very debatable.

------
kome
Stata and SPSS are so much easier and fast...

~~~
jacobush
SPSS is pretty expensive

~~~
kome
true, SPSS is way too expensive. This will ultimately kill it. But big corps
like IBM just don't care.

And it's a pity. Like a lot of "old" technology and "old" languages, also SPSS
is far easier to use for normal people. Asl usual, IBM is just wasting
potential.

------
wtdata
Sincerely, the present question for me is if I should start investing heavily
in Julia. Eventually Python (like every other language) will be superseded by
some new and better suited language for a set of problems. The question is if
it will be Julia now, or some other in the future.

~~~
williamstein
I remember thinking very hard about exactly this question in 2004, and landing
on Python (as the foundation for Sage). It can be unclear how LONG until a
language will be superseded.

