
R vs. Python for Data Science [video] - roseway4
https://blog.dominodatalab.com/video-huge-debate-r-vs-python-data-science/
======
danso
As someone who teaches introductory-level programming in the data
science/visualization domain, I think R is undoubtedly easier for getting non-
programmers from CSV/Excel data to great visualizations via ggplot2. In fact,
I keep R up-to-date on my own machine for when I need to do visualizations
that I can't easily hack in Matplotlib+seaborn.

But I stick to teaching Python because my priority is that students become
general-purpose programmer/hackers, and data science is a very small part of
what they might use programming for in their careers. Python's pandas is
definitely not as elegant as R's tidyverse, where the conventions and data
structures are more baked into the language. But numpy+pandas is manageable
enough for novices (and just fine for experienced programmers). It's a small
tradeoff for all the things you can do in Python, whether it is build and
deploy a web app with Flask/Django/etc, or integrate with AWS via boto, or
more specific things like manipulate video with moviepy.

That said, obviously not every stats-focused person wants to or needs to
become an all-purpose engineer. I think R has basically become the lingua
franca for folks in poli sci and similar departments who want to do
statistical analysis, and it seems to be a huge step up from doing things in
SPSS or Excel.

~~~
larrydag
I'm the Dallas R Users Group organizer and I could not agree more. I've had
similar experience.

Here is my opinion of the Python vs R "debate" in a nutshell. Programmers
prefer Python because data science is just another hack for them to accomplish
as they make their applications. Statisticians/Data Scientists pefer R because
it is a standalone mathematical suite. I believe what drives a person to use R
or Python is what kind of tool they want to use in their toolbox to accomplish
their respective task at hand.

~~~
bsg75
This is an excellent summary, and covers the cases for those of us who use
both and when they choose one vs the other.

------
earino
Hello everyone, I'm the guy in the video!

SPOILER ALERT

If you watch you will notice I do a really sneaky thing where it's possible
that I'm comparing the two in order to show the folly of saying that it's R
_vs_ Python... :) If you have any questions, I'll do my best to be around!
Cheers.

------
photoJ
Doesn't seem like a huge debate for me. R is written in a way that encourages
thinking like a statistician. Python environment enables simpler ways to
"productionalize" your code and makes it easier to "think like an engineer".
Both can be powerful, and each has some corner cases the other doesn't. If you
know why you need to use one use it. If you don't, choose Python if your more
of a coder and R if your more of a mathematician. Then switch when you get
bored! :)

~~~
tnecniv
What benefits does R have other than vectors as native citizens?

I mostly use MATLAB (ugh).

~~~
glial
In my mind one of the primary benefits of R over Matlab are the 'data frame'
data structure and the ecosystem that's built up around that, for
slicing/dicing/plotting/modeling/etc. Also, having factors as first-class
citizens is super nice.

------
cannonpr
I work at a data science company that uses both, in a DevOps capacity. We
actually have a fairly hard policy against R in production or in client
deliverable code. That's mostly because of the difficulty in 'productionizing'
R code, and the relative immaturity of several common R libraries. However
that having been said, nearly all of our junior Data scientists from a non
software engineering background do pick up R faster and feel more comfortable
in it, so we mostly use it for rapid prototyping.

~~~
dpitkin
I am a big R user and evangelist when possible. I agree with your company and
team production policy at the same time and the specific "immaturity" is the
utterly incomprehensible error messages that I get from R on a daily basis. I
think this is a core R team opportunity or choice if they want to transition
from a stats domain-specific-language to a general-purpose language.

~~~
stewbrew
What kind of incomprehensible error messages do you get from R?

~~~
rabboRubble
Try installing the RODBC package on Mac OS X el Capitan.

Dollars to donuts it won't go smoothly. Let me know how long it takes you to
successfully install it.

~~~
stewbrew
To be fair, these are error messages from one nonstandard library (which works
great under Windows BTW) on an OS that doesn't natively use ODBC (does it?) --
and I guess some another 3rd party shared lib not under the control of the
authors of RODBC is involved.

~~~
rabboRubble
Depends on the OSX version on the Mac. Apple used to support RODBC's required
libraries until recent years. Part of the problem with that RODBC package was
that it was written prior to the cessation of OSX library support and
continuity of RODBC package support was (is?) spotty.

------
joelgrus
So "political detox week" lasted like an hour?

~~~
lochland
The introduction to the article is the second best introduction to any article
ever. The first, which is in a totally unrelated field, is:

"THEY do a lot of things wrong in the United States, but they do a lot of
things right, too.

The NFL draft is at, or at least near, the top of the good list."

(see [http://www.afl.com.au/news/2013-04-30/uncle-sams-draft-
week-...](http://www.afl.com.au/news/2013-04-30/uncle-sams-draft-week-
extravaganza))

------
hcarvalhoalves
R (specially w/ R Studio) is effectively a better Excel.

The problem it creates is that someone, somewhere, will eventually present
some projection they put together quickly in R, and that will set expectations
on stakeholders that know nothing about software to start demanding that
solution to scale indefinitely, compute in realtime, or be used inside any
kind of production system really.

It doesn't mean the same can't happen w/ Python, but it at least offers some
migration path to more scalable / hardened solutions if you're careful.

~~~
photoJ
I don't know any research statisticians that produce code for excel. R is THE
place to see their most recent advances. And sadly despite the algo's being
ported to python, sometimes R still has the "best" implementations.

~~~
hcarvalhoalves
> I don't know any research statisticians that produce code for excel.

Never said that.

What I said is that R, nowadays, creates the same kinds of issues / attrition
Excel created on companies back in the 90's - you end up w/ someone, in some
corner of the company, creating solutions that can't scale on top of it. In
comparison, if the original work is in Python, you usually have more
alternatives when this prototype lands on the hands of a software eng. That is
the main, and only difference that matters, IMO.

As a tool for research, I agree it's completely fine.

Also, I too feel the pain of some algos, sadly, being available only on R (had
to write my own wrappers to estimate some models on top of R script and export
the coefficients to be used in Python's sklearn).

~~~
photoJ
Sure, I can see that coming from your perspective. Matlab can do things that
are similar, a researcher creates a model that is totally detached from the
production environment, works beautifully and leaves ALOT of engineering
required of the production team. I wonder how probabilistic programing will
fit over the next 10(?) years. I'm not a fan of "I programed this, now you
implement it" dev cycle and I that will decrease as Data Science matures.
Sadly you are at the friction point for now.

~~~
tnecniv
The plus side with MATLAB is that you can auto-generate code if you are
willing to shell out for the add-on.

------
yomritoyj
R's unusual calling convention makes it very easy to write little DSLs and
access program text at runtime. This I think makes for much more expressive
APIs like base R's formulas or dplyr's data manipulation facilities. In all
other languages I know, formulas or data queries would need to be
(quasi-)quoted or terms appearing in them would have to be declared to have a
special type, spoiling the cleanness of the code

------
minimaxir
Here's a similar discussion between the merits of R vs. Python months ago:
[https://news.ycombinator.com/item?id=11867268](https://news.ycombinator.com/item?id=11867268)

My comment from the thread: At the end of the day, for machine learning
applications, your data is in a tabular format. (in Python, a pandas data
frame) Yes, Python has a few tricks like list comprehensions for speeding up
data processing into that analyzable form. R has a few tricks for processing
tabular data as well. (e.g. dplyr). There are tradeoffs and the skill is
finding which works best. Using a single programming language is a bad
philosophy even for non-statistical developers.

------
patrick_99
I've always used R for analysis, but using Python opens up the world of
PySpark and more scalable ML environments. I think long term Python will
become more prevalent in Data science for this reason.

~~~
falaki
Have you looked into SparkR?

~~~
hadley
Or sparklyr

------
sandGorgon
Pandas with the Rodeo ide is pretty much R.

R's advantage is singular and simple - its not that the language is better,..
but rather that it has existed in its niche for much longer, so it has a much
larger set if libraries that exist in the stats space.

in other words, CRAN.

------
baldfat
Huge Debate = Click Bait

Spoiler R and Python(Pandas) are both great tools

R = Domain Language

Python = General Purpose

Some things in R are worth it for many people.

~~~
dpitkin
Great comment, once you have lived the difference between R and Python error
messages you will unlock the next level of the data science maze.

------
gravelc
I don't know about too many others in the bioinformatics field, but I reckon I
write about 80% of my code in Python and 20% in R. R for stats and plotting;
Python for wrangling data. Having said that, they're the languages I learned
in my studies (along with java) and I haven't had the time and motivation to
move beyond them since.

------
transfire
Julia?

~~~
Recurecur
I think Julia will eventually eclipse both R and Python as a math/science
language - and possibly even a general purpose language.

~~~
stewbrew
I'd rather bet on scala or f#.

------
iaw
Any time I see a domniondatalab.com article I discount it severely.
Historically they've been less useful and more biased towards selling their
services.

Just because it's a company's blog doesn't make the info bad, my experience
with that particular blog makes me hesitant.

------
lottin
Learn both and decide for yourself.

~~~
IndianAstronaut
The most sensible approach. Anyone can jump back and forth between the two
without much problem since a lot of the syntax is so similar. It is better to
have more tools in your arsenal than less.

------
gorbachev
Clickbait title.

I think the consensus is to let the data scientists use whatever tools they
feel most productive with whatever that tool is.

------
narutouzumaki
For some reasons the video stops playing for me after about a minute in in
Chrome (same in incognito)

------
asimjalis
Incanter/Clojure?

~~~
peatmoss
How's the health of the ecosystem doing? Is it being actively developed these
days? Last github activity I see was 5 months ago.

