Hacker News new | past | comments | ask | show | jobs | submit login
R vs. Python for Data Science [video] (dominodatalab.com)
159 points by roseway4 on Dec 5, 2016 | hide | past | web | favorite | 62 comments

As someone who teaches introductory-level programming in the data science/visualization domain, I think R is undoubtedly easier for getting non-programmers from CSV/Excel data to great visualizations via ggplot2. In fact, I keep R up-to-date on my own machine for when I need to do visualizations that I can't easily hack in Matplotlib+seaborn.

But I stick to teaching Python because my priority is that students become general-purpose programmer/hackers, and data science is a very small part of what they might use programming for in their careers. Python's pandas is definitely not as elegant as R's tidyverse, where the conventions and data structures are more baked into the language. But numpy+pandas is manageable enough for novices (and just fine for experienced programmers). It's a small tradeoff for all the things you can do in Python, whether it is build and deploy a web app with Flask/Django/etc, or integrate with AWS via boto, or more specific things like manipulate video with moviepy.

That said, obviously not every stats-focused person wants to or needs to become an all-purpose engineer. I think R has basically become the lingua franca for folks in poli sci and similar departments who want to do statistical analysis, and it seems to be a huge step up from doing things in SPSS or Excel.

I'm the Dallas R Users Group organizer and I could not agree more. I've had similar experience.

Here is my opinion of the Python vs R "debate" in a nutshell. Programmers prefer Python because data science is just another hack for them to accomplish as they make their applications. Statisticians/Data Scientists pefer R because it is a standalone mathematical suite. I believe what drives a person to use R or Python is what kind of tool they want to use in their toolbox to accomplish their respective task at hand.

This is an excellent summary, and covers the cases for those of us who use both and when they choose one vs the other.

R does some things really well, and there are some advanced methids available ,but let's be clear, some functions in SPSS are quite advanced, for example SPSS mixed models are more advanced than lme4.

What is more advanced about SPSS mixed models (not accusatory, just ignorant)? It looks like there are more covariance options, but is the estimation/implementation better/more robust or faster?

Well I may be ignorant also, but as far as I can tell lme4/lmer doesnt have ANY options to specify the variance-covariance matrix while SPSS does. Also, when the number of within subject levels are few, I often find that using the classic repeated measures approach, but estimating using ML/REML (to handle random missing data) on a specified (or unstructured) covariance-variance matrix to be more powerful and/or more likely to converge while accounting for important dependences than mixed models using random effects...SPSS does this kind of analysis easily, but I havent found any equivalent packave in R.


I agree. I'm glad I took time to take quite a few statistics courses using R though. I composed the sentence deliberately: Just a course "learning R" as we have for all other languages IMHO is pretty useless, unless it is for people already statistics-savvy. What I learned from the experience was a focus on the data, not the code. I can't explain it, but to me it seems that when I code(d) in regular programming languages my focus is different, much more on the code. When doing something in R my focus was much more on the data at hand. That's probably also because when I wrote/write normal software I do it generically, when working on an R problem I already had the concrete data set. Maybe that's different for people writing more advanced code in R like packages, then it may be closer to "normal programming", don't know. I assume most people using R are not doing it to program, but to get something from concrete data, and only a minority of R users do much programming. Clarification - I mean code like this: https://rpubs.com/Noseshine/74191 It's code, sure, but I just wanted to get something from the data, no thoughts of writing a general purpose program. R programming felt very "hackish" to me, but I also feel it's fine for such concrete-data-oriented purposes as in my example link. In that case it's like Excel but more editable, processable and reproducible since you've got linear source code, and not a path of clicks through lots of complicated dialogues.

Hello everyone, I'm the guy in the video!


If you watch you will notice I do a really sneaky thing where it's possible that I'm comparing the two in order to show the folly of saying that it's R vs Python... :) If you have any questions, I'll do my best to be around! Cheers.

Doesn't seem like a huge debate for me. R is written in a way that encourages thinking like a statistician. Python environment enables simpler ways to "productionalize" your code and makes it easier to "think like an engineer". Both can be powerful, and each has some corner cases the other doesn't. If you know why you need to use one use it. If you don't, choose Python if your more of a coder and R if your more of a mathematician. Then switch when you get bored! :)

I agree with a lot of what you said. I do a lot of maths and development, and for me it boils down to using R when doing analysis, anything statistics based, and general prototyping, then I switch to python when I have to build an app, even though for some problem domains i.e. Image Processing or NLP I will just start out in python because I prefer the libs.

As a side note on julia. The only reasons I haven't started using it on the daily - and I've tried it and liked much of what I've seen - is that the module system doesn't let you build modular production code and apps like Python does and R is just so good for what I use it for there's no reason to look for a replacement.

I am always amazed when I work in R, how some ideas are just effortless, and you can ignored the syntax and just think about the concept. I really enjoy that.

"the module system doesn't let you build modular production code and apps like Python does"

Can you elaborate on this?


In my opinion this is where the difference lies: with python you can easily build bigger systems with more flexibility, where your data science code is an important piece. R shines in statistics and you can also produce an end-to-end system, but given that you fit well the R ecosystem and its constraints.

I'd like to know more about end-to-end systems in R. I'm not familiar with serious production environments using such tools. Any pointers?

By end-to-end i mean that you can: (1) read data from SQL or NoSQL DB, excel file, csv, JSON through some REST service, etc. (2) process it in R (3) display it to the user through some interactive web interface e.g. using Shiny

It is relatively easy if your app fit the whole loop. I would recommend with some Shiny tutorial to get you started: http://shiny.rstudio.com/tutorial/

But if you need e.g. some stream processing or more complex guis, then R might be not enough I guess.

Cool! Yeah, I know of that workflow. I've seen some neat stuff done with that. Thanks for the follow up!

Use Scala or some other statically typed language for building bigger stuff. Python might be nice for the first 3 lines but you will be convinced once you need to do some serious refactoring.

What benefits does R have other than vectors as native citizens?

I mostly use MATLAB (ugh).

In my mind one of the primary benefits of R over Matlab are the 'data frame' data structure and the ecosystem that's built up around that, for slicing/dicing/plotting/modeling/etc. Also, having factors as first-class citizens is super nice.

The biggest benefit of R is a huge set of statistical libraries. If there is any sort of stats function you might want, it is available in R.

The other benefit R has is the incredibly productive and thoughtful Hadley Wickham. He's completely changed the way I write R over the last five years, and I'm a great deal more productive as a result (as is my code).

R's Scheme-like lexical scoping, lazy evaluation, and reflective abilities make it very extensible if you know what you are doing.

I work at a data science company that uses both, in a DevOps capacity. We actually have a fairly hard policy against R in production or in client deliverable code. That's mostly because of the difficulty in 'productionizing' R code, and the relative immaturity of several common R libraries. However that having been said, nearly all of our junior Data scientists from a non software engineering background do pick up R faster and feel more comfortable in it, so we mostly use it for rapid prototyping.

I am a big R user and evangelist when possible. I agree with your company and team production policy at the same time and the specific "immaturity" is the utterly incomprehensible error messages that I get from R on a daily basis. I think this is a core R team opportunity or choice if they want to transition from a stats domain-specific-language to a general-purpose language.

What kind of incomprehensible error messages do you get from R?

Try installing the RODBC package on Mac OS X el Capitan.

Dollars to donuts it won't go smoothly. Let me know how long it takes you to successfully install it.

To be fair, these are error messages from one nonstandard library (which works great under Windows BTW) on an OS that doesn't natively use ODBC (does it?) -- and I guess some another 3rd party shared lib not under the control of the authors of RODBC is involved.

Depends on the OSX version on the Mac. Apple used to support RODBC's required libraries until recent years. Part of the problem with that RODBC package was that it was written prior to the cessation of OSX library support and continuity of RODBC package support was (is?) spotty.

So "political detox week" lasted like an hour?

The introduction to the article is the second best introduction to any article ever. The first, which is in a totally unrelated field, is:

"THEY do a lot of things wrong in the United States, but they do a lot of things right, too.

The NFL draft is at, or at least near, the top of the good list."

(see http://www.afl.com.au/news/2013-04-30/uncle-sams-draft-week-...)

Eh, I don't expect this to be as heated as the "R or SAS" debates that many statisticians have...

R (specially w/ R Studio) is effectively a better Excel.

The problem it creates is that someone, somewhere, will eventually present some projection they put together quickly in R, and that will set expectations on stakeholders that know nothing about software to start demanding that solution to scale indefinitely, compute in realtime, or be used inside any kind of production system really.

It doesn't mean the same can't happen w/ Python, but it at least offers some migration path to more scalable / hardened solutions if you're careful.

I don't know any research statisticians that produce code for excel. R is THE place to see their most recent advances. And sadly despite the algo's being ported to python, sometimes R still has the "best" implementations.

> I don't know any research statisticians that produce code for excel.

Never said that.

What I said is that R, nowadays, creates the same kinds of issues / attrition Excel created on companies back in the 90's - you end up w/ someone, in some corner of the company, creating solutions that can't scale on top of it. In comparison, if the original work is in Python, you usually have more alternatives when this prototype lands on the hands of a software eng. That is the main, and only difference that matters, IMO.

As a tool for research, I agree it's completely fine.

Also, I too feel the pain of some algos, sadly, being available only on R (had to write my own wrappers to estimate some models on top of R script and export the coefficients to be used in Python's sklearn).

Sure, I can see that coming from your perspective. Matlab can do things that are similar, a researcher creates a model that is totally detached from the production environment, works beautifully and leaves ALOT of engineering required of the production team. I wonder how probabilistic programing will fit over the next 10(?) years. I'm not a fan of "I programed this, now you implement it" dev cycle and I that will decrease as Data Science matures. Sadly you are at the friction point for now.

The plus side with MATLAB is that you can auto-generate code if you are willing to shell out for the add-on.

This. I work with both R and Python (I'm at Microsoft, doing ML for customers) and even with vastly improved R performance, scaling R is not so much about the performance of the engine but how you design the whole solution.

That's the case with most languages.

R's unusual calling convention makes it very easy to write little DSLs and access program text at runtime. This I think makes for much more expressive APIs like base R's formulas or dplyr's data manipulation facilities. In all other languages I know, formulas or data queries would need to be (quasi-)quoted or terms appearing in them would have to be declared to have a special type, spoiling the cleanness of the code

Here's a similar discussion between the merits of R vs. Python months ago: https://news.ycombinator.com/item?id=11867268

My comment from the thread: At the end of the day, for machine learning applications, your data is in a tabular format. (in Python, a pandas data frame) Yes, Python has a few tricks like list comprehensions for speeding up data processing into that analyzable form. R has a few tricks for processing tabular data as well. (e.g. dplyr). There are tradeoffs and the skill is finding which works best. Using a single programming language is a bad philosophy even for non-statistical developers.

I've always used R for analysis, but using Python opens up the world of PySpark and more scalable ML environments. I think long term Python will become more prevalent in Data science for this reason.

It's somehow funny to mention a scala-based solution as a strength of Python. Why not use Java/Scala directly along with R?

Have you looked into SparkR?

Or sparklyr

Pandas with the Rodeo ide is pretty much R.

R's advantage is singular and simple - its not that the language is better,.. but rather that it has existed in its niche for much longer, so it has a much larger set if libraries that exist in the stats space.

in other words, CRAN.

Huge Debate = Click Bait

Spoiler R and Python(Pandas) are both great tools

R = Domain Language

Python = General Purpose

Some things in R are worth it for many people.

Great comment, once you have lived the difference between R and Python error messages you will unlock the next level of the data science maze.

I don't know about too many others in the bioinformatics field, but I reckon I write about 80% of my code in Python and 20% in R. R for stats and plotting; Python for wrangling data. Having said that, they're the languages I learned in my studies (along with java) and I haven't had the time and motivation to move beyond them since.


I think Julia will eventually eclipse both R and Python as a math/science language - and possibly even a general purpose language.

I'd rather bet on scala or f#.

Premature, but scheduled to hit 1.0 in ~1yr. Really promising right now, though.

Yeah, not to mention all the DL frameworks.

Any time I see a domniondatalab.com article I discount it severely. Historically they've been less useful and more biased towards selling their services.

Just because it's a company's blog doesn't make the info bad, my experience with that particular blog makes me hesitant.

Learn both and decide for yourself.

The most sensible approach. Anyone can jump back and forth between the two without much problem since a lot of the syntax is so similar. It is better to have more tools in your arsenal than less.

Clickbait title.

I think the consensus is to let the data scientists use whatever tools they feel most productive with whatever that tool is.

For some reasons the video stops playing for me after about a minute in in Chrome (same in incognito)


How's the health of the ecosystem doing? Is it being actively developed these days? Last github activity I see was 5 months ago.


No comments like this, please.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact