Hacker News new | comments | ask | show | jobs | submit login
Comparison – R vs. Python: head to head data analysis (dataquest.io)
283 points by emre on Oct 14, 2015 | hide | past | web | favorite | 195 comments

This is interesting, but not really an R vs. Python comparison. It's an R vs. Pandas/Numpy comparison. For basic (or even advanced) stats, R wins hands down. And it's really hard to beat ggplot. And CRAN is much better for finding other statistical or data analysis packages.

But when you start having to massage the data in the language (database lookups, integrating datasets, more complicated logic), Python is the better "general-purpose" language. It is a pretty steep learning curve to grok the R internal data representations and how things work.

The better part of this comparison, in my opinion, is how to perform similar tasks in each language. It would be more beneficial to have a comparison of here is where Python/Pandas is good, here is where R is better, and how to switch between them. Another way of saying this is figuring out when something is too hard in R and it's time to flip to Python for a while...

totally agree and that's why we made Beaker: http://beakernotebook.com/

you can code in multiple languages in your noteobook, and they can all communicate, making it easy to go from Python to R to JavaScript, seamlessly.

we just released v1.4 with all kind of new features, check it out: https://github.com/twosigma/beaker-notebook/releases/tag/1.4...

I tried to install this the other day.

I didn't get it working on my Linux machine, but you will definitely see some pull requests once I have time to fiddle with it. The electron version is a nice idea but I would prefer better instructions for installing the normal version. "This script will do it all" is not always helpful.

Thanks for the report. Yea it is not easy to install on linux unless you use the docker version. There are many dependencies and PPAs required in the script because it does everything.

We are working on better linux packing and distribution (see our issue tracker), but it is not easy to do it right, and it will take a while.

PRs very welcome!

FYI - I tried one of the Mac all in one downloads, and it looks promising. However, all I get are status messages saying that it is waiting for Python or R to initialize...

Thanks. We don't have an all-in-one download, you have to install Python or R separately. But if you already have them, it should just work if they are in your PATH, and that path is setup by .bash_profile? Did you install the required R packages? Do you have IPython (not just Python)? We can probably better debug this by email or as a github issue than in this forum.

Sorry, I meant the Electron version...

OK well it loads the backends the same way. Please raise by email or github.

> And it's really hard to beat ggplot.

To be honest, matplotlib seems a good contender to me (http://matplotlib.org/).

Also, what's wrong with comparing R to Pandas/Numpy ? They can only be used from within Python, right?

Edit: just realised from another comment that Pandas/Numpy can be accessed from R, too.

> > And it's really hard to beat ggplot.

> To be honest, matplotlib seems a good contender to me (http://matplotlib.org/).

They're quite different, though, and I can see why many prefer ggplot. It's a declarative, domain-specific language that implements a Tufte-inspired "grammar of graphics" (hence the gg- in the name; see section 1.3 of [1], and [2,3]) for very fast and convenient interactive plotting, whereas matplotlib is just a clone of MATLIB's procedural plotting API.

[1] http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis...

[2] http://www.amazon.com/The-Grammar-Graphics-Statistics-Comput...

[3] http://vita.had.co.nz/papers/layered-grammar.html

"matplotlib seems a good contender to me'

I've waxed lyrical about Python all over this thread, but here you have to give the medal to R. Matplotlib is one of my least favourite libraries to use, been doing it for almost 2 years, and I still spend half my time buried in the documentation trying to figure out how I'm supposed to move the legend slightly to the right or whatever.

ggplot probably has slightly less flexibility overall (mpl is monolithic), but for just doing easy things that you need 99% of the time, ggplot is king.

There is a gpplot clone in python. Also bokeh is starting to develop a grammar of graphics interface. Then there is seaborn and mbplot. Lots of stuff besides mplotlib

I must give you that after few years of using it, I still have to look for documentation for elementary things.

I am not familiar with ggplot, so I wasn't comparing them on the ground of the easiness of use, but by looking at some ggplot examples, they looked like something you can do with matplotlib, too, so I pointed that option out, too.

i couldnt agree more - the API seems very confusing and the examples provided are shitty in my opinion

> what's wrong with comparing R to Pandas/Numpy

Absolutely nothing.

I was referring to the article title that is was an R vs. Python comparison. Python is so much more in terms of a general purpose language than R is. Similarly, R is much more in terms of stats (built-in) than Python. I just thought that it would be more accurate to call the article an R vs. Pandas/NumPy comparison.

Even though both of them need an extra plotting library to make publication quality plots. Matplotlib isn't bad by any means - and it's gotten better over the years. But R/ggplot2 produces nicer plots (IMO). I'm not sure that I'd export data from Python into R just for ggplot, but I might.

Thanks for the clarification. I am sorry I had got your comment in the wrong way.

I am not that familiar with ggplot myself, but I'll give it a go as soon as I'll have the chance.

matplotlib seems a good contender to me

On paper perhaps, less so in application. Sure you can probably make matplotlib do everything ggplot does with enough work, but working with ggplot is just so much quicker easier and more fun.

And I say that as someone who does all his data analysis in Python.

I have rewritten the python ggplot to put it on per with ggplot2.

You can try out my dev version [1] (rewrite branch). It will be nearly API compatible.

[1] https://github.com/has2k1/ggplot

Please write a blog post when you are done?! This will be huge :)

Even regular R plotting is still far easier and more intuitive than matplotlib, not just ggplot.

I completely agree. ggplot is the only reason why I sometimes use R.

I don't have a lot of experience with either, but I was close to really digging in and learning R just for the ease of use of ggplot.

I tried the ggplot for python (ggplot.yhathq.com/) but eventually settled for seaborn (http://stanford.edu/~mwaskom/software/seaborn/). It is really quite easy to get most of the common plots that I wanted and hasn't let me down yet. The standard plots look SO much better than the standard plots of MPL without a lot of customization.

Ggplot for python is almost done. there is an active dev branch

matplot allows you to create almost any chart you want. However, it is very low level.

On the other hand, with ggplot, you can create a good enough chart in couple of lines of code almost for any data.

BTW, there's a ggplot port for a Python: http://ggplot.yhathq.com/

Matplotlib can produce high quality plots. But it requires lots of code, and hours of digging around the API docs and tweaking subclasses.

Happily, a Python port of ggplot is underway [0], although it's still very much a work in progress.

[0] https://github.com/yhat/ggplot/

...with stalled progress. Right now I rather inline R code (with %%R in IPython Notebook) and use the real ggplot2.

There's a dev branch that has been actively developed

Well, for scientists wanting to publish, GGplot it's quite unpractical. Most of the time we have to publish in B&W magazines and GGPlot simply lacks the capabilities to do so properly (por instance B&W filling patterns).

Matplotlib with some good definitions ends up providing much better results and nicer looking plots fro B&W unlike what people normally think.

... and I remembered why I don't use ggplot at all, thanks. After lots and lots of plots done with R, I was starting to feel a bit weird reading the comments.

> It's an R vs. Pandas/Numpy comparison.

And yet, you go right on in the next sentence to make it a Python/Pandas/Numpy vs. R/everything in CRAN comparison. Libraries count.

mbreese's point was not that that is wrong or misguided, just that it was happening.

R has pandas/numpy/scipy integrated in the language (for the most used features at least), but that doesn't make much of a difference because any person that wants to use these tools will do a quick "pip install" to grab them. (which is pretty fast with the new Wheels system)

Out of curiosity, why do you consider CRAN to be much better than PyPI?

I'm only thinking about CRAN > PyPI in terms of statistical packages. CRAN is where new statistical analysis techniques / packages are initially published. If you're lucky they might get ported to Python after the fact. I didn't even mention Bioconductor, which is another beast entirely. There isn't an equivalent of Bioconductor for Python at all.

And the last time I checked, "pip install numpy" could be quite a pain, especially if you needed to compile dependencies. Rstudio makes it ridiculously easy to install R and add packages.

However - for all other types of packages, PyPI is obviously superior. The breadth of packages on PyPI is much better than CRAN.

It about choosing the right tool for the job.

the best way to get the entire numpy/scipy/numba package is anaconda[1]

[1] https://www.continuum.io/downloads

R is certainly a unique language, but when it comes to statistics I haven't seen anything else that compares. Often I see this R vs Python comparison being made (not that this particular article has that slant) as a come drink the Python kool-aid; it tastes better.

Yes; Python is a better general purpose language. It is inferior though when it comes specifically to statistical analysis. Personally I don't even try to use R as a general purpose language. I use it for data processing, statistics, and static visualizations. If I want dynamic visualizations I process in R then typically do a hand off to JavaScript and use D3.

Another clear advantage of R is that it is embedded into so many other tools. Ruby, C++, Java, Postgres, SQL Server (2016); I'm sure there are others.

> R is certainly a unique language

I'd say R is a _terrible_ language. Its types are just really different from every major programming language, and it's horrible for an experienced programmer to use.

I totally agree that R has fantastic libraries, but I'd like to see people focus on improving libraries for Python rather than sticking with R, which as a language is less well-designed than Python.

[I use R for most of my stats, I also use Matlab and Python]

I think you're wrong. R is an excellent language, targeted specifically around the problems you commonly see when doing data analysis. On the whole the standard libraries aren't particularly good, but I think the language is good.

That said, the language is often taught poorly. Here's my attempt to do better: http://adv-r.had.co.nz

Well, time to bring out my favorite dead horse to beat:

   - http://stackoverflow.com/questions/1815606/rscript-determine-path-of-the-executing-script
   - http://stackoverflow.com/questions/3452086/getting-path-of-an-r-script
(where you already commented, so it's not like this is something new...)

I would say that any language that does not have a facility to get the path of the current file, is not 'excellent' under the criteria an experienced programmer would use for assessing it.

Now, I very well know that those criteria are different from what scientists use, but still...

I think R is a great language for certain applications - namely statistics and some data analysis. Your work has certainly made it better.

However, from a computer language design point of view - it leaves a lot to be desired. It's type system is seems very complicated and while the language tries to do what it thinks you want, it's not always clear what is going on (are you working on a matrix or a dataframe that has been cast into a matrix?).

For me, R is one of those languages that is good in a certain domain, but once you get out of that domain, it makes things more complicated than they need to be. It just isn't a general purpose language. By far, the biggest problems I've seen have been people who only know R (mainly stats people or biologists) try to do something in R that would be a quick 10 line Python/Perl/Ruby/whatever script.

Normally for a language design, you aim to make easy things easy, and difficult things possible. For R, it seems like it makes difficult things easy and easy things difficult. Maybe that's the tradeoff that was needed. :)

That said - please keep doing what you're doing. You've made my R work vastly easier.


Thank you for all of your hard work! Keep on keeping on; your contributions have been phenomenal!

> as it explains some of R’s quirks and shows how some parts that seem horrible do have a positive side.

That sounds promising, I'll check it out, thanks.

I think R is a great tool, but I maintain that it is not a well-designed language by modern standards.

Could you give a couple of examples, where R is substantially superior to Python?

I'm not qualified to comment on how good or bad a language R is. But it is maddening how package developers don't follow some convention for naming functions. I load a package that I haven't used recently and I know the function I want but can't remember if it is called my_function, myFunction, my.function, or MyFunction. Google published an R styleguide, https://google-styleguide.googlecode.com/svn/trunk/Rguide.xm.... Does anybody follow it?

Definitely with you there. Even perl has more consistency. And thanks for the guide link! :)

Hmm what do you mean about the types being different?

My experience was exactly the opposite -- first time I saw R syntax (actually, it was S-Plus back then...) , I thought it was the most intuitive and powerful system I've ever seen -- this was after fairly extensive experience in C and C++, as well as a few others.

Now, I don't quite think so any more, because there are many rather tricky things buried under the surface (e.g. how many people really understand how exactly environments work?) -- but the majority of R programmers will never have to deal with them in their code...

Also, I have definitely done general-purpose coding in R -- for a lot of things it is completely adequate. Python has more general-purpose functions and libraries of course, similarly to how R has more statistical ones.

I've used python for years, decided to teach myself R for a masters class I'm taking.

I have to disagree. Its main model is generic function method dispatching. It can feel odd at first to someone coming from the C++ style of OO where objects own methods, not methods owning objects. But it's a legitimate OO style with its own advantages. [1]

I've found the more I use R, the more intuitive a lot of its operations are. It's relatively easy to "guess" what you ought to do to accomplish what you want. More so then other languages I've learned.

1. https://en.wikipedia.org/wiki/Dynamic_dispatch

When people argue that R is terrific language, I remind them that it has 4 (four) objects systems which differ in subtle ways between each other. It's programmers' nightmare.

It's not the worst language in the world, but it isn't terrific language either.

I'd also say the CRAN repository is awful, it discourages collaboration, and is typically written by small groups of academics who write the worst documentation I have ever seen.

I blame the R documentation standards. They force a package author to produce a useless alphabetically-listed pdf, and many people just stop at that point.

Without any standards at all, people would have at least produced a readme.txt, which would have been a huge improvement -- e.g. I much prefer working with unfamiliar user-written Matlab packages :)

I don't know why so many people complain about R documentation, I think it's pretty good. The PDFs are useless for sure, but you don't have to use that. Emacs displays documentation pages in a split window. Or you can use a web browser.


Function docs are fine, but they are not really that helpful in figuring out how to use a new package.

It seems that you are looking for vignettes in fact. Examples of use






I am (sort of); but most packages don't have vignettes. Zoo and ggplot2 (and a few other major packages) have great documentation, but they are an exception.

It certainly needs a Pypi like rating or popularity system.

As an experienced programmer who started using R in the early days I feel that dealing with its quirks got me ready for the current modern languages

Just to toss another name into the ring, I'd say that Fortran is pretty suitable for numeric calculations of all sorts.

I like R as a higher level language (or I guess tools like SPSS or preferably PSPP for even higher level stuff). These days I do most of my academia stuff with R (mostly hypothesis and equivalence testing and the things related to it like power analysis etc.)

I've never really looked into Python which is strange because I use it as a "glue language" quite often. I think I'll investigate Python a bit more next time I have to actually collect and clean up the data before using it. Right now I'm more of a consumer (mostly using data from our experiments that are turned into CSV)

Absolutely; modern Fortran is great and is syntactically rather close to Matlab (and to an extent R as well).

The main difficulty with Fortran is IMO the lack of an extensive standard library -- sure, you can find code out there to do almost anything, but then you need to figure out linking/calling conventions/possibly incompatible data models for each new library you bring in...

But, as another poster mentioned, it is quite straightforward to call Fortran from R :)

matlab was originally released as a fortran library (pre 1.0), so it keeps a lot of that heritage even though it's probably c/c++ now: http://www.mathworks.com/company/newsletters/articles/the-or...

Yeah -- although doesn't it actually pre-date Fortran 90? I wonder which direction the influence went :)

> Just to toss another name into the ring, I'd say that Fortran is pretty suitable for numeric calculations of all sorts.

> I like R as a higher level language (or I guess tools like SPSS or preferably PSPP for even higher level stuff). These days I do most of my academia stuff with R (mostly hypothesis and equivalence testing and the things related to it like power analysis etc.)

You can see R as some sort of glue language around libraries written in lower languages like C++, C or Fortran (I believe a large part if not all the functionalities for matrix operations used by R for linear regressions and statistican analysis (PCA) is written in Fortran).

Fortran code runs much faster, but you don't want to use it to do exploratory analysis ("I have those data about people, what if I filter out the people earning more than X before checking if there is a correlation between the average age where men get married and their incomes?").

> I'd say that Fortran is pretty suitable for numeric calculations of all sorts.

It is indeed. And R works with Fortran quite easily.

Could you provide an example in stat analysis where python is clearly inferior? In the article, R seems to have an advantage of having many useful stat functions baked in vs having to import specific modules in python. im wondering if your proficiency in R is being weighed in your evaluation of R - maybe python's statistical analysis tool has many to offer, but you are more aware of R's toolsets.

I'm primarily a Python user and can say that there's no contest that R has many packages that Python does not have an equivalent of yet. This includes stats stuff and especially finance/trading. Definitely not a showstopper for me but if I were to recommend one or the other to people at work with no programming skills, I would have to choose R for the breadth of existing packages.

>>but if I were to recommend one or the other to people at work with no programming skills, I would have to choose R for the breadth of existing packages.

My 2 cents: If someone has no programming background, then building a foundation from python will allow them to do much much more than building a foundation on R--unless of course they only care about statistical analysis and have no inclination to code more generally. I learned both at the same time even though I had no use for Python at the time (was and still am a professor) but I use it almost everyday now and very much enjoy it!

Agree completely. Should have qualified that with most at my spot/industry(finance) would be using it as an Excel replacement and just want to get things done; hence the value of existing packages.

thank you for your reply

Also, ML academics tend towards R for reference implementations of novel algorithms. They are often available in R first. This cuts both ways; sometimes the Python implementation that comes later misses some subtleties of the R implementation that the original authors nailed, and other times the R implementation is a proof of concept, while a later implementation is more real-world ready. But the latest and greatest tends to be available in R long before it has made its way into e.g. SciPy.

I can't easily do GAMs or SEM in Python.

Great comparison. However, I find R's syntax as obtuse and baroque. Like a shovel with a compartment that carries tweezers. Advocates tend to argue that for moving dirt, this 'R' shovel is far more precise than an ordinary 'Python' shovel. But Python is in fact more like the toolshed from which both tools are housed plus a whole lot more.

I think the new packages from Hadley Wickham are beautiful and straight forward.


End example from a airplane arrival and departure dataset:

flights %>%

  group_by(year, month, day) %>%

  select(arr_delay, dep_delay) %>%


    arr = mean(arr_delay, na.rm = TRUE),

    dep = mean(dep_delay, na.rm = TRUE)

  ) %>%

  filter(arr > 30 | dep > 30)

Well yeah, and I use them, but they're a bandaid over the fundamental problem that just like in Perl, in R TIMTOWTDI. It's the classic 'we have 12 standards, time to make a unifying one - now we have 13' problem. I've sort of gotten used to it now, but it was majorly difficult at first for me (after having programmed for nearly 20 years) to get used to the concept that any task can be done in 20 different ways, each one just as 'valid' or 'easy' or 'maintainable' as the others. At least in C++ there are 20 bad ways to do something, and one good one - the way that Sutter covered in his columns. I know it's not quite fair to compare 'just' the C++ programming language to R and all its packages, but still.

Just curious, what in particular did you find obtuse?

It's not like R does not have obtuse and baroque parts, it certainly does, and their obtus-ity is rather high, but IMO they are not parts of the language a casual user would likely encounter...

On the other hand, Python has quite a few pitfalls itself -- but I suspect a casual user would, for example, run into Python default arguments a bit sooner than she would run into R environments :)

Using R from within Python works pretty well for all those unique R packages which don't have a python equivalence.

Thanks; I suspected support for embedding R within Python already existed as well, but I wasn't sure about that one.

Rpy2 is the python library you probably want: http://rpy2.readthedocs.org/

R is a wonderful language if you chose to get used to it. I love it. I've even used R in production quality assurance to check for regressions in data (not the statistical regressions). I see countless R posts where people try to compare it to Python to find the one true language for working with data. Article after article, there clearly isn't a winner. People like R and Python for different reasons. I think it's actually quite intuitive to think about everything in terms of vectors with R. I like the functional aspects of R. I wish R was a bit faster but I am pretty sure the people who maintain R are working on that. You can't beat the enormous library that R has.

I also LOVE R. Plus the fact that Microsoft and other corporations are supporting R will help more and more. With Hadly Wickham's universe it is a great place to do all your work.

Yup. R is supported by MS, Oracle, IBM and others, and companies like Twitter and even the Python shop that is Google use it.

I spent a few weeks a few months ago learning R. It's not a bad language, and yes, the plotting is currently second-to-none, at least based on my limited experience with matplotlib and seaborn.

There's scant few articles on going from Python to R...and I think that has given me a lot of reason to hesitate. One of the big assets of R is Hadley Wickham...the amount and variety of work he has contributed is prodigious (not just ggplot2, but everything from data cleaning, web scraping, dev tools, time-handling a la moment.js, and books). But that's not just evidence of how generous and talented Wickham is, but how relatively little dev support there is in R. If something breaks in ggplot2 -- or any of the many libraries he's involved in, he's often the one to respond to the ticket. He's only one person. There are many talented developers in R but it's not quite a deep open-source ecosystem and community yet.

Also word-of-warning: ggplot2 (as of 2014[1]) is in maintenance mode and Wickham is focused on ggvis, which will be a web visualization library. I don't know if there has been much talk about non-Hadley-Wickham people taking over ggplot2 and expanding it...it seems more that people are content to follow him into ggvis, even though a static viz library is still very valuable.

[1] https://groups.google.com/forum/#!topic/ggplot2/SSxt8B8QLfo/...

Hadley is actively working on ggplot2. In fact, he just tweeted a list of improvements - https://twitter.com/hadleywickham/status/654283936755904512


Thanks...I didn't know that (though I had been paying attention to bug fixes)...but my point exactly, he's prodigious, so maybe "maintenance mode" to him is "major features every 3 months instead of 2) :).

Also worth pointing out, he's actively working on a new book for ggplot2, which, AFAICT, he's providing for free (you just have to run the build tools)


I think if someone were to run an analysis of Wickham's Github activity, it would produce a freakishly busy chart.

Agreed about Hadley's prolific work.

I used to work a lot with R many years ago. I was shocked to find how bad the documentation was, and worse how rude and unfriendly the "community" of grumpy professors was. I shudder to think of the horrible meanness towards beginners asking questions on the mailing list.

I got so fed up I even wrote a book about R data visualisation. But this was all just around the time ggplot2 came out. Unfortunately I stopped using R soon after, but since then Hadley has single-handedly done more good for the language than anyone else.

I don't know what the R community is like now, and whether people like Hadley have made it friendlier, but it's clearly one reason Python is superior.

I'm a late arrival to the language and have almost interacted with it exclusively through StackOverflow and Github. I've been astonished at not just how friendly people are, but how quickly I can get a helpful response to even what I feel are pretty esoteric (and dumb) questions...again, one of the problems of coming into R is that, because of the relatively small community, there aren't as many references or easily Googlable answers compared to Python...but getting answers to questions if you ask them is very easy, and I think that's a credit to the community.

On the other hand, there seem to be a lot of useful libraries that haven't been ported over to Github or are otherwise easily accessible beyond CRAN...Many of them probably don't get as much exposure as they would if they were more easily discoverable...and I honestly don't even know where, in those cases, to start the bug reporting/patching process. That's obviously the fault of my being spoiled by Github...but that's kind of the point, there's a bit more friction in contributing to R than you might find in Python/Ruby/etc.

Yeah there's a lot more R stuff on SO now than when I was using it. The mailing lists were more active so that's what I had to use to ask for help.


Thanks :)

The caveat on the ggplot2 book is that building it seems to be really hard because of the nightmare of cross-platform latex. But there will be a physical book out early next year.

Also RStudio is growing, so I'm hoping I will have some full-time engineers working with me in the not-too distant future.

> There are many talented developers in R but it's not quite a deep open-source ecosystem and community yet.

Every language has third party packages that are primarily the work of one person.

I'm sure your statement is true for some definition of deep but I don't agree.

Does every language have many of its main third party packages that are heavily influenced by the work of one person? Wickham is to R as John Resig is to JavaScript, if Resig were to have also created and primarily maintained D3, moment.js, and Grunt...Wickham not only steers the libraries that define how a growing majority of R users do data manipulation (dplyr) and visualization, he's also building the tools he needs to maintain and publish them (devtools).

This isn't to say that there aren't other programmers doing brilliant work in R (also, R is just a smaller community overall), but he's devoting significant time to building out support tools and frameworks...this suggests that he is a total mensch, but also that there was a significant need that hadn't yet been addressed.

It does help that I'm one of the few people who are paid to work full-time on nothing but open source R packages that a designed to broadly aid data analysis.

I would argue that most of the scientific & statistical packages for most languages are driven by at most a handful of people, yes.

Another interpretation is that R is an incredibly productive language for this sort of programming, otherwise one person couldn't write so much useful code. ;)

This is just a series of incredibly generic operations on an already cleaned dataset in csv format. In reality, you probably need to retrieve and clean the dataset yourself from, say, a database, and you you may well need to do something non-standard with the data, which needs an external library with good documentation. Python is better equipped in both regards. Not to mention, if you're building this into any sort of product rather than just exploring, R is a bad choice. Disclaimer, I learned R before Python, and won't go back.

Exploring the data is maybe 99% of what data analysis is about. It's very much a trial and error process that can't be planned in advance, and R is in my opinion much better suited for that, with a better interactive interface, plotting system and statistical libraries.

On the other hand, if you know the exact calculations that you need to do and the results you're gonna get, then Python might be a better tool.

Personally I learned R after Python, and I use both languages, but I prefer R for anything involving statistics.

"R is in my opinion much better suited for that, with a better interactive interface"

Have you tried IPython/Jupyter?

Yes, I used IPython a lot.

What I meant by better interactive interface is that the language itself is designed with interactive use in mind.

For instance compare

  func(x$a, x$b)
  func(a, b, data=x)
  func(a=1, b=2)

  func(x['a'], x['b'])
  func('a', 'b', data=x)
  func([1, 2], ['a', 'b'])
The R versions are easier to type and read.

If the Python version is Pandas you could replace the brackets with dot notation. (x.a, x.b)

the last one is, of course, a result of the atrocious handling of the default arguments in Python :)

I think there are lots of good R libraries for getting data from various places: DBI (databases), haven (SPSS, Stata, SAS), readxl (xls & xlsx), httr (web apis), readr/data.table (flat files). (Disclaimer: I wrote/contributed to a lot of those).

I think tidyr also currently has the edge over pandas for making [tidy data](http://vita.had.co.nz/papers/tidy-data.html).

I'm currently using both R and Python, having previously only used Python. At first I didn't like R for general purpose data munging and web scraping. That was before I discovered a few R packages that make it a breeze. And now it's a toss up for me. If it's an interactive data product that I'm buiding I probably go with R. If I need data from an API and the supplier gives me only a Python sample script for accessing it I'll go with Python.

Could you list some of these packages?

Recently started using rvest for web scraping. Sweet bejeezus that's a pleasure. I would've never considered R for scraping before. It was always Python with BeautifulSoup.

I would also check out dplyr for data munging. Since most of the code in dplyr is written in C++ it is much faster than the munging capabilities you probably used when you were using R years ago.

Indeed. I use dplyr frequently.

data.table is another handy one here :)

Sweezyjeezy asks which are breezy.

Sorry, couldn't resist!

Hmm I am curious, how would you do data cleaning without doing data exploration first -- and in what way do you find Python superior to R for that purpose?

Also I assume that by "something non-standard" you mean something other than a way to analyze it? Because there is really no comparison wrt available analysis packages between the two...

Not trying to say that R is perfect and great for everything, definitely not, I just have a hard time imagining a data-processing task for which I would choose Python over R (I might pick SAS over either one of them though...)

How does R compare to SAS? I work in Engineering and we use SAS pretty heavily for a lot of stuff (simple modelling, time series forecasting, multiple regressions that type of thing). One thing I really like is how well integrated SQL is does R have something similar to PROC SQL? That is really the killer feature of SAS for me.

I use SAS professionally at my job, and R in all my academic/hobby work. R has a couple packages that give similar functionality as PROC SQL (about 95% of my SAS workflow, since it's far nicer than data steps for a lot of things). There's an ODBC package (RODBC), as well as SQLDF, which allows you to use SQL queries to manipulate data frames in R.

While there is (almost?) always a way to do a SQL query using idiomatic R, I have to admit that sometimes my brain thinks up a solution in SQL faster (a product of upbringing).

I agree. Once you incorporate the other necessary work and preparation, a well-documented, object oriented language is a better way to go.

I have to agree that Python is more powerful, and I am indeed doing more and more in Python. Python was my first language, before R.

However when the dataset is medium sized (i.e.: fits into your computer's memory / 2) R crushes Python (and Pandas) for the 80% of the time you'll be spending wrangling. The reason is that R is vector-based from the ground up. Pandas does everything that R does, but does it in a less-consistent, grafted-on way, whereas the experienced R person who "thinks vectors" is way ahead of the Python guy before the analysis has even started (i.e., most of the work). I know both really well. I use Python when I want to "get (semi) serious" production wise (I qualify with "semi" because if you're really serious about production, you're probably going to go to Scala).

But when it comes to taking a big chunk of untidy data and bashing it around till it's clean and cube-shaped, will parse, and has no no obvious errors, R is miles ahead of Python. R is where you do your discovering. Python can do it too, but I would estimate the cognitive overhead as double.

By the way, that's why people who "think time series" all day long (i.e., vectors, not objects), and who want to implement their algos, not think CS, will first typically build it in R, which is why CRAN beats Python all the time and every time for off-the-shelf data analysis packages. Data people go to R, computer-people go to Python (schematizing).

R is slow. That's its main problem. And that's saying something when comparing it to Python! But the gem of vector-everything makes it a much more satisfying language than imperative, OO, Python, when it comes to the world of data first, code second.

Finally I'd add that Python 3.x is arguably distancing itself from the pragmatism which data science requires, and 2.x provided, towards a world of CS purity. It's not moving in a direction which is data science friendly. It's moving towards a world of competition with Golang and Javascript, and Java itself.

If you haven't already, you might want to take a look at Julia. It's extremely fast, and has more native support for vectors than Python. It's still immature, but I think it has great potential as the truly great language for scientific/data computing.

I heard that the vector operations were very slow though. Has this changed?

It seems that though with in Julia, vectorized code is typically slower than non-vectorized, it is still faster than in R [1] / Python [2].

[1] http://www.johnmyleswhite.com/notebook/2013/12/22/the-relati...

[2] http://blog.rawrjustin.com/blog/2014/03/18/julia-vs-python-m...

Vector operations are not slow - they are basically the same as python/R (compiled down to C).

However, devectorization (i.e. replacing vector ops with a for-loop) is sometimes a performance improvement because Julia can usually provide C-like speeds in for-loops and avoid creating intermediate arrays.

Julia's for loops are comparable to C in performance, and its vectorized operations are comparable to Numpy/R, although some cases can be optimized using https://github.com/lindahua/Devectorize.jl (see the benchmarks table)

Okay. This post:


had worried me a couple of years ago. JMW shows that vectorized was much slower also in Julia (though still both faster than R - but that's not difficult).

Glad to see Julia is very fast in both cases, though it's still somewhat perplexing the extent to which vectorized code is necessarily slower. I'm thinking that the future of GPU enabled languages will mean vectorized code will be faster, so I prefer languages with a bias towards vectorisation.

it's still somewhat perplexing the extent to which vectorized code is necessarily slower

The vectorized code typically allocates all kinds of intermediate results (more GC, more memory accesses). Apparently, turning it into loops is less trivial than it seems.

I'm thinking that the future of GPU enabled languages will mean vectorized code will be faster, so I prefer languages with a bias towards vectorisation.

I share that concern. Julia has some libraries to support GPU programming, but I don't know of any plans to have the core compiler take advantage of it.

I think you may have misinterpreted that post. Look at the table under "Comparing Performance in R and Julia" again.

Algo people use R because it's faster, nothing to do with being 'data people'.

I am a data person, and I have to deal with a lot of text in my job. If I had to do it in R, I would quit.

Can you explain why you think it is easier to wrangle data in R? My experience is the opposite.

Do you mean they use Python because it's faster? yes sure. But then, just use scala. 10x faster again. With a REPL.

Perhaps I should clarify, I'm talking mainly time series and/or data which is vectorizable. Python is better if you're scraping the web. If there's a lot of if else going on. Ie imperative programming.

R's native functional aspects (all the apply family) and multilevel vector/matrix hierarchical indexing is better built from the ground up for large wrangling of multivariate datasets, in my opinion.

Working with text data in R is painful, but it's not due to limitations of the language.

I agree with your critiques of Python... Could you please post some example of code/operations which are very natural in R but unnatural in Python/Pandas? I'm curious to see what I'm missing out on.

Well, I use both, and I can do everything in Python that I can do in R. However here are some things which will give you a flavour of R's more consistent, data-first nature:

  > rollapply(some1000x10matrix, 200, function(x) eigen(cov(x))$values[1], by.column = FALSE) # get the first eigenvalue rolling 200x10 window. 
  >>> # impossible in Python unless using ultra-complex Numpy stride tricks.

  > dim(someMatrix)
  >>> someMatrix.shape
  > head(someMatrix)
  >>> someMatrix.head() # notice consistent function application in R, whereas in Python, mixed attribute / function? So we're on OO land and I must know if it's an attribute or a function.... 
  > rollapply(some1000x2matrix, 200, function(x) {linmod <- lm(x[, 1] ~ x[, 2]); last(linmod$residuals) / sd(linmod$residuals)}, by.column = FALSE) # get the z score in one multi-step function. 
  >>> Impossible in python without For loop as lambdas cannot be multi-statement. 

  > native indexing using [] brackets by index number, or index value, or boolean. All vectors.
  >>> pandas loc/iloc/ix mess.

  > ordered lists (python dict) by default, so boolean or index subsection easy even when data is hierarchical, not tabular
  >>> easy bugs due to unordered nature of dicts; must import some different module and then still can't vector index it. 
It's all summed up by this:

  > c(1, 2, 3) * 3
  [1] 3 6 9

  >>> [1, 2, 3] * 3
  [1, 2, 3, 1, 2, 3, 1, 2, 3] # wrong! Need rescuing by Numpy!

And then there's CRAN. Just last night someone told me about "nowcasting" which uses "MIDAS regression". A relatively new technique. Google it for R (full package available), Google it for Python (Matlab comes up ;-).

And I'm not even going to start on graphics. Seaborn and bokeh are valiant efforts, but they're still 80% of what ggplot and base graphics can do, especially, at the multidimensional scale. That last 20% is often all the difference between meh and wow. That said, I do appreciate Matplotlib's autos rescaling of axes when adding data. Python charts aren't as pretty nor capable of complexity (for similar effort), but they're arguably more dynamic.

Now don't get me wrong. The converse list for Python would be much longer, because it's more general purpose, and it kills R outside of data science. I wrote 10k loc in R for a semi-production and it was horrible because it does not have the CS tools for managing code complexity, and it really is slow at certain things. R is more focused on iterative, exploratory data science, where it excels.

I think this numpy successor may put some weight in favor of python: https://speakerdeck.com/izaid/dynd

R _is_ object oriented. But it uses generic function style of OO, rather than message passing, which you're probably more familiar with. (Interestingly Julia also uses generic function style OO)

The reason I like R - it just makes data exploration and analysis too damn easy.

You've got R Studio, which is one of the best environments ever for exploring data, visualisation, and it manages all your R packages, projects, and version control effortlessly.

Then you've got the plethora of packages - if you're any of the following fields: statistics, finance, economics, bioinformatics, and probably a few others, there's packages that instantly make your life easier.

The environment is perfect for data exploration - it saves all the data in your 'environment', allows you to define multiple environments, and your project can be saved at any point, with all the global data intact.

If I want some extra speed, I can create C++ modules from within R Studio, compile and link them, as easily as simply creating a new R script. Fortran is a tiny bit more work, still easy enough however.

Want multicore or to spread tasks over a cluster? R has built in functions that do that for you. As easy as calling mcapply, parApply, or clusterApply. Heck, you can even write your function in another language, then R handles applying that over however many cores you want.

Want to install and manage packages, update them, create them, etc...? All can be done from R Studio's interface.

Knitr can create markdown/HTML/pdf/MS Word files from R markdown, or you can simply compile everything to a 'notebook' style HTML page.

And all this is done incredibly easily, all from a single package (R Studio) which itself is easy to get and install.

Oh yeah, visualisation, nothing really beats R.

And while there are quirks to the language, for non-programmers this isn't really an obstacle, since they aren't already used to any particular paradigm.

As for Python, I'm sure it's great (I've used it a little), but I really don't see how it can compare. R's entire environment is geared towards data analysis and exploration, towards interfacing with the compiled languages most used for HPC, and running tasks over the hardware you will most likely be using.

I like Python better as a language, but Python's libraries take more work to understand and the APIs aren't very unified. R is much more regular and the documentation is better. Even complicated and obscure machine learning tasks have good support in R. BUT the performance for R can be very, very annoying. Assignment is slow as all hell and it can often take work to figure out how to rephrase complicated functions in a way that R can figure out how to do efficiently. I think being much more functional than Python works well for data. I mean the L in LISP stands for list! Visualizations are also easier and more intuitive in R, too, IMO. Especially since half the time you can just wrap some data in "plot" and R will figure our which one it should use.

I think the conclusion of the article is correct. R is more pleasant for mathier type stuff, while Python is the better general-purpose language. If your jobs involves showing people powerpoint presentations of the mathematical analysis you've done,you'd probably want to use R. If, on the other hand, you're prototyping data-driven applications, Python would probably be better.

That said, I really like Julia, but can't justify really diving into it at this point. :\

> prototyping data-driven applications, Python would probably be better

I would disagree. Python's libraries are really reimplementing R in Python (Mainly Pandas). I find R to be very flexible and especially in the last 5 years with Hadley Wickham's libraries things are concise and very powerful.

Please look at dplyr and see how this new way fo doing R works. Especially with piping with %>%. https://cran.rstudio.com/web/packages/dplyr/vignettes/introd...

Code in R can look like this beautiful code (If you don't code in R and I would expect anyone can see what is happening) This is why I disagree that prototyping in Python would be better.:

flights %>% group_by(year, month, day) %>%

  select(arr_delay, dep_delay) 


    arr = mean(arr_delay, na.rm = TRUE),

    dep = mean(dep_delay, na.rm = TRUE)) %>%

  filter(arr > 30 | dep > 30)

Python has .pipe but I find it strange it goes to the new line before the items.

Python Code: >>> (df.pipe(h)

... .pipe(g, arg1=a)

... .pipe((f, 'arg2'), arg1=a, arg3=c)

... )

I find the following Pandas code pretty easy to read:

   .groupby(['a', 'b', 'c'], as_index=False)
   .agg({'d': sum, 'e': mean, 'f', np.std})
   .assign(g=lambda x: x.a / x.c)
   .query("g > 0.05")
   .merge(df2, on='a'))
There are now methods in pandas to do pretty much anything, so you can chain them together into one easy-to-read manipulation without lots of intermediate variables.

> R is much more regular

Compare scikit learn to other a large number of R libraries with incompatible interfaces. In this respect Python is more regular.

If you only have time to learn one language, learn Python, because it's better for non-statistical purposes (I don't think that's very controversial).

If you need cutting-edge or esoteric statistics, use R. If it exists, there is an R implementation, but the major Python packages really only cover the most popular techniques.

If neither of those apply, it's mostly a matter of taste which one you use, and they interact pretty well with each other anyway.

I'd say, if most of your job is analyzing the data yourself and trying to make sense of it, R wins hands down. Particularly if statistical graphics or advanced statistical methods may be needed, but it's still the case even if they won't.

If most of your job is going to be implementing data analysis techniques that you or someone else has done earlier and putting things into production, then Python will quite possibly be more suitable.

"If you only have time to learn one language, learn Python, because it's better for non-statistical purposes (I don't think that's very controversial)."

Actually, it is. When someone has only 3 or 4 years to finish their thesis and learning how to program is secondary at best, and they have to do it in a math-heavy department or field, there is no time or use to learn Python.

R does not mean only esoteric statistics. You have many more utilities in the R packages to diagnose and select models. Fitting a model is like 1% of the work, diagnostic is the more important part and R has much more to offer than Python ever will.

Statsmodels has tons of model diagnostic...and there is no R equivalent to Pymc3 (stan has less capability and worse API)

I have always considered R the best tool for both simple and complex analytics. But, it should not go unmentioned that the features responsible for R's usability often manifest as poor performance. As a result, I have some experience rewriting the underlying C code in other languages. What one finds under the hood is not often pretty. It would be interesting to see a performance comparison between Python and R.

Given that R folks are porting it to the JVM, I guess performance on the R side will improve thanks to Hotspot and Graal/Truffle.



Then there is PyPy as well.

I also think they should probably add Julia and Wolfram/Mathematica to these comparisons.

I would say they're both as limited as Python, Julia far more so. R's stats packages get ported to Julia faster, though. Mathematica still can't do mixed generalized linear modeling, and no other language (other than SAS and Stata) has a package for analyzing simple effects within them.

Thanks for the overview, I don't use them. It is more my language geek side speaking louder. :)

I have found Renjin quite useful in the past, and I love the motivation behind the project. I know that the guys at Bedatadriven hope to improve upon its performance, however it does not always (or often, depending on how you use R) outperform GNU R. Some great changes have been made lately (http://www.renjin.org/blog/2015-06-28-renjin-at-rsummit-2015...), so I hope to see Renjin's performance progress beyond GNU R across the board. I actually contributed Renjin's current PRNG – a Java translation of GNU R's – which was my first experience getting under R's hood.

The Purdue project you linked looks quite interesting. Unfortunately, development appears to have stagnated: https://github.com/allr/purdue-fastr

[edit] Another important aspect that Renjin contributes is the packages ecosystem: http://packages.renjin.org/

R being single-threaded internally may also result in performance hits.

R also has tools to spread tasks over multiple cores or over a cluster quite effortlessly. In practice, I can create a Fortran or C++ module, then use R to apply it over multiple cores, and get fantastic performance for certain tasks.

The one thing that sometimes gets overlooked when people decide whether to use R or Python is how robust the language and libraries are. I've programmed professionally in both, and R is really bad for production environments. The packages (and even language internals sometimes) break fairly often for certain use cases, and doing regression testing on R is not as easy as Python. If you're doing one-off analyses, R is great -- for anything else I'd recommend Python/Pandas/Scikit.

Packrat is good for making production "packages" that need specific library versions etc. https://rstudio.github.io/packrat/

or Scala, Clojure, or indeed C.

R's great strength is finding the interesting bits of the data. Testing the Algo. Doing the R&D basically. Better than Python.

Once that's done, why stop at Python? If your game is production, Python will do it, but others will do it so much better, faster, more efficiently.

One nice thing about Python is that you can make a piecewise transition from Python -> C, as it is fairly trivial to wrap C code for use in Python. On the other hand, Java's C interface system JNI is pretty much universally reviled.

The same can be said about R. Rcpp makes it super easy for you to drop right into C++ for bits of code that need that level of performance.

You can beat scala and approach c in python, with python syntax, using numba. It compiles numerical python code.

Good point, but personally I am thinking about the future of clustered data analysis, and this seems to be a JVM world and Scala seems to be the language of choice. Flink / Storm / Spark etc.

Dask has that, and scikit learn is moving that way also. It even beats spark for out of core work on a single machine

Yes Dask looks good! It's definitely featuring in my "must consider" list, but I must also, for reasons of responsible planning, give a lot of weight to the JVM technologies, with all their corporate backing etc.

I'd love to hear what precise production problems that you're seeing. I know people are successfully deploying R in production, but I'd like to hear more about the challenges.

First let me say thank you for your work on R packages, you've helped a lot of people accomplish some great things!

Unfortunately I can't go into specific details without potentially divulging proprietary information, but broadly most of the issues I've seen in production with R are corner cases involving multithreading with large amounts of allocated RAM (over 100GB), and corner cases involving the data.table package. I've also seen packages that update and break backwards compatibility, although that's less of an issue. The biggest concern we have with R, however, is that the documentation and coding practices for most R packages make small bug fixes difficult without having extensive knowledge of the package code. This is not always true, but it's true enough of the time that we can't afford to maintain much production R code.

For R: (1) instead of `sapply(nba, mean, na.rm = TRUE)` use `colMeans(nba, na.rm = TRUE)`. (2) instead of `nba[, c("ast", "fg", "trb")]` use `nba[c("ast", "fg", "trb")]`, (3) instead of `sum(is.na(col)) == 0` use `!anyNA(col)`, (4) instead of `sample(1:nrow(nba), trainRowCount)` use `sample(nrow(nba), trainRowCount)` and (5) instead of tons of code use `library(XML); readHTMLTable(url, stringsAsFactors = FALSE)`

The "cheat sheet" comparison between R and Python is helpful. The presentation is well done.

The conclusions state what we already know: Python is object oriented; R is functional.

The Last Word appropriately tells us your opinion that Python is stronger in more areas.

Python's main problem is that it's moving in a CS direction and not a data science direction.

The "weekend hack" that was Python, a philosophy carried into 2.x, made it a supremely pragmatic language, which the data scientists love. They want to think algorithms and maths. The language must not get in the way.

3.x is wanting to be serious. It wants to take on Golang. Javascript, Java. It wants to be taken seriously. Enterprise and Web. There is nothing in 3.x for data scientists other than the fig leaf of the @ operator. It's more complicated to do simple stuff in 3.x. It's more robust from a theoretical point of view, maybe, but it also imposes a cognitive overhead for those people whose minds are already FULL of their algo problems and just want to get from a -> b as easily as possible, without CS purity or implementation elegance putting up barriers to pragmatism (I give you Unicode v Ascii, print() v print, xrange v range, 01 v 1 (the first is an error in 3.x. Why exactly?), focus on concurrency not raw parallelism, the list goes on).

R wants to get things done, and is vectors first. Vectors are what big data typically is all about (if not matrices and tensors). It's an order of magnitude higher dimensionality in the default, canonical data structure. Applies and indexing in R, vector-wise, feels natural. Numpy makes a good effort, but must still operate in a scalar/OO world of its host language, and inconsistencies inevitably creep in, even in Pandas.

As a final point, I'll suggest that R is much closer to the vectorised future, and that even if it is tragically slow, it will train your mind in the first steps towards "thinking parallel".

"data analysis" means differently in R and Python. In R, it's all kinds of statistical analyses. In Python, it's basic statistical analysis plus data mining stuff. There are too many statistical analyses only exist in R.

I work with biologists. R which seems strange to me they seem to take to. I think some of it is Rstudio the ide, which shows variables in memory on the side bar, you can click to see them. It makes everything really accessible for those that aren't programmers. It seems to replace excel use for generating plots.

I've grown to appreciate R, especially its plotting ability (ggplot).

Rstudio is R for a lot of people. I'm a computational biologist in a group. Our PI is trying to get the postdocs to learn R themselves, but it's an uphill battle. I eventually warmed up to it - primarily for the plotting.

But a few weeks back he asked me how to do some kind of data sorting / manipulation in R. My answer was that it was a 10 line Python script and I gave him the code. Alas, he couldn't figure out how to save the script and run it from a command-line.

You can't underestimate at how important Rstudio is to the popularity of R for non-programmers.

I think some of it is Rstudio the ide, which shows variables in memory on the side bar, you can click to see them

This. Most programming IDEs show the code but hide the data. Excel shows the data but hides the code. RStudio is awesome because it shows both the code and the data.

It amazes me how few biologists / bioinformticians use an IDE.

Language comparisons are equiv. to religion comparisons...you aren't going to find a universal answer or truth, it's an individual/faith sort of thing.

That being said - all the serious math/data people I know love both R and Python...R for the heavy math, Python for the simplicity, glue, and organization.

This is not just interesting for comparison but its interesting for people that know R/Python how to go from one to the other.

Kind of, but the R code is written a little oddly to my eye.

Me too. Why for example did they use an sapply for column means when they could have just used colMeans with na.rm=T?

That is a major difference between these two languages.

Python: There should be one, and only one, preferable way to do things. Though this may not be obvious at first.

R: Every author has a different style of doing things, reflecting in the code.

As for the comparison in general: You can call R from within Python. So Python is at least as powerful as R. The rest (BeautifulSoup, Compression, Game development etc.) is icing on the cake.

How so? As someone familiar with Python but not R, I've always been hesitant to jump in. This code was very readable and made me think that it might be a far more accessible language than I'd previously assumed.

One example in the section titled "Split into training and testing sets" would be to use the createDataPartition() function from the caret package for creating training and testing sets.

He says "In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in sample function" but using caret is more concise.

Added: Later in the section on random forests he says "With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them." Which is why you want to use the caret package as it makes accessing many machine learning packages consistent and easy.

It would be nice to compare JuliaStats and Clojure based Incanter with Python Pandas/NumPy/SciPy. http://juliastats.github.io/

Very picky, but beware constantly using "set.seed" throughout your R scripts. Always using the same random number is not necessarily helpful for stats, and makes the R code look a lot trickier than it need be

I hope you all know that the people who have invested most in actually building this software care the least about this discussion.

I see Hadley Wickham commenting here, so yeah...

And now the creator of pandas -- whom you just replied to -- is here. It's officially now a party :)

In manufacturing Minitab and JMP are used for data analysis (histograms, control charts, DOE analysis, etc.) They are much easier to use and provide helpful tutorials on the actual analysis.

What features or workflow does R or Pandas/Numpy offer to manufacturing that Minatab & JMP can't?

R, Numpy, and Pandas are all FOSS. Probably not much of a practical concern, but it might be preferable in some cases.

I don't know anything about Minitab/JMP scripting myself, but my understanding is that R is generally the most intuitive of all the aforementioned (although that would basically boil down to individual preference).

Here's a review including Minitab and R that might be of interest: http://www.prostatservices.com/statistical-consulting/articl...

The comparison is R to Python+pandas.

The equivalent comparison should be R+dplyr to Python+pandas.

Base R is quite verbose and convoluted compared to using dplyr. Likewise data analysis in Python is painful compared to using pandas.

The rvest implementation was the main thing that seemed like an R port of the python implementation rather than best use of rvest.

An alternate (simpler) implementation of the rvest web scraping example is at https://gist.github.com/jimhester/01087e190618cc91a213

It would be even simpler but basketball-reference designs it's tables for humans rather than for easy scraping.

>seemed like an R port of the python implementation

End of the github for rvest:


    Python: Robobrowser, beautiful soup.

Really, syntax "nba.head(1)" is not any more "object-oriented" than "head(nba, 1)" -- it's just syntax, and the R statement is in fact an application of R's object system (there are several of them).

IMO, R's system is actually more powerful and intuitive -- e.g. it is fairly straightforward to write a generic function dosomething(x,y) that would dispatch specific code depending on classes of both x and y.

Single-dispatch generic functions are easy in python too: https://www.python.org/dev/peps/pep-0443/

That's good to know, thanks :) Although, for single dispatch, the S3 system of R is kinda hard to beat -- you just name your function print.myclass and you are done :)

In general, if I have to chose between two languages, one of which was designed specifically for statistics, and one that was more general, I will chose the more general one.

R's value is in the implementation of its libraries but there is no technical reason a really OCD person couldn't implement such high quality of libraries in Python.

It would be nice to also have some notes about performance of both the languages for each of the tasks compared. I believe pandas would be faster due to its implementation in C. The last time I checked R was an interpreted language with its interpreter written in R.

And like pandas, many of the performance bottlenecks in R have been re-written in C. See dplyr and data.table for packages that solve a similar problem to pandas with similar speed (and for some scenarios they're actually faster!)

Looks interesting! Thanks for the information.

Caret is a great package for a lot of utility functions and tuning in R. For example, the sampling example can be done using Caret's createDataPartition which maintains the relative distributions of the target classes and is more 'terse'.

    > data(iris)
    > library(caret)
    > data(iris)
    > idx <- caret::createDataPartition(iris$Species, p = 0.7, list = F)
    > summary(iris$Species)
        setosa versicolor  virginica
            50         50         50
    > summary(iris[idx,]$Species)
        setosa versicolor  virginica
            35         35         35

IF you do your stuff in R, how do you move it into production? Or do you not need to

There are packages for that (web servers and such). Or you can call it from Java/Python/whatever.

Most R tasks that people use exit. Typical data science task is: gather data, apply an operation over said data, analyse results.

  python < world > csv
  R < csv > analysis

i tried help my wife who use R in school, only to get quickly lost. also attended ~1 hour R course on university.

to me, R was a waste of time and I really dont understand why its so popular in academia. if you already have some programming knowledge, go with Python + Scipy instead

EDIT: R is even more useless without r studio, http://www.rstudio.com/. and NO, dont go build a website in R!

Maybe you didn't mean it this way, but to me your comment reads as, basically, "I tried R for an hour and didn't immediately grok it, therefore it is a waste of time."

That may not be what you meant, so I haven't downvoted yet, but it doesn't seem to be an attitude that is helpful for the conversation.

Thanks for your explanation. It seems my ability to communicate is getting worse every year :-/.

What I meant to say was that I helped my wife during her master thesis (~6 months) with R, in addition to spending an hour in one of the classes.

Her teachers also were novices of both R and Excel, and we had several issues with everything from how R processes csv:s, to just figuring out the proper syntax to have R do what we wanted.

Sorry if my comment wasnt helpful, i was merely attempting to add some reflections from personal experience to the discussion.

I disagree with R being more useless without r studio. I'm not a fan of R overall, but I run everything in tmux+vim and R is the same way. I prefer it to Rstudio. It's popular, because it makes a few choices which are different many programming languages to be geared towards writing scripts for statistics. (e.g. index 0, assignments)

I'll second the utility of alternative environments to RStudio. For me, I love RStudio, but I spend too much time in Python (and occasionally dabbling in others) to use it all the time. So, for me it's Emacs Speaks Statistics, which is fantastic.

As a side benefit, the first time I tried dabbling in Julia, I was pleasantly surprised to have a familiar mature environment work with it out of the box.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact