I think Python is the biggest hidden gem in statistics. It's had a tremendous impact on machine learning and algorithm development, yet traditional statisticians still rely on SAS/R/Stata/MATLAB.
All of these languages have libraries that produce the same results, the difficulty is mangling the data into the correct input format. Python's list comprehensions are much, much easier to use than MATLAB matrices, R's data frames, Java's ArrayLists, etc. I'd advise any new graduate student to learn how to plug data into traditional programs, but save yourself a headache and perform your data manipulation in Python. Eventually you can take the leap and do the analysis in Python as well.
> All of these languages have libraries that produce the same results, the difficulty is mangling the data into the correct input format. Python's list comprehensions are much, much easier to use than MATLAB matrices, R's data frames, Java's ArrayLists, etc.
At the end of the day, for machine learning applications, your data is in a tabular format. (in Python, a pandas data frame) Yes, Python has a few tricks like list comprehensions for speeding up data processing into that analyzable form. R has a few tricks for processing tabular data as well. (e.g. dplyr).
There are tradeoffs and the skill is finding which works best. Using a single programming language is a bad philosophy even for non-statistical developers.
The even bigger advantage of python is everything else: web scraping, interfacing with weird APIs, hitting the database, consuming and emitting obscure formats, parsing text, calling operating system services, and a million other things. That and much much better abstractions for building larger systems out of reusable components and much better tooling for serious software engineering.
R or Matlab can be fine at exploratory data analysis, making charts, running miscellaneous bits of non-programmer-grad-student code found on the internet, etc., but as soon as you want to do anything other than data analysis, they quickly become annoying.
If you haven't looked at the R ecosystem in awhile, there is a package for each of those use cases. (scraping/API: rvest; database: dplyr; parsing text: stringr, etc).
Yes, Hadley Wickham is primarily responsible for the popularity of R.
We’re talking about an order of magnitude difference in number of packages (82096 on PyPI vs. 8551 on CRAN) and their maturity, and such a naïve metric probably undersells the difference in variety of use cases.
If you picked 100 random production python projects out of a hat, no more than a small handful of them would be remotely appropriate to build using R.
And that's entirely fine. R is great at being a quick and dirty statistical analysis language for people doing data exploration.
I suspect that the utility of more packages increases only logarithmically. Having 10x more packages doesn't mean it's 10 times more useful. If your obscure need isn't in the first 8,000 packages, it probably won't be in the next 80,000. That's just how power laws work. And any common task you can think of will probably be in the top 8,000.
Pandas for Python is almost re-imaging of R (Base 0 instead of base 1 UGH) very closely and the "common task" match closely, but I still use R over Python.
I think R's data frames is by far the most intuitive abstraction. There's a reason why it was duplicated in Python, Spark or Julia. And R has a lot of utilities for converting input strings, dates, JSON etc.
I'm glad I read this comment. After checking some of the docs I think I will have a go at Python for data wrangling. List comprehensions look... friendly.
R still rules for plotting and running canned statistical procedures but sometimes I feel like if I stop programming R for a week I forget how to use it effectively... E.g. Forgetting to add stringsAsFactor=FALSE to everything, forgetting rbind() can overwrite column names, forgetting I have to define my own string concatenation operator in every script.
If Python can save me some of the frustration involved in manipulating data frames that will be nice.
A reply from the man himself! Thanks for the link. I'll have a go.
I do like the look of the dplyr library a lot. Combining functions like select and group_by with the pipe operators creates code that is reminiscent of SQL- very nice for readability.
I think this thread illustrates a kind of tension between those coming from an IT/big-data/web oriented background and the more traditional statistics/science/engineering side.
The IT side bring a lot of very powerful and scalable tools to the table. However there are aspects of traditional work which I suspect are lost on some big-data people.
For example, in my line of work (physical asset mgmt) we deal with a lot of very small datasets, very poor quality datasets (e.g. some guy's favourite spreadsheet) and also cultural issues (some engineers are inherently averse to changing systems, and spending decisions are inherently political). In this situation, there is a limit to the benefit of more powerful/scalable tools, and it is advantageous to use tools which are considered high quality and vetted by the community.
R is in a good position here as it has the pedigree of being accepted by the academic stats community, as well as actually being a great tool.
I'm primarily a python user and agree with most of your post but I do find myself going back to R for many of the more esoteric statistical methods. Ie if i want a specific sort of penalized regression it may not yet be implemented in python.
I use both Python and R for analytics projects. Python falls quite short on statistics and time series, especially if one is doing exotic/advanced stuff.
By the end of the day I just want to get my job done, and I select the best tool for the job. I have gained a lot by working with both languages.
On a sidenote, one cannot talk about the success of R without mentioning RStudio (amazing IDE for working with data)
I know few about other languages.
But as the data size growing, I wonder to know if Python has a disadvantage on efficiency over others because it's interpreted?
I really would be interested because I started in Python and R and now use R exclusively due to manipulation. R (dplyr) really is the best manipulation system have ever used. Perhaps people try to do loops in R and other non-R like ways and give up. R is fundamentally a Functional Language and people try to band a square into a round hole and give up?
I think you got downvoted because people didn't know what you meant. You're right though. Although R is laughably inferior to python as a programming language, it is vastly more work to try to do statistical data analysis in python than in R. I recommend using both languages and using csv or whatever format to exchange data sets.
> using csv or whatever format to exchange data sets
Have you used Pandas in the past year or two? I'm curious why you would exchange data sets using csv or another format between Python and R when you could easily call the R function of interest from within Pandas (using rpy2) and not even worry about data interchange.
It's definitely not vastly more work to do statistical analysis using Python Pandas than in R anymore, perhaps it was several years ago.
EDIT: And carlmcqueen mentioned feathers in response, which is a collaboration between the developers of Pandas and R to create interoperable on-disk data frames for both languages. Point being, between rpy2 and feathers (and probably other projects), you definitely don't need to use intermediary csv files anymore to move data back and forth between R and Python.
> It's definitely not vastly more work to do statistical analysis using Python Pandas than in R anymore, perhaps it was several years ago.
Haha, what? What statistics can you do in pandas? You can do some statistics in python by cobbling together stuff from scipy and statsmodels (maybe I'm out of date, is there more?). I see a few modules for regression and stuff in pandas but they are marked as deprecated. I think perhaps you and I mean different things by "statistics". R provides a vast ecosystem covering, for example
- Gold standard implementations of simulation, PDFs, quantiles of any probability distribution you can mention (in python you can find some of this in scipy; not pandas. But scipy is a real mess compared to R and not as comprehensive.)
- Gold standard implementations of any classical hypothesis test you can mention
- Gold standard implementations of computational methods for fitting generalized linear models, mixed models, frameworks for MCMC samplers, graphical models, HMMs, and a vast amount of other stuff I'm not clever enough to name right now let alone understand.
Really any statistical procedure -- whether "classical" or "modern"/"computational statistics" -- in R you will find it, and furthermore it will be basically the reference implementation / gold standard.
That's not mentioning the plotting tools and the numerical computing and clean linear algebra syntax. But that's it, no more: the people who go further and suggest using R for building a web server or web scraping or something mostly haven't used real programming languages.
You're missing the point. The python ecosystem can't compete with R on the statistics front -- it would be crazy to try. That's certainly not the aim of pandas.
> you definitely don't need to use intermediary csv files anymore to move data back and forth between R and Python.
Perhaps not, but doesn't it please you to have a well-defined interface (a serialization format) between the two languages? I haven't tried Rpy2 for years. I don't like to have two different languages get their tentacles into each other like that if I can avoid it, but I'm sure it's a good project which has its use cases.
EDIT: thanks, I hadn't seen feather. That looks like the thing to use.
Have you even looked at pandas? I get the impression you haven't. Pandas has most of the statistical utility functions that R does, and for those few that it lacks, Python/Pandas also has available an easy-to-use FFI interface to R via rpy2.
Have you even looked at statistics? I get the impression you haven't. Pandas has aggregation and that's it. I don't really like to start those language wars, but what the heck. Your precious pandas is actually worse than the R alternatives in every way. It is slower than data.table while having comparable syntax. Dplyr is comparable in speed while having WAY better syntax.
I have heard of people that could not get the same results in Python as they get in R.
Have you looked at R's source? Some common functions contain dozens of unsourced magic numbers (and the comments indicate they have been modified over time).
Plenty of free high quality documentation and learning materials around R (just read anything by Hadley)
Package manager. Super easy to find, install, and start using packages.
Open source / Free
Large community of users
Extensive usage by the stats community. (If a new algorithm comes out, chances R there will be an R implementation)
Easy to build and share your own packages via Github.
Easy to link C++ code to your packages.
----------------------------------------------
I love R, but something about how the language feels syntactically, it's not as pleasurable programming wise compared to something like the Python data stack. But with all of the above advantages, I don't see myself switching to anything else in the future for my data science work, unless I have a really pressing need to. The other thing is that the language is so damn popular that the useR conference was sold out in pre-reg. rounds.. Seriously guys, stop using and learning about R so I can get in the conference....
A big problem with R is that it's just stats. The other day I wanted to do a simple loan amortization (simple PMT/IPMT in Excel). People say 'use R over Excel!'. Right. There are some clunky barely-working packages in R that do half of what you need and some stack overflow posts that mostly show how to do the other half, but that's no basis to build on.
And don't get me started on string handling in R, or that there's no way to get the path of the currently running script, or a dozen other things that are trivial in a general purpose language but are a major pain in R. R is not 'general purpose' enough, and it doesn't have to be useful to write both kernel drivers and database REST frontends, but being able to do things that are math-related and not purely stats - that's not too much to ask for I'd say. Especially because it's not reasonable to ask people whose main job is not writing software to learn multiple languages/tools.
(Other recent example I remember: how unintuitive I found it to plot a sine wave and its first and second derivative. My Mathematica-oriented colleague did it in 2 minutes.)
R is only recently getting involved in the financial world. Most of R development has been in academia focusing on biostatistics, clinical trials and such. There is an R in Finance conference every year. Also there are a lot of good packages for securities, investing, and risk management.
My main problem was the derivative, not so much the plotting (or maybe it was 'plotting an arbitrary function'); but I looked it up and it seems I slightly misremembered what it was I wanted to do. I wanted to draw a cubic spline, not a sine. What I ended up doing was
spline_x <- 1:6
spline_y <- c(0, 0.5, 2, 2, 0.5, 0)
spl_fun <- splinefun(spline_x, spline_y)
p <- ggplot(data.frame(x=spline_x, y=spline_y), aes(x, y))
p <- p + stat_function(fun = spl_fun)
p <- p + stat_function(fun = spl_fun, arg = list(deriv = 2))
I still don't quite understand how that derivative works - ?list doesn't mention anything about 'deriv', and there's a function called 'deriv' but I'm not sure how that's being interpreted in the code above.
Also it seems recent versions of ggplot2 have geom_xspline() which does what I need (I'm told) but that wasn't in the release version when I was doing it.
It took me 5 minutes to figure out how that actually did work! That is rather esoteric code!
(FWIW the reason that there's no native support in ggplot2 for this sort of smoothing is that I think it's a really bad idea as it tends to distort the underlying data)
Basically this. Certainly not the worst language (SAS), but intangibly less pleasurable than python or most other common languages I've used. Maybe that's because I'm not a 'real' stats person though and ~ notation still takes me a minute to grok. And there is no challenger on he horizon for it's dominance in available packages. And at least it's not SAS.
My theories:
A) Stat people use R because because they don't know any better
B) Stat people use R because code monkeys use python
C) Stat people use R because stat people use R (probably this)
No. R is a language that is beautifully well suited to data analysis and interactive computing. Stat people don't use it simply because we don't know better.
As a language, one problem is that it is many people's first programming language, and it's a very bad introduction. In particular it doesn't teach people to think in terms of standard data structures like hash maps / dicts. It's bad for the job prospects of lots of grad students who might hope to leave academia.
EDIT: I am corrected in regards to the SAS routines statement; see the reply.
A few comments. I worked in pharma and the FDA specifically requires a number of SAS routines- specific function calls- to be used when doing drug studies/clinical trials. R can't replace SAS in those cases without massive effort because the FDA is slow and conservative and people like to have validated results.
The SAS spokesperson said: """Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”"""
to which a senior employee of Boeing pointed out that every jet they build uses R as an integral part of the design process. I think that had to be an "oh shit" moment for SAS, where they realized their strong position in stats was going to start to erode.
That is not true. The FDA uses R internally, and there is no requirement that you must use any specific software tool. See https://www.r-project.org/doc/R-FDA.pdf for more details
"""Despite some mistaken conceptions in the pharmaceutical industry, SAS is not required to be used for clinical trials. This origin of this fallacy is probably related to the fact that data must be submitted in the XPT "transport format" (which was originally created by SAS). This data format is now an open standard: XPT files can be read into R with the standard read.xport function, and exported from R with the write.xport function in the SASxport package. (And if you have legacy data in other SAS formats, there's a handy SAS macro to export XPT files.)"""
I'm not totally sure whether this analysis captures the true extent which R vs SAS vs SPSS is used.
If I use R for a plot, or a simple bit of regression, or anova, or even cross-validation. I don't reference it in a paper. I only cite it if there is a package designed for a particular type of data (e.g. a Bioconductor package) or something a bit more esoteric (e.g. apcluster). About 95% of the work is data munging and - sorry Hadley - I don't cite dplyr, purrr, magrittr etc...
However I have notice that in clinical trial or small social science papers simple analyses of this type are often cited as being done in SPSS or SAS. I think this just reflects the fact that non specialist data analysts are more likely to cite SAS or SPSS for simple procedures such as graphs or anova as an appeal to authority.
So I reckon the data may reflect a trend but tells us little about the true levels.
Source code really should be available, but it almost never is. The peer review process is, in my opinion, quite flawed. While your paper's high level content gets reviewed, no one actually looks at your code and data to ensure that you didn't forget to carry the one. Your analysis could be totally wrong, but reviewers only review what you say you did, not what you actually did.
Any paper I peer review better have source code available or they will hear about it. That said, yes, there isn't time (or funding) to actually re-run the entire analysis.
Nobody cites every package they use, it's not feasible. I use a lot of packages, and some journals have a limit on the number of citations you can have. I only cite packages when it provides specialized statistical functionality.
For example, I do a lot of work with data from complex surveys, and I always cite Lumley's survey package because without it I wouldn't be able to do the work. On the flip side, I use Hadley's readr package extensively because I think his I/O functions are more sane than the defaults. I'm not going to cite readr in every paper I write just because I'm too lazy to type stringsAsFactors = FALSE when I read a csv file.
Our genomics workflows use dozens of packages even before I get the data and start really doing analysis, statistics, and plots. It's just not feasible to cite every bit of code we use (though we certainly point people to the higher level routines, which they can use to see what was run/how to reproduce the results).
The success of R in Statistics (in respect to Python, etc) was that it was thought from the beginning with Statisticians and their specific needs and approaches in mind. As much as I appreciate Python, it is a general purpose programming language adapted to Statisticians needs, not the other way around.
R has many issues, but if you speak to Statisticians you will hear that its the closest thing they have to their own way of doing things.
It's great for bleeding edge scientific research. The results of many languages don't always match for advanced algorithms, but the open source nature of R, makes it easier to identify the problem areas.
The R-core interpreter does have a number of deficiencies. (R is based on S-language specification that left wiggle room from the 70s.) General purpose programming and data wrangling/engineering is best handled in other programming idioms.
For example, reshaping JSON to the format an intricate R function expects. Appreciate the great work with (d)plyr and similar packages, but it's still work and overhead. Combined with some inefficiencies/quirks in base r functions (does ifelse() still evaluate twice?) it's easier to go with a widely used and respected package in a general purpose language; Nokogiri for example. For data engineering, consider there is not a maintaned R package for a web-client, and asynchronous programming is weak.
JSON is often a pain because it's so hierarchical and un-dataframe like. I have a few notes on working with it here: http://r4ds.had.co.nz/hierarchy.html.
`ifelse()` is a nightmare of a function but I don't think double-evaluation is ever a problem.
There are two maintained web-clients: curl (low-level) and httr (high-level). And I think rvest does everything that nokogiri does.
Thanks for the link on JSON and your packages, great work as always. I should clarify when I said we-client, I meant websockets client to consume feed. The last time I tried, the only R package (r-websockets) just crashed my Linux box and not maintained for several years. httr doesn't do websockets, as I understand. Seems likely a fundamental way to engineer/wrangle data into R.
You can do it with httpuv, but it might a bit clunky. I think better websockets support, and better async generally, is on the roadmap for the next year.
People say that, but I'd prefer the actual LISP syntax then (being a fan of xlispstat back in the day). I'm surprised nobody has created a "Lisp-flavored R" analogous to Erlang's LFE or Python's Hy.
I'd prefer something more like TypeScript for R, where you can gradually move over but you get better tooling. I'd also ask for a new standard library but I think Hadley is basically doing that.
Maybe in 30 years they will also learn a true programming language and stop producing undocumented, unusable, unportable, underdeveloped libraries for research level tools and technologies.
Outside the world of Neural Network it is a complete disaster, and the NN landscape is at an acceptable level only because of big companies, surely not thanks to the researchers. And the reason, of course, is that most researchers refuse to think of themselves as "software developer" and use these arcane languages which might be good for prototyping but lack power when it comes to shipping a real product (which might also be a tool for other researchers to use).
At least they're not using Matlab where everything breaks as soon as you change machine.
I mean, I won't argue against having better code and documentation, but it's not really our job to ship a real product. Shipping well documented, easily usable, ultra portable, well developed libraries takes a fuckload of time, resources, and expertise that we don't have. Our primary job is to ship ideas.
It would be awesome if every project I did ended up with a nice, polished piece of software, but that's not what I get paid to do. I would be fired if I tried to do that.
Fair enough. But, speaking as a researcher, I often find myself reading through hundreds of lines of code and rebuilding routines from scratch in order to reproduce and expand on what others did in their works.
However, I was very harsh, and of course I wouldn't find viable to expect production ready code, but something moderately portable could come handy. Of course, as you said, a researcher doesn't have the time to build a well developed library. As a solution, my University is considering the idea of hiring a dedicated developer whose job would be to maintain libraries. I really hope this to happen.
For simulation it's not even close. Open source can't do user interfaces. Not a problem for programming languages, but for simulation at least, there's a slew of powerful but unusable open source software made by professors and then there's expensive proprietary ones with nice front ends that save enormous amount of time for the users.
No. Some have the benefit of proprietary modules (FPGA toolchains), some have large libraries of pre-entered and organized data (Mathematica), and some have early access to hardware (LabView, CUDA).
Programming languages, perhaps, are less vulnerable to these issues. And perhaps open source could beat these applications eventually, given perfect competition. But we're not in that world, unfortunately.
I think that open source programming languages will always win in the long run, since the target customer base knows how to program and extend the tools.
I'm not sure that logically follows. What %age of Python users actually know the underlying C well enough to make changes to the language? Even the number who know how to write bindings is tiny overall.
There is an over representation of open source in the amateur and student communities because open source it's usually free while commercial products are very expensive.
It's true however that there is a tendency where open source is displacing more and more commercial products even in commercial settings. I, for example, prefer using python over matlab, even if matlab were free but I'm not very representative since I actually love programming and programming languages and don't mind working with virtualenvs and configuring emacs to my needs. Most engineers and scientists don't have the time or the intestest in learning to do so and prefer a suboptimal (in my opinion) language and more polished tools that Just Work. It's sad that to this day there are incredibly cool libraries like TensorFlow for python and yet there is nothing as easy to use for debugging or profiling as the matlab editor. I know there are alternatives, for example PyCharm, but I assure you that most non-software engineers are not willing to use "such complex" tools.
One thing that interests me is language power, vs experience. Let's say you had 1 year experience in language X. Language Y comes along that is better in some way. In another year, would you be happier and more productive with 2 years experience of X, or one year of Y?
I sometimes think with the churn of languages, no-one really gets deeply enough into one to really leverage it.
The way I use Python in machine learning is quite different from how many others in competitive ML use Python. I use Python purely for Python 2.7 with Pypy and try not to touch or use numpy,scipy,pandas,etc. R's data.table is possibly faster than Python's numpy/scipy/pandas. I think anyone claiming Python because of numpy/scipy/pandas is really being mislead. You should be using Python in spite of the need to rely upon numpy/scipy/pandas. If you really need numpy/scipy/pandas just use R and data.table which is amazingly fast. I think Python is really great because of Pypy and the strength of the standard Python library.
If you need to call one of the built-in pieces of Magic (TM) then Mathematica is OK, but if you want to build something new that needs to interface with literally anything outside of Mathematica, then Mathematica is a PITA.
Actually, the whole Mathematica kernal is exposed via a C API. I wrote a Python-Mathematica bridge based on this and it was wonderful. You could sit in Python, and send Python expressions with variables, etc, to Mathematica for evaluation, and get the results back as Python objects.
I've worked with these types of bridges before. They are terrible if you want to keep the program running and intermittently call Mathematica throughout the course of a multi-hour session.
If you have a small script that makes a single one-off call to Mathematica, and the interface already exists for your language, which it probably doesn't, even if you're using an extremely popular language, and even though you're paying hundreds of dollars a year just for PERSONAL use, then things can be ok. But if you want to make a bunch of calls and keep the program running reliably then you're SOL.
Oh, and don't even think about deploying. It will cost you so much that it's more cost-effective to just rewrite the thing or do the work of switching out with a different library/tool.
I don't understand why you consider a problem to intermittently call mathematica's kernel in a multi-hour session. There's nothing that would make this not work. The mathematica C interface launches a copy of the kernel and communicates with it over a straightforward protocol.
Yes, it's extremely easy to link to packages in R, and even to include C++ programs to make your R package run faster. Mathematica, on the other hand, is a large proprietary package that involves a bit of effort to install and use. I'm not sure about what statistical features it has, but perhaps they're not as developed as R's.
Mathematica is more well represented in mathematics, instead of statistics focused fields. Honestly, for stats stuff, mathematica isn't even on the radar.
It's been about 10 years since I looked at mathematica but at that time it put the emphasis on symbolic manipulation of equations using its own internal magic while R (and matlab and numpy) focus on more traditional numerical computation, eg array operations via BLAS/LAPACK which is much more practical for statistics.
Shameless self promotion, but not every package increases the price. We maintain a free (GPLv3) library for JS/SAS that lets you build nice user interfaces to your programs/workflows using modern frameworks like Angular or React.
Check out github.com/boemska/h54s. I wouldn't normally post it like this, but SAS comes up so rarely that I figured if you're on HN and you use SAS, then you'll probably be interested. The more the merrier.
Just noticed that this was posted by my old (half) boss!
Finally! This is very encouraging, that such an excellent free software package is in such high demand. From what I've used it for, it worked very well. It's great for quickly creating nice-looking graphs and plots.
To 2: When I look for just "R" even in an anonymous window (so it should not use my history) I get as the first suggestion a link to https://www.r-project.org/ - the home of R. What else is there for that letter - that is equally popular? "R" is "hip" and trending. Microsoft not too long ago started a big push into the R space and now regularly generates headlines around the system, accelerating the trend even more.
Even weirder, a Google search for "xlispstat" seems to bring up more R hits that don't even mention xlispstat than actual xlispstat ones. Some weird algorithm is associating R and xlispstat as relating to statistics and because R is much more popular these days, prioritizing R over xlispstat.
it's not the search engine's fault, it's the contents. R is associated to xlispstat as the credible alternative. E.g. Jan de Leeuw JStat's paper. Also the author of xlispstat is one main contributor to R.
>Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown in this particular graph.
All of these languages have libraries that produce the same results, the difficulty is mangling the data into the correct input format. Python's list comprehensions are much, much easier to use than MATLAB matrices, R's data frames, Java's ArrayLists, etc. I'd advise any new graduate student to learn how to plug data into traditional programs, but save yourself a headache and perform your data manipulation in Python. Eventually you can take the leap and do the analysis in Python as well.