All of these languages have libraries that produce the same results, the difficulty is mangling the data into the correct input format. Python's list comprehensions are much, much easier to use than MATLAB matrices, R's data frames, Java's ArrayLists, etc. I'd advise any new graduate student to learn how to plug data into traditional programs, but save yourself a headache and perform your data manipulation in Python. Eventually you can take the leap and do the analysis in Python as well.
At the end of the day, for machine learning applications, your data is in a tabular format. (in Python, a pandas data frame) Yes, Python has a few tricks like list comprehensions for speeding up data processing into that analyzable form. R has a few tricks for processing tabular data as well. (e.g. dplyr).
There are tradeoffs and the skill is finding which works best. Using a single programming language is a bad philosophy even for non-statistical developers.
R or Matlab can be fine at exploratory data analysis, making charts, running miscellaneous bits of non-programmer-grad-student code found on the internet, etc., but as soon as you want to do anything other than data analysis, they quickly become annoying.
Yes, Hadley Wickham is primarily responsible for the popularity of R.
If you picked 100 random production python projects out of a hat, no more than a small handful of them would be remotely appropriate to build using R.
And that's entirely fine. R is great at being a quick and dirty statistical analysis language for people doing data exploration.
Plus dplyr essentially allows you to treat tables in a database as if they are data.frames.
Have you used Pandas in the past year or two? I'm curious why you would exchange data sets using csv or another format between Python and R when you could easily call the R function of interest from within Pandas (using rpy2) and not even worry about data interchange.
It's definitely not vastly more work to do statistical analysis using Python Pandas than in R anymore, perhaps it was several years ago.
EDIT: And carlmcqueen mentioned feathers in response, which is a collaboration between the developers of Pandas and R to create interoperable on-disk data frames for both languages. Point being, between rpy2 and feathers (and probably other projects), you definitely don't need to use intermediary csv files anymore to move data back and forth between R and Python.
Haha, what? What statistics can you do in pandas? You can do some statistics in python by cobbling together stuff from scipy and statsmodels (maybe I'm out of date, is there more?). I see a few modules for regression and stuff in pandas but they are marked as deprecated. I think perhaps you and I mean different things by "statistics". R provides a vast ecosystem covering, for example
- Gold standard implementations of simulation, PDFs, quantiles of any probability distribution you can mention (in python you can find some of this in scipy; not pandas. But scipy is a real mess compared to R and not as comprehensive.)
- Gold standard implementations of any classical hypothesis test you can mention
- Gold standard implementations of computational methods for fitting generalized linear models, mixed models, frameworks for MCMC samplers, graphical models, HMMs, and a vast amount of other stuff I'm not clever enough to name right now let alone understand.
Really any statistical procedure -- whether "classical" or "modern"/"computational statistics" -- in R you will find it, and furthermore it will be basically the reference implementation / gold standard.
That's not mentioning the plotting tools and the numerical computing and clean linear algebra syntax. But that's it, no more: the people who go further and suggest using R for building a web server or web scraping or something mostly haven't used real programming languages.
You're missing the point. The python ecosystem can't compete with R on the statistics front -- it would be crazy to try. That's certainly not the aim of pandas.
> you definitely don't need to use intermediary csv files anymore to move data back and forth between R and Python.
Perhaps not, but doesn't it please you to have a well-defined interface (a serialization format) between the two languages? I haven't tried Rpy2 for years. I don't like to have two different languages get their tentacles into each other like that if I can avoid it, but I'm sure it's a good project which has its use cases.
EDIT: thanks, I hadn't seen feather. That looks like the thing to use.
Take a look: http://pandas.pydata.org/pandas-docs/stable/
EDIT: Specified name of interface (rpy2).
Pandas is not shipping with most of the stuff you can get on CRAN.
xts is quite useful as well, when you need to do modeling/analysis in an especially time-oriented manner:
R still rules for plotting and running canned statistical procedures but sometimes I feel like if I stop programming R for a week I forget how to use it effectively... E.g. Forgetting to add stringsAsFactor=FALSE to everything, forgetting rbind() can overwrite column names, forgetting I have to define my own string concatenation operator in every script.
If Python can save me some of the frustration involved in manipulating data frames that will be nice.
(Except for infix string concatenation - I've never really understood why people prefer that to paste(). Maybe if you're not thinking in vectors?)
I do like the look of the dplyr library a lot. Combining functions like select and group_by with the pipe operators creates code that is reminiscent of SQL- very nice for readability.
I think this thread illustrates a kind of tension between those coming from an IT/big-data/web oriented background and the more traditional statistics/science/engineering side.
The IT side bring a lot of very powerful and scalable tools to the table. However there are aspects of traditional work which I suspect are lost on some big-data people.
For example, in my line of work (physical asset mgmt) we deal with a lot of very small datasets, very poor quality datasets (e.g. some guy's favourite spreadsheet) and also cultural issues (some engineers are inherently averse to changing systems, and spending decisions are inherently political). In this situation, there is a limit to the benefit of more powerful/scalable tools, and it is advantageous to use tools which are considered high quality and vetted by the community.
R is in a good position here as it has the pedigree of being accepted by the academic stats community, as well as actually being a great tool.
Personally, I think anyone who works with data extensively should be familiar with both Python and R.
By the end of the day I just want to get my job done, and I select the best tool for the job. I have gained a lot by working with both languages.
On a sidenote, one cannot talk about the success of R without mentioning RStudio (amazing IDE for working with data)
R has a wonderful ecosystem that does amazing data manipulation (dplyr for example)
Python is a good choice but it isn't "clearly better."
Have you looked at R's source? Some common functions contain dozens of unsourced magic numbers (and the comments indicate they have been modified over time).
Plenty of free high quality documentation and learning materials around R (just read anything by Hadley)
Package manager. Super easy to find, install, and start using packages.
Open source / Free
Large community of users
Extensive usage by the stats community. (If a new algorithm comes out, chances R there will be an R implementation)
Easy to build and share your own packages via Github.
Easy to link C++ code to your packages.
I love R, but something about how the language feels syntactically, it's not as pleasurable programming wise compared to something like the Python data stack. But with all of the above advantages, I don't see myself switching to anything else in the future for my data science work, unless I have a really pressing need to. The other thing is that the language is so damn popular that the useR conference was sold out in pre-reg. rounds.. Seriously guys, stop using and learning about R so I can get in the conference....
And don't get me started on string handling in R, or that there's no way to get the path of the currently running script, or a dozen other things that are trivial in a general purpose language but are a major pain in R. R is not 'general purpose' enough, and it doesn't have to be useful to write both kernel drivers and database REST frontends, but being able to do things that are math-related and not purely stats - that's not too much to ask for I'd say. Especially because it's not reasonable to ask people whose main job is not writing software to learn multiple languages/tools.
(Other recent example I remember: how unintuitive I found it to plot a sine wave and its first and second derivative. My Mathematica-oriented colleague did it in 2 minutes.)
Is this the way you did it? It seems pretty intuitive...
plot(x,sin(a * x))
plot(x,a * cos(a * x))
plot(x,-a^2 * sin(a * x))
My main problem was the derivative, not so much the plotting (or maybe it was 'plotting an arbitrary function'); but I looked it up and it seems I slightly misremembered what it was I wanted to do. I wanted to draw a cubic spline, not a sine. What I ended up doing was
spline_x <- 1:6
spline_y <- c(0, 0.5, 2, 2, 0.5, 0)
spl_fun <- splinefun(spline_x, spline_y)
p <- ggplot(data.frame(x=spline_x, y=spline_y), aes(x, y))
p <- p + stat_function(fun = spl_fun)
p <- p + stat_function(fun = spl_fun, arg = list(deriv = 2))
Also it seems recent versions of ggplot2 have geom_xspline() which does what I need (I'm told) but that wasn't in the release version when I was doing it.
(FWIW the reason that there's no native support in ggplot2 for this sort of smoothing is that I think it's a really bad idea as it tends to distort the underlying data)
(Programming from memory on my phone so might be slightly off)
A) Stat people use R because because they don't know any better
B) Stat people use R because code monkeys use python
C) Stat people use R because stat people use R (probably this)
A few comments. I worked in pharma and the FDA specifically requires a number of SAS routines- specific function calls- to be used when doing drug studies/clinical trials. R can't replace SAS in those cases without massive effort because the FDA is slow and conservative and people like to have validated results.
I think the writing was on the wall for SAS when this article came out: http://www.nytimes.com/2009/01/07/technology/business-comput...
The SAS spokesperson said: """Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”"""
to which a senior employee of Boeing pointed out that every jet they build uses R as an integral part of the design process. I think that had to be an "oh shit" moment for SAS, where they realized their strong position in stats was going to start to erode.
"""Despite some mistaken conceptions in the pharmaceutical industry, SAS is not required to be used for clinical trials. This origin of this fallacy is probably related to the fact that data must be submitted in the XPT "transport format" (which was originally created by SAS). This data format is now an open standard: XPT files can be read into R with the standard read.xport function, and exported from R with the write.xport function in the SASxport package. (And if you have legacy data in other SAS formats, there's a handy SAS macro to export XPT files.)"""
Thanks for clearning that up.
If I use R for a plot, or a simple bit of regression, or anova, or even cross-validation. I don't reference it in a paper. I only cite it if there is a package designed for a particular type of data (e.g. a Bioconductor package) or something a bit more esoteric (e.g. apcluster). About 95% of the work is data munging and - sorry Hadley - I don't cite dplyr, purrr, magrittr etc...
However I have notice that in clinical trial or small social science papers simple analyses of this type are often cited as being done in SPSS or SAS. I think this just reflects the fact that non specialist data analysts are more likely to cite SAS or SPSS for simple procedures such as graphs or anova as an appeal to authority.
So I reckon the data may reflect a trend but tells us little about the true levels.
For example, I do a lot of work with data from complex surveys, and I always cite Lumley's survey package because without it I wouldn't be able to do the work. On the flip side, I use Hadley's readr package extensively because I think his I/O functions are more sane than the defaults. I'm not going to cite readr in every paper I write just because I'm too lazy to type stringsAsFactors = FALSE when I read a csv file.
R has many issues, but if you speak to Statisticians you will hear that its the closest thing they have to their own way of doing things.
It's great for bleeding edge scientific research. The results of many languages don't always match for advanced algorithms, but the open source nature of R, makes it easier to identify the problem areas.
The R-core interpreter does have a number of deficiencies. (R is based on S-language specification that left wiggle room from the 70s.) General purpose programming and data wrangling/engineering is best handled in other programming idioms.
`ifelse()` is a nightmare of a function but I don't think double-evaluation is ever a problem.
There are two maintained web-clients: curl (low-level) and httr (high-level). And I think rvest does everything that nokogiri does.
Looking forward to the improvements. Appreciate all your work in the field.
* Rant mode: On
Maybe in 30 years they will also learn a true programming language and stop producing undocumented, unusable, unportable, underdeveloped libraries for research level tools and technologies.
Outside the world of Neural Network it is a complete disaster, and the NN landscape is at an acceptable level only because of big companies, surely not thanks to the researchers. And the reason, of course, is that most researchers refuse to think of themselves as "software developer" and use these arcane languages which might be good for prototyping but lack power when it comes to shipping a real product (which might also be a tool for other researchers to use).
At least they're not using Matlab where everything breaks as soon as you change machine.
* Rant mode: Off
It would be awesome if every project I did ended up with a nice, polished piece of software, but that's not what I get paid to do. I would be fired if I tried to do that.
However, I was very harsh, and of course I wouldn't find viable to expect production ready code, but something moderately portable could come handy. Of course, as you said, a researcher doesn't have the time to build a well developed library. As a solution, my University is considering the idea of hiring a dedicated developer whose job would be to maintain libraries. I really hope this to happen.
Programming languages, perhaps, are less vulnerable to these issues. And perhaps open source could beat these applications eventually, given perfect competition. But we're not in that world, unfortunately.
In all events, the percentage of users skilled in software development must be higher than among users of tools not related to software development.
I sometimes think with the churn of languages, no-one really gets deeply enough into one to really leverage it.
(Someone may have said this already, but there is no way I'm reading through all the "Python vs R" BS to find out)
It doesn't seem to have a package manager. Could it be that simple?
If you need to call one of the built-in pieces of Magic (TM) then Mathematica is OK, but if you want to build something new that needs to interface with literally anything outside of Mathematica, then Mathematica is a PITA.
The interface was trivial.
If you have a small script that makes a single one-off call to Mathematica, and the interface already exists for your language, which it probably doesn't, even if you're using an extremely popular language, and even though you're paying hundreds of dollars a year just for PERSONAL use, then things can be ok. But if you want to make a bunch of calls and keep the program running reliably then you're SOL.
Oh, and don't even think about deploying. It will cost you so much that it's more cost-effective to just rewrite the thing or do the work of switching out with a different library/tool.
In this case, I wrote that interface and open sourced it. http://library.wolfram.com/infocenter/MathSource/585/
I have no idea how it compares price-wise to SAS?
Check out github.com/boemska/h54s. I wouldn't normally post it like this, but SAS comes up so rarely that I figured if you're on HN and you use SAS, then you'll probably be interested. The more the merrier.
Finally! This is very encouraging, that such an excellent free software package is in such high demand. From what I've used it for, it worked very well. It's great for quickly creating nice-looking graphs and plots.
2. How do the authors unambiguously search for 'R'? Monocharacter language names are difficult search keys. (C, B, S, R)
2. The author posted the exact search terms used for all languages in an earlier post 
Is Machine Learning cooling down?