
R Passes SAS in Scholarly Use - sndean
http://r4stats.com/2016/06/08/r-passes-sas-in-scholarly-use-finally/
======
aabajian
I think Python is the biggest hidden gem in statistics. It's had a tremendous
impact on machine learning and algorithm development, yet traditional
statisticians still rely on SAS/R/Stata/MATLAB.

All of these languages have libraries that produce the same results, the
difficulty is mangling the data into the correct input format. Python's list
comprehensions are much, much easier to use than MATLAB matrices, R's data
frames, Java's ArrayLists, etc. I'd advise any new graduate student to learn
how to plug data into traditional programs, but save yourself a headache and
perform your data manipulation in Python. Eventually you can take the leap and
do the analysis in Python as well.

~~~
ldp01
I'm glad I read this comment. After checking some of the docs I think I will
have a go at Python for data wrangling. List comprehensions look... friendly.

R still rules for plotting and running canned statistical procedures but
sometimes I feel like if I stop programming R for a week I forget how to use
it effectively... E.g. Forgetting to add stringsAsFactor=FALSE to everything,
forgetting rbind() can overwrite column names, forgetting I have to define my
own string concatenation operator in every script.

If Python can save me some of the frustration involved in manipulating data
frames that will be nice.

~~~
hadley
I'd recommend reading [http://r4ds.had.co.nz](http://r4ds.had.co.nz) for an R
workflow that eliminates a lot of those pain points.

(Except for infix string concatenation - I've never really understood why
people prefer that to paste(). Maybe if you're not thinking in vectors?)

~~~
ldp01
A reply from the man himself! Thanks for the link. I'll have a go.

I do like the look of the dplyr library a lot. Combining functions like select
and group_by with the pipe operators creates code that is reminiscent of SQL-
very nice for readability.

~~~
pedrosorio
You're going to love Spark if you haven't tried it yet.

~~~
ldp01
Another powerful tool I will keep in mind!

I think this thread illustrates a kind of tension between those coming from an
IT/big-data/web oriented background and the more traditional
statistics/science/engineering side.

The IT side bring a lot of very powerful and scalable tools to the table.
However there are aspects of traditional work which I suspect are lost on some
big-data people.

For example, in my line of work (physical asset mgmt) we deal with a lot of
very small datasets, very poor quality datasets (e.g. some guy's favourite
spreadsheet) and also cultural issues (some engineers are inherently averse to
changing systems, and spending decisions are inherently political). In this
situation, there is a limit to the benefit of more powerful/scalable tools,
and it is advantageous to use tools which are considered high quality and
vetted by the community.

R is in a good position here as it has the pedigree of being accepted by the
academic stats community, as well as actually being a great tool.

------
uptownfunk
Some reasons I love / use R:

Plenty of _free_ high quality documentation and learning materials around R
(just read anything by Hadley)

Package manager. Super easy to find, install, and start using packages.

Open source / Free

Large community of users

Extensive usage by the stats community. (If a new algorithm comes out, chances
R there will be an R implementation)

Easy to build and share your own packages via Github.

Easy to link C++ code to your packages.

\----------------------------------------------

I love R, but something about how the language feels syntactically, it's not
as pleasurable programming wise compared to something like the Python data
stack. But with all of the above advantages, I don't see myself switching to
anything else in the future for my data science work, unless I have a really
pressing need to. The other thing is that the language is so damn popular that
the useR conference was sold out in pre-reg. rounds.. Seriously guys, stop
using and learning about R so I can get in the conference....

~~~
roel_v
A big problem with R is that it's _just_ stats. The other day I wanted to do a
simple loan amortization (simple PMT/IPMT in Excel). People say 'use R over
Excel!'. Right. There are some clunky barely-working packages in R that do
half of what you need and some stack overflow posts that mostly show how to do
the other half, but that's no basis to build on.

And don't get me started on string handling in R, or that there's no way to
get the path of the currently running script, or a dozen other things that are
trivial in a general purpose language but are a major pain in R. R is not
'general purpose' enough, and it doesn't have to be useful to write both
kernel drivers and database REST frontends, but being able to do things that
are math-related and not purely stats - that's not too much to ask for I'd
say. Especially because it's not reasonable to ask people whose main job is
not writing software to learn multiple languages/tools.

(Other recent example I remember: how unintuitive I found it to plot a sine
wave and its first and second derivative. My Mathematica-oriented colleague
did it in 2 minutes.)

~~~
ldp01
There are many things wrong with R but basic plotting functions are one of its
strengths.

Is this the way you did it? It seems pretty intuitive...

a=pi/180

x=1:360

plot(x,sin(a * x))

plot(x,a * cos(a * x))

plot(x,-a^2 * sin(a * x))

~~~
roel_v
But that just draws 3 separate plots.

My main problem was the derivative, not so much the plotting (or maybe it was
'plotting an arbitrary function'); but I looked it up and it seems I slightly
misremembered what it was I wanted to do. I wanted to draw a cubic _spl_ ine,
not a _s_ ine. What I ended up doing was

    
    
        spline_x <- 1:6
        spline_y <- c(0, 0.5, 2, 2, 0.5, 0)
        spl_fun <- splinefun(spline_x, spline_y)
        p <- ggplot(data.frame(x=spline_x, y=spline_y), aes(x, y))
        p <- p + stat_function(fun = spl_fun)
        p <- p + stat_function(fun = spl_fun, arg = list(deriv = 2))
    

I still don't quite understand how that derivative works - ?list doesn't
mention anything about 'deriv', and there's a function called 'deriv' but I'm
not sure how that's being interpreted in the code above.

Also it seems recent versions of ggplot2 have geom_xspline() which does what I
need (I'm told) but that wasn't in the release version when I was doing it.

~~~
hadley
It took me 5 minutes to figure out how that actually did work! That is rather
esoteric code!

(FWIW the reason that there's no native support in ggplot2 for this sort of
smoothing is that I think it's a really bad idea as it tends to distort the
underlying data)

------
dekhn
EDIT: I am corrected in regards to the SAS routines statement; see the reply.

A few comments. I worked in pharma and the FDA specifically requires a number
of SAS routines- specific function calls- to be used when doing drug
studies/clinical trials. R can't replace SAS in those cases without massive
effort because the FDA is slow and conservative and people like to have
validated results.

I think the writing was on the wall for SAS when this article came out:
[http://www.nytimes.com/2009/01/07/technology/business-
comput...](http://www.nytimes.com/2009/01/07/technology/business-
computing/07program.html?pagewanted=all)

The SAS spokesperson said: """Anne H. Milley, director of technology product
marketing at SAS. She adds, “We have customers who build engines for aircraft.
I am happy they are not using freeware when I get on a jet.”"""

to which a senior employee of Boeing pointed out that every jet they build
uses R as an integral part of the design process. I think that had to be an
"oh shit" moment for SAS, where they realized their strong position in stats
was going to start to erode.

~~~
hadley
That is not true. The FDA uses R internally, and there is no requirement that
you must use any specific software tool. See
[https://www.r-project.org/doc/R-FDA.pdf](https://www.r-project.org/doc/R-FDA.pdf)
for more details

~~~
dekhn
OK, so my employers were wrong!
[http://blog.revolutionanalytics.com/2012/06/fda-r-
ok.html](http://blog.revolutionanalytics.com/2012/06/fda-r-ok.html)

"""Despite some mistaken conceptions in the pharmaceutical industry, SAS is
not required to be used for clinical trials. This origin of this fallacy is
probably related to the fact that data must be submitted in the XPT "transport
format" (which was originally created by SAS). This data format is now an open
standard: XPT files can be read into R with the standard read.xport function,
and exported from R with the write.xport function in the SASxport package.
(And if you have legacy data in other SAS formats, there's a handy SAS macro
to export XPT files.)"""

Thanks for clearning that up.

------
Malarkey73
I'm not totally sure whether this analysis captures the true extent which R vs
SAS vs SPSS is used.

If I use R for a plot, or a simple bit of regression, or anova, or even cross-
validation. I don't reference it in a paper. I only cite it if there is a
package designed for a particular type of data (e.g. a Bioconductor package)
or something a bit more esoteric (e.g. apcluster). About 95% of the work is
data munging and - sorry Hadley - I don't cite dplyr, purrr, magrittr etc...

However I have notice that in clinical trial or small social science papers
simple analyses of this type are often cited as being done in SPSS or SAS. I
think this just reflects the fact that non specialist data analysts are more
likely to cite SAS or SPSS for simple procedures such as graphs or anova as an
appeal to authority.

So I reckon the data may reflect a trend but tells us little about the true
levels.

~~~
hooloovoo_zoo
Why don't you cite the packages you use?

~~~
noelsusman
Nobody cites every package they use, it's not feasible. I use a lot of
packages, and some journals have a limit on the number of citations you can
have. I only cite packages when it provides specialized statistical
functionality.

For example, I do a lot of work with data from complex surveys, and I always
cite Lumley's survey package because without it I wouldn't be able to do the
work. On the flip side, I use Hadley's readr package extensively because I
think his I/O functions are more sane than the defaults. I'm not going to cite
readr in every paper I write just because I'm too lazy to type
stringsAsFactors = FALSE when I read a csv file.

~~~
hadley
To me, citing readr feels like citing the company that made your pipettes.
It's just useful infrastructure.

------
kensai
The success of R in Statistics (in respect to Python, etc) was that it was
thought from the beginning with Statisticians and their specific needs and
approaches in mind. As much as I appreciate Python, it is a general purpose
programming language adapted to Statisticians needs, not the other way around.

R has many issues, but if you speak to Statisticians you will hear that its
the closest thing they have to their own way of doing things.

~~~
lottin
This is absolutely true, and actually learning R is an excellent way of
learning statistics.

~~~
Myrmornis
and a very inadvisable way to learn programming. (But I agree with what you
said.)

------
pollitos
R is really LISP with syntactic sugar and bindings to well respected high-
performance FORTRAN matrix and math optimization codes.

[http://librestats.com/2011/08/27/how-much-of-r-is-written-
in...](http://librestats.com/2011/08/27/how-much-of-r-is-written-in-r/)

It's great for bleeding edge scientific research. The results of many
languages don't always match for advanced algorithms, but the open source
nature of R, makes it easier to identify the problem areas.

The R-core interpreter does have a number of deficiencies. (R is based on
S-language specification that left wiggle room from the 70s.) General purpose
programming and data wrangling/engineering is best handled in other
programming idioms.

~~~
hadley
I would love to know what you mean by data wrangling because I think R has a
lot of good tools for it.

~~~
pollitos
For example, reshaping JSON to the format an intricate R function expects.
Appreciate the great work with (d)plyr and similar packages, but it's still
work and overhead. Combined with some inefficiencies/quirks in base r
functions (does ifelse() still evaluate twice?) it's easier to go with a
widely used and respected package in a general purpose language; Nokogiri for
example. For data engineering, consider there is not a maintaned R package for
a web-client, and asynchronous programming is weak.

~~~
hadley
JSON is often a pain because it's so hierarchical and un-dataframe like. I
have a few notes on working with it here:
[http://r4ds.had.co.nz/hierarchy.html](http://r4ds.had.co.nz/hierarchy.html).

`ifelse()` is a nightmare of a function but I don't think double-evaluation is
ever a problem.

There are two maintained web-clients: curl (low-level) and httr (high-level).
And I think rvest does everything that nokogiri does.

~~~
pollitos
Thanks for the link on JSON and your packages, great work as always. I should
clarify when I said we-client, I meant websockets client to consume feed. The
last time I tried, the only R package (r-websockets) just crashed my Linux box
and not maintained for several years. httr doesn't do websockets, as I
understand. Seems likely a fundamental way to engineer/wrangle data into R.

~~~
hadley
You can do it with httpuv, but it might a bit clunky. I think better
websockets support, and better async generally, is on the roadmap for the next
year.

~~~
pollitos
For httpuv, author Joe Cheng advised cannot do a websocket-client.
[https://stackoverflow.com/questions/28120307/how-to-
interact...](https://stackoverflow.com/questions/28120307/how-to-interact-
with-websocket-from-within-r)

Looking forward to the improvements. Appreciate all your work in the field.

------
jiiam
Good.

* Rant mode: On

Maybe in 30 years they will also learn a true programming language and stop
producing undocumented, unusable, unportable, underdeveloped libraries for
research level tools and technologies.

Outside the world of Neural Network it is a complete disaster, and the NN
landscape is at an acceptable level only because of big companies, surely not
thanks to the researchers. And the reason, of course, is that most researchers
refuse to think of themselves as "software developer" and use these arcane
languages which might be good for prototyping but lack power when it comes to
shipping a real product (which might also be a tool for other researchers to
use).

At least they're not using Matlab where everything breaks as soon as you
change machine.

* Rant mode: Off

~~~
noelsusman
I mean, I won't argue against having better code and documentation, but it's
not really our job to ship a real product. Shipping well documented, easily
usable, ultra portable, well developed libraries takes a fuckload of time,
resources, and expertise that we don't have. Our primary job is to ship ideas.

It would be awesome if every project I did ended up with a nice, polished
piece of software, but that's not what I get paid to do. I would be fired if I
tried to do that.

~~~
jiiam
Fair enough. But, speaking as a researcher, I often find myself reading
through hundreds of lines of code and rebuilding routines from scratch in
order to reproduce and expand on what others did in their works.

However, I was very harsh, and of course I wouldn't find viable to expect
production ready code, but something moderately portable could come handy. Of
course, as you said, a researcher doesn't have the time to build a well
developed library. As a solution, my University is considering the idea of
hiring a dedicated developer whose job would be to maintain libraries. I
really hope this to happen.

------
rsrsrs86
Am I right to believe that there is no way that proprietary scientific
software can keep up with open source?

~~~
LeifCarrotson
No. Some have the benefit of proprietary modules (FPGA toolchains), some have
large libraries of pre-entered and organized data (Mathematica), and some have
early access to hardware (LabView, CUDA).

Programming languages, perhaps, are less vulnerable to these issues. And
perhaps open source could beat these applications eventually, given perfect
competition. But we're not in that world, unfortunately.

~~~
kpil
I think that open source _programming languages_ will always win in the long
run, since the target customer base knows how to program and extend the tools.

~~~
gaius
I'm not sure that logically follows. What %age of Python users actually know
the underlying C well enough to make changes to the language? Even the number
who know how to write bindings is tiny overall.

~~~
kpil
I'm fairly sure that most of them would be able to wing it if they had to. C
isn't exactly that special.

In all events, the percentage of users skilled in software development must be
higher than among users of tools not related to software development.

------
gaius
One thing that interests me is language power, vs experience. Let's say you
had 1 year experience in language X. Language Y comes along that is better in
some way. In another year, would you be happier and more productive with 2
years experience of X, or one year of Y?

I sometimes think with the churn of languages, no-one really gets deeply
enough into one to really leverage it.

------
mikeskim
The way I use Python in machine learning is quite different from how many
others in competitive ML use Python. I use Python purely for Python 2.7 with
Pypy and try not to touch or use numpy,scipy,pandas,etc. R's data.table is
possibly faster than Python's numpy/scipy/pandas. I think anyone claiming
Python because of numpy/scipy/pandas is really being mislead. You should be
using Python in spite of the need to rely upon numpy/scipy/pandas. If you
really need numpy/scipy/pandas just use R and data.table which is amazingly
fast. I think Python is really great because of Pypy and the strength of the
standard Python library.

------
mdjt
This is great news! It excites me to see the continued support of open-source
and well-maintained programming languages in academics.

(Someone may have said this already, but there is no way I'm reading through
all the "Python vs R" BS to find out)

------
sillysaurus3
Mathematica isn't being used at all? That's surprising. Mathematica is
wonderful. I wonder what's holding it back?

It doesn't seem to have a package manager. Could it be that simple?

~~~
throwawaysocks
The interfaces are terrible. The price is high.

If you need to call one of the built-in pieces of Magic (TM) then Mathematica
is OK, but if you want to build something new that needs to interface with
literally anything outside of Mathematica, then Mathematica is a PITA.

~~~
dekhn
Actually, the whole Mathematica kernal is exposed via a C API. I wrote a
Python-Mathematica bridge based on this and it was wonderful. You could sit in
Python, and send Python expressions with variables, etc, to Mathematica for
evaluation, and get the results back as Python objects.

The interface was trivial.

~~~
throwawaysocks
I've worked with these types of bridges before. They are terrible if you want
to keep the program running and intermittently call Mathematica throughout the
course of a multi-hour session.

If you have a small script that makes a single one-off call to Mathematica,
_and the interface already exists for your language, which it probably doesn
't, even if you're using an extremely popular language, and even though you're
paying hundreds of dollars a year just for PERSONAL use_, then things can be
ok. But if you want to make a bunch of calls and keep the program running
reliably then you're SOL.

Oh, and don't even _think about_ deploying. It will cost you so much that it's
more cost-effective to just rewrite the thing or do the work of switching out
with a different library/tool.

~~~
dekhn
I don't understand why you consider a problem to intermittently call
mathematica's kernel in a multi-hour session. There's nothing that would make
this not work. The mathematica C interface launches a copy of the kernel and
communicates with it over a straightforward protocol.

In this case, I _wrote_ that interface and open sourced it.
[http://library.wolfram.com/infocenter/MathSource/585/](http://library.wolfram.com/infocenter/MathSource/585/)

------
benbenolson
Just noticed that this was posted by my old (half) boss!

Finally! This is very encouraging, that such an excellent free software
package is in such high demand. From what I've used it for, it worked very
well. It's great for quickly creating nice-looking graphs and plots.

------
dredmorbius
1\. This is a good place for use of log or semi-log plots.

2\. How do the authors unambiguously search for 'R'? Monocharacter language
names are difficult search keys. (C, B, S, R)

~~~
LionessLover
To 2: When I look for just "R" even in an anonymous window (so it should not
use my history) I get as the first suggestion a link to
[https://www.r-project.org/](https://www.r-project.org/) \- the home of R.
What else is there for that letter - that is equally popular? "R" is "hip" and
trending. Microsoft not too long ago started a big push into the R space and
now regularly generates headlines around the system, accelerating the trend
even more.

~~~
jhbadger
Even weirder, a Google search for "xlispstat" seems to bring up more R hits
that don't even mention xlispstat than actual xlispstat ones. Some weird
algorithm is associating R and xlispstat as relating to statistics and because
R is much more popular these days, prioritizing R over xlispstat.

~~~
yoplait_
it's not the search engine's fault, it's the contents. R is associated to
xlispstat as the credible alternative. E.g. Jan de Leeuw JStat's paper. Also
the author of xlispstat is one main contributor to R.

------
daix
>Note that the decline in the number of articles that used SPSS or SAS is not
balanced by the increase in the other software shown in this particular graph.

Is Machine Learning cooling down?

------
thefastlane
no mention of Incanter . . . is it really that niche?

~~~
jhbadger
Is it still alive? I remember being excited about it when it came out
initially but then things seemed to slow and then stop.

------
erik_landerholm
R? Everyone doing most anything interesting has been using Python for a while.

~~~
Jweb_Guru
I suppose there is no interesting science outside of computers?

