
R Passes SAS, but Python Leaves Them Both Behind - sonabinu
http://r4stats.com/2017/02/28/r-passes-sas/
======
rm999
I'm glad to see the quick convergence on Python. I've evaluated Python every
~2 years since 2006 for "data science" tasks (machine learning, statistics,
data munging, and visualization). I'd argue that Python only properly covered
this full data science stack 1.5-2 years ago. R covered this stack adequately
probably around 2011-2012. Matlab had this before 2005.

What makes Python a superior language to Matlab and R is the ease of software
development. It's an easy, pleasant language to work in, and I trust it for
production tasks (I've written production R code and it's fairly hard to read
and fragile).

What's even better is data science is moving 100% into Python 3 (from 2) by
2020:

[http://www.python3statement.org/](http://www.python3statement.org/)

~~~
fnord123
If you work in a team mixing data engineers and data scientists then python
has been a superior choice for a decade as the different team members are all
using the same language to build the platform and to use it.

We've had a lot of success weaning people of R and MATLAB in finance (so
replace "data scientist" with "quant"). Of course if you work in a field that
doesn't have the libraries and can't build them in house then your mileage
will certainly vary.

~~~
thearn4
Off-topic, but what is a "data engineer"?

~~~
nerdponx
I'm sure someone can come along with a better description, but it's kinda like
devops/sysadmin but specifically for data storage and access.

You could also look it up...

~~~
sdenton4
At some point, people realized that 90% of data science work was building
pipes and keeping them clean. And then data engineering was invented so that
the data scientists wouldn't have to get their hands as dirty. :)

------
NumberSix
There is a question regarding the definition of "data science jobs." The
author explains his methodology in a lengthy report
[http://r4stats.com/articles/how-to-search-for-data-
science-j...](http://r4stats.com/articles/how-to-search-for-data-science-
jobs/)

The issue is what is "data science" really? In what respect is it different
from traditional statistics and data analysis and not just a new buzzphrase?

Probably many jobs using SAS could be considered "data science" but don't use
the specific buzz words and phrases that the author specifies in his
methodology to identify "data science" jobs. Thus, the headline that "R Passes
SAS" could be inaccurate, except in the sense that R is more popular among
statistics and data analysis jobs that use "data science" buzzwords and
phrases.

~~~
jl6
SAS is also a huge iceberg, without much in the way of open source culture
that lends itself to visibility. A lot of SAS happens behind closed doors at
megacorps.

~~~
R_haterade
Also notable that so many new shops are foregoing it entirely. SAS, and its
price tag, is a holdover from the days when 'analytics' was an afterthought
for companies looking to maximize profit.

Now that much less mature companies are realizing the value of 'analytics' (I
hate that word) SAS's cost doesn't really make sense.

~~~
halflings
What word do you prefer to 'analytics', if we are talking about taking data
and analysing it to extract insights?

This does not have to involve machine learning, and "pattern recognition" is
an academic term (and might be confusing for laymen).

~~~
xapata
Why not just "analysis"?

Business people seem determined to invent new jargon when our current
vocabulary is sufficient.

~~~
R_haterade
The euphemism treadmill makes my raises bigger.

------
Xcelerate
I recently finished grad school and accepted a data scientist job a couple
weeks ago in part because the job description mentioned expertise with Julia
as one of the preferred qualifications (it's rare to see a listing that
mentions Julia). It's the language I used for the bulk of my research over the
last four years, and has been improving rapidly since it was released. I like
Julia a lot more than Python, and I hope it continues rising in popularity. I
think once it hits v1.0, we'll begin to see a lot more companies adopting its
usage for data science, statistics, and machine learning.

~~~
fnord123
> I think once it hits v1.0, we'll begin to see a lot more companies adopting
> its usage for data science, statistics, and machine learning.

I don't. It doesn't have the incumbency of R, the use in other areas of
programming of Python, or a company actively marketing it like MATLAB. It's
not 5x or 10x or whatever good enough than the alternatives to assert itself
in the playing field.

If it means you get your work done using it, be all means use it. But I think
it will stay around clojure levels of use in data science, statistics, and
machine learning.

~~~
whyrt12
Once you can compile a julia app- front end, back and probabilistic prog/ML to
web assembly and have it run in browser and mobile, it will skyrocket in
popularity.

~~~
fnord123
Why? The only current benefit of Julia afaict is the tracing jit. If you run
the tracing jit on web assembly then it's giving up most of its performance
benefits. And Python could be built on web assembly as well.

But who knows. Weird things seem to become popular despite all the negative
points.

~~~
whyrt12
It does not have a tracing JIT, and its speed is by far not the only benefit.
See link below

It can precompile very fast code before runtime.

Python will require an interpreter and or hefty runtime.

[https://discourse.julialang.org/t/julia-motivation-why-
weren...](https://discourse.julialang.org/t/julia-motivation-why-werent-numpy-
scipy-numba-good-enough/2236/10?u=mikeinnes)

~~~
fnord123
Forgive me, it's not a tracing JIT but just LLVM's JIT.

Precompiling in Julia is extremely not-straight-forward. You would think you
just use --compile and it would work; but it doesn't at all.

Also, at ~850kb, Python's runtime is not that hefty. It's intended to be
embedded and while it's quite a bit larger than lua's 200kb, but smaller than
libjulia's 16mb.

~~~
statsmatscats
Right, It can currently precompile to some extent, but full source-to-binary-
blob-compilation is on the roadmap. See here:
[http://juliacomputing.com/blog/2016/02/09/static-
julia.html](http://juliacomputing.com/blog/2016/02/09/static-julia.html)

Julia's runtime includes its compiler and full huge standard lib, but of which
are eventually going to be split off, IIUC.

The former because of static compilation potential and the latter into modules
that can be included piecemeal.

------
sixhobbits
I'm surprised that the author combines "the C languages" saying that most
adverts that mention any of C/C++/C# mention all three. In my experience there
is a large difference between companies searching for C# developers and those
searching for C/C++. After blurring this distinction he concludes that R and
Python are "very different languages" while I consider them to be largely
overlapping.

~~~
brogrammernot
Agreed.

So far in my search, C# leans toward Microsoft shops seeking C#/.NET whereas
C/C++ has been companies searching for embedded software roles.

~~~
metaobject
Mostly agree, but I've also seen a fair bit of C/C++ skills associated with
jobs involving *nix development environments. I've rarely seen C++ associated
with embedded jobs (even though I've read that it can certainly be used if
care is taken to avoid things like dynamic dispatch, etc)

~~~
charles-salvia
C++ is very big in finance and game development.

------
cwyers
I hate these sort of comparisons. You've got R, Python, SAS... okay, those are
sort of similar. Then you've got Java and "C, C++ or C#," and man, including
C# with C/C++ is... fraught. Then you've got Hadoop, Spark, Hive... okay,
those are all kind of different from what we've had so far. Now you've got
Tableu and RapidMiner. Uh. In the second chart, "Microsoft" is included as a
keyword. Okay. It's just... comparing apples, oranges, bananas, grapes and
pears. What's it supposed to tell us?

~~~
dredmorbius
"Microsoft was a difficult search since it appears in data science ads that
mention other Microsoft products such as Windows or SQL Server. To eliminate
such over-counting, I treated Microsoft different from the rest by including
product names such as Azure Machine Learning and Microsoft Cognitive Toolkit.
So there’s a good chance I went from over-emphasizing Microsoft to under-
emphasizing it with only 157 jobs."

Read the methodology, Luke.

~~~
cwyers
Okay, but what does that mean? Microsoft Cognitive Toolkit is like Tensorflow,
which is included as an item on the list. Azure Machine Learning is something
you can script with R or Python, and allows you to create APIs for predictive
use on Azure. What good does lumping those together do? And what does it tell
us that some things that are basically an R/Python library are less popular
than R or Python themselves? It's this weird, uneven mix of things. Some are
programming languages, some are libraries or frameworks, some are end-user
products like Tableau.

~~~
nl
This is pretty much what data science in the real world is like.

Define the question you are interested in (in this case, a somewhat reasonable
attempt to compare R/Python/SAS) and the put other things in blobs with a note
that says _this is what this blob is. Enjoy_.

~~~
dredmorbius
Bingo. Thanks. I was losing (patience|interest).

------
tom_b
Nice data munging out of Indeed.com here. The article author gives a detailed
description of searching Indeed in a write-up linked from the original article
as well ([http://r4stats.com/articles/how-to-search-for-data-
science-j...](http://r4stats.com/articles/how-to-search-for-data-science-
jobs/)).

Just playing around with the search terms from that second linked article is
also interesting - it would appear that many terms ("machine learning", "data
science", "predictive modeling", some others) show that Amazon has the largest
number of job listings from a single company - for "machine learning" Amazon
shows 1706 listings out of 12499 or almost 14% of _all_ listings . . . The way
Amazon also pops out in other data science term searches is also interesting -
at least in their job listings, Amazon seems to really be attempting to slurp
up candidates with deeper data and stats skills.

For some time I have been somewhat cynical about data science. My impression
has been that much of what has been pushed as data science jobs is thinly
veiled data reporting gigs (just plain old business intelligence). While I
still think data science is over-hyped, I think I need to reconsider just how
critical it will be as a knowledge base or skill set. While there may not be a
large number of deep learning jobs out there, the expectation that a data
hacker can be expected to perform a linear or logistic regression against a
set of gathered and cleaned data may be closer to fizz buzz than I previously
assumed.

I am teaching an introductory programming class (using Python) this semester
and students are definitely focused on data science as a career track.

------
IndianAstronaut
R's data analysis libraries are still far ahead of Python. Dplyr, Shiny, etc.
So much stuff is still in built in R.

Places where I still use R is its easy to use statistical functions, handling
large amounts of missing data, etc.

------
blauditore
I once worked on a piece of software communicating with a SAS instance (doing
live decision management), but everything about it seemed sketchy. No one
using it really liked it, and its internals always appeared like a blackbox to
me. Also, either it is terribly engineered or the devs working with it were
just bad - we wanted a JSON-based REST API, but they said that it's "not
possible with SAS", so we fell back to badly organized HTTP calls with XML.

Does anybody have some insights about internal quality and code "health" in
SAS?

~~~
R_haterade
Anecdotal, but our SAS decision support environment went down Friday. I can
report back after the post-mortem if you like.

~~~
blauditore
Yeah, would surely be interesting.

~~~
R_haterade
Nothing exciting to report. It was a resource-sizing problem and they didn't
adequately separate prod and dev environments. Someone choked the dev and the
prod went down with it.

~~~
blauditore
Thanks anyway for the insight!

------
confounded
This is a sample of non-data-science job postings, which mention buzzwords
like "big data".

Java is not a popular language for data analysis.

------
robertk
Python suuuucks.

~~~
llukas
This attempt to eli5 could be improved.

