
The Risky Eclipse of Statisticians - rvivek
http://blog.hackerrank.com/the-risky-eclipse-of-statisticians/
======
dworin
In my experience, Data Science isn't replacing statistics, it's replacing
Business Intelligence. And a lot of that is driven by PR and marketing
departments that saw that business intelligence was boring and trapped them in
the IT department, so they needed new buzz words. You can see the relationship
on Google Trends as well, and it's stronger than comparing it to
'statistician', which looks flat more than declining:

[https://www.google.com/trends/explore#q=data%20science%2C%20...](https://www.google.com/trends/explore#q=data%20science%2C%20business%20intelligence&cmpt=q&tz=Etc%2FGMT%2B4)

~~~
exelius
This is spot on. It's another name for BI, which was another name for
reporting, albeit with fancier tools. But fundamentally the job is about
querying, aggregating and representing data -- the methods by which these are
done may be more complex, but the goals of the data are the same. It's just
that with so much more data available, there need to be a lot more people to
build the reporting systems.

But yeah, you're right, nobody wants to do reporting because it's tedious and
doesn't pay particularly well. And really, most suits are just looking for
data that fits the story they want to tell, so the quality of the analysis
that many "data scientists" do is irrelevant. This isn't the case at every
company, but I've seen it happen enough to know it's not a rare occurrence.

------
mswen
Initially I was in the camp of reluctant to associate myself with "data
science" because it might be just a passing fad. It still might be, but for
now I have started using that label.

I started in industry as "an applied or pragmatic statistician" that is
someone trained in social science research with a strong quantitative
methodology bias. As I went along I added focus group moderation, in-depth
interviewing, competitive analysis, ROI analysis and strategy consulting... so
I stared calling what I do "Research-based Consulting."

But that label doesn't seem to quite capture building taxonomies and text
indexing systems or doing latent semantic analysis. Nor does that "Research-
based Consulting" capture teaching myself web development in order to create
data-focused web applications. And, what about all the database work that I do
in operational systems? Or, how do I fit in things like managing and
validating data collection and aggregation systems that track prices for ~10K
sku's across multiple retail websites, combine them in a weighted algorithm
that reflects my client's business priorities and drives thousands of
automated transactions every day?

So even though I came from a background with a lot of grad level statistical
training and even at one point somewhat identified as a statistician it feels
like current definitions of "data scientist" captures more of what I actually
do. So I have come to be at peace with the term.

I totally agree with the points in the article about a mult-disciplinary team.
I would love to recruit people who are better than me at each sub-discipline
and figure out how to help them work together.

~~~
dworin
A large number of my clients have been research companies and consulting
firms, and one thing I've seen is that 'research based consulting' is a phrase
that research companies frequently use to sell themselves as more than
researchers, but that is completely unimportant to clients. On the client
side, it's assumed that most consultants will use research in some form or
another.

~~~
mswen
It seems that there is a class of consultants that are essentially gurus.
Their advice doesn't come out of fresh primary research on the client
situation and current market dynamics. Instead they speak from ongoing
expertise and keeping their finger on the pulse of a particular market. I know
that I used "Research-based Consulting" phrase sometimes to set apart my
methodology - that is the client is paying for fresh research customized to
their unique situation with a consultative type interpretation and advice to
tie it all together at the end. But as I was saying even that phrase doesn't
really capture stuff like text analysis, web scraping, web development and
operations analysis - some of these are very much development and systems
oriented.

Data science isn't a perfect label but it seems to be currently defined in a
way that fits pretty well with the mix of things that I do - so I am willing
to use it.

------
davidw
The joke that I think I saw here first:

A data scientist is a statistician who lives in San Francisco.

~~~
twit22
"I heard a couple of definitions: a data scientist is 1) a data analyst in
California or 2) a statistician under 35."

[http://blogs.gartner.com/svetlana-sicular/data-scientist-
mys...](http://blogs.gartner.com/svetlana-sicular/data-scientist-mystified/)

------
washedup
It's just a name. "Data science" still requires statistics and anyone who
calls themselves a data scientist were (hopefully) trained in statistics at
one point or another, whether formally or not. For those who don't use
statistics properly, well, they aren't doing data science properly either.
It's all statistics.

~~~
wfo
I think the name means something, data science to me means you're more focused
on the tools and less on the theory. And that a VERY dangerous attitude to
take in statistics. It's really, really easy to apply tools and get a model
that looks quantitative and explanatory and good but is actually trash. It's
really hard to tell the difference between good work and bad work. Statistics
is very, very complicated and finicky in theory and practice and I think we'll
need a better way to tell the charlatans from the experts. Maybe the title
will work as a general rule of thumb as people self-select -- data scientist =
knows how to load a data set into R and type 'lm', statistician = actual
expert

~~~
washedup
People who attempt data science without validating the statistics should
eventually end up with a pretty bad track record and leave the "data
scientist" job market. My perception of what data scientist currently means is
a statistician who knows how to program machine learning algorithms and manage
large data sets. Based on what you are saying, there are a lot of impostors
out there, which is most likely true. However, every time I see a job posting
for a data scientist, it almost always requires some formal math or statistics
background.

------
fgimenez
Rob Tibshirani at Stanford had a funny slide about the this phenomenon when
all the ML researchers started getting more grant funding than statisticians:

[http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf](http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf)

------
patio11
What does the job title actually matter where the t-test hits the SQL query?
Nobody is morally opposed to well-educated professionals who can actually code
to save their lives.

------
huac
The guy quoted as saying "massive datasets ... has not yet become part of
mainstream statistical science" has been teaching a statistical computing
class[1] for PhD's for about 10 years now.

It's worth reading the president of the American Statistical Association's
take[2] on the stats - data science divide as well.

[1]:
[http://stat.wharton.upenn.edu/~buja/STAT-541/](http://stat.wharton.upenn.edu/~buja/STAT-541/)
[2]:
[http://magazine.amstat.org/blog/2013/07/01/datascience/](http://magazine.amstat.org/blog/2013/07/01/datascience/)

------
digitalzombie
\>___> I'm going for a master of statistic and thought the premise in the
beginning was weak at best.

And then toward the middle of it, it basically said data science is a marriage
between stat and comp sci.

Congrats I have comp sci as an undergraduate.

I think it's weak at best. Data science is a jack of all trade and master of
none.

I disagree that it's just cs and stats, I know it might be pedantic but it
also ML which involve math that is a bit more than stats. Math it's either
right or wrong. Stats you can kinda bend that in such that it's close enough.

Some may say what's the difference. This article doesn't address the
difference but from what I've gathered Neural Network will tell you or
categorize stuff likewise with KNN, but you won't gain insight into the WHY it
is categorized that way and this is where statistic can tell you why. From
lurking in the subreddit /r/statistic, Bayesian will tell you why but NN will
not.

You still need statistic. It's just that this is a new field and many people
don't have the depth to grasp what's important.

It's like hyping up a nosql database and promising many things and get people
to adopt it. Eventually they'll realized that it's just broken promises and
they're stuck with it. In this case, the industry can just get smarter and
have better idea of what they really need.

~~~
techbio
I was hoping your comment would address GIGO

------
compbio
> Why Didn’t Statisticians Own Big Data?

Because they could not translate their sense of entitlement to actual results.

> pure statisticians often scoff at the hype surrounding the rise of data
> scientists in the industry

> some statisticians simply have no interest in carrying out scientific
> methods for business-oriented data science

Statisticians are often too careful. They let tests decide if they should
continue on a certain path. Machine learning researchers run blindfolded and
trust cross-validation. The latter, though reckless, gets more impressive
results.

You can perfectly be a data scientist coming from a statistics or physics
background. Adapt to it and use your knowledge to your advantage. You can't
keep calling yourself a statistician and own data science at the same time.
Start automating yourselves, like the rest of us are.

------
kasperset
Reminds me of this talk at SciPy 2015:
[https://www.youtube.com/watch?v=TGGGDpb04Yc](https://www.youtube.com/watch?v=TGGGDpb04Yc)

~~~
twit22
Really interesting video. Thanks for sharing.

------
daveloyall
There's a mistake in the infographic-like table named "Big Data Quantified".

It says "72 hours of new video uploaded to YouTube every _day_ ".

Actually, "300 hours of video are uploaded to YouTube every _minute_ ".[1]

The source of the infographic got the "minute" part right. (And probably the
matter of 72-->300 is growth since the original infographic was produced...
Web citations are hard!)

    
    
      1. http://www.youtube.com/yt/press/statistics.html

~~~
Nicholas_C
>Actually, "300 hours of video are uploaded to YouTube every minute".[1]

That's simply an incredible amount of data. The storage for all that must be
huge.

------
vcdimension
Data science needs statisticians. Big data and machine learning courses do not
teach about such things as endogeneity, confounding variables, sampling bias
etc. and how to deal with them. Data scientists that do not understand these
things could end up making big mistakes:
[https://youtu.be/0cizsKDn3TI](https://youtu.be/0cizsKDn3TI)

