Hacker News new | past | comments | ask | show | jobs | submit login
The Risky Eclipse of Statisticians (hackerrank.com)
43 points by rvivek on July 20, 2015 | hide | past | web | favorite | 24 comments



In my experience, Data Science isn't replacing statistics, it's replacing Business Intelligence. And a lot of that is driven by PR and marketing departments that saw that business intelligence was boring and trapped them in the IT department, so they needed new buzz words. You can see the relationship on Google Trends as well, and it's stronger than comparing it to 'statistician', which looks flat more than declining:

https://www.google.com/trends/explore#q=data%20science%2C%20...


This is spot on. It's another name for BI, which was another name for reporting, albeit with fancier tools. But fundamentally the job is about querying, aggregating and representing data -- the methods by which these are done may be more complex, but the goals of the data are the same. It's just that with so much more data available, there need to be a lot more people to build the reporting systems.

But yeah, you're right, nobody wants to do reporting because it's tedious and doesn't pay particularly well. And really, most suits are just looking for data that fits the story they want to tell, so the quality of the analysis that many "data scientists" do is irrelevant. This isn't the case at every company, but I've seen it happen enough to know it's not a rare occurrence.


Initially I was in the camp of reluctant to associate myself with "data science" because it might be just a passing fad. It still might be, but for now I have started using that label.

I started in industry as "an applied or pragmatic statistician" that is someone trained in social science research with a strong quantitative methodology bias. As I went along I added focus group moderation, in-depth interviewing, competitive analysis, ROI analysis and strategy consulting... so I stared calling what I do "Research-based Consulting."

But that label doesn't seem to quite capture building taxonomies and text indexing systems or doing latent semantic analysis. Nor does that "Research-based Consulting" capture teaching myself web development in order to create data-focused web applications. And, what about all the database work that I do in operational systems? Or, how do I fit in things like managing and validating data collection and aggregation systems that track prices for ~10K sku's across multiple retail websites, combine them in a weighted algorithm that reflects my client's business priorities and drives thousands of automated transactions every day?

So even though I came from a background with a lot of grad level statistical training and even at one point somewhat identified as a statistician it feels like current definitions of "data scientist" captures more of what I actually do. So I have come to be at peace with the term.

I totally agree with the points in the article about a mult-disciplinary team. I would love to recruit people who are better than me at each sub-discipline and figure out how to help them work together.


A large number of my clients have been research companies and consulting firms, and one thing I've seen is that 'research based consulting' is a phrase that research companies frequently use to sell themselves as more than researchers, but that is completely unimportant to clients. On the client side, it's assumed that most consultants will use research in some form or another.


It seems that there is a class of consultants that are essentially gurus. Their advice doesn't come out of fresh primary research on the client situation and current market dynamics. Instead they speak from ongoing expertise and keeping their finger on the pulse of a particular market. I know that I used "Research-based Consulting" phrase sometimes to set apart my methodology - that is the client is paying for fresh research customized to their unique situation with a consultative type interpretation and advice to tie it all together at the end. But as I was saying even that phrase doesn't really capture stuff like text analysis, web scraping, web development and operations analysis - some of these are very much development and systems oriented.

Data science isn't a perfect label but it seems to be currently defined in a way that fits pretty well with the mix of things that I do - so I am willing to use it.


The joke that I think I saw here first:

A data scientist is a statistician who lives in San Francisco.


"I heard a couple of definitions: a data scientist is 1) a data analyst in California or 2) a statistician under 35."

http://blogs.gartner.com/svetlana-sicular/data-scientist-mys...


"I heard a couple of definitions: a data scientist is 1) a data analyst in California or 2) a statistician under 35."

http://blogs.gartner.com/svetlana-sicular/data-scientist-mys...


It's just a name. "Data science" still requires statistics and anyone who calls themselves a data scientist were (hopefully) trained in statistics at one point or another, whether formally or not. For those who don't use statistics properly, well, they aren't doing data science properly either. It's all statistics.


Definitely agree.

Part of the problem is that data science doesn't have nearly the same formalism in its definition that statistics does. What's the difference between BI's, Data Miners, Data Analysts, Data Scientists, etc? The tools used to arrive at conclusions (R vs. Python vs. SAS vs. Tableau/Excel/SPSS) doesn't seem like a good way of differentiating the roles.

A more useful discriminator would be the application of statistics (BI vs. Biostatistician, for instance), the depth and complexity of the statistical algorithms used, and whether the main use is stat inference or prediction (machine learning doesn't seem to focus on inference a whole lot, for example).


> For those who don't use statistics properly, well, they aren't doing data science properly either. It's all statistics.

I agree wholeheartedly. It's not just data science, but all science that requires good statistics.

EDIT: IANADS, but data science seems tightly intertwined with statistics, to such a degree that I've had to double-check the difference multiple times. (Seems to be mostly terminology-based, tbh.)


I think the name means something, data science to me means you're more focused on the tools and less on the theory. And that a VERY dangerous attitude to take in statistics. It's really, really easy to apply tools and get a model that looks quantitative and explanatory and good but is actually trash. It's really hard to tell the difference between good work and bad work. Statistics is very, very complicated and finicky in theory and practice and I think we'll need a better way to tell the charlatans from the experts. Maybe the title will work as a general rule of thumb as people self-select -- data scientist = knows how to load a data set into R and type 'lm', statistician = actual expert


People who attempt data science without validating the statistics should eventually end up with a pretty bad track record and leave the "data scientist" job market. My perception of what data scientist currently means is a statistician who knows how to program machine learning algorithms and manage large data sets. Based on what you are saying, there are a lot of impostors out there, which is most likely true. However, every time I see a job posting for a data scientist, it almost always requires some formal math or statistics background.


Rob Tibshirani at Stanford had a funny slide about the this phenomenon when all the ML researchers started getting more grant funding than statisticians:

http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf


What does the job title actually matter where the t-test hits the SQL query? Nobody is morally opposed to well-educated professionals who can actually code to save their lives.


The guy quoted as saying "massive datasets ... has not yet become part of mainstream statistical science" has been teaching a statistical computing class[1] for PhD's for about 10 years now.

It's worth reading the president of the American Statistical Association's take[2] on the stats - data science divide as well.

[1]: http://stat.wharton.upenn.edu/~buja/STAT-541/ [2]: http://magazine.amstat.org/blog/2013/07/01/datascience/


\>___> I'm going for a master of statistic and thought the premise in the beginning was weak at best.

And then toward the middle of it, it basically said data science is a marriage between stat and comp sci.

Congrats I have comp sci as an undergraduate.

I think it's weak at best. Data science is a jack of all trade and master of none.

I disagree that it's just cs and stats, I know it might be pedantic but it also ML which involve math that is a bit more than stats. Math it's either right or wrong. Stats you can kinda bend that in such that it's close enough.

Some may say what's the difference. This article doesn't address the difference but from what I've gathered Neural Network will tell you or categorize stuff likewise with KNN, but you won't gain insight into the WHY it is categorized that way and this is where statistic can tell you why. From lurking in the subreddit /r/statistic, Bayesian will tell you why but NN will not.

You still need statistic. It's just that this is a new field and many people don't have the depth to grasp what's important.

It's like hyping up a nosql database and promising many things and get people to adopt it. Eventually they'll realized that it's just broken promises and they're stuck with it. In this case, the industry can just get smarter and have better idea of what they really need.


I was hoping your comment would address GIGO


> Why Didn’t Statisticians Own Big Data?

Because they could not translate their sense of entitlement to actual results.

> pure statisticians often scoff at the hype surrounding the rise of data scientists in the industry

> some statisticians simply have no interest in carrying out scientific methods for business-oriented data science

Statisticians are often too careful. They let tests decide if they should continue on a certain path. Machine learning researchers run blindfolded and trust cross-validation. The latter, though reckless, gets more impressive results.

You can perfectly be a data scientist coming from a statistics or physics background. Adapt to it and use your knowledge to your advantage. You can't keep calling yourself a statistician and own data science at the same time. Start automating yourselves, like the rest of us are.


Reminds me of this talk at SciPy 2015: https://www.youtube.com/watch?v=TGGGDpb04Yc


Really interesting video. Thanks for sharing.


There's a mistake in the infographic-like table named "Big Data Quantified".

It says "72 hours of new video uploaded to YouTube every day".

Actually, "300 hours of video are uploaded to YouTube every minute".[1]

The source of the infographic got the "minute" part right. (And probably the matter of 72-->300 is growth since the original infographic was produced... Web citations are hard!)

  1. http://www.youtube.com/yt/press/statistics.html


>Actually, "300 hours of video are uploaded to YouTube every minute".[1]

That's simply an incredible amount of data. The storage for all that must be huge.


Data science needs statisticians. Big data and machine learning courses do not teach about such things as endogeneity, confounding variables, sampling bias etc. and how to deal with them. Data scientists that do not understand these things could end up making big mistakes: https://youtu.be/0cizsKDn3TI




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: