Thesis of the article: "Insightful statisticians have for at least 50 years been laying the groundwork for constructing
that would-be entity as an enlargement of traditional academic statistics. This would-be notion of Data Science is not the same as the Data Science being touted today, although there is significant overlap. The would-be notion responds to a different set of urgent trends - intellectual rather than commercial. Facing the intellectual trends needs many of the same skills as facing the commercial ones and seems just as likely to match future student training demand and future research funding trends"
"Data Science" is currently a field where corporations are winning in defining what activity academic universities should support. This article is a nice pull back in the other direction, and focusing on the intellectual rather than commercial elements.
Unfortunately, if I look at the (sometimes very commendable!) efforts of public data release from, e.g., governments (data.gov, data.gov.uk, etc) there always some shortfalls. They are either too flat -- a classic census table, which is of limited interest because you can deal with it without the need of innovative data analysis techniques -- or just not large or deep enough to get a big picture from them. For instance, in human mobility, a government will provide you a mobility survey, a sample of self-reported trips recalled from memory for only one target day. But a phone call record can give you all people, for a long period of time, without relying on people's faulty memory of their movements. If you want to understand, e.g., how disease spreads in a city, or how people are cut out by public transportation and infrastructure shortcomings, the survey results are necessarily going to be worse than the ones you'd get with call metadata.
For more on the history of data science, here are references from a similar talk by Chris Wiggins: http://bitly.com/icerm
I don't remember ever hearing the terms "Data Science" or "Big Data," but I do recall taking Department of Statistics courses with titles like "Data Mining," "Statistical & Machine Learning," and "Statistical Computing." We even sort of worked with what is now called "Big Data," by learning how to run large calculations in parallel using R's cluster computing packages.
As fancy and interesting as those courses were, I would only have a superficial understanding had I not also been exposed to more foundational/theoretical courses/topics like "Probability," "Measure Theory," "Mathematical Statistics," "Linear Regression," "Time Series Analysis," "Applied Stochastic Processes," etc.
When it comes to real-life practical implementation of all these ideas, it is necessary to have a pretty steep background in computer science. It's not enough to be able to do a couple runs in R or Hadoop. What is really called for here is at least an undergraduate level of knowledge in all the traditional areas of computer science, like programming languages, databases, computer systems, algorithms & data structures, etc.
Finally, the fourth ingredient is experience. The only way to really learn data analysis is through practice. It takes hundreds of hours of staring at data and code, struggling hard to find the relevant patterns in your data and improve the predictive performance of models. I guess this is the main advantage of pursuing a modern "Data Science" graduate degree, presumably you will spend a lot of time practicing data analysis on "real" data.
What I think irks people about the Data Science trend is that there seem to be a lot of people out there saying that you don't really need to be educated in mathematics and statistics. It's like saying that a mechanical engineer just has to know how to use 3D modeling computer software and doesn't need to know physics or mathematics ... that would lead to disastrous outcomes.
It's also like saying a Software Engineer just needs to know how to "code" and doesn't need to know math or physics. The cornerstone of any Engineering profession is deep knowledge of underlying science and mathematics and their applications to your specific discipline. A paper similar to this offering commentary on the current trend of conflating programming with Software Engineering is sorely needed.