
50 years of Data Science [pdf] - revorad
https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf
======
rubidium
A great read.

Thesis of the article: "Insightful statisticians have for at least 50 years
been laying the groundwork for constructing that would-be entity as an
enlargement of traditional academic statistics. This would-be notion of Data
Science is not the same as the Data Science being touted today, although there
is significant overlap. The would-be notion responds to a different set of
urgent trends - intellectual rather than commercial. Facing the intellectual
trends needs many of the same skills as facing the commercial ones and seems
just as likely to match future student training demand and future research
funding trends"

"Data Science" is currently a field where corporations are winning in defining
what activity academic universities should support. This article is a nice
pull back in the other direction, and focusing on the intellectual rather than
commercial elements.

~~~
mikk14
Not sure if I completely agree. The problem, as I see it, is that corporations
are one of the most important data collectors out there, and sometimes most
accessible. Personally, I am doing research with data from supermarket chains,
credit card companies and telecommunication operators. However, I always
strive to make the paper mainly about an intellectual challenge (central place
theory, modeling inclusive economic growth, studying effects of the dynamics
human mobility, ...) and just use the commercial application as a side note.
So, even if the commercial intent is there and the corporation interest is
visible in the paper, the attempt is to use it just as a "price to pay" for
increasing a "more general" knowledge.

Unfortunately, if I look at the (sometimes very commendable!) efforts of
public data release from, e.g., governments (data.gov, data.gov.uk, etc) there
always some shortfalls. They are either too flat -- a classic census table,
which is of limited interest because you can deal with it without the need of
innovative data analysis techniques -- or just not large or deep enough to get
a big picture from them. For instance, in human mobility, a government will
provide you a mobility survey, a sample of self-reported trips recalled from
memory for only one target day. But a phone call record can give you _all_
people, for a _long_ period of time, _without_ relying on people's faulty
memory of their movements. If you want to understand, e.g., how disease
spreads in a city, or how people are cut out by public transportation and
infrastructure shortcomings, the survey results are necessarily going to be
worse than the ones you'd get with call metadata.

------
chestervonwinch
This is a very nice read. Updating my worldviews ... I had no idea that John
Tukey (inventor of the fast Fourier transform) was such a large figure in the
history of data analysis. In fact, his paper on exploratory data analysis has
more citations than his paper on the FFT according to google scholar. I was
also surprised to see that he was co-author on the projection pursuit
algorithm. Wow!

------
zatkin
According to a recent report by Glassdoor, Data Scientist is ranked #1 for "25
Best Jobs For Work-Life Balance" [1].

[1] [http://www.glassdoor.com/blog/25-jobs-worklife-
balance-2015/](http://www.glassdoor.com/blog/25-jobs-worklife-balance-2015/)

------
maximz
I saw David Donoho give this talk live in September at Princeton's Tukey
Centennial conference -- fantastic, and well worth a read. IIRC, gives a good
history of data analysis, how to think about the different definitions of and
roles for data science, and an introduction to Tukey's work.

For more on the history of data science, here are references from a similar
talk by Chris Wiggins: [http://bitly.com/icerm](http://bitly.com/icerm)

------
jayvanguard
Great summary and very needed at this time to make sense of a number of
trends. A minor nit: I think it would be better if he didn't overload the long
ago claimed term "Data Modeling" and specifically called it Generative and
Predictive Modeling.

------
n00b101
This is a very good read, but I have to say that I managed to learn all of
this 12 years ago before the term "Data Science" existed. It was quite easy as
I was a statistics, comp sci and applied math "triple major."

I don't remember ever hearing the terms "Data Science" or "Big Data," but I do
recall taking Department of Statistics courses with titles like "Data Mining,"
"Statistical & Machine Learning," and "Statistical Computing." We even sort of
worked with what is now called "Big Data," by learning how to run large
calculations in parallel using R's cluster computing packages.

As fancy and interesting as those courses were, I would only have a
superficial understanding had I not also been exposed to more
foundational/theoretical courses/topics like "Probability," "Measure Theory,"
"Mathematical Statistics," "Linear Regression," "Time Series Analysis,"
"Applied Stochastic Processes," etc.

When it comes to real-life practical implementation of all these ideas, it is
necessary to have a pretty steep background in computer science. It's not
enough to be able to do a couple runs in R or Hadoop. What is really called
for here is at least an undergraduate level of knowledge in all the
traditional areas of computer science, like programming languages, databases,
computer systems, algorithms & data structures, etc.

Finally, the fourth ingredient is experience. The only way to really learn
data analysis is through practice. It takes hundreds of hours of staring at
data and code, struggling hard to find the relevant patterns in your data and
improve the predictive performance of models. I guess this is the main
advantage of pursuing a modern "Data Science" graduate degree, presumably you
will spend a lot of time practicing data analysis on "real" data.

What I think irks people about the Data Science trend is that there seem to be
a lot of people out there saying that you don't really need to be educated in
mathematics and statistics. It's like saying that a mechanical engineer just
has to know how to use 3D modeling computer software and doesn't need to know
physics or mathematics ... that would lead to disastrous outcomes.

~~~
thinkmoore
I think you've nailed it precisely. What is irksome (scary?) isn't the sudden
upsurge in interest in data analysis (great! lets for science!), but the
complete lack of acknowledgement of the importance of statistical theory and
knowledge. As the paper mentions, almost all of the new "Data Science"
academic programs being created have very little communication with, let alone
integration with, real statisticians. Garbage in-garbage out, no matter how
much data or how fancy a program.

~~~
east2west
I submit computer science departments have enough expertise to operate a
graduate data science program. Maybe not every CS department, but enough. It
is interesting that the actual stats departments do not encompass all
statistical knowledge and expertise. More specialized experts in econ,
finance, many fields of social science, not to speak of all engineering fields
and computer science and applied math. Data science predates modern
statistics; Laplace used least-square to ignore "the God hypothesis;" the
function-approximation view taken in many engineering fields is just as
powerful as the statistical approach.

------
erikb
Now it's suddenly 50 years? When I started reading on HN in 2010 (or '09) Data
Science didn't exist yet. We had statistics and IT back then. I saw your birth
dude! All the fluids and the screaming.

