
The key word in “Data Science” is not Data, it is Science - aaronjg
http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/
======
joe_the_user
What does "science" mean in this context anyway? An astronomer, say, can work
with vast, vast swaths of data but that doesn't make the astronomer a "data
scientist". Statistics has some fairly generic tools but it doesn't seem to me
these add up to science in the sense that fields like chemistry, biology,
geology and physics are sciences.

My modest exposure to machine learning at the professional level gives me
impression that the "real experts" combine a strong mathematical
understanding, long experience and some good rules of thumb to perform better
than a grad student shooting in the dark, _if_ they happen to perform better.

Oddly enough, all articles about how hard it is to become a "real data
scientist" gives the impression that however much expertise is involved, that
expertise isn't the codified understanding that is "real science" \- even a
physics undergraduate does real physics because scientists, physics codified
their methods.

Maybe "data science" can become science. But suspect that what will become
scientific is the understand of whatever entity is producing the data. Which
isn't to discount the learning of experts here but simply to note that
compendiums of rules-of-thumb and feelings indicate what Thomas Kuhn might a
pre-scientific field.

~~~
stiff
I think a great many of astronomers were in fact "data scientists". For
example, Kepler's life's work in the end amounts to finding the simplest
possible model that could plausibly account for Tycho de Brahe's data. In
fact, many of the most basic statistical modelling methods originated in the
process of trying to find explanations for astronomical observations, like
least squares:

[http://en.wikipedia.org/wiki/Least_squares#History](http://en.wikipedia.org/wiki/Least_squares#History)

In one sense science is just the outcome of correctly applying the scientific
method to a given problem, and in that sense data science is perhaps a meta-
science, in that it focuses on the part of the scientific method concerned
with making inferences from data, and tries to improve it's effectiveness,
using process that is itself scientific to some degree (applying mathematical
reasoning to find more reliable ways of inference, introducing additional
empirical experiments like cross-validation into the inference process itself
for confirmation). I think comparing physics to statistics is a categorical
error, and at the same time it's worth keeping in mind that modern physics,
biology and chemistry certainly rely very heavily on probability theory and
mathematical statistics, so those disciplines always brought up as "properly
scientific" are in many respects only rigorous to the extent statistics is.

~~~
kyzyl
> I think a great many of astronomers were in fact "data scientists"

I think the data component is a feature of applied physics in general. Physics
problems have had a tendency to be extremely data driven, because when you're
dealing with things out in the cosmos, or trying to model nuclear structure
for the first time, or shed light on the Higgs mechanism, the data is _all you
have_. You often can't see it with your eye, or a microscope. The effect
you're observing only lasts a femtosecond, so you conceive of extremely fancy
contraptions that measure the secondary effects en masse (if you'll allow the
pun), and you end up with fantastically large data sets.

That said, when I look at the scientists I knew/know, including astronomers,
their solutions often take a different approach that what I see today in the
"data science" blogosphere. There, the approach seems highly centered on
scaling various methods to work with huge amounts of data, whereas most of
scientists seem to do just the opposite. They go to great lengths--such as
building massive integrating antennae the size of football fields--to make
sure they collect precisely the correct data, and nothing more. They then take
that data and apply the technique that they know _must_ work, if their
hypothesis is correct.

------
netcan
Data science is one of those words that are like a semantic mine, which HN
seems to be especially susceptible to. These words (a few years ago it was
"cloud" or "Web 2.0") catch on because they capture a bunch of things that
have chained which seem related. But, they aren't clearly definable. This
traps people (I think HN people especially)

Web 2.0 was a trend in prominent personality types, types of websites,
business models, increased scale of online interaction, use of real names, web
design styles, programming languages. It wouldn't have been absurd to see a
guy with a specific look and a specific laptop and say "that's so web 2.0". To
add insult to injury, pretty much all the stuff that gets captured in a word
like web 2.0 has predecessors. Crowd sourcing? What about Wikipedia?

I think a good analogy is "movement" in art, philosophy & culture. Modernist
is a word that encompasses Frank Lloyd Write, Pablo Picasso, James Joyce, Ayne
Rand & karl Marx. It applies to paintings, manifestos, econometrics and
buildings with straight lines.

That's the kind of word that 'data science' is. We found ourselves recording a
lot of data as a sort of side effect of digitization. It's growing. Then we
start to try and get some value from that data. Some new stuff is possible
with that volume of data. Some new people are now interested in data. A lot of
the tools people were using to collect and analyze data don't work at that
volume, so we start using new tools. We end up with a word that includes
astronomers, netflix, medical researchers, self driving cars, R, statistical
theories etc.

Data science doesn't mean anything that specific yet. It's best not to lead
the discussion (as I am doing right now) to a discussion about the word, what
qualifies as data science.

~~~
kyzyl
Yes, thank you. I like to distill things a bit:

1\. There is science, and there is data. Science has data, and is a thing you
can do. Data is just data. "Data Science" as a description of an activity is
like saying "Words Writing" or "Money Banking". As a whole the term holds no
concrete meaning, thus it is contorted to whatever the current topic of
conversation is (this is exactly what you've just said, I think).

2\. There _has_ been a massive change in the volume of data and the tools we
use to analyze it. This is _nothing_ to do with what you're doing, what field
you're in, or whether you know what you're doing. Again, data is just data.
(Also what you said).

3\. If you want to use the term in your conversations, blog posts or--if you
must--publications, fine. Just make sure YOU have already set in stone what it
is you're talking about. It shouldn't take others probing your sentence
structure, nomenclature and argumentation for you to come up with a precise
definition of what you mean. That's just bad SCIENCE, no matter how much DATA
you have.

~~~
netcan
It's bad science, but I don't think it's bad conversationally. If I say that I
think the hardest part of a good streaming media service is getting the data
science right, we both kind of know what I mean. It might be useful that I
might mean the choice of database or I might mean algorithm. Even though the
things encompassed by the term aren't necessarily related, they usually are.
Google was obsessed with big data before it was cool and they were/are the
best on most fronts: collecting the most data, getting between the user and
the data (query->result) fastest, and getting the most correct answers.

The "cloud" buzzword really pissed people off. I thought that was an
overreaction too. If you say that enterprise software companies need to deal
with these new cloud competitors, you mean a whole cluster of things. Web
software, SaaS pricing models and sales cycles, smaller companies that don't
run their own infrastructure, etc.

------
ibsathish
Data gives a perception only when there is a problem to solve and you need
support from the data. Problem solving is more about the knowledge in the
vertical, the approach to the problem and your capability to think from
multiple angles, which does not have to do about science.

Good article, nevertheless.

------
mathattack
I really love all these articles about Data Science, but it's a lot more than
statistics. It's programming, it's domain knowledge, and yes, it's a lot about
thinking of the meaning, format, and pliability of the data.

~~~
gaius
In other words, being a business analyst, but without the experience to know
that that's already a job.

~~~
saraid216
And possibly applicable to things that aren't related to making money.

~~~
gaius
I know several BAs who works for charities. Any large organization needs them
to operate efficiently.

~~~
mathattack
One of the most important uses of data science is to determine the impact of
charitable giving. Is it a more efficient use of money to give people cash, or
pay for infrastructure in their town? Is it more efficient to invest in K-5
education, 6-8, high school or college? There are lots of ideological answers
to these questions, but data science is great for supplying empirical
evidence.

~~~
gaius
Actually civil servants have been churning this stuff out since the 50s, if
not before. Politicians ignore it of course, but if you think this is a new
thing, then you have fallen into the common data science trap of not knowing
there's nothing new under the sun. Which is odd really since anyone calling
themselves a scientist of any sort should be able to do some research and not
reinvent the wheel.

------
platz
By what process can I "understand whether these correlations matter for
specific, interesting questions"?

