

What is data science? - helwr
http://radar.oreilly.com/2010/06/what-is-data-science.html

======
hnote
The misnaming of fields of study is so common as to lead to what might be
general systems laws.

For example, Frank Harary once suggested the law that any field that had the
word "science" in its name was guaranteed thereby not to be a science. He
would cite as examples Military Science, Library Science, Political Science,
Homemaking Science, Social Science, and Computer Science.

Discuss the generality of this law, and possible reasons for its predictive
power.

\-- Gerald Weinberg, "An Introduction to General Systems Thinking."

~~~
ced
Counterexample: Neuroscience. But that's the only one I can find.

~~~
anamax
Materials science is my counter example.

~~~
billswift
Materials science isn't any more (or less) science than computer science is.
It is applied chemistry.

------
Tichy
Hm, what is the scale of the chart at the end (Cassandra Jobs)? My guess is
that it went from 0 to 3 jobs...

Not that I dislike Cassandra, just that interesting jobs tend to be rare in my
experience.

------
greenlblue
I disagree with the whole premise that data science is the wave of the future.
As soon as you collect a piece of information it is already out of date and is
of very limited use. Weather scientists have been collection data since the
beginning of the century and they are still no better at predicting the
weather than they were at the beginning of the century. Economists are in the
same boat. They have tons of data on markets but they still can't figure out
what makes markets tick. What we need is not more data or data centric
thinking. We need generative models that explain how the data is being
generated and why it is being generated in a certain way. There is only so
much you can squeeze out of raw data by computing numbers from it and so far I
don't think the results have been that impressive.

~~~
akshayubhat

         already out of date and is of very limited use

Depends on the type of the information, news: yes,information about biology,
or Geo-Information or encyclopedic: not so fast. on other hand we now have
access to real time data.

    
    
         Weather scientists have been collection data since the beginning of the century and they are still no better at predicting the weather than they were at the beginning of the century. 

Predicting Weather is near impossible in theory forget the practice. its a
chaotic system, generally in data science we are trying to predict things
which we know can be predicted, such as determining whether an article is
relevant to a topic for example. Humans can do that very well but for machines
to do it you need more data.

    
    
         They have tons of data on markets but they still can't figure out what makes markets tick. 

Again markets are chaotic and are affected by things like low probability
events. You can do predict some patterns using information asymmetry, thats
what all those traders at Goldman Sachs and other firms do.

    
    
        We need generative models that explain how the data is being generated and why it is being generated in a certain way.

The systems that we are interested in have complex models, and generating them
from first principles isnt always as easy, additionally even if you do
generate you need to test them against real world data. Rather than using the
normal hypothesis - experiment cycle, it makes more sense to look for
predictable patterns in data.

    
    
        I don't think the results have been that impressive.

Thats because you are out of touch with the field. Look at Google's
Statistical Translation results, for example, they beat every generative model
around the town. Or netflix's prize for recommendation system.

As someone interested in Data Science and been quite involved with it for few
years as a student and an intern let me try to explain why data science is now
becoming an important area:

1\. We now have a lot of data in a machine readable format:

Unlike few years ago we now have huge datasets publicly available they, you
have topic specific datasets such as Geo Names, Linked Geo data to much more
wide encyclopedic datasets such as Wikipedia Data Dump, Freebase, Open Cyc,
DBPedia.

We also have huge amount of user generated data, E.g. I have on my computer
right now a huge chunk of twitters follower network consisting of 35 million
users(I am writing an open source whom to follow system). Additionally I also
have 100 million tweets.

2\. Not just data we now have access to real time data:

You can access the public twitter time-line, using their streaming api, there
are quite a few pubsubhubub systems out there which combine information from
disparate sources and provide you unified source.

3\. Moreover we have tools to handle the deluge [well sort off]:

Thanks to Google, Apache, Yahoo, Facebook we now have Hadoop , Map Reduce, Pig
and other tools which make job of parallelizing processing of the data easier.

5\. We have a Scalable on demand infrastructure in place:

Using AWS we can buy processing power as needed. It would have been impossible
earlier, I recently bought a high memory instance with 17GB Ram for 50 cents
an hour for 10 hours to run some jobs. It would have been impossible few years
ago. We can now also deploy web apps very easily using Google App engine and
dont need to even pay a penny, this enables us to create nice interfaces for
visualization and querying for the data at a low initial cost.

Finally we are slowly building infrastructure to sell datasets or custom apps.
E.g. Amazon Dev Pay or Infochimps.

The name data science can though be misleading, if you are a student that
would mean taking course in Machine Learning, Data Mining, Information
Retrieval, Statistics, Distributed Computing, Databases.

~~~
pvg
_Predicting Weather is near impossible in theory forget the practice._

Are you really saying that everyone from pre-historic hunter-gatherer
societies tracking the seasons to modern meteorologists with their
supercomputers and satellites have been engaging in something theoretically
and practically impossible? That would surely be staggering news to everyone
involved.

~~~
btilly
The details of which days are going to be sunny a month out, which will be
rainy, and when storms will arrive is chaotic and impossible even in theory to
predict in detail more than a few weeks away. Practice is worse, there we
manage no more than a few days, and this has been true for several decades.

That said, there are larger trends that can be predicted. For instance the
seasons which come from astronomical facts. Or the several year El Niño/La
Niña oscillation. Not to mention relatively slow moving Rossby waves in the
jet stream. (One of which is bringing hot weather to Russia and monsoons to
Pakistan right now.) These give useful information about what is likely to
happen and keep happening over periods of weeks, months, and to some extent
years.

But _none_ of this brings us any closer to the idea of being able to give an
exact weather forecast for a day 6 months in the future. That goal is
impossible. And has been known to be impossible for several decades.
Furthermore I assure you that this fact is well-known to every competent
modern meteorologist. (The word "competent" does not necessarily cover people
chosen primarily for their appearance to deliver the weather report for local
TV stations.)

~~~
mturmon
According to this ECMWF planning document (page 14, figure 4):

[http://www.ecmwf.int/about/programmatic/strategy/strategy.pd...](http://www.ecmwf.int/about/programmatic/strategy/strategy.pdf)

the forecast skill for ECMWF and NOAA have been improving pretty steadily over
the last 15 years. Basically, we're seeing two days farther into the future
now than 20 years ago.

I agree that chaotic dynamics and various noise sources limit the time horizon
for weather predictions to perhaps 2 weeks.

