

In Media, Big Data Is Booming but Big Results Are Lacking - carlosgg
http://allthingsd.com/20130520/in-media-big-data-is-booming-but-big-results-are-lacking/?mod=thisweek

======
ryanlchan
I think a major hurdle we have to overcome with big data is separating
causation vs correlation. As the data set scales, we gain ever-increasing
confidence in the correlation, but an ever more complex set of causations.

Take their House of Cards example. Netflix saw a strong correlation between
David Fincher, Political Thrillers, and Kevin Spacey. Fantastic. But why? What
did people like about these things? Why did this 'work'?

Let's try to replicate this decision: take great directors (Wachowski
siblings), a strong cast (Emille Hirsch, John Goodman, Susan Sarandon), and
nearly unlimited budget ($200m) to reboot an existing, well received
franchise. Should be a hit, right? Wrong - it's a complete and utter failure
known as 2008's Speed Racer.

When we say we want to be data-driven we actually mean we want to be insights-
driven. We want to understanding the "Why?" from the data's "What"; it's the
'Why' which lets us know how to react next. It's easy to confuse the data's
specificity with insight's certainty, but they are distinctly not the same: We
can pinpoint conversions down to 6 digits of significance without having a
clue why it occurs.

What we really need is Big Insight, but that's a significantly harder problem,
not because we don't have the technology to create a solution, but because
don't even know what the right questions are.

I'm optimistic about the possibilities of a system like IBM's Watson in
helping solve this, but as it stands, Big Data's utility is giving us 99.755%
certainty that we have no idea what is going on.

~~~
joe_the_user
What I would like to know is where this leaves the "big data will give us AI"
school of thought, the Nate Silver school of thought.

------
karterk
Engineers are better equipped to collect and store massive amounts of data
than to analyze it to draw meaningful conclusions from it or to drive
decisions. It takes business acumen to ask the right questions to the data.

The way "big data" is stored today also poses problems with respect to
queryability. NoSQL systems are great at storing huge amounts of data
efficiently but don't help us slice and dice the data easily. Having to write
map reduce jobs for every "query" is painful and time consuming. Tools like
Hive, Pig and Cascading help in writing MR jobs succinctly but are still very
slow when someone wants to quickly filter and explore the data.

~~~
izendejas
You should look into Spark: <http://spark-project.org/> and Shark (Spark +
Hive).

I personally find that using the Spark repl is both efficient and flexible;
and being able to cache data in memory makes for very fast iterations,
especially when doing something more sophisticated.

There's also Cloudera Impala, but I've never used it, nor do I think I'll ever
have to given what the AMPLab folks are doing.

------
stfu
I am just hoping to hear one day the "real" story of how "House of Cards" came
into place. Somehow my gut tells me that using this as the poster child of big
data is only a half-truth. I for one believe that it is just a lucky guess,
e.g. not really surprising that some quality stories do work with a paying
audience. Something along the lines of Boston Legal would have probably worked
just well for the professional, 30 plus, male, metro area audience.

Unless we now see a consistent strike of at least 5-10 original series of
original programming that become hit series the whole bid data thing is based
on lucky guessing (original programming - not some recycled material - after
all House of Cards was already a success in the UK and next in line, Arrested
Development, is having already a hardcore fan basis).

Btw, is anyone having some numbers how "House of Cards" is performing in
comparison to original HBO programming?

------
kaa2102
Data is data. The most important data is the the data that impacts the
strategic objectives of your business or organization. There is zero value in
boiling the ocean.

------
tytung2020
Are there anyone using evolutionary algo to sort these data? I am no expert in
these but I did some research in AL (Artificial Life) back in undergrad. I
think evolutionary algo will beat any AI in sorting and finding connections.

