
Big data: are we making a big mistake? - pietro
http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz2xS1VXiUc
======
nemesisj
Another conclusion to draw from this article (which I really enjoyed, by the
way) is that Big Data has been turned into one of the most abstract buzzwords
ever. You thought "cloud" was bad? "Big Data" is far worse in its specificity.

I can't count the number of times I'll be talking to some sales rep and
they'll describe how they scan the data within whatever application they're
demoing and "suggest" items using "big data techniques". In almost all cases
they're talking about a few thousand or hundred thousand records, tops.

I've found that when non-hardcore techies talk about Big Data, what they
really mean is "they have some data" vs before, when they had zero data.

From the article:

 _" Consultants urge the data-naive to wise up to the potential of big data. A
recent report from the McKinsey Global Institute reckoned that the US
healthcare system could save $300bn a year – $1,000 per American – through
better integration and analysis of the data produced by everything from
clinical trials to health insurance transactions to smart running shoes. _

What these consultants mean is that by having just some data compared to the
silo'd data that is the norm in US healthcare, they could save a lot, and
they're right. My previous company had a large data set (20+ million patients)
and we'd find millions of dollars of savings opportunities for every hospital
we implemented in, but that's because we had the data, not because we were
running some kind of non-causual correlation analysis like the article
references. It was just because we could actually run queries on a data set.

\-----

Off Topic - how annoying is it that when you copy & paste from the FT, they
preface your copy with the following text?

 _High quality global journalism requires investment. Please share this
article with others using the link below, do not cut & paste the article. See
our Ts&Cs and Copyright Policy for more detail. Email ftsales.support@ft.com
to buy additional rights.
[http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabd...](http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#ixzz2xSKoQYaW*)

~~~
JonLim
Out of curiosity, when does it effectively become "big data"?

I ask not to be snarky, but it might be the case that it's "big data" to
someone else, but not necessarily to you. I figured it was a relative term for
your industry/business, but the hacker crowd definitely seems to peg that
amount in the millions of data points before calling it big data at all.

Seems fair, but I'd rather clarify.

~~~
twic
I usually follow DevOps Borat's definition [1]:

"Big Data is any thing which is crash Excel."

Many a true word spoken in jest.

[1]
[https://twitter.com/DEVOPS_BORAT/status/288698056470315008](https://twitter.com/DEVOPS_BORAT/status/288698056470315008)

~~~
xroche
This is very inaccurate/misleading IMHO. Big Data is something which does not
fit in a regular machine for a given operation. You can sort billions of
records on an iPhone, for example. You can grep a string within a terabyte-
file data on a single personal computer, and I am not convinced you'd go
faster with a distributed system (reading the file on cold storage will be the
limiting factor). People claiming to do "big data" in these situations do not
generally understand the underlying concepts.

~~~
dannypgh
With a distributed storage system you should be able to read said terabyte
file using far more disk heads.

It would also be easier to engineer it so the terabyte file was entirely in
RAM by distributing it across multiple machines (although single machines with
TB ram capacity are no doubt continuing to become more common)

Sure, store it on a single tape or disk and distributing the computation won't
help. You need distributed storage to properly leverage distributed
computation for otherwise I/O bound processes.

------
amirmc
_" But while big data promise much to scientists, entrepreneurs and
governments, they are doomed to disappoint us if we ignore some very familiar
statistical lessons.

“There are a lot of small data problems that occur in big data,” says
Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff.
They get worse.”"_

This should be the main learning point. Humans can be astonishingly bad at
dealing with stats and biases which can led to erroneous decisions being made.
If you want an example where such decisions by very smart people can have
catastrophic consequences, look up the Challenger disaster [1].

I rarely see people stating their assumptions upfront, which doesn't help the
problem (I guess it's not cool to admit potential weaknesses). The more
people/companies that get into 'big data' (without adequate training) the more
false positives we're going to see.

[1]
[http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disast...](http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster)

------
sitkack
This article reminds me of the argument [0] between Noam Chomsky [1] and Peter
Norvig [2]. TL;DR (paraphrased with hyperbole) Chomsky claims the statistical
AI of Norvig is a fancy sideshow that doesn't understand _why_ it is doing a
thing. It just throws gigabytes of data at an ensemble and comes out with an
answer.

[0] -
[http://www.theatlantic.com/technology/archive/2012/11/noam-c...](http://www.theatlantic.com/technology/archive/2012/11/noam-
chomsky-on-where-artificial-intelligence-went-wrong/261637/)

[1] -
[http://en.wikipedia.org/wiki/Noam_Chomsky](http://en.wikipedia.org/wiki/Noam_Chomsky)

[2] -
[http://en.wikipedia.org/wiki/Peter_Norvig](http://en.wikipedia.org/wiki/Peter_Norvig)

\----

Norvigs rebuttal,
[http://norvig.com/chomsky.html](http://norvig.com/chomsky.html)

~~~
shas3
Also relevant to this discussion is Douglas Hofstadter's solitary pursuit of
'thinking machines', outlined recently in this Atlantic profile:
[http://www.theatlantic.com/magazine/archive/2013/11/the-
man-...](http://www.theatlantic.com/magazine/archive/2013/11/the-man-who-
would-teach-machines-to-think/309529/)

This analogy is particularly illuminating,

"“The quest for ‘artificial flight’ succeeded when the Wright brothers and
others stopped imitating birds and started … learning about aerodynamics,”
Stuart Russell and Peter Norvig write in their leading textbook, Artificial
Intelligence: A Modern Approach. AI started working when it ditched humans as
a model, because it ditched them. That’s the thrust of the analogy: Airplanes
don’t flap their wings; why should computers think?"

While the Norvig-Chomsky debate is about the philosophy of the science of AI,
it has practical implications to practitioners who tend to apply statistical
techniques as if they are popping a pill. Engineers applying statistical
learning, etc. should understand the limitations of the techniques, as
outlined by Chomsky in the debate. The outcome of the Chomsky-Norvig (or
Hofstadter vs. everyone else in CS) debate is less important than the
arguments put forth by both the groups.

~~~
joe_the_user
The problem with the analogical comparison between the tupples [birds,
airplanes, flight] and [humans, AI-machines, intelligence] is that flight is a
clear and ambiguous achievement whereas intelligence is something we haven't
fully defined and for-which humans, we ourselves are our only accepted model
(and self-interrogation is an activity that can feel easy but in-which we
found many subtle and obvious problem).

------
SixSigma
> a provocative essay published in Wired in 2008, “with enough data, the
> numbers speak for themselves”

I think that's indicative of Wired breathless enthusiasm for technology that
turned my off buying the print version many years ago.

Scrape away some of the hyperbole and it is true that data driven management
has made many companies more competitive and, if I dare mention the hobgoblin,
efficient.

Hunches and ideas can only get you so far. It is important to visit the data
gemba and do the genchi genbutsu.

[http://en.wikipedia.org/wiki/Gemba](http://en.wikipedia.org/wiki/Gemba)

[http://en.wikipedia.org/wiki/Gembutsu](http://en.wikipedia.org/wiki/Gembutsu)

~~~
sireat
I have some Wired issues from mid 90s in the bathroom and the tone is the
same.

It seems pretty much everything they write about is supposed to change the
world in a major paradigm shift.

~~~
SixSigma
It delights in the techno-utopia envisaged by Nicholas Negroponte, personally
I just can't be doing with it.

[http://en.wikipedia.org/wiki/Nicholas_Negroponte](http://en.wikipedia.org/wiki/Nicholas_Negroponte)

------
RA_Fisher
I'm much more impressed when someone can squeeze information out of small
data. W.S Gosset was extracting tons of information from as little as two
observations. I'm very grateful that my advisor guided my cohort to work with
two-observation MLE in many contexts. This type of practice focuses the
analyst on squeezing out as much information as possible. When applied to big
data, this approach can be very useful. Big data comes with data wrangling
challenges, but if you don't carefully squeeze out information, you'll be
leaving tons and tons on the table.

------
hawkharris
The misconceptions about big data are similar to those surrounding the word
science.

Many people associate "science" with things: cells, microscopes, the inner
workings of the body. But science isn't a set of things; it's a process, a
method of thinking, that can be applied to any facet of life.

Big data is similar, in my opinion. It's not so much about the stuff — the
size or diversity of a company's datasets. It has more to do with the types of
observations you're making and the statistical methods involved.

This distinction is important for two reasons:

1\. If Big Data is recognized as a process rather than a circumstance,
businesses will be more deliberate in deciding whether to use the methods.
They will weigh the benefits of, say, MapReduce against other approaches.

2\. The idea that "Big Data" techniques have everything to do with size is
somewhat misleading. A comprehensive query of a 50,000 user dataset can be
more computationally expensive than a simple operation on a 100,000-record
dataset.

~~~
mxfh
It's the misconception that measurable observations equal the real
distribution of the underlying events. Even professional data people often get
that wrong, and it's not strictly limited to big data.

One of the most obvious examples was this one: A data set of all known
meteorite landings[1] turns into "Every meteorite fall on earth mapped" [2]
with looks like a world population maps sprinkled with some deserts known for
their meteorite hunter tourism. The actual distribution can be theoretically
described as a curve falling towards the poles.[3]

While this example is pretty obvious, one could expect similar observation
biases in other data sources. A danger lies where data analyst do not bother
to investigate what their data actually represents and then go on to present
their conclusions like it would be some kind of universal truth.

[1][http://visualizing.org/datasets/meteorite-
landings](http://visualizing.org/datasets/meteorite-landings)

[2][http://www.theguardian.com/news/datablog/interactive/2013/fe...](http://www.theguardian.com/news/datablog/interactive/2013/feb/15/meteorite-
fall-map)

[3][http://articles.adsabs.harvard.edu//full/1964Metic...2..271H...](http://articles.adsabs.harvard.edu//full/1964Metic...2..271H/0000276.000.html)

previous discussion of this:
[https://news.ycombinator.com/item?id=5240782](https://news.ycombinator.com/item?id=5240782)

~~~
greenyoda
You have the same problem with historical global temperature data: weather
stations tend to be in or near populated areas, which excludes oceans (70% of
the earth's surface area) and huge, sparsely populated regions like the
Arctic, the Antarctic, deserts, rain forests, remote mountain ranges like the
Himalayas and Andes, etc.

------
nobbyclark
I get the impression from looking at local "big data" events that the
enterprise software crowd has tuned into big data.

I fear that now that SOAP and enterprise buses have gone their way, they look
a new buzzword to sell. More solutions looking for problems...

------
hibikir
I find it amusing that the article talks about big mistakes in polling data,
when the clear winner of the last two US elections is one Nate Silver, who
aggregated polls to get predictions so close to the actual results, one
wonders why people actually vote anymore.

Now, just like with every other technological solution, we only learn about
the limits of its use by overuse. There's plenty of people out there storing
large amounts of data and getting no valuable conclusions out of it. But the
fact that many people will fail doesn't mean the concept is not worth
pursuing.

Chasing what is cool is a pretty dangerous impulse. The trick is to be able to
tell when it can pay off, and to quickly learn when it will not, and cut your
losses. Maybe you don't need big data, just like maybe your shiny cutting edge
library might not be ready for production.

~~~
jaravis
Nate's approach is based on evaluating the quality of the various polls -
which is the thrust of the FT article. In fact he actively weighted each of
the polls & corrected for known biases.

------
emiliobumachar
Great article. I think the brightest gem here is the Multiple comparisons
problem:

[http://en.wikipedia.org/wiki/Multiple_comparisons](http://en.wikipedia.org/wiki/Multiple_comparisons)

~~~
sitkack
If we aren't careful the singularity AI will believe in God, and not
necessarily us.

~~~
emiliobumachar
I didn't get it. Would you please elaborate?

------
akadien
This is my favorite line and the one that damns so many "big data" efforts:

"They cared about ­correlation rather than causation."

Analytics are a tool to help find correlations and patterns so that humans can
do the hard work of determining and testing for causation. Computers are doing
their jobs; humans aren't.

------
dj-wonk
The “with enough data, the numbers speak for themselves” statement has several
meanings.

In one sense, if you can observe real phenomena, you don't have to guess at
what is happening. For businesses that collect troves of it, they may need
statistics 'less' because the sample size may approach the population size.

But calculating basic (mean, standard deviation, etc.) statistics is hardly
the most interesting part. Inferential statistics is often more useful: how
does one variable affect another?

As the article points out, the "... the numbers speak for themselves”
statement may also be interpreted as "traditional statistical methods (which
you might call theory-driven) are less important as you get more data". I
don't want to wade in the theory-driven vs. exploratory argument, because I
think they both have their places. Both are important, and anyone who says
that only one is important is half blind.

Here is my main point: data -- in the senses that many people care about; e.g.
prediction, intuition, or causation -- does __not __speak for itself. The
difficult task of thinking and reasoning about data is, by definition, driven
by both the data and the reasoning. So I 'm a big proponent of (1) making your
model clear and (2) sharing your model along with your interpretations. (This
is analogous to sharing your logic when you make a conclusion; hardly a
controversial claim.)

------
stillsut
What executives say it does...

"Facebook’s mission is to give people the power to share and make the world
more open and connected."

What it actually does... (that will be left to the reader.)

"Big Data" is often sold as one thing by Enterprise software folks. But what
value the data, or processing of it actually has is usually much more
dependent on the user and his context (like FB!) and usually doesn't fit as
nicely onto a PPT slide.

Articles like this usually confuse the PR definition and the analyst
definition.

------
MCarusi
A few other comments have raised this point, but Big Data is basically the new
Web 2.0. Aside from being a buzzword, as a term it's so nebulous that half of
the articles about it don't really define what it is. When does "data" become
"big data"?

------
sam_sach
Conclusion: "Big Data" is a stupid buzzword and it makes me cringe every time
I'm forced to say it to sell some new solution or frame something in a way
someone who barely knows anything about computer science can understand.

It's nebulous. I've seen it applied to machine learning, data management, data
transfer, etc. These are all things that existed long before the term, but
bloggers just won't STFU about it. Businesses, systems, etc. generate data. If
you don't analyze that data to test your hypotheses and theories, at the end
of the day, you don't understand your own business and are relying on
intuition for decision making.

------
bsbechtel
There is definitely value to big data, but isn't it also a form of
legitimizing stereotypes, at least in some cases? I mean, the general premise
of big data, is to glean conclusions and new knowledge of the world from
billions of records. When humans are the source of the data that is being
extracted and analyzed, are the conclusions not stereotypes of those
individuals, unless the correlation is 100%? This might be ok, and even
useful, when trying to optimize clicks on ads, but what about when the
government uses it to make policy decisions?

------
SworDsy
if i work for facebook and i want to figure out something about my users,
isn't it safe to say N = All since the data im accessing is all user data from
fb? it's easy to go wrong with big data, and although the article glossed over
some fairly important things (assuming the people who work on these datasets
are much dumber than they are in reality), they're right on about idea that
the scope and scale of what big data promises may be too grandiose for it's
capabilities

~~~
mrow84
Whilst, in the example you provide, it might be the case that "N = all", the
cautionary tale offered in the article is that you always need to make sure
you are asking the right question, and it is pretty easy to confuse yourself.

So you said "if i work for facebook and i want to figure out something about
my users", and for whatever you were doing, looking at your existing user base
might be the right thing to do. Perhaps, though, you actually want to know
something about all your potential users, not just the users you happen to
have right now. Whether or not your current user base offers a good model for
your potential user base would then be a pretty important question, and one
that almost certainly isn't answered by "big data".

I think that, as with most of statistics, the key point is "think about your
problem", and that focusing on a set of solutions rather than the problems
themselves can get in the way of that.

------
linuxhansl
Any either-or discussion is doomed to fail. Saying that BigData is the end of
theory is clearly nonsense.

BigData vs. Theory, Java vs. C++, Capitalism vs. Socialism, Industry vs.
Nature, Good vs. Bad, etc.

BigData allows to store a lot of data and provides a means run some
computation on that data. Not more, and not less.

------
pella
_" Big data can tell us what’s wrong, not what’s right."_

from: [http://www.wired.com/2013/02/big-data-means-big-errors-
peopl...](http://www.wired.com/2013/02/big-data-means-big-errors-people/)

------
Sami_Lehtinen
I think this site is really related to this topic, even if it doesn't involve
term 'big data'.
[http://www.statisticsdonewrong.com/](http://www.statisticsdonewrong.com/)

------
dreamfactory2
Reminds me of chartism vs [http://en.wikipedia.org/wiki/Efficient-
market_hypothesis](http://en.wikipedia.org/wiki/Efficient-market_hypothesis)

------
ddmma
I like to think that every object or living being in this world has properties
and methods as in programming ... This the source of data, small or big
depending of actions or complexity

------
instaheat
Well I was considering a career as a Data Scientist having a strong interest
in this sort of thing and as a poker player.

This just kills my vibe, man.

------
wglb
Excellent article.

New favorite phrases "data exhaust" and "digital exhaust".

~~~
tomaskazemekas
TL;DR. Too much 'data exhaust' can cause 'data vomit'.

------
kushti
We're making a big mistake with an every big thing, that's the way we handle
buzzwords.

------
Houshalter
Nonsense. Google Flu was not "Big" data, they had only a few years worth of
data at best. Additionally, when combined with current CDC data, it's
predictions were better than models based on CDC data alone. And in all
likelihood they can improve it with better methods.

