
How Big Data Creates False Confidence - ezhil
http://nautil.us/blog/how-big-data-creates-false-confidence
======
DrNuke
It's a tool, not a magic wand, but it's difficult to keep your head cool when
everybody and his dog competes for contracts, jobs and market attention. It's
2010s gold rush.

------
pessimizer
The same way false precision creates false confidence. Ten vague, radically
differing estimates off the top of ten people's heads? Rubbish. An average of
those estimates taken out to 4 decimal places? Science.

------
unabst
The more data you have, the more creative we can get with it. Gather enough
statistics, and you're bound to find something that backs your theory.

Traditional intuition insists this isn't suppose to happen. But that's why
statistics isn't physics. Data does not have to be botched or erroneous to get
creative. It can all be true, and the backing of your theory may also be
valid. The issue is whether your theory itself holds any weight or precedence
in light of all other possible theories. So when taking big data and
statistics into account, "all possible theories" is the big data picture, not
any specific theory. Searching for one theory is already misguided, because
information chaos/noise mounts with scale, as other data scientists will
consistently tell you.

But if we consider theories as abstractions of evidence, then this should all
make intuitive sense. A shitload of theories should emerge from a shitload of
evidence.

------
cataflam
Concise article, good examples. Doesn't go very much in depth after exposing
the problematic examples, but recommended read.

------
gravypod
My boss often talks about some of the accidental misrepresentations of data
occur in big data/statistics in the academic world. I started becoming
interested in this during the P-Value debacle.

There is a lot of work to be done to reverse the clinical misunderstanding and
misuse of the tools we have at hand, because to be frank I'd say none of us
understands them.

Can someone point me to a piece of reading so I can learn about the correct
way to do this kind of statistical analysis?

------
essayist
This suggests that in many "Big Data" analyses, validation is an afterthought,
because the interpretation of results is "obvious".

------
jkraker
These problems definitely are not specific to big data. They apply to
statistics in general. The author is right, though--many people tend to be
overly ready to draw overly confident conclusions from their analysis of big
data just because...big data.

~~~
kyrra
[https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statist...](https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics)

Statistics are there to answer a specific question, and even then it is going
to be wrong when your data is incomplete or you ask a question of your data
that it can't answer properly.

------
askyourmother
Most of the recent "big data" projects we have been involved with will fail
due to lack of basic direction from the client, from the beginning.

Trying to explain how they failed to find the twenty needles in the three
pieces of straw, they now want to roll forward to a barn-full of bales of hay
to try and find less needles!

~~~
collyw
Assuming that you are some sort of big data consultancy, shouldn't it be your
job to explain things to the client so they have a better understanding?

------
anotherhacker
Mo' data, mo' problems

As your data set grows, unbounded variance grows nonlinearly compared to the
valid data. As variance increases, deviations grow larger, and happen more
frequently. This causes spurious relationships grow much faster than authentic
ones. The noise becomes the signal.

Related: Overfitting:
[https://en.wikipedia.org/wiki/Overfitting](https://en.wikipedia.org/wiki/Overfitting)

Overfitting happens when you try add too many variables to your training data.
This happens because people think that by adding more data (variables), they
can remove bias. What they end up doing, is becoming better at describing the
data they have, but not the overall phenomena.

It's counter intuitive but mathematically true.

~~~
mikeskim
This is incorrect in almost every way. When you have 2^m independent
observations that you can use to cross validate (where m is very large),
overfitting is exceptionally difficult almost regardless of the number of
features you have. Overfitting typically occurs when the number of data points
is small in magnitude overall and is small compared to the number of features
and the observations are not iid.

~~~
mebassett
I think he's talking about growth in the features (dependant variables) of
your dataset while keeping the number of independent observations constant;
not growth in the dataset due to new independent observations.

I think he's correct in discussing it - I find folks propose new features far
more frequently than new observations become available.

~~~
Fede_V
When people talk about big data, they usually discuss datasets with lots of
observations (aka, rows). Not datasets with lots of features but few rows -
those are far more common in fields like genetics or omic sciences in general.

------
rm999
I think "big" data is being confused with "ubiquitous" data in this article.
Larger datasets will always lead to more statistical confidence in a
conclusion you make from the data. The article does a great job of explaining
the caveats to this (data skews, misuse of stats, external factors), but those
issues exist with small data too. In other words, this isn't about the volume
of data. I think there are two real and different issues at play here, and
both come instead from the ubiquitousness of data nowadays:

1\. It's become easier to do more experiments, so even experts are more likely
to produce some bad conclusions.

2\. Data has become much more accessible, so people without rigorous stats
backgrounds have an easier time abusing the hell out of stats on datasets.

~~~
Obi_Juan_Kenobi
> Larger datasets will always lead to more statistical confidence

I think I understand what you're trying to say, but I don't agree with how
you've phrased it. More data can lower p-values or increase power for a given
analysis, but the assumptions that go into using data don't change when you
simply have more of it. And those assumptions are everything in statistics.

In fact, I think the temptation of 'more is better' leads to more use of what
is easily available which can be highly biased. You also get much more
'significant' results that are more tempting to believe in. It's harder to shy
away from a very low p-value, even when you know the sampling may not be
appropriate.

I'm a biologist, and as a field we have a lot of issues to address with
bioinformatics. Over-eager investigators will pull out interesting tidbits
from datasets without considering how problematic that sort of hypothesis
generation can be. Fortunately the grant reviewers seem to be well aware of
this for the most part. I've heard of many people having grants rejected
because they thought they found a 'one in a million' phenomenon, but it turns
out they looked a million times to find it :) Good bioinformatics is still
firmly grounded in genetics.

~~~
rm999
Your comment is mostly addressing what I refer to as "data skews, misuse of
stats, and external factors", and is what the article is about. Yes, those are
huge issues, but they have little to do with data size.

I'm confused why you don't agree with my phrasing. To be clear, and from your
comment I think you already know this, p-value is _not_ a measure of
confidence. P-value is a terrible metric that jumbles confidence with effect
size. As you get more data you have the ability to properly detect small
effect sizes with confidence; hence, the part of my comment you quoted and
seem to disagree with? But people who don't have experience working with large
datasets see a low p-value and often think they have a big effect size with
reasonable confidence instead of a small effect size with very high
confidence. Chalk this up to a "misuse of stats" :)

