

The backlash against big data - rrtwo
http://www.economist.com/blogs/economist-explains/2014/04/economist-explains-10?fsrc=scn/tw_ec/the_backlash_against_big_data

======
j2kun
> Third, the problem of spurious correlations—associations that are
> statistically robust but only happen by chance—increases with more data.

Can someone explain why this is the case? I don't see how spurious
correlations wouldn't appear in small data just as well as big data.

~~~
ZenPro
Probability theory.

If I flip _one_ coin 100 times the outcome is likely to be close to the
perception you would expect.

If I flip _8 million_ coins simultaneously for 100 iterations it is possible
that a number of the coins will display massive outlier behaviour (98 Heads /
2 tails etc).

~~~
j2kun
Now I'm confused what is meant by spurious correlation. You're talking about
an event that is known to happen, and happens infrequently, and if you have
enough samples it's guaranteed to happen at least once with high probability.

Spurious correlation seems to be related to the use of ratios of independent
variables making correlation "out of thin air." The word "spurious" also seems
to be used in relation to hidden causation.

Neither of these two uses are related to this example, nor does the problem
(in the sense in the previous paragraph) "increase" with more data.

~~~
ZenPro
OK, my example was not perfect, I was just trying to make it accessible to
show why spurious correlations in data have a greater likelihood in larger
datasets, especially super massive. The sequence was analogous to the
correlation.

Given enough data it will appear but it cannot be trusted as any sort of
indicator.

Prob just best to read
[http://en.wikipedia.org/wiki/Spurious_relationship](http://en.wikipedia.org/wiki/Spurious_relationship)

~~~
j2kun
Yeah I read that as well as the wikipedia article on "Spurious correlation"
proper (which appears to be a different concept) [1]. Neither one seems to be
related to the amount of data being processed, nor to the description the
author gives of unlikely but statistically robust events.

Without an explanation I can only conclude that the author misused the term.
It's a shame when he's (she's?) writing to inform.

[1]:
[http://en.wikipedia.org/wiki/Spurious_correlation](http://en.wikipedia.org/wiki/Spurious_correlation)

~~~
ZenPro
If you do not personally understand something you automatically assume that an
author is incorrect?

[http://thefederalist.com/2014/01/17/the-death-of-
expertise/](http://thefederalist.com/2014/01/17/the-death-of-expertise/)

That is your choice of course but it is classic Dunning Kruger. You are not
required to form an opinion about concepts you are not _au fait_ with and to
do so reveals more about you than it does about the author.

I would read the Tom Nichols paper above; it changed the way I approach
certain topics.

In all honesty though; I am struggling to see what further explanation you
require?

In larger datasets the likelihood of spurious correlation increases. This is
because in large data sets, large deviations are vastly more attributable to
variance (or noise) than to information (or signal).

[http://valbonneconsulting.files.wordpress.com/2013/10/traged...](http://valbonneconsulting.files.wordpress.com/2013/10/tragedy.jpg?w=640)

 _Figure 18 shows the swelling number of potential spurious relationships. The
idea is as follows. If I have a set of 200 random variables, completely
unrelated to each other, then it would be near impossible not to find in it a
high correlation of sorts, say 30 percent, but that is entirely spurious.
There are techniques to control the cherry-picking (one of which is known as
the Bonferoni adjustment), but even then they don’t catch the culprits—much as
regulation doesn’t stop insiders from gaming the system._

[http://www.wired.com/2013/02/big-data-means-big-errors-
peopl...](http://www.wired.com/2013/02/big-data-means-big-errors-people/)

~~~
j2kun
Thanks, the linked article gives a much better explanation.

It's not about my personal understanding of something, but about the apparent
contradiction with other sources I trust and the unnecessary vagueness about a
subject that requires precision in terminology. "Statistically robust but
unlikely" has many meanings, but "spurious correlation" has only one. I am
training to be an expert in mathematics, so I do feel somewhat entitled to
discuss the finer (though perhaps overly pedantic) points. It's literally my
job to rigorously reason about these things. So I notice that the word "large"
means something very specific here (many independent variables), while you
appear to still mean "lots of data," which can happen with few variables just
as well. "Big data" encompasses both (volume and variety), which is why it's
confusing. So my need for a further explanation is because "in large data sets
large deviations happen more often" does not appear to address my question if
I'm specifically asking why volume would cause more spurious correlations (in
fact, it does not). My questions are admittedly vague as well, but this should
clear up my confusion.

I'm not saying the author doesn't know what spurious correlation is, just that
the author used the term incorrectly, or too vaguely to be called correct, in
that instance. There are many reasons to do this intentionally, I'm sure, but
as a consequence I will ask questions with trivial but technical answers. I
ask a hundred such questions every day, most of which I can answer myself, but
I'll continue to ask them even when (if ever) I'm considered an expert in any
topic.

