
Torture your data long enough and it'll tell you anything - DanielRibeiro
http://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.html?
======
refurb
As a scientist who has moved into the business world, it amazes me how
statistics are abused.

When I was conducting scientific research, the goal was to come up with an
air-tight (or as air-tight as possible) case for your hypothesis. If you
presented your findings at a meeting, you better be prepared for the onslaught
of questions like "Did you consider X?" and "What about Y?".

Then I moved to the business side and holy crap are the standards lower. Of
course it's easier to prove something in a lab than in the real world, but in
so many cases I've seen somebody say "If you do X, you will get Y result,
based on the data I analyzed". Then I raise my hand and say "But what about Z?
That could explain your results." and all I get is blank stares like I just
solved a differential wave function in my head.

------
rexf
Google Correlate could be a fun tool to find arbitrary correlation

<http://www.google.com/trends/correlate/draw>

~~~
alexchamberlain
That is a really cool tool. I will definitely be using it to _prove_ some
crazy statements.

------
DanielBMarkham
This subject must be in the air. This article came out, I just blogged on the
same general topic
([http://www.whattofix.com/blog/archives/2011/12/management-
by...](http://www.whattofix.com/blog/archives/2011/12/management-by-s.php)),
and just a minute ago I got through reading another article on cognitive bias.
[http://www.american.com/archive/2011/december/the-
political-...](http://www.american.com/archive/2011/december/the-political-
implications-of-ignoring-our-own-ignorance)

It's a great topic, especially for businesses and startups. I think the
problem is much, much deeper than correlation != causation. The basic problem
is that we don't understand how to deal with statistics, especially aggregate
numbers. This is a funny way to make a point, but the problem is waaaay deeper
than just confusion about correlation. The errors in scientific studies, for
instance, are just one example of the harm caused by these kinds of cognitive
blind spots. (I say blind spot instead of lack of math education because I
don't believe the root problem is a lack of understanding math. In my opinion,
something else is going on.)

------
easyfrag
A particular favorite of mine: <http://pubs.acs.org/doi/abs/10.1021/ci700332k>

~~~
kd1220
There is a logical explanation. Drivers are spending more time sorting through
lemons at the grocery store and end up driving less, thus traffic fatalities
fall.

------
mwexler
All these comments are true. But let's not make the similar error to say that
correlations are bad.

Sometimes, knowing what caused something is the necessary answer, and for
those, a root cause analysis and proper experimental design for validation are
important. But sometimes, in business and in life, just knowing that things
hang together can be pretty handy.

Correlations are important clues. The entire "recommendation" world, from
Amazon's collaborative filtering to Hunch's "everything you might be
interested in" are all predicated on correlations.

No argument, saying correlation implies causation is bad. But it's just as bad
to say "therefore, correlation is bad". DanielBMarkham's article and this
BW.com post both show that it comes down to interpretation of what the data
says. It's understanding the limitations of what a number, or a trend, or even
a distribution can reveal. It's understanding what regression to the mean
actually means, or why we consider a distribution "normal"... and that
outliers actually can be profitable.

And it's a recognition that with the democratization of big data, it will get
worse before it gets better... but it will get better. 40 years ago, no one
ever saw the stock market on the news, or had access to it's ups and downs
every second. We now all have a better understanding of stocks (well, ok,
that's a bit of a stretch, but you get my drift), and their dangers.
Similarly, as we get used to seeing lots more data, and discovering that if
you interpret it wrong, bad things happen... well, I expect more folks to ask
that next level of questions. Not all, and not much past that... but it will
be a start.

------
forrestthewoods
"There are three kinds of lies: lies, damned lies, and statistics."

I'd go so far as to say the problem is 100x more complicated than "Correlation
!= Causation". Given a set of factual statistics it's not terribly difficult
to present them in a truthful, reasonable manner than support any side of a
given argument.

~~~
klodolph
Well, most people seem to forget that if you are looking for correlations
among N variables, you can't compare each pair with the same standards as if
you only had 2 variables. (Remember the recent article about neuroscience
papers? Same thing.)

So the damned lie of statistics is pretty subtle, you just have to omit the
number of variables you actually looked at when you present your data.

~~~
haliax
What standards/techniques do you use?

~~~
CognitiveLens
The field of statistics is very concerned with mis-representing data, and has
developed a huge array of methods for dealing with uncertainty. It is
important for people to appreciate that there are some minimum steps that must
be taken for statistical analysis to claim validity. One of the first, which
is almost universally ignored in pop-stats like the OP is _state your
assumptions_ , then justify them.

Stats are open to interpretation, which is why academia favors peer review,
where faulty underlying assumptions can be checked.

------
Sukotto
I wish schools taught math leading to statistics and probability instead of
leading to calculus. I believe that would much more useful for the average
citizen.

~~~
dxbydt
> wish schools taught math leading to statistics and probability instead of
> leading to calculus

This is silly. All probability distributions are cadlag, so how can you even
teach probability without the notion of right continous with left limits,
which means you have to resort to limits & derivatives => Calc.

Actually, the argument for combining Calc & Stats is very compelling, because
there is too much synergy. How can you teach a continous probability
distribution like say the Gaussian without teaching how to integrate under the
curve for the cumulative distribution function, or obtaing the probability
density function via the derivative, or obtaining the variance aka second
central moment via the moment generating function, which means you now have to
teach atleast some fourier transforms which again means Calculus. At both
UChicago & Stanford where I learnt all of my probability, calculus was quite
intertwined with the teaching of probability. I believe its the same case in
most other schools as well.

Without calc in probability, you can do "lame" stuff like discrete
distributions ( Binomial, Poisson etc....but even there, the key insight is to
show how the CDFs of the discrete distributions, which will generally have
terribly complicated formulae with giant factorial expressions, can be very
nicely approximated by the continous distributions for large n, small p etc. (
aka continous correction <http://en.wikipedia.org/wiki/Continuity_correction>
). So for a large number of coin flips trials, you use a Normal to approximate
the CDF because otherwise the original binomial CDF is too hard to compute
with your TI-84s (because you have one giant factorial divided by another
giant factorial and the numerical overflows will kill the computation unless
you are very careful about how you go about computing the result).

My favorite go-to guide remains the excellent Calc & Stat Dover book (
[http://www.amazon.com/Calculus-Statistics-Dover-Books-
Mathem...](http://www.amazon.com/Calculus-Statistics-Dover-Books-
Mathematics/dp/0486449939/ref=ntt_at_ep_dpt_7) ), which combines Calc & Stats
from page 1. There is simply no better way to learn stats than via calc.

~~~
timwiseman
You are right in the sense you absolutely cannot get a deep understanding of
Statistics without Calculus.

But with a mere background of high school algebra, you can learn more about
Stats than most college graduates have, and that knowledge is far more
relevant to the day-to-day lives of the average person in America than
Calculus is.

~~~
comicjk
True, because the average person in America has a pretty crappy job.

Those who get a chance to follow through with calculus and apply it do much,
much better.

------
stygianguest
In fact, this is the origin of the verb data-mining: to find whatever you need
in data. Funny how it changed from a derogative to a respected --or at least
well-payed-- practice.

~~~
CognitiveLens
You are over-simplifying and therefore trivialize what data-mining is. Data-
mining is about deriving fact-based conclusions from complex information as an
alternative to making decisions based on intuition or ignorance. Like almost
anything complex, it can be done very poorly (as in the OP), or it can be done
well. That doesn't mean that it originates in mis-representing information for
the sake of 'finding whatever you need'.

------
ben_h
The frame showing Facebook vs Greek debt is particularly good. We ran a piece
in a similar vein last week: [http://theconversation.edu.au/how-david-beckham-
caused-globa...](http://theconversation.edu.au/how-david-beckham-caused-
global-warming-the-man-u-climate-model-4548)

------
tatsuke95
This is a massive exaggeration. Of course you can find correlation over
specific periods between random series. But when you're doing real analysis,
the series you use aren't random (like the shape of a mountain). The idea is
to draw an inference first, then see what the associated data says.

Of course, anyone beyond the base level of wisdom in this field understands
this. It just annoys me that people attempt to diminish the value of
statistics with an argument like this.

~~~
timwiseman
I don't think they are trying to diminish the value of statistics at all, but
rather point out that it is easy to misunderstand or even deliberately abuse
them.

This is more a warning to people without an understanding of statistics,
because most people out there do not have a deep grasp of the fact that
correlation does not imply causation.

------
CognitiveLens
Articles like this tend to elicit an interesting response from people. One one
hand, many seem to believe that statistics = deceptive manipulation. One the
other, many call for better statistics education in schools. Sometimes, both
claims are made by the same individuals. So it seems that statistics education
has at least two goals: explain what statistics actually is, and then explain
how to do it correctly.

------
tomelders
Malcolm Gladwell take note.

------
iqster
Minor footnote: I thought the saying was "if you torture your data long
enough, it will CONFESS to anything."

------
beza1e1
Of course, xkcd has an img for that: <http://xkcd.com/552/>

~~~
ctdonath
And on a related note: <http://xkcd.com/982/>

------
_delirium
Fig. 6 is awesome. Clearly too striking a correlation to be merely
coincidence.

~~~
wtvanhest
I wasn't sure at first, but the gun illustration cemented it as fact.

------
tintin
<http://www.google.com/trends/correlate/comic?p=6> The bottom row explains it
all.

------
alphamale3000
The core idea is true and should be spread around, but their examples lack
refinement and subtlety.

------
unixIKnowThat
Splunk

