
Correlation does not even imply correlation - luu
http://andrewgelman.com/2014/08/04/correlation-even-imply-correlation/
======
jawns
I wrote "Correlated: Surprising Connections Between Seemingly Unrelated
Things" ([http://www.correlated.org](http://www.correlated.org)), and one of
the things I run into a lot when promoting the book is the "correlation
doesn't imply causation" line.

Obviously, the statistics I present in the book and on the website are tongue-
in-cheek, but I like to take issue with that line, because typically, people
don't fully understand what it means. I use a variation of the line in the
OP's post -- that "nothing other than correlation implies causation."

~~~
micro_cam
You're joking right? Uncorrelated things can easily be causal.

~~~
kretor
The right way to disprove "Nothing other than correlation implies causation."
would be to find one causation whose existance we know from something other
than causation.

But you won't find it. Think about any causation. Now ask yourself why we now
that it exists. The answer will always be correlation.

~~~
micro_cam
Causal calculus, mutual information, random forest importance scores, various
hypothesis tests and other methods can all imply causation as well or better
(especially in the case of non-linear or multivariate association) then
correlation. All these methods and more are widely used in literature.

~~~
kretor
Those methods still rely on correlation. To be clear: We are not only talking
about linear correlation here, which is only one of several kinds of
correlations.

~~~
micro_cam
No they don't. They work directly with underlying estimates of probability
distributions, entropy or impurity decrease in machine learning models.

Another example: mendelian inheritance patterns in a pedigree study.

If you know of a good measure of non linear correlation please let me know.
And publish a paper in science or nature like the MIC/MINE people did (a
measure that has issues in practice).

~~~
kretor
To estimate probability distributions, you need data that is non-random. Non-
random means there's a pattern. That's another word for correlation.

In using those methods you may never calculate correlation as a number, but
when those methods find something you still rely on the fact that there is a
correlation.

------
bluthru
[http://www.slate.com/articles/health_and_science/science/201...](http://www.slate.com/articles/health_and_science/science/2012/10/correlation_does_not_imply_causation_how_the_internet_fell_in_love_with_a_stats_class_clich_.html)

------
erikb
Isn't this statement taking it too far? As far as I understand, the word
"imply X" means that there is a chance >0% that X might be true. Therefore
"not imply X" would mean that the chance is <=0%. But when you have an unknown
dataset showing X (with X being "A correlates with B", for instance), then the
chance for X being true is in fact bigger than 0%. It might be dismissably
more likely than 0%, so little more that you might conclude it is basically
0%, but it still is >0% in the mathematical sense. Therefore seeing
correlation should imply correlation.

~~~
j2kun
"Imply X" means "one can logically _prove_ X follows as a consequence." When
you say correlation does not imply causation you're saying "there are examples
of things which correlate but have no casual relationship." Proving such a
claim (by finding an example) is a logical dis _proof_ of the claim that
correlation implies causation. Logical implication has nothing to do with
probabilities in this sense.

The point of the article is that correlation (of observed data) does not
logically imply correlation (of the underlying phenomena, usually in a more
general setting than the data allows).

~~~
erikb
I see, then I misunderstood the definition of "imply X". Thanks :)

------
snowwrestler
The key point is that there is more to science than statistics. To reword:
statistics alone are not sufficient to create scientific knowledge.

To create scientific knowledge you need to make a prediction (aka a
hypothesis) and test it. To make an hypothesis you need some notion of a
causal mechanism; to run the experiment you observe an interference in that
mechanism to see what happens.

If all you have is a statistical correlation, and you haven't identified or
altered a mechanism...you really don't have much. That is what this author is
getting at.

------
cashoil
Statistical correlation (revealed by some statsistical test) is one thing.

The common meaning of correlation is close relationship embedding some
causation in it (A implied B or B implied A or the same thing is the root for
A and B).

You can find that two series have high correlation (statistically), whereas
this happens just by chance.

OK.

------
ninkendo
I may be thick but I don't get the tagline at all. How can A not imply A for
any value of A? (A = correlation in this case.)

Or maybe this is one of those clickbait titles that state something logically
contradictory so that you'll get miffed and read the rest.

~~~
Double_Cast
Correlation (in sample) != Correlation (in population)

Also, apophenia. E.g. consider constellations. Stars are distributed randomly,
but that doesn't stop us from making patterns out of them anyways.

~~~
jerf
In other words, it equivocates:
[http://en.wikipedia.org/wiki/Equivocation](http://en.wikipedia.org/wiki/Equivocation)
Wikipedia concentrates on the fallacy angle... in this case it isn't a
fallacy, it's sloganeering, since the point is to encourage the viewer to be
confused and dig into the two different meanings being used. I think it's not
likely to work too well, though.

------
theophrastus
Non-correlation is so correlated with non-causation that it requires it. And
that forms the basis of a lot of "you accept the data you're dealt" science.

~~~
micro_cam
Not true. It is easy to find/construct a causal non linear relationship that
won't show up in correlation tests. Correlation really isn't that great of a
measure.

~~~
theophrastus
You are correct if what you're implying is that correlation isn't a robust
measure. Whereas non-correlation can be quite robust if your test isn't
constructed badly (or as you imply 'constructed' badly) It's important to
consider the meaning of such measures without regard to quality of the test;
which can always be faulty.

~~~
micro_cam
That is my point exactly though I am extremely skeptical of any test for non-
correlation. Gelman actually has some other articles worth reading on how
dangerous it can be to make policy decisions based on such tests with real
world examples including traffic laws.

------
porter
Man, for all the critics out there quick to point out that correlation is not
equal to causation, nobody ever seems to explain what needs to be in place to
actually show causation.

To establish causation you need 3 things:

1) Correlation

2) Temporal precedence. That is, you have to show that the cause occurs in
time before the effect

3) A lack of other plausible explanations

If more people knew the above the world would be a better place. Even the
Udacity statistics course failed to mention the above, even though they
hammered home that correlation does not imply causation.

~~~
czr80
When I get a cold I pray to Zeus. It always clears up in a few days. Can't
think of any other reason for it getting better, therefore by your 3 criteria
I'm justified in thinking Zeus cures my colds.

Actually, to establish causation, in as much as this is possible at all [1],
you need a predictive model and controlled experiments.

[1] following Hume, you can never definitively show causation.

~~~
tel
What's Hume's objection to proving causality? (Not that I intend to challenge
it, I would just like to hear it characterized).

~~~
snowwrestler
You can never rule out hidden variables.

Edit to add: because they are hidden.

------
coldcode
Wouldn't this make political polling pointless?

~~~
Dwolb
No. Model-building relies on a given set of assumptions that may or may not be
true. Once you've agreed to a set of assumptions, the data becomes meaningful
within the framework. The framework can be bent and shaped as the assumptions
evolve.

If we can assume that on average, a random set of people who are randomly
asked about a political topic is a reflection of the population of a whole, we
can start to draw meaningful conclusions about the political topic.

The OP article points to the concept of spurious correlation [1] which is a
danger if you have very little domain expertise in the data that you're
working with. e.g. If your regression shows US GDP is statistically
significantly affected by Bangladesh butter production, you may want to
discuss the results with domain experts about why the result may or may not be
spurious.

[1]
[http://en.wikipedia.org/wiki/Spurious_correlation](http://en.wikipedia.org/wiki/Spurious_correlation)

------
waps
Assuming:

1) every variable you calculate with follows the law of large numbers (also
known as the central limit theorem) (this means amongst other things, that if
you find correlation, you can't tell anyone involved in the variables, even
yourself, because you'll act on it and change it, at which point the
correlation won't hold. Also it won't work for chaotic variables, like pretty
much anything involving human actions)

2) the correlation remains intact over longer time periods. The variables must
be sampled without bias. You have properly separated your concepts, and made
sure they actually represent what you want to be looking at. Etc. Etc. Etc.
(this is essentially stating : don't fuck up the math before you calculate
correlation) (obviously don't fuck up the math afterwards either, in other
words : correctly check for statistical significance)

3) You have to check with every time offset, and even with variable time
offsets. This is too complex to go into.

Then correlation implies a "causal relationship". Meaning corr(A, B) > 0 IFF

1) A causes B

2) B causes A

3) there is some external factor C that causes both A and B

Now keep in mind that there is no such thing as a root cause. There is simply
a chain of events, and if the next event wouldn't have happened without the
previous, the first event is said to be the cause of the second. "I shot him
because his bees annoyed me", this "algorithm" would find the bees as guilty
as the perpetrator, they are both causes of the death.

Also keep in mind that this only works AS LONG AS YOU LEAVE THE CAUSAL CHAIN
ALONE. If you calculate correlation, and find, say "BAC" always goes up 2 days
before "JPM" goes up, and then you proceed to buy JPM when BAC goes up, you've
just invalidated your conclusion (translation : this works better the smaller
an investor you are).

There is also absolutely no guarantee that killing of one chain of events
won't simply lead to another. Say you have a harbor with 2 entrances, and
because one is wider, every ship uses it. Therefore "getting to that harbor"
correlates perfectly with "going through entrance A". Obviously blocking
entrance A won't lead to no more ships in the harbor.

So this DOES NOT match the human/legal idea of "cause", and can never yield
useful actions to take. It is nevertheless a useful metric.

