My favourite counter-example is the one with a thermostat.
If someone controls the thermostat so that the temperature in the room stays constant no matter the temperature outside, then you record the temperature inside and the thermostat setting over many days, the two series will be completely uncorrelated. But you will have a perfect correlation between thermostat setting and outside temperature.
That's a nice example! Usually people's intuition between causation and correlation gets thrown off because they forget the impact of other variables in the causality network. Normally I've got examples for "correlation -/-> causation" but this is great for the other direction.
When the temperature outside goes down, the thermostat goes up and the other way around. They are perfectly correlated, even though you can't really say one causes the other (outside temperature does in a very roundabout way cause thermostat setting).
I like that example because it counters both "causation implies correlation" and "correlation implies causation".
I'm confused. Don't you simply set a thermostat to the desired indoor temperature and leave it? Assuming the desired indoor temperature is constant, I would expect the thermostat setting to be constant, the internal temperature to be nearly constant, the outdoor temperature to be uncorrelated, and the energy usage of the heating/cooling system to be correlated with the outdoor temperature (namely, the difference between outdoor temperature and desired internal temperature).
I think the confusion arises from the term "thermostat". The example works much better with a "heating power dial", not a "desired indoor temperature dial".
I think he means not the internal temperature, but the differential the thermostat sends to the heating, or somethign like that. I kind of get the analogy, but it is a little tortured if people can't pick it up immediately.
Here's a harder question: is it ever possible to deduce causation based on data alone? We all know that correlation does not imply causation, but is there anything that does?
If not: prove it. And if yes, under what conditions exactly?
The
do
-calculus was developed in 1995 to facilitate the
identification of causal effects in non-parametric mod-
els. The completeness proofs of
[
Huang and Valtorta,
2006
]
and
[
Shpitser and Pearl, 2006
]
and the graphi-
cal criteria of
[
Tian and Shpitser, 2010
]
have laid this
identification problem to rest. Recent explorations un-
veil the usefulness of the
do
-calculus in three addi-
tional areas: mediation analysis
[
Pearl, 2012
]
, trans-
portability
[
Pearl and Bareinboim, 2011
]
and meta-
synthesis. Meta-synthesis (freshly coined) is the task
of fusing empirical results from several diverse stud-
ies, conducted on heterogeneous populations and un-
der different conditions, so as to synthesize an esti-
mate of a causal relation in some target environment,
potentially different from those under study. The talk
surveys these results with emphasis on the challenges
posed by meta-synthesis. For background material,
see
〈
http://bayes.cs.ucla.edu/csl
papers.html
〉
This is heavily studied by Judea Pearl. I highly recommend his papers and books. The high-level summary is that you need to specify your causal assumptions and then use that information to justify or disprove experiments and tests as being sufficient to make causal claims.
The reason that's a hard question is that while most people have what seems to be an "obvious" definition of "causation" in their head, it turns out to be fairly difficult to define in precise (and especially testable/decidable/falsifiable) terms.
In general, no. Simple counterexample: everyday it rains my grass is wet, and everyday it doesn't rain the grass is dry. Does the rain cause the grass to be wet? Well, probably. But maybe my grass is in a greenhouse, and I only turn on my sprinkler when it rains so the grass gets a "natural" amount of water. In this case the wet grass is not causally linked to the rain, even though it looks exactly like the situation where it is. My action of turning on the sprinkler is known as a "confounding factor".
You need to make assumptions to be able to draw a causation from data. This is what people are doing when they design controlled experiments.
You have to assume the data you have is all that there is, and sometimes the math will only give a class of causal graphs instead of a single causal graph.
From what I can tell, that presentation is assuming a certain class of graph structures in its DACB example - for example that there isn't an element E that can causally link the other nodes, emulating a causal relationship without there being one. Assuming a class of structures is not a general approach; in my example this would be the equivalent of assuming there is no sprinkler, but how can you make that assumption from the data alone?
There's another issue: their approach is statistical, so even if you assume a class of graph structures you can only draw conclusions that the causal structure "almost surely" exists, not that it does exist. What this means is that even within infinite data, you can't prove that the other structures are impossible. If I turn on my sprinkler when it rains the first {very large number} times, you may conclude that there is almost surely a causal link between the rain and wet grass, but there is no reason I have to do it. You aren't left with a proof.
What you're saying insinuates I must turn on the sprinkler when it rains, which is not true. Just because I have my whole life doesn't mean I will for eternity. It's simply a correlation - I tend to turn on the sprinkler when it rains. The only causation is that the grass becomes wet when the sprinkler turns on.
The only way this would have meaning for me would be if causation has a special, jargon meaning in this context that specifically means direct, deterministic causation.
This is very much the field of Econometrics, perhaps the subset of Mathematics with the weakest theoretical footing. The short answer is not 100% because you can never 100% prove there isn't some other bigger cause that you just don't see. [1] Once you accept that, there are ways to test theories on the data (requires some knowledge of the subject matter) to piece it out. In this case, you're not going to get an absolute law of science, but you might get a clue for a policy decision.
James Heckman [0] is one of several who make this their life's work. He has applied this analysis to fields like early childhood education.
Yes, under some assumptions. What you need though is free randomness. Free randomness is basically random information that isn't caused by anything. What you do is you perform a number of experiments, where you use free randomness to set the independent variable. By observing what happens to the dependent variable, you can deduce causation.
Now, the challenge is that free randomness doesn't exist. But with a good approximation (i.e. it has little common cause with the dependent variable, and is caused very little by the dependent variable. a dice roll would in many cases be a very good approximation) we can get a result with high confidence.
This can be done based on data alone, if someone else has performed the experiments, or through a "natural experiment".
It's very difficult to prove, but easy to disprove (change in one thing does not cause expected change in other).
Sometimes disproving the causation is enough to gain valuable knowledge.
I was wondering the same thing.
Imagine we make a study on programmers and observe that 80% of them like science. Then we have a new programmer than wasn't part of the study, can we say he has a 80% chance of liking science?
Intuitively I would say yes but I think it's wrong, can someone help?
I don't know much more than you, but I'll give it a shot:
We can say whatever we like, really. If you mean "is it true", then yes, if our sample was representative of all the programmers and not biased in some way, the new programmer will have a 80% chance of liking science.
In reality, though, "80% chance of liking science" is just a best guess. Since all we have to go with is the sample we already studied, we assume it's representative (or try to correct for the bias we know) and make a guess based on that.
I can clarify this. The thing I outline below is a little situation dependent (and matching this hypothetical) but the intuition is generally correct and amenable with yours.
One way to see all the pieces is to recognize that at least in theory there are real answers to all of these questions which we are approximating. We could literally round up every "programmer" and ask them whether or not they "like science" and then get a completely accurate statistic that, say, "92% of programmers like science".
But that's infeasible (even buying that my quoted words could make literal sense) so instead we run a finite study. This is called taking the subpopulation of a superpopulation and drawing the statistic from that subpopulation. Now we see that in our subpopulation of N participant-programmers, 80% of them liked science.
Now let's bring in a new guy. Let's say we brought him in by taking every programmer in the superpopulation (replacing our sampled subpopulation back into the global pool) and picking one guy in a lottery. Provided that was a truly random lottery then we can use the 92% number from before and know that there's a 92% chance that our lottery picked a science-liker.
But we don't actually have that information. Instead, we simply have an estimate to 92%, our subpopulation statistic of 80%. If our error bars on guessing incorrectly are fine with a 12% discrepancy then we're still golden, though.
But there's one final element—we want to be able to talk about our sampling process and the risks of using it. The only parameter of this process is the size of the subpopulation chosen. For instance, if we had sampled only 5 people then there's about a 1/4 chance that we'd arrive at an estimate of 80% and a 7/10 chance we'd get 100%. This still leaves around 10% of the possible subpopulations we could choose that would tell us something really misleading, like only 20% of programmers like science.
This could be called sample statistic stability, or just straight up estimation error. It's rarely taken into account when people say things like "80% of programmers like science" because the usual assumption is that the sampled subpopulation is "big enough" to minimize this error sufficiently. In truth, though, it's hard to know exactly what the possible error of your sampling method really is.
So finally, we want to know that, given we saw 80% of our subpopulation likes science, what's the chance that a new, randomly drawn programmer also likes science. The honest truth is that there's a 92% chance, so we're ignoring some amount of error by taking our study on faith and saying that there's an 80% chance. However, given literally no other information, it's still our best bet to assume there's an 80% chance—it's better than randomly picking any other number based on the data we've seen. [0] Finally, if we were to repeat this whole experience over and over again starting with fresh, random draws of the subpopulation then we'd be, on average, correct in our guesses about the ratio of the super population [1].
[0] Bayesians here would say that we should mix in other information we might have to form a better estimate—I think that makes a lot of sense in a situation like this, especially if our subpopulation is tiny, so I'm trying to be really clear that we have no other information. Let's not talk non-informative priors.
[1] i.e. if we keep repeatedly re-estimating our "80% statistic" and using it honestly to make our guesses then it would bounce around based on our subpopulation, sometimes 80%, sometimes 100%, sometimes 20%, etc. If we multiply these estimates by how often they occur and sum it all up then we'll get exactly 92%. We'd get there faster just by drawing a bigger subpopulation, though.
I responded at length to this elsewhere, but quite simply this isn't causation yet. This statement is entirely about the information in our head, not the effect of actions on the world.
The causal statement would be "training to become a programmer makes you like science more" and it already begins to indicate that we'd want to observe people who both do and do not decide to become programmers and question their preferences over time in order to have relevant information.
If by correlation we mean Pearson correlation (linearly related) the answer is no.
If by correlation we mean some hand-waving association between A and B then yes. Since if A causes B there is some hand-waving association between them, namely causation.
I suggest we reserve 'correlation' for linear relationships and stop using it in the second sense. It's unhelpful and confusing.
Two problems: first is that casual association may be invisible to even hand-waving criteria; think good crypto PRNG output which is fully determined by seed and is serial number, but the parameters are practically unfittable from the data because the function is so chaotic. Second thing is that the more hand-waving you allow the more totally incidental relations will look "true". With a modern amounts of hypotheses, even Pearson correlation often becomes useless in this manner.
Note that I am criticising the idea of "general" or "intuitive" correlation; actually useful correlation measures (like Pearson or Spearman) will be just always tied to some model of dependence. This way it cannot be directly tied to the obviously absolute dependence.
But it's highly non-linear. They have an expected Pearson correlation of zero.
Note: there are simple counter example of variables that have a Pearson correlation of 0 yet some non-linear dependency structure as in the 3rd row of the plot of:
If the question is "does causation imply linear correlation", then the answer is obviously and trivially "no".
But we don't need a PRNG to see that. Take a series of chutes whose entrances are lined up in a row, and whose exits are arranged in a circle. Now start dropping balls in the entrances and see where they end up.
Obviously there is a causal relationship between where we drop the ball and where it lands, but it's not a linear correlation. But there's still a correlation.
I guess an even more obvious case would be a cryptographic hash, with the variables being the input and the output.
But in this case still there is a correlation. In fact, the existence of the correlation is trivial to see: after a known transformation (the PRNG/hash itself) the correlation becomes linear.
Probably. One of the answers on there also has a discussion of referring to it as 'correlation', which has a specific statistical meaning that is often ignored when people trot out "Correlation != Causation".
A better wordsmithed version of the question would probably be "Does Association Suggest Causation?"
Correlation does not always mean Pearson's r², which is the thing with a specific statistical meaning. There are other specific correlations, and general correlation.
If someone controls the thermostat so that the temperature in the room stays constant no matter the temperature outside, then you record the temperature inside and the thermostat setting over many days, the two series will be completely uncorrelated. But you will have a perfect correlation between thermostat setting and outside temperature.