Hacker News new | comments | ask | show | jobs | submit login
Correlation is usually not causation. But why not? (gwern.net)
114 points by Amarok on Sept 7, 2014 | hide | past | web | favorite | 71 comments



All I know about causation and correlation I learnt hunting bugs in large legacy software systems. In that environment I got the impression that correlation almost never equalled causation, but that's only because the hardest bugs, the ones I remembered, were hard because the obvious correlations did not help identify the root cause. A similar argument might be made for scientific studies: most of the easy causes that can be identified from correlation have already been found, leaving the majority of new studies with correlations that don't easily establish causation.


I think this low-hanging-fruit idea is generally true for all of science: after centuries of science as a profession, pretty much all the easy stuff has already been done.

It's actually an argument for more cojones in science-- being willing to do bold stuff and explore "crazy" hypotheses. If all the easy stuff is done already, then picking methodically and timidly among the dregs is unlikely to ever yield anything.


An alternative viewpoint: everything's easy after it's been done, but it's hard up until then.

> after centuries of science as a profession, pretty much all the easy stuff has already been done.

But with the benefit of all that has already been done, shouldn't we be able to do things now that were previously impossible? In other words, "easy in 2014" != "easy in 1200".

> It's actually an argument for more cojones in science-- being willing to do bold stuff and explore "crazy" hypotheses.

Maybe doing whatever's easy is the most efficient (by time, money) way to make progress?


I'm not sure... seems like what you say may hold for a while, but eventually you start hitting a more objective sort of hard: things that are hard for human beings to comprehend due to the limitations of our intelligence itself. Beyond that there's probably an even harder hard-- when you start actually running out of new things to discover. Can there really be an infinite number of physical laws, principles, and useful relations? Or at some point have you actually found most of basic physics?

Once you start hitting those, you've either entered a permanent era of diminishing returns or one where you can only really make progress by radically redefining problems, making leaps, or trying wild and crazy ideas in the hopes of unlocking some isolated seam of high-value research that isn't connected to the others in the fitness/value state space graph.


Sure, we can't rule out those possibilities ... but to me it just sounds like wild speculation.


Yeah, debugging is great training in the scientific method.


This question is bound to be a can of worms. There has been a great deal written about the matter, particularly in regard to observational studies.

Drawing causal inference is overwhelmingly likely to be wrong when there is a good chance that unknown variables are influencing the correlates observed. In health-related sciences that more often than not is the case.

A few years ago there was a study correlating hours of TV watched and ADHD symptoms in children. The news media picked up on these "findings" and of course the causal influence of TV on ADHD was reported.

It was obvious that saying watching TV caused ADHD was absurd, that other variables weren't taken into account, e.g., some other characteristic of ADHD kids prompted watching TV more than other kids.

There was a great article published in PLOS several years ago (ATM I don't have the link) showing mathematically that the odds were about 1 in a million that an observational study like the above would turn out to be a "true" causal relationship, and the author concluded most published studies were junk.

In experimental studies, variables are limited and controlled as well as possible. With fewer and known variables, correlations would have a greater chance of revealing a reproducible causal relationship among events.

The discussion gets tripped up when it comes to defining "cause" or "causal relationship". The theory is controlled as an experiment may be, there's a possibility that unknown variables were present and affected the phenomena occurring in the experiment. Conclusions can't be absolute, but only true to some probability.

I think the history of science over the last 100 years or so has something to say about the nature of "causal relationships".


> There was a great article published in PLOS several years ago (ATM I don't have the link) showing mathematically that the odds were about 1 in a million that an observational study like the above would turn out to be a "true" causal relationship, and the author concluded most published studies were junk.

I very much doubt the second part of that, that most published studies are junk. The idea that not showing a causal link and only a correlational one makes a study junk is not held by anyone in the field who garners a whole lot of respect or notoriety. I'm not saying the quality is equal whatsoever, simply that correlation studies are not inherently junk - some are fantastic and some are quite the opposite.

There is no question that journals in general are pumping out a significant amount of junk (in studies of all types), but I speculate the root cause of that has more to do with the rise of "publish or perish" and significant increases in grad school enrollment. And even worse the notion that for a grad student, the number of publications they have their name attributed to is more important than the content they publish in terms of employment after they graduate. So there is a situation where people have more pressure to publish than ever before [0], there is more competition for scarce funding so studies are vastly underfunded and studies are rushed so another can begin to add to the resume.

That doesn't mean there is less good science being done either! There probably is more good science being published now than ever before, the problem is the signal:noise ratio has gone down making it harder for good studies to get media attention, and easier for the media to latch on to whatever story they think will get viewers.

[0]: Sidenote, this also makes it increasingly challenging not only to have high quality research, but research that only is making a correlation as opposed to going through and providing evidence to claim you might have a causal link.


This, I think, is the study the first poster meant, Ioannidis 2005, 'Why Most Published Research Findings Are False':

http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fj...

It's quite a famous paper. Many rebuttals, comments, blog-posts etc, have been published, have a look at the comments on the PLOS site, I especially like this blog post summarizing more research:

http://simplystatistics.org/2013/09/25/is-most-science-false...

From the same author, here's a small overview of the discussion at the end of 2013:

http://simplystatistics.org/2013/12/16/a-summary-of-the-evid...

Quote on the Ioannidis paper:

>Under assumptions about the way people perform these tests and report them it is possible to construct a universe where most published findings are false positive results. Important drawback: The paper contains no real data, it is purely based on conjecture and simulation.


Because you can find things that are correlated, but have no practical relationship - many (frequently humorous) examples here:

http://tylervigen.com


Datamining correlations like that confuses two issues: sampling error and 'everything is correlated'. When you find a correlation like that, it's unclear whether it's a spurious correlation which will disappear if you just collect more data (and caused by slicing the data in a thousand ways without any compensation by this) or whether there genuinely is some sort of opaque correlation which will persist no matter how much data you collect (maybe 'US spending on science, space, and technology' really does correlate with 'Suicides by hanging, strangulation and suffocation' through something like government stimuluses and economic growth).


Excellent link, thanks for posting :)


Pot smoking kills bees is my favorite... ought to make it into a t-shirt.


Bad winter weather can cause auto accidents, and we expect a positive correlation between bad winter storms and winter auto accidents. Okay, but in the northern hemisphere, living in more northern latitudes also correlates with winter auto accidents but does not cause them.

For heart disease, we know that the main causes have to do with aging. Well, then, since now the audience for TV news is comparatively old, we can expect that watching TV news has positive correlation with heart disease. Still watching TV news does not cause heart disease.

In the US NE, hurricanes are positively correlated with pretty leaves on the trees, but the leaves do not cause the hurricanes, and the hurricanes do not cause the colors in the leaves. Instead, hurricanes are caused by the surface waters of the Atlantic Ocean hot from summer sun with cooler air on top, and that situation is caused by the fall weather with also causes the colored leaves. So, colored leaves have a spurious correlation with hurricanes but do not cause them.

Correlation is much more common and much easier to establish than causality. Usually the convincing evidence of causality if some basic physical connection.


Your first example is wrong. If you move to a more northerly latitude then it will increase the risk of accidents for you (i.e. there is causation). It is indirect causation, but then ultimately (almost?) all causation is going to be indirect if you take the reduction far enough. That is bad weather doesn't directly cause accidents, but ice on the roads does ... etc.


No: The bad winter weather, say, ice on the roads, is the cause of the accidents. If happen to have a winter with little or no snow or ice, then the number of auto accidents will go down no matter what the latitude: So, latitude can't be the cause; ice and snow on the roads is the cause. So, if live in a lower latitude, say, closer to the equator, and get a freak winter snow storm with a lot of ice and snow on the roads, the we will see auto accidents in spite of the latitude.

For indirect causation, I just said winter weather to try to be brief. But, sure, the cause is the ice and snow on the roads, not just the weather in some vague sense. I was trying to keep the explanation simple, avoid discussing indirect causation, and concentrate on the OP issue of causation versus correlation.


>Well, then, since now the audience for TV news is comparatively old, we can expect that watching TV news has positive correlation with heart disease. Still watching TV news does not cause heart disease.

Well, I'm not so sure. All this sitting to watch TV news, plus all the stress from bad news and fear-mongering...


Watching (bad) fear monger news doesn't cause stress. Rather, the story you tell yourself about what the news means causes the stress. One person (say, me for example) can watch the (bad) news and have a good laugh (call that a , while another (one of my housemates, for example) will watch the same thing and have a negative response to it.


Well, that's like "gun's don't kill people, bullets (or holes in vital organs) kill people".

Technically correct, but doesn't really take away from the first thing.


Normally, causation has a couple confounders for correlation, such as

- inverse causation - common cause - random correlations

The first two should be relatively stable and reproducible, and we could then proceed to find out about causation with an intervention study (e.g. if "good education" directly causes "good job prospects", will it help if we give people good education that wouldn't normally get one? Or does it improve people's educational achievement when we give them better access to jobs?)

The third isn't reproducible, but should normally be relatively rare. Why we're seeing more and more non-reproducible results is usually

- People fishing around in data sets for correlations, or tweaking experiments until they find a correlation

Because of this, we end up with a great deal of correlations that have nothing to do with causation.

For a related but different perspective, see this article ("Language is never, ever random", Adam Kilgariff) http://www.kilgarriff.co.uk/Publications/2005-K-lineer.pdf


The article is a little dense for me, and presumes knowledge about Probabilistic Graphical Models (PGMs) and directed acyclic graphs / causal Bayesian networks (DAGs) without introducing the background knowledge - so, I expect my reading of it missed a lot of the details.

But the one thing I came out wondering, is if you do a proper randomized double-blinded study with a large population sample, say, taking 1000 people with a particular illness, completely randomizing them, and then giving 500 of the sample a particular treatment, and 500 of the sample a placebo that is indistinguishable from the treatment. If an unbiased third-party, then observes the groups (still without knowledge of which group had which treatment), and if one group shows markedly different results (say, 490/500 of the treatment group recover, and only 10/500 placebo group recover), is it fair to say that in this particular scenario, that correlation of recovery with the treatment implied that the treatment caused the recovery?


Yes. In this case, you have two outcomes: Recovering and Not Recovering. You also have two starting states: Drug and Placebo (which both came off of the original starting state, Participant in Study, with a 50% probability).

So your Drug -> Recover has a 98% probability, vs. Drug -> Not Recover of 2%. Likewise, Placebo -> Recover has only a 2% probability, vs. Placebo -> Not Recover of 98%. In this case, you know with very high probability that the fork you stuck in the road caused two very clear, disparate outcomes. Especially because you actually began the "give drug" phase.

However, usually medical trials have results more along the lines of: Placebo group of 35 participants: 10 recovered in < 5 days, 12 recovered in < 10 days, and 13 failed to recover. Drug group of 35 participants: 17 recovered in < 5 days, 13 recovered in < 10 days, and 5 failed to recover. With data like that, clearly SOMETHING causes people to recover, and clearly your drug is not the underlying cause of recovery. It may be supporting whatever process is causing recovery, but it's very unclear whether that's even the case.


Well it's fair to say that experiment shows causation but no I don't think you can say the correlation shows it works. I think it goes against the definition of the word.....

I found the article too dense to get as well, but I do think 'perhaps' we could use correlation more to assume causation. I think we are too cautious and certainly the haters always bring in correlation to stop science articles they think don't follow their beliefs. Not sure if that's compatible with the OP or not, it hurt my head.


> it's fair to say that experiment shows causation

No, that experiment doesn't show causation.

> I do think 'perhaps' we could use correlation more to assume causation. I think we are too cautious

Why do you think that? Is it because you think that science is moving too slowly (thus wasting time and money)? Is it because you think that good results are mistakenly discarded?

"using correlation more to assume causation" should produce results more quickly, both correct and incorrect. Do you understand the costs of incorrect results? First, incorrect results can have negative consequences for the consumers: for example, drugs that have harmful impacts and no beneficial ones for the people taking them (in addition to the cost of the drug, and the opportunity cost of not doing something else known to be beneficial). Second, time is wasted following up incorrect results and work based on incorrect results would be worthless. Third, the incorrect results would have to be overturned (and this communicated to everybody who had taken them as correct).

I think deciding whether to "use correlation more to assume causation" needs to take into account the costs and benefits of incorrect results, as well as the costs and benefits of correct results. Do you have any data on this? (unfortunately, I do not)

> certainly the haters always bring in correlation to stop science articles they think don't follow their beliefs

Can you support this accusation with examples of it happening to articles that actually do a good job of showing their result is something more than a correlation? Or are you saying that this happened to articles that just present a correlation?


Correlation vs causation is not actually a complex mathematical problem. The issue is more philosophical.

Bayesian networks are a highly sophisticated and flexible framework for thinking about causation. And yet the essence of Bayesian networks can be captured in much simpler methods like Instrumental Variables or simply regressions with controls.

In all cases, the true distribution of observables (which we can estimate from samples) combined with some assumptions about the possible nature of causality (either very sophisticated in the case of Bayesian networks, or very simple in the case of instrumental variables) lead to the actual causal relationships.

Where do these assumptions come from? Consider a randomized trial. Even though physically, it is possible that some hidden cause influenced both the random number generator which selected patients into a trial, and also whether patients will get better. And yet people universally believe that there is no such mechanism, and so they believe that randomized trials prove the causal relationship between taking a drug and getting better.

No physicist ever wrote down an equation proving this. It is simply something we deduced from our human understanding of how nature works.

Put simply, causality is not a physical notion, neither is it a statistical notion. It is a part of our intuitive understanding of the physical world, in which a higher level notion of causality exists, beyond that described by special relativity.


To add to your point, we cannot prove or disprove the existence of causation (Try to conceive a falsifiable experiment about causation and I would argue you would end up with a metaphysical crisis). The 'is-ought' issue David Hume showed, where just because something 'is' a way does not indicate how it 'ought' to be or will continue to be, has highlighted to us how difficult it is to think this problem for hundreds of years. Another way to look at it is to ask "Who or what is to guarantee the law of physics will remain the same tomorrow? What is to stop the speed of light changing to 1 mile per hour?"

This isn't to say the article posted is of no use. Having a 'graph-like' mental model of how things work is incredibly useful, as most education simplifies real-world problems into a few key issues. Although most non-computer science issues can be reduced successfully using the 80/20 rule in real life, sometimes some problems require us to look at the 100s of contributing factors to allow us to solve the problem we're facing properly.

The more 'graph-like' problem solving becomes acceptable as way to solve issues the better of everyone will be.


The difference between correlation and causation is merely conventional. Does the striking of a match cause it to ignite? There is no way to prove that it does, only the correlation between the striking and the ignition makes us say they're causal. Correlation is the only way to determine what's causal and what's not.

On the other hand, if you want to look at it philosophically, then the only sensible definition of "causation" is "anything necessary for a thing to exist." So what's necessary for a thing to exist? Every other thing in the universe that is not that thing! Pluto causes us to exist right now because it hasn't turned into a giant space goat and swallowed the Earth. Yes, the fact that something DIDN'T prevent the existence of a thing is also ultimately a cause of that thing's existence. Causation in its purest form can help us understand the nature of reality, but it can't help us predict anything. It's only when we draw a line between causes that we can control and causes that we can't that it becomes a practical tool (not that understanding reality isn't practical).

So you can see, this whole correlation != causation discussion is pure nonsense. Get some philosophical skill before you try to discuss philosophical issues and stop embarrassing yourselves. Sheesh, it's ridiculous.


I agree with others. Your comment is nonsense. But the match question is more complicated than it looks.

Striking a match causes ignition because other humans have already made a long list of correlations and found them reliable enough to build a thing called a 'match'.

This is one of the differences between science and technology. Technology is practical causality. You know enough about phenomena to be reasonably confident that you can assemble them into configurations that do useful things in a reliable way.

Science is about not being sure if correlations exist until you check to find out. Science is also about building abstract models that can reliably predict unexpected correlations.

I'm not sure there's more to the idea of causation than reliable correlation backed up by elegant and efficient abstraction with predictive power. (There probably is, but it's not something I've looked into enough.)

Quantum theory has a habit of pushing this in unexpected directions. It seems to be about manipulating probabilities rather than objects. The idea that you can causally manipulate a probability distribution is a deeply weird one if you stop to wonder what exactly is being manipulated in physical terms, and how.


Apart from semantic acrobatics, there is a practical need for defining causation in science and technology. The cause has to have a temporal precedence and be necessary.

> Causation in its purest form can help us understand the nature of reality, but it can't help us predict anything

huh?


The arrow of time. Randomised experiments.


What? Your comment is pure nonsense. Of course you can prove that striking a match causes it to ignite. There's a long chain of events, each of which trivially physically verifiable, which starts with you moving the match head while applying a given amount of pressure against the material lining the matchbox, which then, due to friction, flakes off (both head and lining), the two react together to produce a spark and a flame starts. You can cut the chain down to micro-events and you'll find each one to be consistently verifiable.

Or do an experiment - grab a mug and lift it. I claim that your arm motion while your hand had grabbed the mug caused it to ascend. We can again examine the basic physical events, cutting them as fine as you'd like, and we'll see that it really was you with your arm, who caused the mug to ascend.

This is a cause-and-effect link -- when you can show how event A lead to event B via a chain of events, each of which directly causes the next. At the least you need to show that there is a sensible progression from the cause to the effect, in order to claim a causation.

Correlation on the other hand is a description of past observations. You can observe that each time it snows people increase their energy expenditure in order to heat their homes. Does the increase in heating cause the snow? That doesn't seem likely, knowing how heating works. It could be that increased energy demands leads to more coal being burned, leading to larger clouds being formed and lowering the outside temperature. However, even though increased coal burning does lead to such effects, they're pretty small from just increased heating, so we can discount this hypothesis. On the other hand, you can show that cold weather causes snow. Water that falls in cold air crystallises and becomes snow. So there is a definite causative relationship between cold weather and snow. Could there also be a causative relationship between cold weather and heating? Well, assuming that people wish to live at a constant surrounding air temperature of about 20-30 degrees, that seems very likely. There are of course individuals who don't fit that profile, but they are very few (I can't actually provide a study which supports that claim, but just assume it for the purposes of this demonstration). In the end, we can say that while snow and heating are correlated, heating does not cause snow. They simply have a common cause - cold weather.

And finally, if you keep your home at a constant temperature, the setting on your heater does not correlate with the temperature inside your home, because the temperature is constant, while you have to change the setting up and down to counteract the more or less cold weather outside. It does correlate perfectly with the outside temperature, but I'll leave proving that putting your heater on high doesn't make the temperature outside drop by 10 degrees as an exercise.


Any fact that depends on observation is uncertain, no matter how small or trivial it may seem. This whole physical world could be an incredibly elaborate computer simulation. We can never know the true nature of our observations, so nothing we observe can ever prove anything.


In reality, causal mechanisms for the phenomena we most care about (economic and social phenomena particularly) are so fantastically complicated that you never really "figure it out". Nevertheless, if you can exploit the hypothesized causation to optimize some function, you've won, regardless of the metaphysics.

Even in the context of physical processes, like the biological processes Gwern mentions, the notion of "establishing causality" is much more of a regulatory artifact than anything else.

You might call this the machine learning approach (optimize some objective function, regardless of mechanism) as opposed to the statistics approach (generate some measured claim about a process itself).


Statistics serves as a tool to overcome our cognitive biases. But what if these biases are at the center of learning?

Take for example the Gambler's Fallacy where a player believes she can predict the outcome of a coin toss with greater certainty than is possible. Obviously she cannot. But if I had to design a Machine Learning algorithm, I would certainly want it to always assume that a pattern existed. That way, if the data were predictable, the algorithm would be able to take advantage of it.


A smarter algorithm would refrain from betting on the coin.


That's great if you know a priori that you cannot predict the outcome of what you are betting with.


I don't think it would be necessary to know a priori that something is unpredictable; failing to find a pattern after some number of observations should allow an algorithm to determine that the event is not predictable by that algorithm.


The trouble is that you can't reliably distinguish random data (or data determined by factors you're not considering) from patterned data.


How do you think random number generators are tested?


A smarter algorithm would consider the payoff as well as the odds. Then it may bet or refrain.


What if the coin had the best odds compared to all other games?


Correlation is due to one of the 3 C's, coincidence, confounding, or causality. I just like the 3 C's.


Another cracking article from gwern.

For those wondering, that snazzy looking `choose` combinatorial function is from clojure, and I assume other lisps:

http://clojuredocs.org/incanter/incanter.core/choose


No, it's just binomial coefficient. It's common to say "10 choose 2", it's the standard way of reading binomial coefficients out loud. See http://www.wolframalpha.com/input/?i=10+choose+2.


The problem is that causation is an oxymoron.

The measurement problem is the same as the problem of induction.

Max Planck understood this, read his quotes on matter.

No amount of correlation increases the probability of one event followed by another. Therefore, all scientific analysis is unverifiable. Knowledge of the world is completely unjustified. Not to mention our immense presumption of consistency in world phenomena when we really have no basis for asserting uniformity of nature. Read Hume on causality.

"Belief in the Causal Nexus is superstition" - Wittgenstein

If knowledge requires certainty, and our method of determining certainty is an infinite regress of reduction, it is like a recursive function which never returns a value.

Logic is inherently a comparison operator, and if logic is the only mechanism available to us, we are trapped in a purely relational analysis and it is impossible for the mind to conceive of certainty beyond a non-reasonable emotional preordained state of knowing. A feeling so powerful it is personally indubitable.

The premise of a singularity is invalid since it is inconceivable and the idea that causation is at any time inferred via comparison is fundamentally incomprehensible. If the premise is unclear then the argument is invalid.

If I say that an airplane functions on fairy dust and you make an argument about propulsion and lift, and you claim that my assertion is wrong because I have not properly performed a rational reduction and that a detail I do not understand could invalidate my entire theory, well guess what neither of us is technically correct to any degree since the test for invalidity applies equally to both of our assumptions. The concept of proof is also an oxymoron.

Hence, the very concept of phenomenal causation is an oxymoron. The notion of knowledge about the world is a disguise for a coping mechanism based on hope and desire.


> If I say that an airplane functions on fairy dust and you make an argument about propulsion and lift, and you claim that my assertion is wrong because ...

The only reason "science" would claim that your assertion is wrong is if your explanation doesn't agree with reality.

However, if you collect a bunch of data that you say supports your theory, but your experimental technique or data analysis is not good, then it's perfectly reasonable to point out the problems and say that your result is not supported by your procedure. This is not the same as saying that your assertion is wrong.

Another important concept that you're missing is usefulness: your assertion is unlikely to be useful (this is totally different from whether it's correct). Let's say you want to ensure that planes don't fall out of the air: how would you take advantage of your assertion to do this? (My proxy for usefulness is falsifiability -- un-falsifiable theories often are useless)


"The only reason "science" would claim that your assertion is wrong is if your explanation doesn't agree with reality."

Your statement is a circular reference to the problem of induction. You arbitrarily claim to be able to make accurate assertions of reality while at the same time agreeing that the essence of reality is unknowable.

This reply of yours is full of circular reasoning. Stop using the word good without making your case for righteousness. You can't substitute it with practical or useful either, both infer benefit at the personal degree.

Just because an expectation proved to be useful once, remember that the past does not predict the future.

"Belief in the Causal Nexus is superstition"


Oops, instead of "reality" I meant "experimental observations". Good catch.

> You arbitrarily claim to be able to make accurate assertions of reality

Nope.

> while at the same time agreeing that the essence of reality is unknowable.

Yep. I guess? All I think is that it's okay if we don't know the true essence; science is still useful[0].

> Stop using the word good without making your case for righteousness.

Stop hijacking common words.

> You can't substitute it with practical or useful either, both infer benefit at the personal degree.

I don't know what that means.

> Just because an expectation proved to be useful once, remember that the past does not predict the future.

Depends what you mean by predict the future. Absolute truth? No. Useful[0] forecasts? Yes.

----

[0] dictionary.reference.com:

> useful: being of use or service; serving some purpose; advantageous, helpful, or of good effect


I do agree with you that science doesn't show causation, but I think your interpretation of science is incorrect:

> Therefore, all scientific analysis is unverifiable.

I disagree. It's verified by experiment. (Here I'm using "verify" to mean that contradictions have not (yet) been found by experimental observation).

> Knowledge of the world is completely unjustified.

I disagree. Again, it's justified by experiment. If that's not enough for you, too bad, because that's all we have.

> Our immense presumption of consistency in world phenomena when we really have no basis for asserting uniformity of nature.

I disagree. This is only asserted insofar as it's 1) useful, and 2) justified by observation.

I don't think you understand the significance of scientific results. They don't say "at a fundamental, irreducible level, this is how <some system> operates." Rather, they're saying something more along the lines of "as far as we can tell, this is a description of how <some system> operates, but it's entirely possible that we're wrong. However, we don't have any data that shows that we're wrong (yet), and because this description is useful for understanding <some system>, we're going to use it."

We don't need to know how a system works at a fundamental level, if understanding it at a less-detailed level is useful. I do agree with you that we can't determine "actual causality", but we don't need to, if we can instead find a bunch of really good correlations. The problem is that people find lots of crappy correlations, and/or can't tell the difference between good and crappy correlations.


I agree that we do not disagree.

Your post is simply applying a different definition to the terms I used.

Your reply is a case for righteousness.

I don't think you have considered the implications of the use of such words as "good". I agree with you that experimentation is futile in the absence of ethics.

Why are you presuming common application of the use of the word good and useful? For example, the science behind the atomic bomb and it's usage, was it useful? To whom was it useful, those devastated by the blast or those who set it off?

I urge you to reevaluate your basis for righteousness.

Don't you understand that morality requires certainty?


At no point did I touch on morality or ethics. I do not know why you think that I did so.

> I agree with you that experimentation is futile in the absence of ethics.

I do not agree with that.

--------

Dictionary.reference.com:

> useful: being of use or service; serving some purpose; advantageous, helpful, or of good effect

> good: having admirable, pleasing, superior, or positive qualities; not negative, bad or mediocre

I urge you to reevaluate your understanding of the words "good" and "useful". Don't you understand that I'm not talking about morality?


Philosophically speaking you can never verify anything through experimentation: in a controlled experiment you will use instruments and tools you need to trust (how do you know the microscope isn't lying?) and if you go through the process of trying to see if anything is confirmed, for a long enough time you will end up with the axioms of any scientific field, which are never verified (just assumed true).

You can prove negatives though, so thats something :).


> Philosophically speaking you can never verify anything through experimentation

This depends on what you mean by "verify". You seem to mean "prove that something is absolutely true".

That is not what I meant. I should have clarified that I meant something along the lines of "we've collected a bunch of data that don't contradict <something>".

I've edited my meaning into the post because I don't want my stupid equivocation to distract from more significant issues.


Go further: reproducible experiment can tease out details to many decimal places of likelihood. Something can be known so thoroughly that we can make statements like "this is certain to behave as predicted millions or billions of times more often that not", which is pretty close to certain knowledge.


I'm not sure that I agree. I think that once you get into the realm of "absolute truth" (which is what I'm interpreting your post as saying -- apologies if I'm mistaken), you've left science behind. IMHO, science cannot (and does not aim to) deliver certain knowledge. Instead, it produces useful approximations.


I agree with Matt,

There is no quantity of correlation that promotes one iota of certainty or probability.

The issue is Matt, many people like this gentlemen here actually believe that scientific experimentation offers explanations.

How many times have we heard, "there must be a rational explanation", when it fact never has a rational explanation ever been provided for any phenomenon.

We can't involve degrees, when the extreme principles infer that no such claims can be made. There is an unknown amount of probability given any proposition.

The only form of falsifiability we're capable of is in whether or not a person is conforming to the traditional use of language. If I say that 2 + 2 = 5, then I am wrong, since the rules for mathematical language are understood with certainty based on our tradition.

If I claimed that a plane IS powered by fairy dust and yet it is NOT powered by fairy dust, then I am technically wrong since I abused the use of language.


I couldn't edit, so I'll reply, but this mind bending quote really sums up the limitations of logic and observation.

"Language disguises the thought; so that from the external form of the clothes one cannot infer the form of the thought they clothe, because the external form of the clothes is constructed with quite another object than to let the form of the body be recognized." - Wittgenstein


Hey Amarok, Image Occlusion addon author here. I'd like to meet you in person (can't reply to the comment in which you mentioned me). My email is in my profile.


Perceived correlation is usually not actual causation. That should help explain why not.


Disagree with the title. Correlation does imply casuation a lot of the times (especially for simple systems). But not always. Therefore, the caution is not to assume it apriori, but pursue further investigation to confirm or reject it. Even when it is rejected, a lot of those cases result in a third variable being the cause behind the correlated "effect" variables.


"Disagree with the title."

Perhaps you should read the article before posting your disagreement. (As well as the several other people who appear to have paragraph-sized responses to 8 words, rather than the actual article.)


The problem with using "correlation does not imply causation" outside of a mathematical context is that in math, saying A implies B means that if condition A is true, then condition B is true. In common vernacular, implies means to suggest, hence the confusion.


12 days late but still:

I just realized "Correlation does imply casuation a lot of the times" is a completely meaningless statement that I made (if correlation is not because of causation even some of the times then that's simply 'correlation doesn't imply causation').

What I really had in mind (which I think is correct) is that 'correlation ends up being due to causation a lot of the times'.


This article starts off with a common mistake.

It is ok to create the 3 categories:

> If I suspect that A→B, and I collect data and establish beyond doubt that A&B correlates r=0.7, how much evidence do I have that A→B?

> you can divvy up the possibilities as: 1. A causes B 2. B causes A 3. both A and B are caused by a C

So far so good, but here is the problem:

> Even if we were guessing at random, you’d expect us to be right (at at least 33% of the time...

No. It is not valid to assume each possibility is equally likely. If you do so, you are bringing your own assumptions to the problem.

If you ever find yourself assuming a distribution, pause and consider testing your assumption.


First, keep reading.

Second, when gwern says "you'd expect 33%", he [1] does not mean "the abstract 'we' mathematically expect 33%", but indeed "the generic person-on-the-street has an intuitive belief that we should get 33%". If you check the context I think you'll see this fits.

[1] So far as I know, anyhow.


@jerf Why did you assume I did not read the article? I read it, and I find the writing to be unclear. Maybe "you'd expect" means what you think; maybe not.

Independent of my or your opinion, (or precisely because reasonable people may disagree over something so simple) the writing is unnecessarily unclear. It would be simple to add a quick note saying, more-or-less, that the author will revisit the point later.

I also disagree with another commenter who says that the later writing clears up this issue. The point about 33% not being a fair assumption isn't addressed head on in the way that I think matters most.

Here is my main point, which was not received in the spirit I intended. Many smart people, even programmers, sometimes incorrectly assume a discrete uniform distribution (e.g. 1/3, 1/3, 1/3 in the above example), probably because they think it makes the fewest assumptions.

I'll put it another way with a related example. Alexa gives Bart a two-sided coin, tells him it may or not be fair, and asks the expected probability of getting heads. Bart reasons as follows: "I know that the coin may be unfair, ranging from, say, 0%/100% to 50%/50% to 100%/0%. By symmetry, for each 1%/99% coin there is a 99%/1% coin. So, in the end, the coin will average out to 50/50, so the best answer is 50%."

Chuck comes along and says, "Bart got it wrong. Any point estimate must be incorrect, because we don't have enough information. The correct answer must be a distribution. Since we know nothing, the best answer is a flat (uniform) distribution ranging from 0% to 100%."

Danielle chimes in and adds "Yes, Bart was wrong, and Chuck improved on the analysis, but Chuck did not go far enough. Both Bart and Chuck assumed symmetry, without justification. Chuck's act of assuming a uniform distribution is tantamount to knowing the process that generated the coin. Does Chuck really know the chances of a 0-10% coin are the same as a 10-20% coin, and so on? No, he does not. Therefore, even answering the question with a distribution is incorrect. The only correct answer is that the probability of heads ranges from 0 to 100%, but no distribution can be given."

Danielle is correct. The author may well know this, but it is not communicated clearly in the article.

I would be willing to wager that many people, when confronted with Alex's question, would answer like Bob or Chuck did.


> I also disagree with another commenter who says that the later writing clears up this issue. The point about 33% not being a fair assumption isn't addressed head on in the way that I think matters most.

How would you want it addressed? A large part of the article is exactly about how the - to some - "intuitive" idea of 3 evenly split categories is incorrect. E.g. later in the article he writes:

"It turns out, we weren’t supposed to be reasoning ‘there are 3 categories of possible relationships, so we start with 33%’, but rather: ‘there is only one explanation “A causes B”, only one explanation “B causes A”, but there are many explanations of the form “C1 causes A and B”, “C2 causes A and B”, “C3 causes A and B”…’, and the more nodes in a field’s true causal networks (psychology or biology vs physics, say), the bigger this last category will be."

Presumably this is why two people separately assumed you did not read the whole article.


Ok, to move this forward, let's move past the who has read what part, which doesn't seem to be helping. Anyhow, my point is relatively subtle, I think, and getting lost in communication somehow. I get what the article is saying, but I'm not sure that everyone here is getting what I'm saying.

Let me try to elaborate. My points are:

1. The article does indeed criticize the 1/3, 1/3, 1/3 split, but not for the same reasons that I am.

2. I would prefer that the 1/3, 1/3, 1/3 split not be used as a prior at all. (Yes, I get that Bayesian priors don't matter that much if you have enough data to update them. The article, on my reading, does not offer a way to get the needed observations, so I think the priors dominate.)

3. Yes, later, the article dismisses the even split as naive but does not go far enough when suggesting a better way. You see, the "better way" still boils down to the naive assumption of counting bins: this is to say that if we have more bins, then we should have a higher expected probability. In my opinion, the "better way" is still falling into a conceptual trap: this is not a combinatorics problem. For combinatorics problems to work they have to rely on real, countable things, not arbitrary notions of how you slice and dice a problem. Yes, I get that as you add more variables to the DAG, you can certainly say that there are more possible paths through the graph. But we cannot assume that just adding more nodes in the DAG will change reality. You see, the DAG is a mental construct. We should not assume that adding more mental bins (an arbitrary way of slicing reality) will affect the distribution.

Here is an example. Francis tells Greg there are three marbles in a jar (green, red, blue) but does not reveal the distribution, so Greg cannot rightly assume a 1/3, 1/3, 1/3 distribution. But let's say Greg does anyway. Next, Francis tells Greg, "Sorry, I was oversimplifying before; really there are 4 types of marbles in a jar (forest green, lime green, red, blue)." What should Greg do? Update the prior to 1/4, 1/4, 1/4, 1/4? Well, that would be inconsistent, since 1/4 (forest green) + 1/4 (lime green) does not equal 1/3 (green). Should Greg suggest a 1/6 (forest), 1/6 (lime), 1/3 (red), 1/3 (blue) split? Nope. My point is this: down that road lies arbitrariness and contradictions.

Hopefully this helps explain what I'm talking about. I like the article, but I don't want people to get into the habit of assuming a distribution. I also don't like people counting up concepts that are arbitrary and turning that into a distribution.


If you read the whole article, he addresses this about halfway down.


@Jemaclus I'm not sure why you assumed that I did not read the article. Maybe you disagreed with my conclusion and assumed I had not read the article?

To your question, I would ask: Where, exactly, does the author "address this"? (Maybe we are talking about different meaning of "this"? I explain the fallacy I'm talking about in a longer comment on a sister thread (by "sister" I mean up one level, over one, down one).

Did you mean this part?

> It turns out, we weren’t supposed to be reasoning ‘there are 3 categories of possible relationships, so we start with 33%’, but rather: ‘there is only one explanation “A causes B”, only one explanation “B causes A”, but there are many explanations of the form “C1 causes A and B”, “C2 causes A and B”, “C3 causes A and B”…’, and the more nodes in a field’s true causal networks (psychology or biology vs physics, say), the bigger this last category will be.

This is also fallacious, in my opinion. See my longer comment. In short, if you "count up" categories in this way, you are assuming information that you don't have; doing so is a matter of belief, not analysis.

Or did you mean this part?

> they might not be reasoning in a causal-net framework at all, but starting from the naive 33% base-rate you get when you treat all 3 kinds of causal relationships equally. > This could be shown by eliciting estimates and seeing whether the estimates tend to look like base rates of 33% and modifications thereof.

This chunk of text does not debunk the claim that a 33% base rate is reasonable as a starting point. My point is that the 33% starting point was not reasonable in the first place.


[deleted]


The author more than understands this. In the first two paragraphs he address the much more interesting question about whether, in realistic causal DAGs, potential correlations grow at a faster rate than actual causal links—the idea being that if they did it would harm the notion that seeing a correlation improves the hope of causation due to pigeonholing.

The author doesn't assert that this is true, but it definitely means he's thinking about this quite hard.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: