
Correlation, Causation, and Confusion - gwern
http://www.thenewatlantis.com/publications/correlation-causation-and-confusion
======
Houshalter
I think the statistics used in a lot of actual science are too primitive. Not
that the methods are simple or old by any means. Or that making them more
complicated is desirable in itself. Just that there are better ways of
inferring structure from data.

Just correlating two variables will often be misleading. E.g. the coffee
drinkers that live longer. But if you collect a ton of other information, you
will find many other relevant variables. And once you build a model from all
that information, you find that coffee no longer predicts mortality.

Of course there is a limit to the amount of causal structure you can infer
from data. But that limit is well above what we are currently doing. And the
more relevant data you collect, the higher that limit is.

This is basically what the field of machine learning is. Where we want
computers to automatically model our data and make the most accurate
predictions for us. Tons of methods are invented and tested against each other
on different datasets. There are websites like kaggle where users compete to
find the best model for a dataset. But somehow none of this innovation goes
back to mainstream statistics or used in scientific studies to get better
models.

And machine learning sucks too. A lot of methods use maximum likelihood with
some tricks to keep it from overfitting. Rather than ideal bayesian methods.
And the models are kind of simple. Like simple decision trees or linear
models.

I think in the future we will use something like bayesian neural networks.
Which can model very complicated structures with little data. And we won't
arbitrarily select some of the variables to be inputs and some to be outputs.
All of the observed variables are outputs of some unobserved function with
unobserved inputs. We need to model that.

Then you have a fully general statistics algorithm that you can plug any
dataset into. You can then ask it questions like "if I observed that this
person drank coffee, but everything else I know about them is the same, what
is their mortality?" And then it will give you the most accurate possible
answer.

Is that proof of causation? No. But it's far more likely that coffee causes
cancer if it can accurately predict it, even when you control for all other
variables. Than if you just see that it happens to correlate with cancer.

~~~
le0n
I don't think machine learning vs Bayes vs sampling theory has much to do with
the content of the article, which is more about causality than interpretations
of probability.

> it's far more likely that coffee causes cancer if it can accurately predict
> it, even when you control for all other variables

I don't know about this: prediction is not equivalent to explanation in
general. The "all the other variables" bit is also a bit of a kicker (what
counts as "all"?) -- hence randomization, and, well, pretty much everything
else the article discusses.

------
mizzao
As someone who studies causality via experiments for a living, this article is
a remarkably accessible synthesis given that it covers so much from different
disciplines.

Definitely bookmarked as a juicy morsel to send to anyone who asks, "why
should we care about causality?"

------
markcmyers
This is a masterpiece.

~~~
le0n
+1. A fantastic article. Definitely one of the best popular-level stats pieces
I've read.

