
Correlation is usually not causation. But why not? - Amarok
http://www.gwern.net/Causality
======
dusklight
All I know about causation and correlation I learnt hunting bugs in large
legacy software systems. In that environment I got the impression that
correlation almost never equalled causation, but that's only because the
hardest bugs, the ones I remembered, were hard because the obvious
correlations did not help identify the root cause. A similar argument might be
made for scientific studies: most of the easy causes that can be identified
from correlation have already been found, leaving the majority of new studies
with correlations that don't easily establish causation.

~~~
api
I think this low-hanging-fruit idea is generally true for all of science:
after centuries of science as a profession, pretty much all the easy stuff has
already been done.

It's actually an argument for more cojones in science-- being willing to do
bold stuff and explore "crazy" hypotheses. If all the easy stuff is done
already, then picking methodically and timidly among the dregs is unlikely to
ever yield anything.

~~~
mattfenwick
An alternative viewpoint: everything's easy after it's been done, but it's
hard up until then.

> after centuries of science as a profession, pretty much all the easy stuff
> has already been done.

But with the benefit of all that has already been done, shouldn't we be able
to do things now that were previously impossible? In other words, "easy in
2014" != "easy in 1200".

> It's actually an argument for more cojones in science-- being willing to do
> bold stuff and explore "crazy" hypotheses.

Maybe doing whatever's easy is the most efficient (by time, money) way to make
progress?

~~~
api
I'm not sure... seems like what you say may hold for a while, but eventually
you start hitting a more objective sort of hard: things that are hard for
human beings to comprehend due to the limitations of our intelligence itself.
Beyond that there's probably an even harder hard-- when you start actually
running out of new things to discover. Can there really be an infinite number
of physical laws, principles, and useful relations? Or at some point have you
actually found most of basic physics?

Once you start hitting those, you've either entered a permanent era of
diminishing returns or one where you can only really make progress by
radically redefining problems, making leaps, or trying wild and crazy ideas in
the hopes of unlocking some isolated seam of high-value research that isn't
connected to the others in the fitness/value state space graph.

~~~
mattfenwick
Sure, we can't rule out those possibilities ... but to me it just sounds like
wild speculation.

------
jrapdx3
This question is bound to be a can of worms. There has been a great deal
written about the matter, particularly in regard to observational studies.

Drawing causal inference is overwhelmingly likely to be wrong when there is a
good chance that unknown variables are influencing the correlates observed. In
health-related sciences that more often than not is the case.

A few years ago there was a study correlating hours of TV watched and ADHD
symptoms in children. The news media picked up on these "findings" and of
course the causal influence of TV on ADHD was reported.

It was obvious that saying watching TV caused ADHD was absurd, that other
variables weren't taken into account, e.g., some other characteristic of ADHD
kids prompted watching TV more than other kids.

There was a great article published in PLOS several years ago (ATM I don't
have the link) showing mathematically that the odds were about 1 in a million
that an observational study like the above would turn out to be a "true"
causal relationship, and the author concluded most published studies were
junk.

In experimental studies, variables are limited and controlled as well as
possible. With fewer and known variables, correlations would have a greater
chance of revealing a reproducible causal relationship among events.

The discussion gets tripped up when it comes to defining "cause" or "causal
relationship". The theory is controlled as an experiment may be, there's a
possibility that unknown variables were present and affected the phenomena
occurring in the experiment. Conclusions can't be absolute, but only true to
some probability.

I think the history of science over the last 100 years or so has something to
say about the nature of "causal relationships".

~~~
nmjohn
> There was a great article published in PLOS several years ago (ATM I don't
> have the link) showing mathematically that the odds were about 1 in a
> million that an observational study like the above would turn out to be a
> "true" causal relationship, and the author concluded most published studies
> were junk.

I very much doubt the second part of that, that most published studies are
junk. The idea that not showing a causal link and only a correlational one
makes a study junk is not held by anyone in the field who garners a whole lot
of respect or notoriety. I'm not saying the quality is equal whatsoever,
simply that correlation studies are not inherently junk - some are fantastic
and some are quite the opposite.

There is no question that journals in general are pumping out a significant
amount of junk (in studies of all types), but I speculate the root cause of
that has more to do with the rise of "publish or perish" and significant
increases in grad school enrollment. And even worse the notion that for a grad
student, the number of publications they have their name attributed to is more
important than the content they publish in terms of employment after they
graduate. So there is a situation where people have more pressure to publish
than ever before [0], there is more competition for scarce funding so studies
are vastly underfunded and studies are rushed so another can begin to add to
the resume.

That doesn't mean there is less good science being done either! There probably
is more good science being published now than ever before, the problem is the
signal:noise ratio has gone down making it harder for good studies to get
media attention, and easier for the media to latch on to whatever story they
think will get viewers.

[0]: Sidenote, this also makes it increasingly challenging not only to have
high quality research, but research that only is making a correlation as
opposed to going through and providing evidence to claim you might have a
causal link.

~~~
a_bonobo
This, I think, is the study the first poster meant, Ioannidis 2005, 'Why Most
Published Research Findings Are False':

[http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fj...](http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124)

It's quite a famous paper. Many rebuttals, comments, blog-posts etc, have been
published, have a look at the comments on the PLOS site, I especially like
this blog post summarizing more research:

[http://simplystatistics.org/2013/09/25/is-most-science-
false...](http://simplystatistics.org/2013/09/25/is-most-science-false-the-
titans-weigh-in/)

From the same author, here's a small overview of the discussion at the end of
2013:

[http://simplystatistics.org/2013/12/16/a-summary-of-the-
evid...](http://simplystatistics.org/2013/12/16/a-summary-of-the-evidence-
that-most-published-research-is-false/)

Quote on the Ioannidis paper:

>Under assumptions about the way people perform these tests and report them it
is possible to construct a universe where most published findings are false
positive results. Important drawback: The paper contains no real data, it is
purely based on conjecture and simulation.

------
zdw
Because you can find things that are correlated, but have no practical
relationship - many (frequently humorous) examples here:

[http://tylervigen.com](http://tylervigen.com)

~~~
gwern
Datamining correlations like that confuses two issues: sampling error and
'everything is correlated'. When you find a correlation like that, it's
unclear whether it's a spurious correlation which will disappear if you just
collect more data (and caused by slicing the data in a thousand ways without
any compensation by this) or whether there genuinely is some sort of opaque
correlation which will persist no matter how much data you collect (maybe 'US
spending on science, space, and technology' _really does_ correlate with
'Suicides by hanging, strangulation and suffocation' through something like
government stimuluses and economic growth).

------
graycat
Bad winter weather can cause auto accidents, and we expect a positive
correlation between bad winter storms and winter auto accidents. Okay, but in
the northern hemisphere, living in more northern latitudes also correlates
with winter auto accidents but does not cause them.

For heart disease, we know that the main causes have to do with aging. Well,
then, since now the audience for TV news is comparatively old, we can expect
that watching TV news has positive correlation with heart disease. Still
watching TV news does not cause heart disease.

In the US NE, hurricanes are positively correlated with pretty leaves on the
trees, but the leaves do not cause the hurricanes, and the hurricanes do not
cause the colors in the leaves. Instead, hurricanes are caused by the surface
waters of the Atlantic Ocean hot from summer sun with cooler air on top, and
that situation is caused by the fall weather with also causes the colored
leaves. So, colored leaves have a _spurious_ correlation with hurricanes but
do not cause them.

Correlation is much more common and much easier to establish than causality.
Usually the convincing evidence of causality if some basic physical
connection.

~~~
lsd5you
Your first example is wrong. If you move to a more northerly latitude then it
will increase the risk of accidents for you (i.e. there is causation). It is
indirect causation, but then ultimately (almost?) all causation is going to be
indirect if you take the reduction far enough. That is bad weather doesn't
directly cause accidents, but ice on the roads does ... etc.

~~~
graycat
No: The bad winter weather, say, ice on the roads, is the _cause_ of the
accidents. If happen to have a winter with little or no snow or ice, then the
number of auto accidents will go down no matter what the latitude: So,
latitude can't be the cause; ice and snow on the roads is the cause. So, if
live in a lower latitude, say, closer to the equator, and get a freak winter
snow storm with a lot of ice and snow on the roads, the we will see auto
accidents in spite of the latitude.

For indirect causation, I just said winter weather to try to be brief. But,
sure, the cause is the ice and snow on the roads, not just the _weather_ in
some vague sense. I was trying to keep the explanation simple, avoid
discussing indirect causation, and concentrate on the OP issue of causation
versus correlation.

------
sqrt17
Normally, causation has a couple confounders for correlation, such as

\- inverse causation \- common cause \- random correlations

The first two should be relatively stable and reproducible, and we could then
proceed to find out about causation with an intervention study (e.g. if "good
education" directly causes "good job prospects", will it help if we give
people good education that wouldn't normally get one? Or does it improve
people's educational achievement when we give them better access to jobs?)

The third isn't reproducible, but should normally be relatively rare. Why
we're seeing more and more non-reproducible results is usually

\- People fishing around in data sets for correlations, or tweaking
experiments until they find a correlation

Because of this, we end up with a great deal of correlations that have nothing
to do with causation.

For a related but different perspective, see this article ("Language is never,
ever random", Adam Kilgariff)
[http://www.kilgarriff.co.uk/Publications/2005-K-lineer.pdf](http://www.kilgarriff.co.uk/Publications/2005-K-lineer.pdf)

------
ghshephard
The article is a little dense for me, and presumes knowledge about
Probabilistic Graphical Models (PGMs) and directed acyclic graphs / causal
Bayesian networks (DAGs) without introducing the background knowledge - so, I
expect my reading of it missed a lot of the details.

But the one thing I came out wondering, is if you do a proper randomized
double-blinded study with a large population sample, say, taking 1000 people
with a particular illness, completely randomizing them, and then giving 500 of
the sample a particular treatment, and 500 of the sample a placebo that is
indistinguishable from the treatment. If an unbiased third-party, then
observes the groups (still without knowledge of which group had which
treatment), and if one group shows markedly different results (say, 490/500 of
the treatment group recover, and only 10/500 placebo group recover), is it
fair to say that in this particular scenario, that correlation of recovery
with the treatment implied that the treatment caused the recovery?

~~~
aaron695
Well it's fair to say that experiment shows causation but no I don't think you
can say the correlation shows it works. I think it goes against the definition
of the word.....

I found the article too dense to get as well, but I do think 'perhaps' we
could use correlation more to assume causation. I think we are too cautious
and certainly the haters always bring in correlation to stop science articles
they think don't follow their beliefs. Not sure if that's compatible with the
OP or not, it hurt my head.

~~~
mattfenwick
> it's fair to say that experiment shows causation

No, that experiment doesn't show causation.

> I do think 'perhaps' we could use correlation more to assume causation. I
> think we are too cautious

Why do you think that? Is it because you think that science is moving too
slowly (thus wasting time and money)? Is it because you think that good
results are mistakenly discarded?

"using correlation more to assume causation" should produce results more
quickly, both correct and incorrect. Do you understand the costs of incorrect
results? First, incorrect results can have negative consequences for the
consumers: for example, drugs that have harmful impacts and no beneficial ones
for the people taking them (in addition to the cost of the drug, and the
opportunity cost of not doing something else known to be beneficial). Second,
time is wasted following up incorrect results and work based on incorrect
results would be worthless. Third, the incorrect results would have to be
overturned (and this communicated to everybody who had taken them as correct).

I think deciding whether to "use correlation more to assume causation" needs
to take into account the costs and benefits of incorrect results, as well as
the costs and benefits of correct results. Do you have any data on this?
(unfortunately, I do not)

> certainly the haters always bring in correlation to stop science articles
> they think don't follow their beliefs

Can you support this accusation with examples of it happening to articles that
actually do a good job of showing their result is something more than a
correlation? Or are you saying that this happened to articles that just
present a correlation?

------
QuantumChaos
Correlation vs causation is not actually a complex mathematical problem. The
issue is more philosophical.

Bayesian networks are a highly sophisticated and flexible framework for
thinking about causation. And yet the essence of Bayesian networks can be
captured in much simpler methods like Instrumental Variables or simply
regressions with controls.

In all cases, the true distribution of observables (which we can estimate from
samples) combined with some assumptions about the possible nature of causality
(either very sophisticated in the case of Bayesian networks, or very simple in
the case of instrumental variables) lead to the actual causal relationships.

Where do these assumptions come from? Consider a randomized trial. Even though
physically, it is possible that some hidden cause influenced both the random
number generator which selected patients into a trial, and also whether
patients will get better. And yet people universally believe that there is no
such mechanism, and so they believe that randomized trials prove the causal
relationship between taking a drug and getting better.

No physicist ever wrote down an equation proving this. It is simply something
we deduced from our human understanding of how nature works.

Put simply, causality is not a physical notion, neither is it a statistical
notion. It is a part of our intuitive understanding of the physical world, in
which a higher level notion of causality exists, beyond that described by
special relativity.

~~~
pasharayan
To add to your point, we cannot prove or disprove the existence of causation
(Try to conceive a falsifiable experiment about causation and I would argue
you would end up with a metaphysical crisis). The 'is-ought' issue David Hume
showed, where just because something 'is' a way does not indicate how it
'ought' to be or will continue to be, has highlighted to us how difficult it
is to think this problem for hundreds of years. Another way to look at it is
to ask "Who or what is to guarantee the law of physics will remain the same
tomorrow? What is to stop the speed of light changing to 1 mile per hour?"

This isn't to say the article posted is of no use. Having a 'graph-like'
mental model of how things work is incredibly useful, as most education
simplifies real-world problems into a few key issues. Although most non-
computer science issues can be reduced successfully using the 80/20 rule in
real life, sometimes some problems require us to look at the 100s of
contributing factors to allow us to solve the problem we're facing properly.

The more 'graph-like' problem solving becomes acceptable as way to solve
issues the better of everyone will be.

------
mgregory22
The difference between correlation and causation is merely conventional. Does
the striking of a match cause it to ignite? There is no way to prove that it
does, only the correlation between the striking and the ignition makes us say
they're causal. Correlation is the only way to determine what's causal and
what's not.

On the other hand, if you want to look at it philosophically, then the only
sensible definition of "causation" is "anything necessary for a thing to
exist." So what's necessary for a thing to exist? Every other thing in the
universe that is not that thing! Pluto causes us to exist right now because it
hasn't turned into a giant space goat and swallowed the Earth. Yes, the fact
that something DIDN'T prevent the existence of a thing is also ultimately a
cause of that thing's existence. Causation in its purest form can help us
understand the nature of reality, but it can't help us predict anything. It's
only when we draw a line between causes that we can control and causes that we
can't that it becomes a practical tool (not that understanding reality isn't
practical).

So you can see, this whole correlation != causation discussion is pure
nonsense. Get some philosophical skill before you try to discuss philosophical
issues and stop embarrassing yourselves. Sheesh, it's ridiculous.

~~~
anon4
What? Your comment is pure nonsense. Of course you can prove that striking a
match causes it to ignite. There's a long chain of events, each of which
trivially physically verifiable, which starts with you moving the match head
while applying a given amount of pressure against the material lining the
matchbox, which then, due to friction, flakes off (both head and lining), the
two react together to produce a spark and a flame starts. You can cut the
chain down to micro-events and you'll find each one to be consistently
verifiable.

Or do an experiment - grab a mug and lift it. I claim that your arm motion
while your hand had grabbed the mug caused it to ascend. We can again examine
the basic physical events, cutting them as fine as you'd like, and we'll see
that it really was you with your arm, who caused the mug to ascend.

This is a cause-and-effect link -- when you can show how event A lead to event
B via a chain of events, each of which directly causes the next. At the least
you need to show that there is a sensible progression from the cause to the
effect, in order to claim a causation.

Correlation on the other hand is a description of past observations. You can
observe that each time it snows people increase their energy expenditure in
order to heat their homes. Does the increase in heating cause the snow? That
doesn't seem likely, knowing how heating works. It could be that increased
energy demands leads to more coal being burned, leading to larger clouds being
formed and lowering the outside temperature. However, even though increased
coal burning does lead to such effects, they're pretty small from just
increased heating, so we can discount this hypothesis. On the other hand, you
can show that cold weather causes snow. Water that falls in cold air
crystallises and becomes snow. So there is a definite causative relationship
between cold weather and snow. Could there also be a causative relationship
between cold weather and heating? Well, assuming that people wish to live at a
constant surrounding air temperature of about 20-30 degrees, that seems very
likely. There are of course individuals who don't fit that profile, but they
are very few (I can't actually provide a study which supports that claim, but
just assume it for the purposes of this demonstration). In the end, we can say
that while snow and heating are correlated, heating does not cause snow. They
simply have a common cause - cold weather.

And finally, if you keep your home at a constant temperature, the setting on
your heater does not correlate with the temperature inside your home, because
the temperature is constant, while you have to change the setting up and down
to counteract the more or less cold weather outside. It does correlate
perfectly with the outside temperature, but I'll leave proving that putting
your heater on high doesn't make the temperature outside drop by 10 degrees as
an exercise.

~~~
mgregory22
Any fact that depends on observation is uncertain, no matter how small or
trivial it may seem. This whole physical world could be an incredibly
elaborate computer simulation. We can never know the true nature of our
observations, so nothing we observe can ever prove anything.

------
fiatmoney
In reality, causal mechanisms for the phenomena we most care about (economic
and social phenomena particularly) are so fantastically complicated that you
never really "figure it out". Nevertheless, if you can exploit the
hypothesized causation to optimize some function, you've won, regardless of
the metaphysics.

Even in the context of physical processes, like the biological processes Gwern
mentions, the notion of "establishing causality" is much more of a regulatory
artifact than anything else.

You might call this the machine learning approach (optimize some objective
function, regardless of mechanism) as opposed to the statistics approach
(generate some measured claim about a process itself).

------
jostmey
Statistics serves as a tool to overcome our cognitive biases. But what if
these biases are at the center of learning?

Take for example the Gambler's Fallacy where a player believes she can predict
the outcome of a coin toss with greater certainty than is possible. Obviously
she cannot. But if I had to design a Machine Learning algorithm, I would
certainly want it to always assume that a pattern existed. That way, if the
data were predictable, the algorithm would be able to take advantage of it.

~~~
eternauta3k
A smarter algorithm would refrain from betting on the coin.

~~~
jostmey
That's great if you know a priori that you cannot predict the outcome of what
you are betting with.

~~~
nitrogen
I don't think it would be necessary to know _a priori_ that something is
unpredictable; failing to find a pattern after some number of observations
should allow an algorithm to determine that the event is not predictable by
that algorithm.

~~~
baddox
The trouble is that you can't reliably distinguish random data (or data
determined by factors you're not considering) from patterned data.

~~~
thaumasiotes
How do you think random number generators are tested?

------
hessenwolf
Correlation is due to one of the 3 C's, coincidence, confounding, or
causality. I just like the 3 C's.

------
JonnieCache
Another cracking article from gwern.

For those wondering, that snazzy looking `choose` combinatorial function is
from clojure, and I assume other lisps:

[http://clojuredocs.org/incanter/incanter.core/choose](http://clojuredocs.org/incanter/incanter.core/choose)

~~~
TeMPOraL
No, it's just binomial coefficient. It's common to say "10 choose 2", it's the
standard way of reading binomial coefficients out loud. See
[http://www.wolframalpha.com/input/?i=10+choose+2](http://www.wolframalpha.com/input/?i=10+choose+2).

------
pleb
The problem is that causation is an oxymoron.

The measurement problem is the same as the problem of induction.

Max Planck understood this, read his quotes on matter.

No amount of correlation increases the probability of one event followed by
another. Therefore, all scientific analysis is unverifiable. Knowledge of the
world is completely unjustified. Not to mention our immense presumption of
consistency in world phenomena when we really have no basis for asserting
uniformity of nature. Read Hume on causality.

"Belief in the Causal Nexus is superstition" \- Wittgenstein

If knowledge requires certainty, and our method of determining certainty is an
infinite regress of reduction, it is like a recursive function which never
returns a value.

Logic is inherently a comparison operator, and if logic is the only mechanism
available to us, we are trapped in a purely relational analysis and it is
impossible for the mind to conceive of certainty beyond a non-reasonable
emotional preordained state of knowing. A feeling so powerful it is personally
indubitable.

The premise of a singularity is invalid since it is inconceivable and the idea
that causation is at any time inferred via comparison is fundamentally
incomprehensible. If the premise is unclear then the argument is invalid.

If I say that an airplane functions on fairy dust and you make an argument
about propulsion and lift, and you claim that my assertion is wrong because I
have not properly performed a rational reduction and that a detail I do not
understand could invalidate my entire theory, well guess what neither of us is
technically correct to any degree since the test for invalidity applies
equally to both of our assumptions. The concept of proof is also an oxymoron.

Hence, the very concept of phenomenal causation is an oxymoron. The notion of
knowledge about the world is a disguise for a coping mechanism based on hope
and desire.

~~~
mattfenwick
I do agree with you that science doesn't show causation, but I think your
interpretation of science is incorrect:

> Therefore, all scientific analysis is unverifiable.

I disagree. It's verified by experiment. (Here I'm using "verify" to mean that
contradictions have not (yet) been found by experimental observation).

> Knowledge of the world is completely unjustified.

I disagree. Again, it's justified by experiment. If that's not enough for you,
too bad, because that's all we have.

> Our immense presumption of consistency in world phenomena when we really
> have no basis for asserting uniformity of nature.

I disagree. This is only asserted insofar as it's 1) useful, and 2) justified
by observation.

I don't think you understand the significance of scientific results. They
don't say "at a fundamental, irreducible level, this _is_ how <some system>
operates." Rather, they're saying something more along the lines of "as far as
we can tell, this is a description of how <some system> operates, but it's
entirely possible that we're wrong. However, we don't have any data that shows
that we're wrong (yet), and because this description is useful for
understanding <some system>, we're going to use it."

We don't need to know how a system works at a fundamental level, if
understanding it at a less-detailed level is useful. I do agree with you that
we can't determine "actual causality", but we don't need to, if we can instead
find a bunch of really good correlations. The problem is that people find lots
of crappy correlations, and/or can't tell the difference between good and
crappy correlations.

~~~
pleb
I agree that we do not disagree.

Your post is simply applying a different definition to the terms I used.

Your reply is a case for righteousness.

I don't think you have considered the implications of the use of such words as
"good". I agree with you that experimentation is futile in the absence of
ethics.

Why are you presuming common application of the use of the word good and
useful? For example, the science behind the atomic bomb and it's usage, was it
useful? To whom was it useful, those devastated by the blast or those who set
it off?

I urge you to reevaluate your basis for righteousness.

Don't you understand that morality requires certainty?

~~~
mattfenwick
At no point did I touch on morality or ethics. I do not know why you think
that I did so.

> I agree with you that experimentation is futile in the absence of ethics.

I do not agree with that.

\--------

Dictionary.reference.com:

> useful: being of use or service; serving some purpose; advantageous,
> helpful, or of good effect

> good: having admirable, pleasing, superior, or positive qualities; not
> negative, bad or mediocre

I urge you to reevaluate your understanding of the words "good" and "useful".
Don't you understand that I'm not talking about morality?

------
tmbb
Hey Amarok, Image Occlusion addon author here. I'd like to meet you in person
(can't reply to the comment in which you mentioned me). My email is in my
profile.

------
snowwrestler
_Perceived_ correlation is usually not _actual_ causation. That should help
explain why not.

------
fizixer
Disagree with the title. Correlation does imply casuation a lot of the times
(especially for simple systems). But not always. Therefore, the caution is not
to assume it apriori, but pursue further investigation to confirm or reject
it. Even when it is rejected, a lot of those cases result in a third variable
being the cause behind the correlated "effect" variables.

~~~
lotsofpulp
The problem with using "correlation does not imply causation" outside of a
mathematical context is that in math, saying A implies B means that if
condition A is true, then condition B is true. In common vernacular, implies
means to suggest, hence the confusion.

~~~
fizixer
12 days late but still:

I just realized "Correlation does imply casuation a lot of the times" is a
completely meaningless statement that I made (if correlation is not because of
causation even some of the times then that's simply 'correlation doesn't imply
causation').

What I really had in mind (which I think is correct) is that 'correlation ends
up being due to causation a lot of the times'.

------
dj-wonk
This article starts off with a common mistake.

It is ok to create the 3 categories:

> If I suspect that A→B, and I collect data and establish beyond doubt that
> A&B correlates r=0.7, how much evidence do I have that A→B?

> you can divvy up the possibilities as: 1. A causes B 2. B causes A 3. both A
> and B are caused by a C

So far so good, but here is the problem:

> Even if we were guessing at random, you’d expect us to be right (at at least
> 33% of the time...

No. It is not valid to assume each possibility is equally likely. If you do
so, you are bringing your own assumptions to the problem.

If you ever find yourself assuming a distribution, pause and consider testing
your assumption.

~~~
jerf
First, keep reading.

Second, when gwern says "you'd expect 33%", he [1] does not mean "the abstract
'we' mathematically expect 33%", but indeed "the generic person-on-the-street
has an intuitive belief that we should get 33%". If you check the context I
think you'll see this fits.

[1] So far as I know, anyhow.

~~~
dj-wonk
@jerf Why did you assume I did not read the article? I read it, and I find the
writing to be unclear. Maybe "you'd expect" means what you think; maybe not.

Independent of my or your opinion, (or precisely because reasonable people may
disagree over something so simple) the writing is unnecessarily unclear. It
would be simple to add a quick note saying, more-or-less, that the author will
revisit the point later.

I also disagree with another commenter who says that the later writing clears
up this issue. The point about 33% not being a fair assumption isn't addressed
head on in the way that I think matters most.

Here is my main point, which was not received in the spirit I intended. Many
smart people, even programmers, sometimes incorrectly assume a discrete
uniform distribution (e.g. 1/3, 1/3, 1/3 in the above example), probably
because they think it makes the fewest assumptions.

I'll put it another way with a related example. Alexa gives Bart a two-sided
coin, tells him it may or not be fair, and asks the expected probability of
getting heads. Bart reasons as follows: "I know that the coin may be unfair,
ranging from, say, 0%/100% to 50%/50% to 100%/0%. By symmetry, for each 1%/99%
coin there is a 99%/1% coin. So, in the end, the coin will average out to
50/50, so the best answer is 50%."

Chuck comes along and says, "Bart got it wrong. Any point estimate must be
incorrect, because we don't have enough information. The correct answer must
be a distribution. Since we know nothing, the best answer is a flat (uniform)
distribution ranging from 0% to 100%."

Danielle chimes in and adds "Yes, Bart was wrong, and Chuck improved on the
analysis, but Chuck did not go far enough. Both Bart and Chuck assumed
symmetry, without justification. Chuck's act of assuming a uniform
distribution is tantamount to knowing the process that generated the coin.
Does Chuck really know the chances of a 0-10% coin are the same as a 10-20%
coin, and so on? No, he does not. Therefore, even answering the question with
a distribution is incorrect. The only correct answer is that the probability
of heads ranges from 0 to 100%, but no distribution can be given."

Danielle is correct. The author may well know this, but it is not communicated
clearly in the article.

I would be willing to wager that many people, when confronted with Alex's
question, would answer like Bob or Chuck did.

~~~
vidarh
> I also disagree with another commenter who says that the later writing
> clears up this issue. The point about 33% not being a fair assumption isn't
> addressed head on in the way that I think matters most.

How would you want it addressed? A large part of the article is exactly about
how the - to some - "intuitive" idea of 3 evenly split categories is
incorrect. E.g. later in the article he writes:

"It turns out, we weren’t supposed to be reasoning ‘there are 3 categories of
possible relationships, so we start with 33%’, but rather: ‘there is only one
explanation “A causes B”, only one explanation “B causes A”, but there are
many explanations of the form “C1 causes A and B”, “C2 causes A and B”, “C3
causes A and B”…’, and the more nodes in a field’s true causal networks
(psychology or biology vs physics, say), the bigger this last category will
be."

Presumably this is why two people separately assumed you did not read the
whole article.

~~~
dj-wonk
Ok, to move this forward, let's move past the who has read what part, which
doesn't seem to be helping. Anyhow, my point is relatively subtle, I think,
and getting lost in communication somehow. I get what the article is saying,
but I'm not sure that everyone here is getting what I'm saying.

Let me try to elaborate. My points are:

1\. The article does indeed criticize the 1/3, 1/3, 1/3 split, but not for the
same reasons that I am.

2\. I would prefer that the 1/3, 1/3, 1/3 split not be used as a prior at all.
(Yes, I get that Bayesian priors don't matter that much _if_ you have enough
data to update them. The article, on my reading, does not offer a way to get
the needed observations, so I think the priors dominate.)

3\. Yes, later, the article dismisses the even split as naive but does not go
far enough when suggesting a better way. You see, the "better way" still boils
down to the naive assumption of counting bins: this is to say that if we have
more bins, then we should have a higher expected probability. In my opinion,
the "better way" is still falling into a conceptual trap: this is not a
combinatorics problem. For combinatorics problems to work they have to rely on
real, countable things, not arbitrary notions of how you slice and dice a
problem. Yes, I get that as you add more variables to the DAG, you can
certainly say that there are more possible paths through the graph. But we
cannot assume that just adding more nodes in the DAG will change reality. You
see, the DAG is a mental construct. We should not assume that adding more
mental bins (an arbitrary way of slicing reality) will affect the
distribution.

Here is an example. Francis tells Greg there are three marbles in a jar
(green, red, blue) but does not reveal the distribution, so Greg cannot
rightly assume a 1/3, 1/3, 1/3 distribution. But let's say Greg does anyway.
Next, Francis tells Greg, "Sorry, I was oversimplifying before; really there
are 4 types of marbles in a jar (forest green, lime green, red, blue)." What
should Greg do? Update the prior to 1/4, 1/4, 1/4, 1/4? Well, that would be
inconsistent, since 1/4 (forest green) + 1/4 (lime green) does not equal 1/3
(green). Should Greg suggest a 1/6 (forest), 1/6 (lime), 1/3 (red), 1/3 (blue)
split? Nope. My point is this: down that road lies arbitrariness and
contradictions.

Hopefully this helps explain what I'm talking about. I like the article, but I
don't want people to get into the habit of assuming a distribution. I also
don't like people counting up concepts that are arbitrary and turning that
into a distribution.

