
Simpson’s Paradox (2016) - mromnia
https://www.forrestthewoods.com/blog/my_favorite_paradox/
======
mcguire
I'd like to say that the author has been reading _The Book of Why_ , but it
seems that he hasn't because he missed the punch line of the section on the
paradox: you need a causal model to separate the two branches of the paradox.
It's as easy to construct examples where the overall view is correct as it is
so construct examples where the separate views are.

~~~
jonahx
I'm unclear: what was the incorrect claim you're saying the author made?

~~~
whatshisface
The parent is not saying the author made an incorrect claim. They are saying
that the parent did not continue their argument to arrive at a conclusion that
someone else had, the conclusion that causal models are what tells you when
you can combine datasets and when you can't.

~~~
chii
> causal models are what tells you when you can combine datasets and when you
> can't.

but then the causal model is subjective right? What if there are two different
causal models, and a priori cannot be known which is the "true" one?

Can the selection of the causal model be used to justify the dataset, in order
to push a particular agenda?

~~~
whatshisface
Your job when analysing data is simply to enumerate the possibilities and
assign likelihoods to them if possible. If two models fit equally well, you're
supposed to write them both down in the hope that someone will collect further
data to distinguish between them.

If you're cutting holes in your report for political reasons, that's just not
doing the job. That's what pundits are paid to do, not (ideally at least)
scientists. Fraud is easy to commit, and the fact that it's possible is not
that hard of a philosophical issue.

~~~
chii
How do you tell that a paper containing conclusions to support an agenda is
written with correct scientific rigor, rather than fraud? Using Simpson's
paradox, one can obfuscate their biases by making the desired conclusion drop
out of the data.

------
knappa
The sex-discrimination lawsuit against UC Berkley seems to be a kind of
academic urban myth; the administration was apparently afraid of such a
lawsuit and the study was done in response to those administrative fears.

~~~
techbio
Ie. [https://www.refsmmat.com/posts/2016-05-08-simpsons-
paradox-b...](https://www.refsmmat.com/posts/2016-05-08-simpsons-paradox-
berkeley.html)

------
gok
The last example of software optimization causing mean slowdown because users
actually use the software is so true. Another example I've seen is better ML
models causing accuracy to go down; users try harder things.

------
freddex
I like the way this is written. Very clear and to the point, with a tone of
"Hey, check out this cool thing".

~~~
oneeyedpigeon
Very accessible, essentially making just one strong point with excellent
examples and an easy-to-understand explanation. It does leave me with
questions - doesn’t the number of trials in e.g. the kidney stone example
count, as well as the relative success rate - but that can only be a good
thing!

------
IngoBlechschmid
An explorable explanation of Simpson's Paradox, neatly complementing the
article, is here:
[https://pwacker.com/simpson.html](https://pwacker.com/simpson.html)

------
currymj
Judea Pearl’s explanations of this in terms of causality are the only way it
really makes sense, in my view.

[https://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf](https://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf)

~~~
Gibbon1
Unless I'm misreading the take away is failure to appreciate graph/network
theory is behind the Simpson paradox. And I think a lot of broken 20th century
'science'. Because theory was based on simplistic statistical analysis on
processes with strong path dependence.

------
YeGoblynQueenne
So, this is the data that the wikipedia page on Simpson's Paradox cites for
the Berkeley study, and that the author of the article has quoted:

    
    
                         Men              Women
        Department Applied  Admitted Applied  Admitted
        A          [825]    62%      108      [82%]
        B          [560]    63%      25       [68%]
        C          325      [37%]    [593]    34%
        D          [417]    33%      375      [35%]
        E          191      [28%]    [393]    24%
        F          [373]    6%       341      [7%]
    
    

Above, I've bracketed in each pair of columns a) the sex with the most
applicants and b) the sex with the most admissions, in a department. If that
data is really the Berkeley data, then it's clear that the bias is against the
sex with the most applicants, rather than either men or women.

I can propose a mechanism for this kind of (with some abuse of terminology)
selection bias. A department accepts some applications, then realises they've
admitted too many applicants of one sex and start rejecting applicants from
the dominant sex in an attempt to redress the balance. They make a mess of it
and end up biased too far in the opposite direction than they originally
started.

Also note that in 4 out of 6 departments, more men applied than women,
explaining why more departments appear biased against men (provided my
observation holds).

However, I can't be sure whether this is actually the original data because
it's nowhere to be found on my pdf copy of the study (Sex bias in graduate
admission) which I believe I got from here:
[https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-
Berkele...](https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-
Berkeley.pdf). If anyone knows where this data actually comes from, I'd
welcome a pointer.

~~~
YeGoblynQueenne
As a separate comment, which might be controversial, I would like to call
bullshit on the entire claim _of the Berkeley study_ in particular (and not
about Simpson's Paradox in general). In the "Berkeley data" (if that's what it
is), it's clear again that men applied to most departments in larger numbers
than women. The Berkeley data claims that because more women were admitted on
a per-department basis, more departments were biased against men.

Now, picture this. Alice and Bob share a pizza. Alice takes 7 pieces and Bob
takes 3 (he's on an intermittent fasting diet so he only eats every other
slice). Alice eats 4 of her slices, Bob eats 3 of his. At the end, Alice turns
to Bob and says "boy, you're such a glutton! You scoffed down all of your
slices, but I still have 3 left".

Is that a fair comparison? Well, no. Alice starts out with almost double the
slices than Bob. Bob eats _less_ than Alice, but he's accused of stuffing his
face because he eats a larger proportion of his smaller share.

Same with the Berkeley data. If that _is_ the Berkeley data.

~~~
roenxi
I'm not quite sure I follow your complaint, but I think I might be disagreeing
with you. A key lesson of Simpson's Paradox is you can't read stories into
data without having a causal model derived from outside the data.

I can comfortably invent stories that are not inconsistent with the data for a
wide range of scenarios:

1) Only the most capable women are applying to Dept A due to discrimination,
so the data is evidence of discrimination.

2) Dept A is discriminating towards women (self evident, 80% vs 60%
admissions).

3) Dept A is completely non-discriminatory and the assessors are unaware of
the gender of applicants; the differences are due to personal choices w.r.t.
education and social networks turning out to be proxies for gender.

No study this sort of data can detect gender bias. It can be used as evidence
in a broader study that comes up with a causal model for how the admissions
process works; but there is no getting around interviews and field
observations.

~~~
YeGoblynQueenne
I'm not challenging Simpson's paradox, only the conclusion quoted in respect
with the data in the above table (I'm still not sure where it came from).

------
srean
Simpson's Paradox is one of the many phenomena that shows how different
applied ML is from regular software engineering. Another one is feedback loops
between decomposed subproblems.

In ML encapsulation, shielding away of inner details often does not work. One
needs to know what is happening on the other side of the abstraction boundary.
This is a problem for managers and PM coning to ML from a purely software
engineering background. They are used to encapsulation and decomposition
serving them well and they expect the same.

~~~
brians
You’re right about ML. But you’re mistaken about software engineering—though
in good company with most software engineers.

Of denotation, cache access, confidentiality, authentication, integrity, non-
repudiability, performance, thread safety, memory overhead—only denotation and
some parts of memory overhead allow composition of abstractions.

~~~
jonahx
Would you mind elaborating on this?

------
jzl
Observation #2: the paradox is essentially describing statistical
gerrymandering. :)

~~~
gdne
Came here to say this. Simpson’s paradox is exactly how gerrymandering works.
It’s all about how the data is grouped.

------
esquire_900
This is the exact feeling I've been having for years, nicely described in an
easy to understand language. At least in data science and (god forbid)
behavioral psychology, you can answer any question any way you like -
statistically valid - by slightly shifting the level of focus (as described
here), definitions or angle of attack. The more data, the easier.

Thanks for putting it in such a clear way :)

------
throway88989898
Neatly phrased:

Trends which appear in slices of data may disappear or reverse when the groups
are combined.

~~~
naasking
Or perhaps even more succinctly: slicing data can introduce bias.

~~~
Matumio
This is less accurate, because not slicing data can also lead to bias.

~~~
naasking
Except the original statement didn't make any claim about "not slicing", so
neither does mine.

~~~
throway88989898
Not slicing is nevertheless slicing. The trivial selection. Rush's song "If
you choose not to decide you still have made a choice"

------
sopooneo
In simples case at least, such as with the kidney stones, can we reduce our
risk of reaching wrong conclusions by increasing our sample size of patients
and randomizing which receive each treatment?

~~~
rwilson4
Yes absolutely! Random assigment along with statistical power and significance
considerations does indeed allow one to draw causal conclusions. It’s the gold
standard for causal inference.

~~~
AnthonyMouse
> Yes absolutely!

The problem with these cases is generally that people want to use data that
didn't come from a controlled experiment to begin with. You have a nice, fat
data set of all the people who have been treated for kidney stones -- you
could never afford to do a controlled experiment at that scale. But because
the treatments weren't randomized (and neither was anything else), the
conclusions are erroneous.

This has been a huge problem in social sciences, where you can't do the
controlled experiment _at all_ , even at a smaller scale, because there is no
way to randomize the choices individuals make. All you can do is try to
control for the divergence statistically -- but there isn't one confounder in
real data, there are thousands or more, and each one you want to control for
multiplies the measurement error (because the measurement error in the primary
factor combines with the measurement error in the control factor).

~~~
rwilson4
You're right, and in some instances it is possible to draw causal conclusions
from observational data. See [0] and [1] for two pretty different
perspectives. But for this to work, you need a lot of data: both lots of units
(e.g. people), and a lot of information about each individual unit.

[0] Causality, Judea Pearl

[1] Causal Inference for Statistics, Social, and Biomedical Sciences: An
Introduction, Guido Imbens and Donald Rubin

~~~
AnthonyMouse
The trouble is you can't fix large numbers of statistical confounders with
more data because there is a limit for how many factors you can control for
before the measurement error overwhelms the signal.

To do statistical controls, you essentially sort the data by category, so that
you're not just comparing black people with white people, you're comparing
middle class 18 year old black female college applicants with college educated
parents to middle class 18 year old white female college applicants with
college educated parents.

But every one of those factors is a chance to have measured something wrong.
Your group of middle class 18 year old black female college applicants with
college educated parents will have a couple of people who were misidentified
as middle class, a couple of people who were misidentified as black, a couple
of people who were misidentified as female, a couple of people who were
misidentified as 18 and a couple of people who were misidentified as having
college educated parents. And they don't cancel out exactly because the
original correlations with the primary factor existed to begin with, so the
measurement error compounds in proportion to the strength of the correlation
of the primary factor with each confounder.

Meanwhile the size of each subcategory shrinks each time you bisect it
further. So the more things you try to control for, the higher the percentage
of the sample in each subcategory is measurement error.

~~~
apathy
I hate to take “both sides” but in the absence of confounding by indication,
you can often use propensity scoring within robust models to decrease these
impacts.

Mind you, the problem with non random and undetected sampling bias is that it
can be subtle. See for example
[https://www.nytimes.com/2018/08/06/upshot/employer-
wellness-...](https://www.nytimes.com/2018/08/06/upshot/employer-wellness-
programs-randomized-trials.html)

~~~
AnthonyMouse
Propensity scoring is a method of applying statistical controls. How does it
address the issue of controls compounding measurement error?

~~~
apathy
That’s the whole point of doubly robust models. However, in the event of
confounding by indication or sampling misspecification, my experience is that
nothing can save you.

I am a rather strong proponent of randomized trials for this exact reason.
(They can also have sampling bias, but some degree of noise is inevitable)

------
jzl
Cool article. My knowledge of statistics is really rusty, but isn't this
another way approaching the topic of "Bayesian Thinking"? If you think about
the scenarios in the article from the standpoint of _predicting_ any given
outcome in advance, male vs. female and hard department vs. easy department
should be treated as "priors". Or to put it another way, Bayesian thinking
means asking the question "What is the chance of X happening _given Y_?"

A nice intro to the topic: [https://betterexplained.com/articles/an-intuitive-
and-short-...](https://betterexplained.com/articles/an-intuitive-and-short-
explanation-of-bayes-theorem/)

Which explains why a positive test on a mammogram means you only have an 8%
chance of having breast cancer:

 _> The chance of getting a real, positive result is .008. The chance of
getting any type of positive result is the chance of a true positive plus the
chance of a false positive (.008 + 0.09504 = .10304)._

 _> So, our chance of cancer is .008/.10304 = 0.0776, or about 7.8%._

 _> Interesting — a positive mammogram only means you have a 7.8% chance of
cancer, rather than 80% (the supposed accuracy of the test). It might seem
strange at first but it makes sense: the test gives a false positive 9.6% of
the time (quite high), so there will be many false positives in a given
population. For a rare disease, most of the positive test results will be
wrong._

~~~
currymj
This is actually a case that shows the limits of Bayesian thinking.

The power of probability is that it can work in two directions. You can use it
to make predictions, from causes to effects, from past to future. Or you can
use it to reason diagnostically, from effects to causes, like deducing what
must have happened in the past to produce the current observation. Thinking
probabilistically, these two cases are treated the same: they're both just
conditioning on evidence, which is really elegant.

The problem is that when the two cases really need to be treated differently,
probability can't distinguish between them. For example, asking about the
probability of hypothetical situations, or predicting the results of
interventions. You need to know which variables are causes and which are
effects, but this is outside the scope of probability.

Simpson's paradox is something that only shows up when the variables involved
have certain cause-effect structures. If you think in terms of these
structures, it stops being counterintuitive.

------
TicklishTiger
That is not a paradox. It's just the fact that a theory about something might
not hold when you take a closer look at that something.

In the articles example, the admission rates of a university seemed to
indicate that there is a bias against women.

Zooming in and looking at the admission rates of the individual departments
seem to indicate that there is a bias against men.

The article makes it sound like the first theory was wrong. And the second
theory - the bias against men - is the real truth.

Zooming in further might indicate the opposite again.

Take two boxers. So far, one of them has won 86% of his fights and the other
one has won 100%. According to the article, "The data is clear".

Now we add more data:

One fighter is Mike Tyson. He won 50 of his 58 fights. The other one is me. I
did one fight in kindergarden and won it. But to be honest: I would not want
to fight Tyson. As paradox as it sounds.

~~~
sopooneo
I don't know, but at some point, aren't we just running up against the
definition of "probability"?

~~~
TicklishTiger
Probably.

------
air7
This is one of my favorite paradoxes too. Here's why:

"... given the same table, one should sometimes follow the partitioned and
sometimes the aggregated data, depending on the story behind the data, with
each story dictating its own choice. Pearl considers this to be the real
paradox behind Simpson's reversal." [0]

[0][https://en.wikipedia.org/wiki/Simpson%27s_paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)

~~~
emmelaich
Not really a paradox, but you will like
[https://en.wikipedia.org/wiki/Anscombe%27s_quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

(if you've not encountered it before, which I suspect is unlikely!)

------
emmelaich
I sometimes wonder why people expect there to be any fixed, categorical
semantic relationship between any set of numbers and set of natural language
statements.

Very rarely do the words or the numbers cover even a tiny amount of the
possible interpretations.

------
_bxg1
This is basically how gerrymandering works, isn't it?

------
jdhzzz
I am reminded of this XKCD comic
[https://xkcd.com/2080/](https://xkcd.com/2080/).

------
clircle
Iirc, you can guard against simpson's paradox by designing/collecting balanced
data

~~~
GolDDranks
I thought the same; at least in the kidney stone story, the data wasn't
balanced: treatment A was assigned a lot more "harder cases". Either the trial
wasn't randomized or the data set size wasn't big enough.

~~~
unparagoned
Unless you are God. You will never be able to even properly know what to
factor in. Actually doing the experiment and analysis is exponentially harder.
It's like saying well who cares about p=np, if you want to decrypt Aes without
the key just make a super fast computer.

------
lettergram
Idk why the 2016 needs to be in the title here. I understand for date relevant
content, but this is not.

~~~
city41
Another reason to put the year is it helps people decide if they’ve read it
before.

~~~
lettergram
I doubt that helps. For instance, this is from 2016, had you read it before?

I’m simply suggesting it because I don’t think it adds anything to the
conversation. In addition, I’ve seen this being added more often lately and I
worry it makes people think it’s date relevant (as I did) or that it somehow
provides less value due to some time delay.

~~~
city41
Another reason is it’s possible the same author writes an update or new
article on the same topic. The year helps disambiguate that.

