Hacker News new | past | comments | ask | show | jobs | submit login
Simpson’s Paradox (2016) (forrestthewoods.com)
370 points by mromnia 48 days ago | hide | past | web | favorite | 82 comments

I'd like to say that the author has been reading The Book of Why, but it seems that he hasn't because he missed the punch line of the section on the paradox: you need a causal model to separate the two branches of the paradox. It's as easy to construct examples where the overall view is correct as it is so construct examples where the separate views are.

I'm unclear: what was the incorrect claim you're saying the author made?

The parent is not saying the author made an incorrect claim. They are saying that the parent did not continue their argument to arrive at a conclusion that someone else had, the conclusion that causal models are what tells you when you can combine datasets and when you can't.

> causal models are what tells you when you can combine datasets and when you can't.

but then the causal model is subjective right? What if there are two different causal models, and a priori cannot be known which is the "true" one?

Can the selection of the causal model be used to justify the dataset, in order to push a particular agenda?

Your job when analysing data is simply to enumerate the possibilities and assign likelihoods to them if possible. If two models fit equally well, you're supposed to write them both down in the hope that someone will collect further data to distinguish between them.

If you're cutting holes in your report for political reasons, that's just not doing the job. That's what pundits are paid to do, not (ideally at least) scientists. Fraud is easy to commit, and the fact that it's possible is not that hard of a philosophical issue.

How do you tell that a paper containing conclusions to support an agenda is written with correct scientific rigor, rather than fraud? Using Simpson's paradox, one can obfuscate their biases by making the desired conclusion drop out of the data.

Simpson's paradox is about a data conflict between an overall view and a more specific view. For example, in the kidney stone scenario, say you find treatment A is more successful overall and treatment B is more successful in the specific view at both treating small stones and treating big stones when broken down that way. The article indicates that the specific view is always correct so treatment B should be used in the future, whereas the commenter is saying that context is important to determine which treatment should be used.

Exactly. With a causal model (which can be validated independently) you have a principled reason for choosing which variables to control for.

Post author here. Can confirm I’ve not read The Book of Why!

I’ll add it to my reading list.

A warning: it's seriously self-congratulatory. But I don't know of anything better.

The sex-discrimination lawsuit against UC Berkley seems to be a kind of academic urban myth; the administration was apparently afraid of such a lawsuit and the study was done in response to those administrative fears.

Some people would reliah the chance to disregard a narrative that fails to align with their ideology. An advantage is obtained with selective acknowlegement of reality.

Now, how to go about the rationalization of ignoring it?

The last example of software optimization causing mean slowdown because users actually use the software is so true. Another example I've seen is better ML models causing accuracy to go down; users try harder things.

I like the way this is written. Very clear and to the point, with a tone of "Hey, check out this cool thing".

Very accessible, essentially making just one strong point with excellent examples and an easy-to-understand explanation. It does leave me with questions - doesn’t the number of trials in e.g. the kidney stone example count, as well as the relative success rate - but that can only be a good thing!

An explorable explanation of Simpson's Paradox, neatly complementing the article, is here: https://pwacker.com/simpson.html

Judea Pearl’s explanations of this in terms of causality are the only way it really makes sense, in my view.


Unless I'm misreading the take away is failure to appreciate graph/network theory is behind the Simpson paradox. And I think a lot of broken 20th century 'science'. Because theory was based on simplistic statistical analysis on processes with strong path dependence.

So, this is the data that the wikipedia page on Simpson's Paradox cites for the Berkeley study, and that the author of the article has quoted:

                     Men              Women
    Department Applied  Admitted Applied  Admitted
    A          [825]    62%      108      [82%]
    B          [560]    63%      25       [68%]
    C          325      [37%]    [593]    34%
    D          [417]    33%      375      [35%]
    E          191      [28%]    [393]    24%
    F          [373]    6%       341      [7%]

Above, I've bracketed in each pair of columns a) the sex with the most applicants and b) the sex with the most admissions, in a department. If that data is really the Berkeley data, then it's clear that the bias is against the sex with the most applicants, rather than either men or women.

I can propose a mechanism for this kind of (with some abuse of terminology) selection bias. A department accepts some applications, then realises they've admitted too many applicants of one sex and start rejecting applicants from the dominant sex in an attempt to redress the balance. They make a mess of it and end up biased too far in the opposite direction than they originally started.

Also note that in 4 out of 6 departments, more men applied than women, explaining why more departments appear biased against men (provided my observation holds).

However, I can't be sure whether this is actually the original data because it's nowhere to be found on my pdf copy of the study (Sex bias in graduate admission) which I believe I got from here: https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkele.... If anyone knows where this data actually comes from, I'd welcome a pointer.

As a separate comment, which might be controversial, I would like to call bullshit on the entire claim of the Berkeley study in particular (and not about Simpson's Paradox in general). In the "Berkeley data" (if that's what it is), it's clear again that men applied to most departments in larger numbers than women. The Berkeley data claims that because more women were admitted on a per-department basis, more departments were biased against men.

Now, picture this. Alice and Bob share a pizza. Alice takes 7 pieces and Bob takes 3 (he's on an intermittent fasting diet so he only eats every other slice). Alice eats 4 of her slices, Bob eats 3 of his. At the end, Alice turns to Bob and says "boy, you're such a glutton! You scoffed down all of your slices, but I still have 3 left".

Is that a fair comparison? Well, no. Alice starts out with almost double the slices than Bob. Bob eats less than Alice, but he's accused of stuffing his face because he eats a larger proportion of his smaller share.

Same with the Berkeley data. If that is the Berkeley data.

I'm not quite sure I follow your complaint, but I think I might be disagreeing with you. A key lesson of Simpson's Paradox is you can't read stories into data without having a causal model derived from outside the data.

I can comfortably invent stories that are not inconsistent with the data for a wide range of scenarios:

1) Only the most capable women are applying to Dept A due to discrimination, so the data is evidence of discrimination.

2) Dept A is discriminating towards women (self evident, 80% vs 60% admissions).

3) Dept A is completely non-discriminatory and the assessors are unaware of the gender of applicants; the differences are due to personal choices w.r.t. education and social networks turning out to be proxies for gender.

No study this sort of data can detect gender bias. It can be used as evidence in a broader study that comes up with a causal model for how the admissions process works; but there is no getting around interviews and field observations.

I'm not challenging Simpson's paradox, only the conclusion quoted in respect with the data in the above table (I'm still not sure where it came from).

You need to look at the figures. The differences that support your argument are minor and within the margin for error. You could similarly concluded that women are just smarter across the board.

I'm sorry, I don't understand your comment. What difference is minor? What is the margin for error? And how would I conclude what you say?

Men are only favourites by 1-2%. That's within the margin if error. Women are favourites by say 10% plus. The comment treats them the same, and base their theory on a binary concept. It's just bad logic and may even be a version of the Simpson paradox.

Women are the favorite by 10%+ only for a single department. This is a _different_ fallacy, now...

I still do not understand. How are men "favourites by 1%-2%" and women "by 10% plus"? Favourites, for what?

And how did you calculate the margin of error for this study?

First each subject you compare the chance of admission. For men when they have a higher chance of admission, even in their most advantaged subject they have a higher chance of admission of 4%. Women on the other hand have a 20%. You can't say that they are equivalent in the least. In terms of error margins, a few percent is common, from experience. You could do a stats 95 confidence style calculation.

You're talking about the difference between the percentages of applicants of each sex that were admitted. I tabulate:

                  Men              Women              % Difference
    Department Applied  Admitted Applied  Admitted    Men     Women
    A          [825]    62%      108      [82%]               +20%
    B          [560]    63%      25       [68%]               +5%
    C          325      [37%]    [593]    34%         +3%
    D          [417]    33%      375      [35%]               +2%
    E          191      [28%]    [393]    24%         +4%
    F          [373]    6%       341      [7%]                +1%

So, there's a 20% difference for one department that is a clear outlier and then everything is within a couple of percentiles of difference. In fact, the average difference is higher for men (3.5) than for women (2.666) ignoring the outlier, since it's an outlier.

However, I'm really not sure that taking the difference between proportions of different wholes is meaningful. The numbers don't add up to 100, so what does the difference mean, exactly?

I don't know what "a stats 95 confidence style" is, or how it is related to a margin of error, so please do that calculation and post your results.

Simpson's Paradox is one of the many phenomena that shows how different applied ML is from regular software engineering. Another one is feedback loops between decomposed subproblems.

In ML encapsulation, shielding away of inner details often does not work. One needs to know what is happening on the other side of the abstraction boundary. This is a problem for managers and PM coning to ML from a purely software engineering background. They are used to encapsulation and decomposition serving them well and they expect the same.

You’re right about ML. But you’re mistaken about software engineering—though in good company with most software engineers.

Of denotation, cache access, confidentiality, authentication, integrity, non-repudiability, performance, thread safety, memory overhead—only denotation and some parts of memory overhead allow composition of abstractions.

Would you mind elaborating on this?

>> In ML encapsulation, shielding away of inner details often does not work.

I call bs on this. It’s just that we haven’t yet invented a consistent type theory on top of ML.


This would be like saying “it’s just that we haven’t proven P!=NP” in CS. Best of luck.

Meanwhile applied people will deal with the problem by model diagnostics and sensitivity analysis as has been done for decades. I can’t wait for the next AI winter to come. So tired of this handwaving by people who don’t seem to have practical experience.

Whoah! You have quite a treasure trove in your favorites. The possibility of getting some work done vanished as soon as I found that.

Always happy to be a bad influence.

Yeah, but you just reduced the parent comment to ML being the same as CS. And the parent is saying the opposite: that they differ from each other.

So meanwhile, speaking of applied knowledge... I believe you didn't even read what you're replying to.

Observation #2: the paradox is essentially describing statistical gerrymandering. :)

Came here to say this. Simpson’s paradox is exactly how gerrymandering works. It’s all about how the data is grouped.

This is the exact feeling I've been having for years, nicely described in an easy to understand language. At least in data science and (god forbid) behavioral psychology, you can answer any question any way you like - statistically valid - by slightly shifting the level of focus (as described here), definitions or angle of attack. The more data, the easier.

Thanks for putting it in such a clear way :)

Neatly phrased:

Trends which appear in slices of data may disappear or reverse when the groups are combined.

Or perhaps even more succinctly: slicing data can introduce bias.

This is less accurate, because not slicing data can also lead to bias.

Except the original statement didn't make any claim about "not slicing", so neither does mine.

Not slicing is nevertheless slicing. The trivial selection. Rush's song "If you choose not to decide you still have made a choice"

In simples case at least, such as with the kidney stones, can we reduce our risk of reaching wrong conclusions by increasing our sample size of patients and randomizing which receive each treatment?

Yes, but it won't help with other problems like measuring the wrong metric.

For example, the YouTube latency example linked at the bottom was a randomized A/B test ("launched an opt-in to a fraction of our traffic"), but it was measuring per-user latency metrics when the distribution of 'user' had changed radically thanks to the improvements; for this, he would've needed to instead be monitoring some more global long-term effect like user retention or total traffic (then he would've seen a result like 'latency got a lot worse, but we're getting a ton more users and they're coming back much more frequently, so, that's good overall but why is latency up and who are all these new users...? aha!'). You have a Simpson's paradox on the level of metrics here, instead of individuals.

Yes absolutely! Random assigment along with statistical power and significance considerations does indeed allow one to draw causal conclusions. It’s the gold standard for causal inference.

> Yes absolutely!

The problem with these cases is generally that people want to use data that didn't come from a controlled experiment to begin with. You have a nice, fat data set of all the people who have been treated for kidney stones -- you could never afford to do a controlled experiment at that scale. But because the treatments weren't randomized (and neither was anything else), the conclusions are erroneous.

This has been a huge problem in social sciences, where you can't do the controlled experiment at all, even at a smaller scale, because there is no way to randomize the choices individuals make. All you can do is try to control for the divergence statistically -- but there isn't one confounder in real data, there are thousands or more, and each one you want to control for multiplies the measurement error (because the measurement error in the primary factor combines with the measurement error in the control factor).

You're right, and in some instances it is possible to draw causal conclusions from observational data. See [0] and [1] for two pretty different perspectives. But for this to work, you need a lot of data: both lots of units (e.g. people), and a lot of information about each individual unit.

[0] Causality, Judea Pearl

[1] Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Guido Imbens and Donald Rubin

The trouble is you can't fix large numbers of statistical confounders with more data because there is a limit for how many factors you can control for before the measurement error overwhelms the signal.

To do statistical controls, you essentially sort the data by category, so that you're not just comparing black people with white people, you're comparing middle class 18 year old black female college applicants with college educated parents to middle class 18 year old white female college applicants with college educated parents.

But every one of those factors is a chance to have measured something wrong. Your group of middle class 18 year old black female college applicants with college educated parents will have a couple of people who were misidentified as middle class, a couple of people who were misidentified as black, a couple of people who were misidentified as female, a couple of people who were misidentified as 18 and a couple of people who were misidentified as having college educated parents. And they don't cancel out exactly because the original correlations with the primary factor existed to begin with, so the measurement error compounds in proportion to the strength of the correlation of the primary factor with each confounder.

Meanwhile the size of each subcategory shrinks each time you bisect it further. So the more things you try to control for, the higher the percentage of the sample in each subcategory is measurement error.

I hate to take “both sides” but in the absence of confounding by indication, you can often use propensity scoring within robust models to decrease these impacts.

Mind you, the problem with non random and undetected sampling bias is that it can be subtle. See for example https://www.nytimes.com/2018/08/06/upshot/employer-wellness-...

Propensity scoring is a method of applying statistical controls. How does it address the issue of controls compounding measurement error?

That’s the whole point of doubly robust models. However, in the event of confounding by indication or sampling misspecification, my experience is that nothing can save you.

I am a rather strong proponent of randomized trials for this exact reason. (They can also have sampling bias, but some degree of noise is inevitable)

Even if you had infinite data, you are not allowed to just control for everything you measured. You still need to bring in your causal knowledge. E.g. you probably shouldn't control for body weight if it was measured a month after the treatment.

The point is, if you use your causal knowledge in a smart way, you can also draw strong conclusions from just observational data.

Lots of practical challenges for sure!

I <3 this reply. So, so good. Sneaky way of introducing the RCT.

You’re doing it right.

Cool article. My knowledge of statistics is really rusty, but isn't this another way approaching the topic of "Bayesian Thinking"? If you think about the scenarios in the article from the standpoint of predicting any given outcome in advance, male vs. female and hard department vs. easy department should be treated as "priors". Or to put it another way, Bayesian thinking means asking the question "What is the chance of X happening given Y?"

A nice intro to the topic: https://betterexplained.com/articles/an-intuitive-and-short-...

Which explains why a positive test on a mammogram means you only have an 8% chance of having breast cancer:

>The chance of getting a real, positive result is .008. The chance of getting any type of positive result is the chance of a true positive plus the chance of a false positive (.008 + 0.09504 = .10304).

>So, our chance of cancer is .008/.10304 = 0.0776, or about 7.8%.

>Interesting — a positive mammogram only means you have a 7.8% chance of cancer, rather than 80% (the supposed accuracy of the test). It might seem strange at first but it makes sense: the test gives a false positive 9.6% of the time (quite high), so there will be many false positives in a given population. For a rare disease, most of the positive test results will be wrong.

This is actually a case that shows the limits of Bayesian thinking.

The power of probability is that it can work in two directions. You can use it to make predictions, from causes to effects, from past to future. Or you can use it to reason diagnostically, from effects to causes, like deducing what must have happened in the past to produce the current observation. Thinking probabilistically, these two cases are treated the same: they're both just conditioning on evidence, which is really elegant.

The problem is that when the two cases really need to be treated differently, probability can't distinguish between them. For example, asking about the probability of hypothetical situations, or predicting the results of interventions. You need to know which variables are causes and which are effects, but this is outside the scope of probability.

Simpson's paradox is something that only shows up when the variables involved have certain cause-effect structures. If you think in terms of these structures, it stops being counterintuitive.

This is more about knowing what the right question to ask is, which is trickier than expected. In the classic example, the people who brought the lawsuit asked “what are the odds of getting into Berkeley if you are a woman?” However, if people don’t apply to “Berkeley” but instead to “Berkeley’s College of Engineering”, then the right question is “what are the odds of getting into Berkeley’s college of engineering if you’re a woman”. The paradox is due to the fact that we expect the answers to be the same.

And all of this, of course, ignores sampling bias…

That is not a paradox. It's just the fact that a theory about something might not hold when you take a closer look at that something.

In the articles example, the admission rates of a university seemed to indicate that there is a bias against women.

Zooming in and looking at the admission rates of the individual departments seem to indicate that there is a bias against men.

The article makes it sound like the first theory was wrong. And the second theory - the bias against men - is the real truth.

Zooming in further might indicate the opposite again.

Take two boxers. So far, one of them has won 86% of his fights and the other one has won 100%. According to the article, "The data is clear".

Now we add more data:

One fighter is Mike Tyson. He won 50 of his 58 fights. The other one is me. I did one fight in kindergarden and won it. But to be honest: I would not want to fight Tyson. As paradox as it sounds.

It is a paradox. In common usage, a paradox is an apparent absurdity which nevertheless holds up upon deeper investigation. In this case the apparent absurdity is e.g. "Treatment A is better at treating kidney stones despite performing worse in both trials".

Sometimes the word paradox has a slightly different meaning. For example, Russell's paradox in mathematics is the opposite; it takes something apparently well-founded and shows that it is absurd.

It’s a paradox because many people find it counterintuitive. It’s the mathematical statement of why correlation does not imply causation. The existence of a confounding variable correlated both with the purported cause (eg gender) and the purported effect (school admissions) can lead to reversals in observed association when grouped or broken out. Thus it is challenging to draw causal conclusions from observational data.

That a pair of attributes doesn't necessarily exhibit independence within the universe at large, even if it exhibits independence within each sub-universe is a powerful observation, and it's a troubling one to anyone who has attempted to design a sales and marketing strategy, a drug trial, or frameworks to encourage social equality: To have it suggested I can say nothing less about these thousand students other than a thousand different things, just sounds so absurd, and yet here it is true.

Sometimes people use the term "paradox" simply to a contradictory statement which upon investigation turns out to be true. In that way, "Simpson's Paradox" is absolutely a paradox.

I don't know, but at some point, aren't we just running up against the definition of "probability"?


> By doing so, the article makes the exact same mistake

Read further, the article talks about this

True. Shame on me. Removed this line from my otherwise wonderful comment :)

This is one of my favorite paradoxes too. Here's why:

"... given the same table, one should sometimes follow the partitioned and sometimes the aggregated data, depending on the story behind the data, with each story dictating its own choice. Pearl considers this to be the real paradox behind Simpson's reversal." [0]


Not really a paradox, but you will like https://en.wikipedia.org/wiki/Anscombe%27s_quartet

(if you've not encountered it before, which I suspect is unlikely!)

I sometimes wonder why people expect there to be any fixed, categorical semantic relationship between any set of numbers and set of natural language statements.

Very rarely do the words or the numbers cover even a tiny amount of the possible interpretations.

This is basically how gerrymandering works, isn't it?

I am reminded of this XKCD comic https://xkcd.com/2080/.

Iirc, you can guard against simpson's paradox by designing/collecting balanced data

I thought the same; at least in the kidney stone story, the data wasn't balanced: treatment A was assigned a lot more "harder cases". Either the trial wasn't randomized or the data set size wasn't big enough.

Unless you are God. You will never be able to even properly know what to factor in. Actually doing the experiment and analysis is exponentially harder. It's like saying well who cares about p=np, if you want to decrypt Aes without the key just make a super fast computer.

Idk why the 2016 needs to be in the title here. I understand for date relevant content, but this is not.

It’s not uncommon for something clear and expository to be invalidated and putting the date in the title may cause someone who knows the domain to say “oh, this must be from before this was all invalidated” and post a useful reference as a comment.

Well it’s a convention and some conventions (like this) are better applied uniformly than allowing for acidental editorialiation.

Another reason to put the year is it helps people decide if they’ve read it before.

I doubt that helps. For instance, this is from 2016, had you read it before?

I’m simply suggesting it because I don’t think it adds anything to the conversation. In addition, I’ve seen this being added more often lately and I worry it makes people think it’s date relevant (as I did) or that it somehow provides less value due to some time delay.

Another reason is it’s possible the same author writes an update or new article on the same topic. The year helps disambiguate that.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact