but then the causal model is subjective right? What if there are two different causal models, and a priori cannot be known which is the "true" one?
Can the selection of the causal model be used to justify the dataset, in order to push a particular agenda?
If you're cutting holes in your report for political reasons, that's just not doing the job. That's what pundits are paid to do, not (ideally at least) scientists. Fraud is easy to commit, and the fact that it's possible is not that hard of a philosophical issue.
I’ll add it to my reading list.
Now, how to go about the rationalization of ignoring it?
Department Applied Admitted Applied Admitted
A  62% 108 [82%]
B  63% 25 [68%]
C 325 [37%]  34%
D  33% 375 [35%]
E 191 [28%]  24%
F  6% 341 [7%]
I can propose a mechanism for this kind of (with some abuse of terminology) selection bias. A department accepts some applications, then realises they've admitted too many applicants of one sex and start rejecting applicants from the dominant sex in an attempt to redress the balance. They make a mess of it and end up biased too far in the opposite direction than they originally started.
Also note that in 4 out of 6 departments, more men applied than women, explaining why more departments appear biased against men (provided my observation holds).
However, I can't be sure whether this is actually the original data because it's nowhere to be found on my pdf copy of the study (Sex bias in graduate admission) which I believe I got from here: https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkele.... If anyone knows where this data actually comes from, I'd welcome a pointer.
Now, picture this. Alice and Bob share a pizza. Alice takes 7 pieces and Bob takes 3 (he's on an intermittent fasting diet so he only eats every other slice). Alice eats 4 of her slices, Bob eats 3 of his. At the end, Alice turns to Bob and says "boy, you're such a glutton! You scoffed down all of your slices, but I still have 3 left".
Is that a fair comparison? Well, no. Alice starts out with almost double the slices than Bob. Bob eats less than Alice, but he's accused of stuffing his face because he eats a larger proportion of his smaller share.
Same with the Berkeley data. If that is the Berkeley data.
I can comfortably invent stories that are not inconsistent with the data for a wide range of scenarios:
1) Only the most capable women are applying to Dept A due to discrimination, so the data is evidence of discrimination.
2) Dept A is discriminating towards women (self evident, 80% vs 60% admissions).
3) Dept A is completely non-discriminatory and the assessors are unaware of the gender of applicants; the differences are due to personal choices w.r.t. education and social networks turning out to be proxies for gender.
No study this sort of data can detect gender bias. It can be used as evidence in a broader study that comes up with a causal model for how the admissions process works; but there is no getting around interviews and field observations.
And how did you calculate the margin of error for this study?
Men Women % Difference
Department Applied Admitted Applied Admitted Men Women
A  62% 108 [82%] +20%
B  63% 25 [68%] +5%
C 325 [37%]  34% +3%
D  33% 375 [35%] +2%
E 191 [28%]  24% +4%
F  6% 341 [7%] +1%
However, I'm really not sure that taking the difference between proportions of different wholes is meaningful. The numbers don't add up to 100, so what does the difference mean, exactly?
I don't know what "a stats 95 confidence style" is, or how it is related to a margin of error, so please do that calculation and post your results.
In ML encapsulation, shielding away of inner details often does not work. One needs to know what is happening on the other side of the abstraction boundary. This is a problem for managers and PM coning to ML from a purely software engineering background. They are used to encapsulation and decomposition serving them well and they expect the same.
Of denotation, cache access, confidentiality, authentication, integrity, non-repudiability, performance, thread safety, memory overhead—only denotation and some parts of memory overhead allow composition of abstractions.
I call bs on this. It’s just that we haven’t yet invented a consistent type theory on top of ML.
This would be like saying “it’s just that we haven’t proven P!=NP” in CS. Best of luck.
Meanwhile applied people will deal with the problem by model diagnostics and sensitivity analysis as has been done for decades. I can’t wait for the next AI winter to come. So tired of this handwaving by people who don’t seem to have practical experience.
So meanwhile, speaking of applied knowledge... I believe you didn't even read what you're replying to.
Thanks for putting it in such a clear way :)
Trends which appear in slices of data may disappear or reverse when the groups are combined.
For example, the YouTube latency example linked at the bottom was a randomized A/B test ("launched an opt-in to a fraction of our traffic"), but it was measuring per-user latency metrics when the distribution of 'user' had changed radically thanks to the improvements; for this, he would've needed to instead be monitoring some more global long-term effect like user retention or total traffic (then he would've seen a result like 'latency got a lot worse, but we're getting a ton more users and they're coming back much more frequently, so, that's good overall but why is latency up and who are all these new users...? aha!'). You have a Simpson's paradox on the level of metrics here, instead of individuals.
The problem with these cases is generally that people want to use data that didn't come from a controlled experiment to begin with. You have a nice, fat data set of all the people who have been treated for kidney stones -- you could never afford to do a controlled experiment at that scale. But because the treatments weren't randomized (and neither was anything else), the conclusions are erroneous.
This has been a huge problem in social sciences, where you can't do the controlled experiment at all, even at a smaller scale, because there is no way to randomize the choices individuals make. All you can do is try to control for the divergence statistically -- but there isn't one confounder in real data, there are thousands or more, and each one you want to control for multiplies the measurement error (because the measurement error in the primary factor combines with the measurement error in the control factor).
 Causality, Judea Pearl
 Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Guido Imbens and Donald Rubin
To do statistical controls, you essentially sort the data by category, so that you're not just comparing black people with white people, you're comparing middle class 18 year old black female college applicants with college educated parents to middle class 18 year old white female college applicants with college educated parents.
But every one of those factors is a chance to have measured something wrong. Your group of middle class 18 year old black female college applicants with college educated parents will have a couple of people who were misidentified as middle class, a couple of people who were misidentified as black, a couple of people who were misidentified as female, a couple of people who were misidentified as 18 and a couple of people who were misidentified as having college educated parents. And they don't cancel out exactly because the original correlations with the primary factor existed to begin with, so the measurement error compounds in proportion to the strength of the correlation of the primary factor with each confounder.
Meanwhile the size of each subcategory shrinks each time you bisect it further. So the more things you try to control for, the higher the percentage of the sample in each subcategory is measurement error.
Mind you, the problem with non random and undetected sampling bias is that it can be subtle. See for example https://www.nytimes.com/2018/08/06/upshot/employer-wellness-...
I am a rather strong proponent of randomized trials for this exact reason. (They can also have sampling bias, but some degree of noise is inevitable)
The point is, if you use your causal knowledge in a smart way, you can also draw strong conclusions from just observational data.
You’re doing it right.
A nice intro to the topic: https://betterexplained.com/articles/an-intuitive-and-short-...
Which explains why a positive test on a mammogram means you only have an 8% chance of having breast cancer:
>The chance of getting a real, positive result is .008. The chance of getting any type of positive result is the chance of a true positive plus the chance of a false positive (.008 + 0.09504 = .10304).
>So, our chance of cancer is .008/.10304 = 0.0776, or about 7.8%.
>Interesting — a positive mammogram only means you have a 7.8% chance of cancer, rather than 80% (the supposed accuracy of the test). It might seem strange at first but it makes sense: the test gives a false positive 9.6% of the time (quite high), so there will be many false positives in a given population. For a rare disease, most of the positive test results will be wrong.
The power of probability is that it can work in two directions. You can use it to make predictions, from causes to effects, from past to future. Or you can use it to reason diagnostically, from effects to causes, like deducing what must have happened in the past to produce the current observation. Thinking probabilistically, these two cases are treated the same: they're both just conditioning on evidence, which is really elegant.
The problem is that when the two cases really need to be treated differently, probability can't distinguish between them. For example, asking about the probability of hypothetical situations, or predicting the results of interventions. You need to know which variables are causes and which are effects, but this is outside the scope of probability.
Simpson's paradox is something that only shows up when the variables involved have certain cause-effect structures. If you think in terms of these structures, it stops being counterintuitive.
And all of this, of course, ignores sampling bias…
In the articles example, the admission rates of a university seemed to indicate that there is a bias against women.
Zooming in and looking at the admission rates of the individual departments seem to indicate that there is a bias against men.
The article makes it sound like the first theory was wrong. And the second theory - the bias against men - is the real truth.
Zooming in further might indicate the opposite again.
Take two boxers. So far, one of them has won 86% of his fights and the other one has won 100%. According to the article, "The data is clear".
Now we add more data:
One fighter is Mike Tyson. He won 50 of his 58 fights. The other one is me. I did one fight in kindergarden and won it. But to be honest: I would not want to fight Tyson. As paradox as it sounds.
Sometimes the word paradox has a slightly different meaning. For example, Russell's paradox in mathematics is the opposite; it takes something apparently well-founded and shows that it is absurd.
Sometimes people use the term "paradox" simply to a contradictory statement which upon investigation turns out to be true. In that way, "Simpson's Paradox" is absolutely a paradox.
Read further, the article talks about this
"... given the same table, one should sometimes follow the partitioned and sometimes the aggregated data, depending on the story behind the data, with each story dictating its own choice. Pearl considers this to be the real paradox behind Simpson's reversal." 
(if you've not encountered it before, which I suspect is unlikely!)
Very rarely do the words or the numbers cover even a tiny amount of the possible interpretations.
Well it’s a convention and some conventions (like this) are better applied uniformly than allowing for acidental editorialiation.
I’m simply suggesting it because I don’t think it adds anything to the conversation. In addition, I’ve seen this being added more often lately and I worry it makes people think it’s date relevant (as I did) or that it somehow provides less value due to some time delay.