I think one misnomer is that to do stats right you just need to do the math right. But
what analysis is done in the first place, the overall methodology and how that data is interpreted are more often where we go wrong.
At Geckoboard we’ve been trying to raise awareness of some of these issues. Heres a poster we put together: https://www.geckoboard.com/learn/data-literacy/statistical-f...
E.g. for the mammogram:
P(cancer) = 0.8%
P(~cancer) = 1 - P(cancer) = 99.2%
P(positive_mammogram | cancer) = 90%
P(positive_mammogram | ~cancer) = 7%
P(cancer | positive_mammogram) = P(positive_mammogram | cancer) P(cancer) / (P(positive_mammogram | cancer) P(cancer) + P(positive_mammogram | ~cancer) (1 - P(cancer))) = 90% * 0.8% / (90% * 0.8% + 7% * 99.2%) = 9.39457%
I have a hunch that determining a correct model is like a science all on its own. Are there any good books/blog posts/etc on that?
the book explains it with prose: Imagine 1,000 randomly selected women who choose to get mammograms. Eight of them (0.8%) have breast cancer. The mammogram correctly detects 90% of breast cancer cases, so about seven of the eight women will have their cancer discovered. However, there are 992 women without breast cancer, and 7% will get a false positive reading on their mammograms, giving us 70 women incorrectly told they have cancer. In total, we have 77 women with positive mammograms, 7 of whom actually have breast cancer. Only 9% of women with positive mammograms have breast cancer.
That's the biggest issue. It's very unlikely that your model family actually contains the data-generating distribution.
In particular, the issue raised here is about comparing estimates when the sample sizes are very different. Hierarchical partial pooling, can be used to accurately compare groups when the samples sizes vary widely, even when some groups have zero observations. Here is an example talking about baseball batting averages.
Sorry if I've got this wrong, but does this line say that we're assuming the probability of someone having cancer if they've got a positive mammogram, is the same as the probability of them having a positive mammogram if they've got cancer?
As far as I know, the first one would be the Positive Predictive Value (PPV) of the mammogram, the second one would be the Sensitivity of the mammogram.
They're related but not usually the same. The PPV would change depending on the prevalence of the disease (go up as the prevalence goes up), but the sensitivity would remain the same.
HN really needs better code formatting.
Basically, most of them boil down to:
- mistakes that can be tackled if you write it down explicitly
- hidden assumptions that can be discovered (and made explicit or modified)
While there is some philosophical difference between frequentist and Bayesian probability (and for some reason, I know people moving only one way).
"Frequentist probability is Bayesian probability, where priors are flat, hidden, and considered taboo".
BTW: Frequentists vs. Bayesians https://xkcd.com/1132/ (there is never too much of xkcd!)
The Bayesian is correct to offer the bet.
Who is correct about the sun exploding actually is irrelevant to that; only the conditional probability of the bet being collectable if the sun has exploded vs. that of it has not exploded is needed here. You would care about the probability that the sun actually had exploded if one of those weren't zero, but it is, so it doesn't matter.
Hearing it from the side of probability (measure spaces etc.) to me it sounds very much like "tomato" vs "tomato"?
In the comic, the frequentist is asking: "the machine said yes, what is the probability of that if the Sun hasn't exploded?" Since that probability (the p-value) is less than 0.05, the frequentist concludes the Sun has exploded, which illustrates a common error: mistaking statistical significance for truth.
Bayesians, in contrast, interpret probabilities as beliefs about the world and use experiments to update those beliefs, in accordance with Bayes rule.
In the comic, the Bayesian has presumably started with a strong belief that the Sun has not exploded, and the evidence that the machine says "Yes" one time slightly reduces their certainty but isn't strong enough to convert that belief to "the Sun has (probably) exploded".
Most people (even frequentist statisticians) actually interpret statistics that way, but Bayesian statistics formalizes it mathematically.
So the difference is only philosophical and that frequentists, according to Bayesians, tend to make errors more often in setting up their models?
The result is a different way of practicing statistics, not merely a difference in interpretation.
1 - Monitoring tests on an ongoing basis and then calling them as soon as they hit some confidence threshold (like 95%) will give you biased results. It's important to determine your sample size up front and then let the test run all the way through, or at least be aware that the results are less reliable if you stop early.
2 - Testing for multiple metrics requires a much larger sample. If you run a test and then compare conversion rate, purchase amount, pageviews per session, retention, etc. etc., you'll have a much higher error rate since the more things you measure, the more likely you are to get an outlier. You either need to run a separate test for each metric or increase your sample size a lot to account for this effect (iirc the math for exactly how much is in the book).
Regarding AB testing, you might be interested in this recent research, which uses real data from Optimizely to estimate how often people get AB test false positives because they stopped as soon as they hit significance: https://ssrn.com/abstract=3204791
> Specifically, about 73% of experimenters stop the experiment just when a positive effect reaches 90% confidence. Also, approximately 75% of the effects are truly null. Improper optional stopping increases the false discovery rate (FDR) from 33% to 40% among experiments p-hacked at 90% confidence
The Bayesian method doesn't really solves this as much it answers a fundamentally different question -- modeling how your personal belief changes. These typically will not have the interpretation as normalized long term frequency. As long as the Bayesian posteriors are not interpreted as frequentist probabilities, they are perfectly acceptable.
That said, this 'peeking' problem can be easily resolved in the frequentist setting and this is well known in stats, probability and hopefully ML literature. The core results are really old. In fact this was classified information during world war II. If you are interested, search for sequential hypothesis test. They are actually more efficient than their batch cousins.
Bayesian vs Frequentist is an orthogonal axis from sequential/online vs batch. Think of a 2 X 2 box, you can choose to be in any quadrant you want.
How to Lie with Statistics by Darrell Huff
I'd put it in a class with Fred Brooks; dead obvious, but somehow needing to be constantly re-explained to a new group of people (or the same people at a later date).
Because it's not all 'dead obvious', maybe?
i always kind of wonder when i see a medical doctor explaining statistics to a general audience: do they really have this right?
For most mammographers, practicing medicine is not deliberate practice, according to Ericsson. It’s more like putting into a tin cup than working with a coach. That’s because mammographers usually only find out if they missed a tumor months or years later, if at all, at which point they’ve probably forgotten the details of the case and can no longer learn from their successes and mistakes.
One field of medicine in which this is definitively not the case is surgery. Unlike mammographers, surgeons tend to get better with time. What makes surgeons different from mammographers, according to Ericsson, is that the outcome of most surgeries is usually immediately apparent—the patient either gets better or doesn’t—which means that surgeons are constantly receiving feedback on their performance.
I wonder if you made mammographers spend a day a week analyzing mammograms from 5 years ago and then showing them the outcomes if they would get more accurate?
Mammographers would still be missing the motivation surgeons get when they see a patient die hours or days later. The emotional kick in the pants is harder to provide.
My wife is in her last year of medical school. I wonder the same question frequently when talking with her peers and seniors. They certainly get taught the basics of interpreting statistics, but I don't really see the "average" doc discussing things in a critical manner.
The system seems deeply focused on the results without critically thinking about the context and appropriateness of the studies.
Honest mistakes occur because what looks like a simple problem cannot always be analysed very easily (or in an obvious way).
Abuse occurs because it's quite easy to fool others that an analysis is sound - most people aren't sophisticated enough to identify problems, even if they are given the data!
I am reading Sex by Numbers, which I enjoy a lot. It's a touchy subject with a lot of data of varying quality. A naive approach would be to take all of it. A dogmatic - to set an arbitrary (and subjective!) threshold, separating "good" from "bad" data. I love the way, in which it is done there, i.e. by grading sources by:
4: numbers that we can believe (e.g. births and deaths)
3: numbers that are reasonably accurate (e.g. well-designed & conducted surveys, e.g. the Natsal report)
2: numbers that could be out by quite a long way (e.g. non-uniform sampling, the Kinsey report)
1: numbers that are unreliable (e.g. surveys from newspapers, even with huge sample sizes)
0: numbers that have just been made up (e.g. "men think of sex every 7 seconds")
Just have a peek at the first chapter, which is freely accessible, and is exactly on data reporting, data reliability and dealing with subjective questions. A lot of thought is given about knowing the possible biases (e.g. people who are less likely to respond, who would like to downplay or exaggerate some things) and consistency of measurements.
So - in short: started reading to get curious facts about sex, ended up recommending to my data science students and mentees. (As the vast majority of problems starts with how you collects data, interpret it, and how well are you aware of its shortcomings).