Hacker News new | comments | ask | show | jobs | submit login
Statistics Done Wrong (statisticsdonewrong.com)
547 points by lawlorino 18 days ago | hide | past | web | favorite | 54 comments



I really rate this book so it’s great to see it getting some publicity!

I think one misnomer is that to do stats right you just need to do the math right. But what analysis is done in the first place, the overall methodology and how that data is interpreted are more often where we go wrong.

At Geckoboard we’ve been trying to raise awareness of some of these issues. Heres a poster we put together: https://www.geckoboard.com/learn/data-literacy/statistical-f...


Isn't most or all of this avoided by explicitly using Bayes' theorem along with a correct formalization of the domain?

E.g. for the mammogram:

P(cancer) = 0.8%

P(~cancer) = 1 - P(cancer) = 99.2%

P(positive_mammogram | cancer) = 90%

P(positive_mammogram | ~cancer) = 7%

P(cancer | positive_mammogram) = P(positive_mammogram | cancer) P(cancer) / (P(positive_mammogram | cancer) P(cancer) + P(positive_mammogram | ~cancer) (1 - P(cancer))) = 90% * 0.8% / (90% * 0.8% + 7% * 99.2%) = 9.39457%


In short, not entirely. Bayes' theorem helps when you have a correct model. People don't have access to truth or correct models. Generally people applying Bayesian methods are using cookie cutter formulas. These can't protect you from the many many ways one can muck up their data especially if you want your decisions and density intervals to be close to "right".


> when you have a correct model

I have a hunch that determining a correct model is like a science all on its own. Are there any good books/blog posts/etc on that?


i think the answer is yes, it is avoided in the way that you demonstrate. i guess it's a matter of using an approach that reaches your audience.

the book explains it with prose: Imagine 1,000 randomly selected women who choose to get mammograms. Eight of them (0.8%) have breast cancer. The mammogram correctly detects 90% of breast cancer cases, so about seven of the eight women will have their cancer discovered. However, there are 992 women without breast cancer, and 7% will get a false positive reading on their mammograms, giving us 70 women incorrectly told they have cancer. In total, we have 77 women with positive mammograms, 7 of whom actually have breast cancer. Only 9% of women with positive mammograms have breast cancer.


I know this is an example, but this assumes false positives and false negatives are equally likely, and I now am wondering if that is true in real life.


It doesn't: P(positive_mammogram | cancer) = 90% P(positive_mammogram | ~cancer) = 7%


To be a little clearer, the false negative rate is 10% and the false positive rate is 7%.


> along with a correct formalization of the domain?

That's the biggest issue. It's very unlikely that your model family actually contains the data-generating distribution.


Direct application of Bayes theorem or more generally Bayesian inference (made easy now through probabilistic programming languages like stan or pymc3) can solve all of the issues that I read on this site.

In particular, the issue raised here[1] is about comparing estimates when the sample sizes are very different. Hierarchical partial pooling, can be used to accurately compare groups when the samples sizes vary widely, even when some groups have zero observations. Here is an example talking about baseball batting averages[2].

[1] https://www.statisticsdonewrong.com/regression.html#little-e... [2] https://docs.pymc.io/notebooks/hierarchical_partial_pooling....


> P(cancer | positive_mammogram) = P(positive_mammogram | cancer)

Sorry if I've got this wrong, but does this line say that we're assuming the probability of someone having cancer if they've got a positive mammogram, is the same as the probability of them having a positive mammogram if they've got cancer?

As far as I know, the first one would be the Positive Predictive Value (PPV) of the mammogram, the second one would be the Sensitivity of the mammogram.

They're related but not usually the same. The PPV would change depending on the prevalence of the disease (go up as the prevalence goes up), but the sensitivity would remain the same.


You left out most of the equation. The whole equation as posted is just an application of Bayes' rule: P(A|B) = P(B|A)P(A)/P(B). Your excerpt leaves out everything after P(B|A).


If you narrow your window enough to get the mobile (responsive) view it looks like separate lines.

HN really needs better code formatting.


Ha! That's exactly what I did sorry! I had the browser just taking up half the screen, and they split across a few lines.


Yes, most of the problems (with conditional probability, statistical tests, significance, etc) disappear once you express it in a Bayesian way (it's not only Bayes' formula - it explicitly creating a Bayesian model).

Basically, most of them boil down to:

- mistakes that can be tackled if you write it down explicitly

- hidden assumptions that can be discovered (and made explicit or modified)

While there is some philosophical difference between frequentist and Bayesian probability (and for some reason, I know people moving only one way).

"Frequentist probability is Bayesian probability, where priors are flat, hidden, and considered taboo".

BTW: Frequentists vs. Bayesians https://xkcd.com/1132/ (there is never too much of xkcd!)


That is fascinating. Which one is correct? Is Bayesian using more assumptions than Frequentist, namely the fact that repeated queries that haven't been done yet will show that the machine's answer is NO most of the time?


> Which one is correct?

The Bayesian is correct to offer the bet.

Who is correct about the sun exploding actually is irrelevant to that; only the conditional probability of the bet being collectable if the sun has exploded vs. that of it has not exploded is needed here. You would care about the probability that the sun actually had exploded if one of those weren't zero, but it is, so it doesn't matter.


Could you give a brief explanation what or who "frequentists" and "Bayesians" are?

Hearing it from the side of probability (measure spaces etc.) to me it sounds very much like "tomato" vs "tomato"?


Frequentist statistics is "classical" statistics, the kind most people know. It's often used to control the chance of reporting an effect when there's no real effect (the p-value); in many fields 5% is accepted, which, of course, leaves a lot of false positives among millions of studies.

In the comic, the frequentist is asking: "the machine said yes, what is the probability of that if the Sun hasn't exploded?" Since that probability (the p-value) is less than 0.05, the frequentist concludes the Sun has exploded, which illustrates a common error: mistaking statistical significance for truth.

Bayesians, in contrast, interpret probabilities as beliefs about the world and use experiments to update those beliefs, in accordance with Bayes rule.

In the comic, the Bayesian has presumably started with a strong belief that the Sun has not exploded, and the evidence that the machine says "Yes" one time slightly reduces their certainty but isn't strong enough to convert that belief to "the Sun has (probably) exploded".

Most people (even frequentist statisticians) actually interpret statistics that way, but Bayesian statistics formalizes it mathematically.


> Most people (even frequentist statisticians) actually interpret statistics that way, but Bayesian statistics formalizes it mathematically.

So the difference is only philosophical and that frequentists, according to Bayesians, tend to make errors more often in setting up their models?


The core difference is philosophical, but building on that, Bayesians have built a new set of mathematical tools involving things like conjugate priors, posterior probabilities, credible intervals, and Bayes factors.

The result is a different way of practicing statistics, not merely a difference in interpretation.


This is a great book. I read it a couple years ago and I remember a couple takeaways that apply well to AB testing:

1 - Monitoring tests on an ongoing basis and then calling them as soon as they hit some confidence threshold (like 95%) will give you biased results. It's important to determine your sample size up front and then let the test run all the way through, or at least be aware that the results are less reliable if you stop early.

2 - Testing for multiple metrics requires a much larger sample. If you run a test and then compare conversion rate, purchase amount, pageviews per session, retention, etc. etc., you'll have a much higher error rate since the more things you measure, the more likely you are to get an outlier. You either need to run a separate test for each metric or increase your sample size a lot to account for this effect (iirc the math for exactly how much is in the book).


Thanks, I'm glad you enjoyed the book! (Author here -- the website got its first publicity here on HN.)

Regarding AB testing, you might be interested in this recent research, which uses real data from Optimizely to estimate how often people get AB test false positives because they stopped as soon as they hit significance: https://ssrn.com/abstract=3204791

> Specifically, about 73% of experimenters stop the experiment just when a positive effect reaches 90% confidence. Also, approximately 75% of the effects are truly null. Improper optional stopping increases the false discovery rate (FDR) from 33% to 40% among experiments p-hacked at 90% confidence


While it may be possible take the frequentist approach to AB testing, Bayesian inference is becoming the way to go with this.[1] Instead of directly setting up a yes-no hypothesis test with the nearly impossible to use correctly p-values[2], Bayesian approaches aim to directly estimate whatever quantity you want. With the Bayesian approach, you get estimates for A, B, and A-B (or whatever combination you want, e.g. (A-B)/A). Each of those estimates are properly called posterior probability distributions and describe the range of possible values. The end result is that instead of saying "A is better than B (p < 0.05)" you get a probability distribution of A-B. From that probability distribution you can answer any question you want: the most likely difference between A and B (the average), the probability that A is better than B (just integrate the area above 0), or whatever is needed to make a decision.

[1] https://conversionxl.com/blog/bayesian-frequentist-ab-testin...

[2] https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf


> Monitoring tests on an ongoing basis and then calling them as soon as they hit some confidence threshold (like 95%) will give you biased results ...

The Bayesian method doesn't really solves this as much it answers a fundamentally different question -- modeling how your personal belief changes. These typically will not have the interpretation as normalized long term frequency. As long as the Bayesian posteriors are not interpreted as frequentist probabilities, they are perfectly acceptable.

That said, this 'peeking' problem can be easily resolved in the frequentist setting and this is well known in stats, probability and hopefully ML literature. The core results are really old. In fact this was classified information during world war II. If you are interested, search for sequential hypothesis test. They are actually more efficient than their batch cousins.

Bayesian vs Frequentist is an orthogonal axis from sequential/online vs batch. Think of a 2 X 2 box, you can choose to be in any quadrant you want.


Hi, I didn't see anything on the page about "intended audience" - would you say this is appropriate for someone who has done the basic classes of statistics in uni but is now pretty rusty or would you need a more solid foundation to be able to grasp the content fully?



Those interested in How to Lie with Statistics may also enjoy Huff's other book, How to Lie with Smoking Statistics, commissioned by the tobacco industry in the 60s to fight the growing evidence that smoking causes cancer. It was never published, but I compiled the surviving manuscript and wrote about it: https://www.refsmmat.com/articles/smoking-statistics.html


I don't remember the attribution, but one of my favorite quotes goes something like: "There are lies, damned lies, and statistics." I think I first remember seeing it in the foreword of a chapter in the book "Against the Gods: the Remarkable Story of Risk" by Peter Bernstein


It seems that no-one knows exactly where it's from. Was in use in the 1890s.

https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statist...


Love the Huff read. +1 to that.


I'm fairly certain that most of my opening salvos in evaluating the quality of someone's charts come to me from people regurgitating Huff's book.

I'd put it in a class with Fred Brooks; dead obvious, but somehow needing to be constantly re-explained to a new group of people (or the same people at a later date).


>dead obvious, but somehow needing to be constantly re-explained

Because it's not all 'dead obvious', maybe?


"Science is made up of so many things that appear obvious after they are explained." ~ Pardot Kynes, Dune


There's a great book about this, Everything is Obvious (Once You Know The Answer) about how many scientific studies come up with obvious-seeming findings, but if they'd found the opposite, that would have seemed obvious to us too..


Highly recommend this book. I would consider myself experienced with many statistical methods but this book was still chock full of brilliant examples that let me look at things with fresh eyes. It was helpful also in giving me language to explain technical concepts to less technical folks.


yeah. the "base rate fallacy" example sheds some light on the pros/cons of mammogram results. one interesting thing about that section of the book is that it says medical doctors fall prey to the fallacy more often than not.

i always kind of wonder when i see a medical doctor explaining statistics to a general audience: do they really have this right?


I also found this passage about mammographers from "Moonwalking with Einstein" interesting:

For most mammographers, practicing medicine is not deliberate practice, according to Ericsson. It’s more like putting into a tin cup than working with a coach. That’s because mammographers usually only find out if they missed a tumor months or years later, if at all, at which point they’ve probably forgotten the details of the case and can no longer learn from their successes and mistakes.

One field of medicine in which this is definitively not the case is surgery. Unlike mammographers, surgeons tend to get better with time. What makes surgeons different from mammographers, according to Ericsson, is that the outcome of most surgeries is usually immediately apparent—the patient either gets better or doesn’t—which means that surgeons are constantly receiving feedback on their performance.


Many of us in the software community know that if the feedback loop isn't fast enough, it doesn't work.

I wonder if you made mammographers spend a day a week analyzing mammograms from 5 years ago and then showing them the outcomes if they would get more accurate?


Well, that’s pretty much like training a neural network, except that the integration of the matrix coefficients happens not from the immediate output data, but from data which was output by the network millions of epochs ago.


Giving them the information could certainly help, and it is a good first step.

Mammographers would still be missing the motivation surgeons get when they see a patient die hours or days later. The emotional kick in the pants is harder to provide.


> i always kind of wonder when i see a medical doctor explaining statistics to a general audience: do they really have this right?

My wife is in her last year of medical school. I wonder the same question frequently when talking with her peers and seniors. They certainly get taught the basics of interpreting statistics, but I don't really see the "average" doc discussing things in a critical manner.

The system seems deeply focused on the results without critically thinking about the context and appropriateness of the studies.


I can answer that clearly for you. 99+% of docs don't have the slightest clue about how statistics are done, or even about what they mean (am MD with special interest in stats).


Along similar lines, Dr Ben Goldacre's Bad Science is an excellent beginner's introduction to the scientific method. If you understand how quacks bamboozle unwitting journalists, you gain a key insight into what good science looks like.

https://www.amazon.com/Bad-Science-Quacks-Pharma-Flacks/dp/0...


I think I never was in a developed field that was as error prone as statistics, both intentional (fraud, p value hacking, etc) and with honest mistakes.


Statistics might be seen as the process of rooting out errors and falsehoods. To come into contact with them is the goal (to eliminate them). :)


Indeed. It's the good statisticians that use good statistical methodology and theory to identify out errors and falsehoods.

Honest mistakes occur because what looks like a simple problem cannot always be analysed very easily (or in an obvious way).

Abuse occurs because it's quite easy to fool others that an analysis is sound - most people aren't sophisticated enough to identify problems, even if they are given the data!


finance?


For practical statistic done well, I recommend "Sex by Numbers" by David Spiegelhalter (https://www.amazon.com/Sex-Numbers-Wellcome-David-Spiegelhal...).

I am reading Sex by Numbers, which I enjoy a lot. It's a touchy subject with a lot of data of varying quality. A naive approach would be to take all of it. A dogmatic - to set an arbitrary (and subjective!) threshold, separating "good" from "bad" data. I love the way, in which it is done there, i.e. by grading sources by:

4: numbers that we can believe (e.g. births and deaths)

3: numbers that are reasonably accurate (e.g. well-designed & conducted surveys, e.g. the Natsal report)

2: numbers that could be out by quite a long way (e.g. non-uniform sampling, the Kinsey report)

1: numbers that are unreliable (e.g. surveys from newspapers, even with huge sample sizes)

0: numbers that have just been made up (e.g. "men think of sex every 7 seconds")

Just have a peek at the first chapter, which is freely accessible, and is exactly on data reporting, data reliability and dealing with subjective questions. A lot of thought is given about knowing the possible biases (e.g. people who are less likely to respond, who would like to downplay or exaggerate some things) and consistency of measurements.

So - in short: started reading to get curious facts about sex, ended up recommending to my data science students and mentees. (As the vast majority of problems starts with how you collects data, interpret it, and how well are you aware of its shortcomings).



I thought this was going to a be a collection of statistics people really did wrong.


I really loved xkcd 2059.


You make it sound like some type of ISO standard. :P


Might as well link directly

https://xkcd.com/2059/

And mobile

https://m.xkcd.com/2059/




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: