Statistical Mistakes and How to Avoid Them 292 points by ingve on Nov 23, 2016 | hide | past | favorite | 76 comments

 This is the insight that made statistics "click" for me many years ago: a statistical test answers one central question: what are the odds that the results you observed could have arisen by chance? If those odds are low, then you are justified in concluding that the results probably did not arise by chance, and so there must be some other explanation (usually, but not always, the causal hypothesis you are advancing).Once consequence of this is that it is crucial that you advance your hypothesis before you collect (or at least look at) the data because the odds of something arising by chance change depending on whether you predict or postdict the results. Also, the more data you have, the more likely you are to find something in there that looks like a signal but is in fact just a coincidence. Many a day-trading fortune has been lost to this one mistake.
 Unfortunately no. Very much no, even though it's widely believed that that is a good definition/intuition (and used in many places).It's the odds of having that results due to chance, if the null hypothesis is true[0]. That latter part might sound pedantic, but the whole point is that we don't know how likely the null hypothesis is. If I test wheather the sun has just died[1] and get a p-value of 0.01 it's still very likely that this result is due to change (surely more than 1%)! We need a prior probability (i.e. bayesian statistics) to calculate the probability that the result was due to chance, that is why that partial definition is incomplete and actually very misleading. This point is subtle, but very important to really understand p-values.Another way to look at it is: if we knew the probability that the result was due to chance we could also just take 1-p and have to probability of there actually being some effect, a probability that hypothesis testing cannot give us.There is one nice property that hypothesis testing does have (and why presumably it's so widely used): if the idea you are testing is wrong (which actually means "null hypothesis true") you will most likely (1-p) not find any positive results. This is good, this means that if the sun in fact did not die, and use 0.01 as your threshold, 99% of the experiments will conclude that there is no reason to believe the sun has died. So hypothesis testing does limit the number of false positive findings. The xkcd comic is a bit misleading it this regard, yes it does highlight the limitations of frequentist hypothesis testing, but the scenario depicted is a very unlikely one, in 99% of the cases there would have been a boring and reasonable "No, the sun hasn't died".For an incredibly interesting article about the difficulty of concluding anything definitive from scientific results I highly recommend "The Control Group is out of Control" at slatestarcodex[2].[0] To be even more pedantic you would have to add "equal or more extreme", and "under a given model", but "if the null hypothesis is true" is by far the most important piece often missing.
 > It's the odds of having that results due to chance, if the null hypothesis is true[0].Yes, that's right. I don't know why you think this is at odds with what I said. In fact, I clarified this myself a few hours ago in a sibling comment:
 "what are the odds that the results you observed could have arisen by chance?"If you say it like this it will very easily be misinterpreted. Once your results are in there are two cases: (1) either the null hypothesis is true and you got those results due to chance, or (2) the null hypothesis is false and there was some actual effect outside of the null hypothesis that helped you get the results.Due to this it is very easy to interpret you statement as referring to the probability of (1).Two two following definitions of p-values sound similar but are not:[Correct] The probability of getting the results by chance if the null hypothesis is true P(Results|H0)[Wrong] The probability that you got the results by chance and thus the null hypothesis was actually true P(H0|Results)I'm not saying you didn't get it, but somebody reading what you wrote can very easily be fooled. And there are a lot of dead wrong definitions on the web[0][1][2][3].
 The OP is referring to Fisher's p-value rather than the more common Neyman-Pierson method that you refer to. Fisher's method doesn't have the concept of the null hypothesis. The difference is fascinating and I do believe the Fisherian method is superior if you can't easily replicate.
 The definition of p-value is the same independent of method, as far as I can tell the only real difference is that by Neyman–Pearson you just look at whether the p-value is below a threshold, and Fisher looks at p-value as "strength of evidence" valuable in itself. It's still not the probability that your result was due to chance, it's the probability that under the null hypothesis (and you will definitely need one) you would get that value (or more extreme) by chance.
 > a statistical test answers one central question: what are the odds that the results you observed could have arisen by chanceWell, no, that'd be very interesting but unfortunately what a statistical test really says is the probability of the results you observed (or more extreme) given chance. P(data|model) and not P(model|data).
 > what are the odds that the results you observed could have arisen by chance?> the probability of the results you observed (or more extreme) given chanceI think he said the same thing as you. He didn't talk about "the odds that the results you observed have arisen by chance" but the odds that the results you have observed (the or more extreme is implicit) could have arisen by chance (i.e. the probability of happening by chance alone, given that only chance was involved).
 lisper on Nov 24, 2016 [–] It's actually P(data|null-hypothesis).
 How about you guys are both right, sort of.There are both Bayesian and Frequentist approaches in statistics!They represent very different methods to statistics, but they are also quite similar. My apologies, I couldn't find one link that gave a good description of Bayesian vs. Frequentist. Here are a couple links to get started:https://xkcd.com/1132/http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bay...If anyone comes across a link that describes Bayesian and Frequentist clearly, please share it if you don't mind.
 There are some cases when the difference between bayesian and frequentist statistics is just a matter of interpretation, but not in this case: the mathematics for a bayesian statistical test are different from the mathematics of a frequentist statistical test and they will produce different results, because the former accounts for prior probabilities and the latter does not. The article talks about frequentist tests, as does almost everyone else who mentions statistical tests without further specification.
 aristus on Nov 24, 2016 [–] Frequentists count their chickens after they've hatched. Bayesians bet the chickens that they will.
 lisper on Nov 24, 2016 [–] That cartoon is actually a pretty good illustration of why frequentists are wrong.
 Do you mean probably wrong?Either way I disagree. Both approaches have pros and cons depending on the type of analysis.At the very least, just the presence of competing approaches in the field has pushed statisticians to have more rigour and do way more double checking than they might have, out of fear the other side actively looking to poke holes. It's easy to lie with statistics and the only people who can call statisticians out on their BS is other statisticians...Need more of this.
 > Do you mean probably wrong?Nope.> Both approaches have pros and consWhat are the pros of the frequentist approach?
 In other words: the advantage of frequentism is that it gives you better plausible deniability when you get the wrong answer. Did I get that right?
 No, and I don't know where you're getting your certainty that the answer is wrong from.How about this: find me a published scientific paper with only Bayesian results in it, no frequentist statistics at all. Argue for its superiority all you want, but I don't think anyone does it. It would be seen as a stunt.
 Don't know about a paper, but here's a book:
 jcahill on Nov 25, 2016 [–] > find me a published scientific paper with only Bayesian results in itTrivial¹, surely?> [B]ut I don't think anyone does it. It would be seen as a stunt.What did you mean by "scientific"?You'd have to hunt for a narrow reading s.t. the above holds — extant counterexamples aren't limited to any one branch.Check it out² for yourself.Further: cogsci's bayesian adoption is rapidly accelerating.The replication crisis is brutalizing huge swathes of psych:• loss of confidence in NHST is becoming near-total for many• journals are purging in turn — e.g. BASP's p-value ban³⁴• others are overhauling stats-in-psych entirelyJ Math Psych alone has two recent special issues⁵⁶ on this.I mean:> no frequentist stats at allisn't even a strawman lately, let alone an absurdity.In some fields, it's a battle-cry.All that being said… re:> Argue for its superiority all you wantI wouldn't even go that far. Don't give em that.Probability interpretation fundies• are all wrong,• narrow minds and waste lifespans with the cultism, and• should at least learn of the other interpretations.Pitching probability as a 1v1 isn't merely wrong-prime⁷ — it's doublepluswrong″⁸.It's pseudofundamentalism: fundies uphold foundations.Flamewars predicated on ignorance of the same gotta go, no matter how fashionable they may be.____________________________________________________________[3] http://doi.org/4z8 | BASP 37[4] http://doi.org/34p | Nature re: BASP[5] http://doi.org/btqx | J Math Psych 72[6] http://doi.org/btqz | J Math Psych 74
 The main advantage of the frequentist approach is that you can do the calculations much more easily. Bayesian statistics is great, but often the calculations are much more difficult because of your distribution of priors. You can make up simplified priors to ease the calculations, but then you run into some of the same problems as frequentist statistics.Here is a simple example: lets say you flip a coin 10 times and get 8 heads, what is the probability that the coin is not fair? In frequentist statistics you only have to calculate the likelihood of the results for a null hypothesis, and then you use a p-value. While this approach is flawed, at least you can quickly do the calculation and get an approximate answer. In Bayesian statistics you have to specify the prior distribution, and calculate the likelihood of your results under every possible hypothesis. Correctly specifying this prior distribution and calculating the results is quite challenging - especially if you want to use a realistic prior (not just uniform). This is a pretty simple example, you can imagine how much more challenging this becomes in real-world problems. On the other hand, it is true that the frequentist approach doesn't really answer the question asked, so it is misleading (especially if you choose a p-value that isn't specific to the problem). If you choose p-values based on prior knowledge, than the differences between frequentist and bayesian are less extreme.
 A man was walking down a city street when he saw another man wandering around a lamp post looking at the sidewalk. "What are you doing?" the first man asked. "Looking for my keys," said the second man. "Oh, did you lose them around here?" asked the first man. "No," the second man replied, "but the light is better here."
 Often times getting close to an answer that is approximately correct is better than trying to find the perfect solution. Most real world problems, especially in analyzing data, don't have perfect answers. For example approximations are made all the time in physics, because without these approximations the calculations can't be done. Knowing when to make approximations and what approximations to make is an essential skill for analyzing data.
 ska on Nov 24, 2016 [–] > Bayesian statistics is great, but often the calculations are much more difficult because of your distribution of priors.This is often a feature, forcing you to actually look at the complexity head on before you sweep it under a rug.
 Simple maths that can be done without a computer. Useful in the past - not so useful today.
 jcahill on Nov 24, 2016 [–] All models are wrong. That much is entailed by 'models'.There is no Zuul.
 The model is the null hypothesis here... I don't think the parent comment was wrong w.r.t. this.
 No, there's a difference. The model is something like, "This drug attaches itself selectively to cancer cells and kills them." The null hypothesis is, "This drug has no effect." So you conduct a double-blind study, measure the effect of the drug on cancer cells, collect some data and compute that P(data|null-hypothesis) is 1%. It it not the case that there is a 99% chance that your model is correct and that the drug does in fact attach itself to cancer cells and kill them. Because there are other possible models, e.g.: the drug redirects your Chi and channels it into your chakras which kills the cancer cells. Statistics alone cannot tell you which of those two models is correct.
 >"Statistics alone cannot tell you which of those two models is correct."Statistics can tell you whether a model is consistent with the data. But you need to deduce the null hypothesis from your model rather than use the default "no difference" (of course, sometimes no difference is deduced from a real model, but not often, in that case: great!).In fact, that is the proper use of statistics. I would guess >99.99% of current usage is incorrect (ie pseudoscience) and amounts to a waste of time at best. The usual usage turns scientific reasoning on its head, and has lead to a (literally for most people) unbelievable amount of trouble.This was pointed out most aptly by Paul Meehl long, long ago: http://www.fisme.science.uu.nl/staff/christianb/downloads/me...
 > Statistics can tell you whether a model is consistent with the data.Yes, that's true, but it badly misses the point. The power of statistics is to tell you when a model (the null hypothesis) is (most likely) inconsistent with the data so that you can confidently rule it out. Any finite data set is consistent with an infinite number of models, so knowing that a model and the data are consistent tells you absolutely nothing about whether or not that model has any relationship with reality (which, at the risk of stating the obvious, is what science actually cares about). This is the reason that rejecting the null hypothesis is considered a positive result.
 Except such a data set is also inconsistent with an infinite number of models, so ruling one out via rejecting a null also provides practically no information value and moves us no closer to understanding. /devils advocateIn practical terms, we're not interested in true models, but useful ones, so the description of a model's consistency with observed data is often the more useful metric in practice than rejecting nulls :/ especially in applications where you can't set up repeated experiments.OK, I realise its more nuanced than that too, but given how many papers and practitioners seem incapable of understanding that evidence against the null it's not explicit evidence for an arbitrary alternative, practically and consequentially I don't think that's how we should be working...
 > Except such a data set is also inconsistent with an infinite number of models, so ruling one out via rejecting a null also provides practically no information value and moves us no closer to understanding.No, that's not true, because experiments are not done in a (figurative) vacuum. They are done in the context of an explanatory theory that has already gone through a vigorous filter and shown to be consistent with the all prior experimental data and has better explanatory power than all of its competitors. It is only when more than one theory survives this filter that an experiment is done, and the experiment is designed specifically to distinguish between the surviving theories.So while it is true that an experiment allows you to eliminate an infinite number of theories, it's irrelevant, because by the time the experiment is done nearly all of those theories have already been eliminated anyway.
 nonbel on Nov 24, 2016 [–] >"The power of statistics is to tell you when a model (the null hypothesis) is (most likely) inconsistent with the data so that you can confidently rule it out"If you know whether your model is consistent with the data, you know whether it is inconsistent... I think you are talking about some other issue than I am.The point about deducing the null hypothesis from your explanatory model is that the null hypothesis is precise. In that case you will get a strong test of the model, and it will get stronger as more data gets collected. Using a default null and and vague alternative is the exact opposite. (check the Meehl 1967 paper I linked earlier in this thread).
 > If you know whether your model is consistent with the data, you know whether it is inconsistent...No, that's not true. And in fact I got it wrong earlier when I agreed with you that statistics can tell you when data is consistent with a model. They can't. At best they can tell you whether the data are not inconsistent. That sounds like the same thing, but it isn't. It's like the distinction between "not guilty" and "actually innocent." At best (or at worst depending on how you look at it) a statistical test can tell you, "This theory cannot be confidently ruled out on the basis of that data under the following assumptions..."> I think you are talking about some other issue than I am.That is quite possible.[UPDATE:] BTW, I read the Meehl paper, and I completely agree with what he says. So you and I may be in "violent agreement" here.
 fluxion on Nov 24, 2016 [–] I think this is just an argument over semantics, the "null hypothesis" is a perfectly valid model using the definition of the statistics community [0] (i.e. a collection of probability distributions over some sample space).
 The word "model" is being used here in two mutually incompatible ways. Yes, the null hypothesis is a model, but it is not an explanatory model. A scientific hypothesis has to meet two tests to be considered a valid theory. It has to be consistent with the data, and it has to have explanatory power. The theory that cancer drugs work by aligning a patient's chi with their chakras is rejected not because it is inconsistent with the data (it's not) but because it lacks the explanatory power of alternative theories based on molecular biology.The null hypothesis never has explanatory power. The null hypothesis is always a statement of the form, "The explanatory hypothesis under test is wrong for some unknown reason." This is why rejecting the null hypothesis, i.e. showing that the data are (with high probability) inconsistent with the null hypothesis, is considered a positive result.
 stdbrouw on Nov 24, 2016 [–] Yup, and the null hypothesis is a reduced model. Different words, same thing.
 I have no idea what you mean by "reduced" but they are not the same thing. The difference in their explanatory power. A (scientific) hypothesis explains things. A null hypothesis simply says, "This hypothesis is wrong" but doesn't say why. It's a crucial distinction.
 I'm not a statistician but even so I think this article makes assumptions that may not hold up for computer science. The first thing to do is plot your data. If it doesn't look like a bell curve, it's unlikely that common statistical calculations (which assume something close to gaussian) apply here.If you're doing benchmarking, another common model is a peak at a minimum value (when everything goes right) and a long tail, due to various events like cache misses that always slow things down, but don't happen in every test run.On a system with multiple programs running (a typical desktop), taking the mean is meaningless - this just adds noise due to activity unrelated to your program. You'd be better off taking the minimum, which with enough test runs should capture all the events that happen every time and none of the events that don't.The median or 95% percentile might also be useful if you're investigating events that don't happen every time. But if you want to know about cold start performance (for example), maybe the best thing to do would be to flush your caches before every test run, so the events you're interested in are events that happen every time.
 > If it doesn't look like a bell curve, it's unlikely that common statistical calculations (which assume something close to gaussian) apply here.The key word in there is common. There is an entire industry of statistical techniques that do not require Gaussian assumption or for that matter any parametric assumption.I strongly feel it is time to retire the Gaussian distribution from the space it occupies. Discovering and studying Gaussian distribution and the bog standard central limit theorem should be considered one of mankind's crowning achievements. They deserve to be put on a pedestal to appreciate their elegance, but when rubber meets the road one has to open ones mind to look beyond. Appearance of the Gaussian distribution is rarely as normal as many expect/claim it to be (I blame the stats education machinery for this), nor was it invented by Gauss. In fact Gauss used it as a post-hoc justification for backing the least-squares method. His original motivation for least squares was simplicity and convenience, not the normal distribution or CLT or for that matter the Gauss-Markov theorem.
 The Gaussian distribution is central in continuous-time models for different reasons:https://almostsure.wordpress.com/2010/04/13/levys-characteri...Basically any reasonable* stochastic continuous process is driven by a brownian motion. Also: discontinuous processes are more or less* the sum of a brownian motion and a poisson-type process.https://en.wikipedia.org/wiki/L%C3%A9vy_process#L.C3.A9vy.E2...(* Much details about filtrations, Banach spaces yadda yadda omitted)
 Yes indeed, but the Brownian motion story is weaker than the CLT story. Lot more conditions required, you have to look at it the right scale, in the right way ... then stochastic processes look very much like a Brownian motion. Well, technically the bog standard CLT is a special case of this, hence has a simpler story.
 I am a huge fan of nonparametric statistics, but lots of non-Gaussian distributions are often 'good enough' for normality assumptions to be ok.
 "it’s telling you that there’s at most an alpha chance that the difference arose from random chance. In 95 out of 100 parallel universes, your paper found a difference that actually exists. I’d take that bet."This is wrong. It’s telling you that there’s at most an alpha chance that a difference like that (or more) would have arisen from random chance if the quantities are actually equal. And if the quantities are equal 95 out of 100 parallel universes would not be able to reject the null hypothesis.Is he saying that he would take the xkcd bet[0] on the frequentist side?
 The t-test assumes a normal distribution which, is rarely true, especially when the number of runs is under 100. A better test is the Mann-Whitney U test which is applicable for a wider category of distributions.
 I think the t-test is conceptually easier to understand, which is important since the target audience for this article is people who know next to nothing about statistics.The t-test might not be the best test for every situation, but if the alternative is no test at all, I'll take it.
 Central limit theorem means there are lots of cases where normal distributions are directly applicable.
 The central limit theorem applies to independent variables only. If you are not sure your variables are independent you cannot rely on that assumption.
 That's not 100% true. There are lots of different theorems that are "central limit theorems", and that work across different cases.You can have CLT's with non-iid variables (either the aren't identically distributed or aren't independent). The math just becomes much harder, and you have to assume specific dependence structures.
 Thanks for the reference - can you provide an example where one would use the Martingale CLT, if you are aware of any ?
 Look for any estimation or inference problems in the context of stochastic processes. For a simpler example you can take a look at sequential test of hypothesis. It is quite ubiquitous, but not always called out by its name.
 platz on Nov 24, 2016 [–] Good to know!
 srean on Nov 24, 2016 [–] Someone ought to formulate the Godwin's law analogue for CLT.Hint 1: what happens when variance does not exist or is too large to be considered finite for any practical purposes.Hint 2: Levy processes, stable distributions
 Too cryptic to be helpful, sorry..
 Ah I see, my apologies, I assumed Googling those keywords would be sufficient to connect the dots.The main thing that I wanted to convey is that the consequences of CLT does not come for free. It is not remotely as widely applicable as it is made out to be. CLT is also not so narrow that you need IID random variables as is often claimed. Those assumptions can be relaxed substantially. What gets in the way in obtaining a Gaussian distribution in the limit is the requirement that the original distribution(s) have a finite variance. The (averaging) process may still converge to some limiting distribution but it would not be a Gaussian one. Levy measures, more precisely stable measures are that class and Gaussian is but one member of that class, the only one that has finite variance. It is slowly getting acknowledged that many natural processes in fact do not have a finite variance. Fractals are one such process
 I wasn't aware of the idea that natural processes might not have a finite variance, or what implications that would have. Thanks for clarifying!
 I've seen variations of this comment in multiple threads, and have to ask: is there a good paper or textbook that spells this out?
 I would say go with books on (i) heavy tailed distributions and another on (ii) stable distributions. Communities that have quickly wizened up to the deficiencies of the Gaussian assumption are (a) mathematical finance (b) statistical analysis of network packets. Following that literature might also be useful.
 I don't like how the article tries to push statistics on the reader. If a CS paper compares a pair of averages, then that gives certain information. If statistics can add to that, and make the results a little more precise, then that is nice. But by no means is it absolutely necessary. And statistics will not give a conclusive result either.I think that authors should use statistics when they see fit, and when it does not distract too much from the original subject of the paper.
 Needless to say, I disagree. It can be straight-up misleading to report means without including a more nuanced view of the distribution. You don't need to use a bunch of fancy statistics, but you do need to consider whether your results could have arisen by random chance. That's not a distraction; it's accurately reporting what you found.Here's one frightening example of spurious performance results in CS: https://www.cis.upenn.edu/~cis501/papers/producing-wrong-dat...
 > It can be straight-up misleadingIt is only misleading if the reader doesn't understand statistics. There is, imho, nothing wrong with putting all your focus on the subject matter, and skipping the statistics while being frank about it.Also, if you need statistics to show that your method is better than other methods, then perhaps your method is not really that much better.
 If your analysis involves data, you can't skip the statistics. If there is no analysis of data than go ahead and skip the stats all you want. You need to try and determine the uncertainty in your results, both systematic and statistical.
 Any time you compare two averages, you are doing statistics, whether or not you report the result in statistical terms. It's not something optional that you can "add on" to provide extra information. If you don't provide some measure of significance, I'm not going to trust that your result has any chance of being real. At best, you don't really know whether the result is real because you ignored the statistics; at worst, you ran the statistics and know it's not real, and you're hoping I won't notice.
 IMO it's better to just plot the two distributions that you want to compare so that people can eyeball the difference (or run the statistical tests themselves should they wish to do so). For one thing not all distributions are Gaussian. And the t-test only answers one specific question (i.e., with one given p-value). Then there's people who misinterpret the result of the t-test. Or people who mess with the p-value till they get what they need.
 This thinking is part of the problem. While an individual average may just be "a fact", you cannot meaningfully compare two averages without knowing more than their values.Pretending you can has lead to a lot of muddled thinking.
 Assuming distributions that don't correspond to reality can also lead to a lot of muddled thinking.
 That's true, but though related, a separate issue.For example. I compute a arithmetic mean of two data sets A,B yielding means a, b.I can tell you what the difference |a-b| is without any other information, but I simply can't tell you if it is significant or not.This has nothing to do with distribution. Given the variances, I can tell you something about the significance (at least in some senses). But without knowing the distribution I can't tell you at all how to interpret the variance.The point is, as soon as you compute that mean, you are doing statistics. If you do it carefully, you will be able to define what the numbers mean, and what they do not. The fact that many people don't do it well does not change this.There is no possibility of improving the situation by ignoring how the numbers were arrived at and what that actually means. Sometimes the best thing to come out of it is that you are simply calculating the wrong thing for what you want to learn.
 imh on Nov 24, 2016 [–] You can do quite a bit without making strong distributional assumptions. There are nonparametric tests for all sorts of things (and the basic ones are covered in good first intro to stats books too).
 Eh. Let's see how this goes.Profession A has a mean salary 20% higher than that of Profession B.Yet people who are in profession A are much more likely to be in poverty than in profession B.Yet almost any time someone compares two means, they never seem to come to this conclusion - or even consider it a possibility.Comparing two means without other details is rarely illuminating, and often leads to wrong conclusions (which are worse than no conclusions with no data).
 The idea that you should 'plot the error bars' ahead of, well, looking at the data seems a bit premature. As many other comments have stated, looking at the data first is critical.It drives me up the wall: we have 1200dpi printers, retina displays, and so on, and yet somehow people feel the need to collapse everything they've done to these giant finger-painting quality bar charts. Statistical tests are well and good, but I'm amazed at the extent to which smart people will happily plug data which they have never actually seen into statistical metrics. So a mean might be derived from 9 reasonable results and a howlingly off factor-of-2 outlier, and you can dutifully plug this series into a bunch of standard tests and speak confidently about p-values.
 That's a good article but pretty short. There would be a lot more ground to cover.
 Is there a good resource to learn the underpinnings of P values and T-tests?I feel like everybody says these are important, show a formula and then arguments ensue about what p=0.95 means, and nobody seems to know this.
 I think any intro stats book should do the trick. As far as I know, the material in a first stats course is pretty homogeneous. I'm not a biostatistician, but I happen to like this book [0] for introductory stuff. Amazon says you can get it used for \$26.
 I took intro to stats at a business school and switched to computational linguistics and honestly, I have only been met with the "It is something that you do" in regards to P-values.Ill try and look at an introductory book again and see if it satisfies my curiosity.