Once consequence of this is that it is crucial that you advance your hypothesis before you collect (or at least look at) the data because the odds of something arising by chance change depending on whether you predict or postdict the results. Also, the more data you have, the more likely you are to find something in there that looks like a signal but is in fact just a coincidence. Many a day-trading fortune has been lost to this one mistake.
It's the odds of having that results due to chance, if the null hypothesis is true. That latter part might sound pedantic, but the whole point is that we don't know how likely the null hypothesis is. If I test wheather the sun has just died and get a p-value of 0.01 it's still very likely that this result is due to change (surely more than 1%)! We need a prior probability (i.e. bayesian statistics) to calculate the probability that the result was due to chance, that is why that partial definition is incomplete and actually very misleading. This point is subtle, but very important to really understand p-values.
Another way to look at it is: if we knew the probability that the result was due to chance we could also just take 1-p and have to probability of there actually being some effect, a probability that hypothesis testing cannot give us.
There is one nice property that hypothesis testing does have (and why presumably it's so widely used): if the idea you are testing is wrong (which actually means "null hypothesis true") you will most likely (1-p) not find any positive results. This is good, this means that if the sun in fact did not die, and use 0.01 as your threshold, 99% of the experiments will conclude that there is no reason to believe the sun has died. So hypothesis testing does limit the number of false positive findings. The xkcd comic is a bit misleading it this regard, yes it does highlight the limitations of frequentist hypothesis testing, but the scenario depicted is a very unlikely one, in 99% of the cases there would have been a boring and reasonable "No, the sun hasn't died".
For an incredibly interesting article about the difficulty of concluding anything definitive from scientific results I highly recommend "The Control Group is out of Control" at slatestarcodex.
 To be even more pedantic you would have to add "equal or more extreme", and "under a given model", but "if the null hypothesis is true" is by far the most important piece often missing.
Yes, that's right. I don't know why you think this is at odds with what I said. In fact, I clarified this myself a few hours ago in a sibling comment:
If you say it like this it will very easily be misinterpreted. Once your results are in there are two cases: (1) either the null hypothesis is true and you got those results due to chance, or (2) the null hypothesis is false and there was some actual effect outside of the null hypothesis that helped you get the results.
Due to this it is very easy to interpret you statement as referring to the probability of (1).
Two two following definitions of p-values sound similar but are not:
[Correct] The probability of getting the results by chance if the null hypothesis is true P(Results|H0)
[Wrong] The probability that you got the results by chance and thus the null hypothesis was actually true P(H0|Results)
I'm not saying you didn't get it, but somebody reading what you wrote can very easily be fooled. And there are a lot of dead wrong definitions on the web.
Well, no, that'd be very interesting but unfortunately what a statistical test really says is the probability of the results you observed (or more extreme) given chance. P(data|model) and not P(model|data).
> the probability of the results you observed (or more extreme) given chance
I think he said the same thing as you. He didn't talk about "the odds that the results you observed have arisen by chance" but the odds that the results you have observed (the or more extreme is implicit) could have arisen by chance (i.e. the probability of happening by chance alone, given that only chance was involved).
There are both Bayesian and Frequentist approaches in statistics!
They represent very different methods to statistics, but they are also quite similar. My apologies, I couldn't find one link that gave a good description of Bayesian vs. Frequentist. Here are a couple links to get started:
If anyone comes across a link that describes Bayesian and Frequentist clearly, please share it if you don't mind.
Either way I disagree. Both approaches have pros and cons depending on the type of analysis.
At the very least, just the presence of competing approaches in the field has pushed statisticians to have more rigour and do way more double checking than they might have, out of fear the other side actively looking to poke holes. It's easy to lie with statistics and the only people who can call statisticians out on their BS is other statisticians...Need more of this.
> Both approaches have pros and cons
What are the pros of the frequentist approach?
Now. You might read this as me saying that Bayesian statistics is all made up, and that's not what I'm saying.
I'm saying that if you go Bayesian all the way, and your result is even remotely controversial, someone could easily challenge you by saying "this result would have been different with different priors, and why won't they just come out and say p < 0.01? What are they hiding?"
When you do frequentist statistics, the equivalent of priors were, instead, made up for you by long scientific tradition. They're not very good priors, as the relevant xkcd illustrates. But at least it's not you making them up.
In a more complex model, there's also the thing where you can't calculate the result exactly, you have to approximate it with Markov Chain Monte Carlo or something, leading to another way to doubt your results (did the MCMC converge correctly?)
So instead you do frequentist statistics. You use your favorite stats package and it spits out a nice comforting p-value that will satisfy the reviewers. You have tons of guidance about how to do things. It's not a great thing, but it is a definite advantage of using frequentist statistics.
How about this: find me a published scientific paper with only Bayesian results in it, no frequentist statistics at all. Argue for its superiority all you want, but I don't think anyone does it. It would be seen as a stunt.
> [B]ut I don't think anyone does it. It would be seen as a stunt.
What did you mean by "scientific"?
You'd have to hunt for a narrow reading s.t. the above holds — extant counterexamples aren't limited to any one branch.
Check it out² for yourself.
Further: cogsci's bayesian adoption is rapidly accelerating.
The replication crisis is brutalizing huge swathes of psych:
• loss of confidence in NHST is becoming near-total for many
• journals are purging in turn — e.g. BASP's p-value ban³⁴
• others are overhauling stats-in-psych entirely
J Math Psych alone has two recent special issues⁵⁶ on this.
> no frequentist stats at all
isn't even a strawman lately, let alone an absurdity.
In some fields, it's a battle-cry.
All that being said… re:
> Argue for its superiority all you want
I wouldn't even go that far. Don't give em that.
Probability interpretation fundies
• are all wrong,
• narrow minds and waste lifespans with the cultism, and
• should at least learn of the other interpretations.
Pitching probability as a 1v1 isn't merely wrong-prime⁷ — it's doublepluswrong″⁸.
It's pseudofundamentalism: fundies uphold foundations.
Flamewars predicated on ignorance of the same gotta go, no matter how fashionable they may be.
 http://doi.org/4z8 | BASP 37
 http://doi.org/34p | Nature re: BASP
 http://doi.org/btqx | J Math Psych 72
 http://doi.org/btqz | J Math Psych 74
Here is a simple example: lets say you flip a coin 10 times and get 8 heads, what is the probability that the coin is not fair? In frequentist statistics you only have to calculate the likelihood of the results for a null hypothesis, and then you use a p-value. While this approach is flawed, at least you can quickly do the calculation and get an approximate answer. In Bayesian statistics you have to specify the prior distribution, and calculate the likelihood of your results under every possible hypothesis. Correctly specifying this prior distribution and calculating the results is quite challenging - especially if you want to use a realistic prior (not just uniform). This is a pretty simple example, you can imagine how much more challenging this becomes in real-world problems. On the other hand, it is true that the frequentist approach doesn't really answer the question asked, so it is misleading (especially if you choose a p-value that isn't specific to the problem). If you choose p-values based on prior knowledge, than the differences between frequentist and bayesian are less extreme.
This is often a feature, forcing you to actually look at the complexity head on before you sweep it under a rug.
There is no Zuul.
Statistics can tell you whether a model is consistent with the data. But you need to deduce the null hypothesis from your model rather than use the default "no difference" (of course, sometimes no difference is deduced from a real model, but not often, in that case: great!).
In fact, that is the proper use of statistics. I would guess >99.99% of current usage is incorrect (ie pseudoscience) and amounts to a waste of time at best. The usual usage turns scientific reasoning on its head, and has lead to a (literally for most people) unbelievable amount of trouble.
This was pointed out most aptly by Paul Meehl long, long ago:
Yes, that's true, but it badly misses the point. The power of statistics is to tell you when a model (the null hypothesis) is (most likely) inconsistent with the data so that you can confidently rule it out. Any finite data set is consistent with an infinite number of models, so knowing that a model and the data are consistent tells you absolutely nothing about whether or not that model has any relationship with reality (which, at the risk of stating the obvious, is what science actually cares about). This is the reason that rejecting the null hypothesis is considered a positive result.
In practical terms, we're not interested in true models, but useful ones, so the description of a model's consistency with observed data is often the more useful metric in practice than rejecting nulls :/ especially in applications where you can't set up repeated experiments.
OK, I realise its more nuanced than that too, but given how many papers and practitioners seem incapable of understanding that evidence against the null it's not explicit evidence for an arbitrary alternative, practically and consequentially I don't think that's how we should be working...
No, that's not true, because experiments are not done in a (figurative) vacuum. They are done in the context of an explanatory theory that has already gone through a vigorous filter and shown to be consistent with the all prior experimental data and has better explanatory power than all of its competitors. It is only when more than one theory survives this filter that an experiment is done, and the experiment is designed specifically to distinguish between the surviving theories.
So while it is true that an experiment allows you to eliminate an infinite number of theories, it's irrelevant, because by the time the experiment is done nearly all of those theories have already been eliminated anyway.
If you know whether your model is consistent with the data, you know whether it is inconsistent... I think you are talking about some other issue than I am.
The point about deducing the null hypothesis from your explanatory model is that the null hypothesis is precise. In that case you will get a strong test of the model, and it will get stronger as more data gets collected. Using a default null and and vague alternative is the exact opposite. (check the Meehl 1967 paper I linked earlier in this thread).
No, that's not true. And in fact I got it wrong earlier when I agreed with you that statistics can tell you when data is consistent with a model. They can't. At best they can tell you whether the data are not inconsistent. That sounds like the same thing, but it isn't. It's like the distinction between "not guilty" and "actually innocent." At best (or at worst depending on how you look at it) a statistical test can tell you, "This theory cannot be confidently ruled out on the basis of that data under the following assumptions..."
> I think you are talking about some other issue than I am.
That is quite possible.
[UPDATE:] BTW, I read the Meehl paper, and I completely agree with what he says. So you and I may be in "violent agreement" here.
The null hypothesis never has explanatory power. The null hypothesis is always a statement of the form, "The explanatory hypothesis under test is wrong for some unknown reason." This is why rejecting the null hypothesis, i.e. showing that the data are (with high probability) inconsistent with the null hypothesis, is considered a positive result.
If you're doing benchmarking, another common model is a peak at a minimum value (when everything goes right) and a long tail, due to various events like cache misses that always slow things down, but don't happen in every test run.
On a system with multiple programs running (a typical desktop), taking the mean is meaningless - this just adds noise due to activity unrelated to your program. You'd be better off taking the minimum, which with enough test runs should capture all the events that happen every time and none of the events that don't.
The median or 95% percentile might also be useful if you're investigating events that don't happen every time. But if you want to know about cold start performance (for example), maybe the best thing to do would be to flush your caches before every test run, so the events you're interested in are events that happen every time.
The key word in there is common. There is an entire industry of statistical techniques that do not require Gaussian assumption or for that matter any parametric assumption.
I strongly feel it is time to retire the Gaussian distribution from the space it occupies. Discovering and studying Gaussian distribution and the bog standard central limit theorem should be considered one of mankind's crowning achievements. They deserve to be put on a pedestal to appreciate their elegance, but when rubber meets the road one has to open ones mind to look beyond. Appearance of the Gaussian distribution is rarely as normal as many expect/claim it to be (I blame the stats education machinery for this), nor was it invented by Gauss. In fact Gauss used it as a post-hoc justification for backing the least-squares method. His original motivation for least squares was simplicity and convenience, not the normal distribution or CLT or for that matter the Gauss-Markov theorem.
Basically any reasonable* stochastic continuous process is driven by a brownian motion. Also: discontinuous processes are more or less* the sum of a brownian motion and a poisson-type process.
(* Much details about filtrations, Banach spaces yadda yadda omitted)
This is wrong. It’s telling you that there’s at most an alpha chance that a difference like that (or more) would have arisen from random chance if the quantities are actually equal. And if the quantities are equal 95 out of 100 parallel universes would not be able to reject the null hypothesis.
Is he saying that he would take the xkcd bet on the frequentist side?
The t-test might not be the best test for every situation, but if the alternative is no test at all, I'll take it.
You can have CLT's with non-iid variables (either the aren't identically distributed or aren't independent). The math just becomes much harder, and you have to assume specific dependence structures.
Hint 1: what happens when variance does not exist or is too large to be considered finite for any practical purposes.
Hint 2: Levy processes, stable distributions
The main thing that I wanted to convey is that the consequences of CLT does not come for free. It is not remotely as widely applicable as it is made out to be. CLT is also not so narrow that you need IID random variables as is often claimed. Those assumptions can be relaxed substantially. What gets in the way in obtaining a Gaussian distribution in the limit is the requirement that the original distribution(s) have a finite variance. The (averaging) process may still converge to some limiting distribution but it would not be a Gaussian one. Levy measures, more precisely stable measures are that class and Gaussian is but one member of that class, the only one that has finite variance. It is slowly getting acknowledged that many natural processes in fact do not have a finite variance. Fractals are one such process
I think that authors should use statistics when they see fit, and when it does not distract too much from the original subject of the paper.
Here's one frightening example of spurious performance results in CS: https://www.cis.upenn.edu/~cis501/papers/producing-wrong-dat...
It is only misleading if the reader doesn't understand statistics. There is, imho, nothing wrong with putting all your focus on the subject matter, and skipping the statistics while being frank about it.
Also, if you need statistics to show that your method is better than other methods, then perhaps your method is not really that much better.
Pretending you can has lead to a lot of muddled thinking.
For example. I compute a arithmetic mean of two data sets A,B yielding means a, b.
I can tell you what the difference |a-b| is without any other information, but I simply can't tell you if it is significant or not.
This has nothing to do with distribution. Given the variances, I can tell you something about the significance (at least in some senses). But without knowing the distribution I can't tell you at all how to interpret the variance.
The point is, as soon as you compute that mean, you are doing statistics. If you do it carefully, you will be able to define what the numbers mean, and what they do not. The fact that many people don't do it well does not change this.
There is no possibility of improving the situation by ignoring how the numbers were arrived at and what that actually means. Sometimes the best thing to come out of it is that you are simply calculating the wrong thing for what you want to learn.
Profession A has a mean salary 20% higher than that of Profession B.
Yet people who are in profession A are much more likely to be in poverty than in profession B.
Yet almost any time someone compares two means, they never seem to come to this conclusion - or even consider it a possibility.
Comparing two means without other details is rarely illuminating, and often leads to wrong conclusions (which are worse than no conclusions with no data).
It drives me up the wall: we have 1200dpi printers, retina displays, and so on, and yet somehow people feel the need to collapse everything they've done to these giant finger-painting quality bar charts. Statistical tests are well and good, but I'm amazed at the extent to which smart people will happily plug data which they have never actually seen into statistical metrics. So a mean might be derived from 9 reasonable results and a howlingly off factor-of-2 outlier, and you can dutifully plug this series into a bunch of standard tests and speak confidently about p-values.
I feel like everybody says these are important, show a formula and then arguments ensue about what p=0.95 means, and nobody seems to know this.
Ill try and look at an introductory book again and see if it satisfies my curiosity.
As a concrete example, I might ask you for the distribution of the mean of N samples given that they come from the standard normal distribution (mean zero, variance 1). That's easy. The sample mean, which is itself a random variable, also is normally distributed with a mean of zero and a variance of 1/N. On the other hand, if I ask you about the mean, but the only info you have is that your data isn't from a standard normal, then it could be anything! There's no objective way to say how the sample mean is distributed, given that one crappy piece of info.
The most basic thing you can do then, is to assume that your model is true and see if your data is plausible. If I have a hypothesis that I'm flipping a fair coin and I get all heads on 10 flips, I'm going to start doubting my hypothesis. The probability of all heads or all tails with a fair coin is only 1/512=0.002. P values formalize that notion. We call the hypothesis we can model our "null hypothesis", and see if we get data that makes sense with it. If your observations are some of the most unlikely ones according to your null model, let's start doubting the model. That's it.
The benefit and trouble are both that we dodged the entire question of what an alternative to our model could be, and how the data looks under those alternatives. Ignoring that incredibly important question can give rise to a weird way of thinking, and opens the door to some conceptually mind bending mistakes, but it all comes from a simple interpretation of a p value. How unlikely is your data given your null hypothesis (given the model you're trying to test). Formally, this tends to be "what is the probability that some statistic is this unlikely or worse."
Does that make any sense?
Edit: I now see that this was mentioned elsewhere here. Good!