Heh, you'd be surprised! Most people I met would interpret a 95% CI by saying that there is a 95% chance that it contains the true mean.
The thing that makes it hurt is that what seems like a minor rewording of that statement, "We're 95% confident that the range we calculated contains the true population mean," would be correct. (It's not the definition of a CI, but it is implied by the definition of a CI.) And that the reason why one is true and the other is false is down to the technical distinction that the true population mean is a discrete value and not a draw from a random variable, so, from a strictly mathematical sense, it is nonsensical to apply probabilities to it. By a the same token, an even more subtle rewording gets us back to falsehood: "There's a 95% chance that the range we calculated contains the true population mean."
Which, delving into to that level of hair-splitting makes for interesting math, but also leaves me with the opinion that, sure, the most common intuitive interpretation is wrong, but it's wrong in a way that isn't really of much practical importance.
By contrast, the most common intuitive interpretation of the p-value is disastrously wrong.
Another discrepancy between frequentists statistics and the article is that yes, the values at the boundary of your interval are as credible as in the center.
Another note: 'confidence interval' typically refers to the frequentist meaning, whereas 'credibility interval' is used in the Bayesian setting, when describing an interval of the posterior with 95% probability (which is arguably more interpretable). The usages of the two terms do not seem to generally be strict, however.
What would that mean in the frequentist framework?
To provide an extreme case, during the Iraq war, epidemiologists did a survey and came up with an estimated number of deaths. The point value was 100K, and that's what all the newspapers ran with. But the actual journal paper had a CI of (8K, 194K). There's no reason to believe the true value is closer to 100K than it is to 10K. Or to 190K.
But we can't neither say that the true value is equally likely to be closer to 21 than to 3.
The point is that, from the frequentist definition of a confidence interval, there is nothing at all that we can say about how likely the true value is to be here or there.
It could be 3, 21, or 666 and there is nothing that can be said about the likelihood of each value (unless we go beyond the frequentist framework and introduce prior probabilities).
Yes - sorry if I wasn't clear. I did not mean to imply that each value in the interval is equally likely (and looking over my comments, I do not think I did imply that).
The complaint is that the article is stating otherwise as fact.
>One practical way to do so is to rename confidence intervals as ‘compatibility intervals’
>The point estimate is the most compatible, and values near it are more compatible than those near the limits.
They simply are not in a frequentist model (which is the model most social scientists uses). I agree with the main thrust of the article in that there are many problems with P values. But I am surprised that a journal like Nature is allowing clearly problematic statements like these.
I don't know enough about the Bayesian world to be able to state if his statement is wrong there as well, but if it is correct there, it is problematic that the authors did not state clearly that they are referring to the Bayesian model and not the frequentist one.
(Not to get into a B vs F war here, but I remember a nice joke amongst statisticians. There are 2 types of statisticians: Those who practice Bayesian statistics, and those who practice both).
When you said that "the values at the boundary of your interval are as credible as in the center" you kind of implied that, which is why I asked.
I won't defend the article being discussed, but you opposed their statement that "the values in the center are more compatible than the values at the boundary" with an equally ill-defined "the values at the boundary are as credible as in the center".
What I meant was "there is no reason to prefer values at the center more than values at the boundary" based on the CI (there may be external reasons, though). To me, this is equivalent to your:
>there is nothing that can be said about the likelihood of each value
I find this very silly, since if we ditch the arbitrary 0.95 and go with 0.999.. confidence interval of [-998, 1040] for example. How can one say that one cannot tell if which value is more likely, 21 or 1040?
If this is an actual limitation of the frequentist model like you said, everybody should be a bayesian thinker then. And the "confidence interval" is just a quick way to communicate how wide and where the posterior bell curve is.
The difference is hugely important, the central limit theorem and the implication of a normal distribution commonly applies to sample means.
You can calculate confidence intervals for most (not all) other statistics, like point estimates, but the distributions might not be normal.
> Edit, November 2014: ... Had we used, say, the Maximum Likelihood estimator or a sufficient estimator like min(x), our initial misinterpretation of the confidence interval would not have been as obviously wrong, and may even have fooled us into thinking we were right. But this does not change our central argument, which involves the question frequentism asks. Regardless of the estimator, if we try to use frequentism to ask about parameter values given observed data, we are making a mistake.
I agree that credible intervals and confidence intervals answer different questions. I don't think that it's obvious that the confidence interval approach is wrong, and the example in the blog post is definitely not evidence towards this.
> A 95% confidence level does not mean that for a given realized interval there is a 95% probability that the population parameter lies within the interval (i.e., a 95% probability that the interval covers the population parameter). According to the strict frequentist interpretation, once an interval is calculated, this interval either covers the parameter value or it does not; it is no longer a matter of probability.
> For a single confidence interval, you have either captured the mean in your confidence interval, or you've not -- there's no probability about it.
Isn't there? The underlying truth is that you either definitely have or have not captured the population mean in any specific confidence interval. But you can't know this truth. In the long run, if "a 95% confidence interval contains the true mean 95% of the time across an infinite number of replications of the experiment/study," then isn't it true that any single specific experiment's CI has a 95% probability of containing the true value?!
In my untrained mind, this is exactly equivalent to flipping an unfair coin with a 95% chance of heads. Sure, before flipping, the outcome of heads has a 95% probability. After flipping, you either get heads or tails. But if you flip a coin and hide the outcome without looking at it, doesn't it still have a 95% chance of being heads as far as the experimenter can tell?
For example, if I perform an experiment on a light source and find a 95% CI of 400-1200 lumens for the brightness, the actual probability of this being true is much higher if the light source is a 60W incandescent bulb than if the light source is the sun.
Instead, if you state that "luminescence OF THE 60W INCANDESCENT LIGHTBULB is 400-1200 lumens with 95% CI", then that's the usefull information that let's you set the right expectations when designing lightning for your new house, for example.
Just like a large number of people misinterpret a P value of .01 to mean a 1% chance of the results being due to chance, CIs can be similarly misinterpreted.
1: A .01 P value actually means that if the null hypothesis is true, then you would get the result 1% of the time. The analogy to my above example would be that if I run an experiment and get a result that "the sun is less bright than a 60W light-bulb" with a P value of .01, it's almost certainly not true that the sun is less bright than the light-bulb, since the prior probability of the sun being less bright than a 60W light bulb is many orders of magnitude smaller than 1%.
I don't know why you get the idea that all 100 will be non-overlapping. That's simply false.
And yes, if your assumptions were correct, regular (i.e. frequentist) statistics will state that roughly 95% of the CIs will contain the true mean. There is nothing absurd about it.
Its imperative to understand that these definitions are written not for the average user of statistics, but for a trained statistician. Unfortunately, the average stat consumer vastly outnumbers the professional. Papers are littered with statements like p value proves H0, or proves H1. I have had numerous conversations with scientists ( not statisticians, but pharma/epidem/engg people who show up to the stat lab for consult ) that their p value doesn't prove H0 or H1. "What do you mean you can't prove H1 ? Oh you mean it only rejects H0 ? Ok but isn't that same as prove H1 ? It isn't ?! Well in my field if I just state it rejects H1 it won't be well understood so I am going to instead say H1 has been proved!" So there's little the statistician can do.
Regards overlap, I meant total/exact overlap, as in no two CIs will be identical on any conti dist.
And as a TA, I hope you are marking them wrong! My stats professor went through great pains to point it out, as does my stats textbook.
>but then nobody ever conducts 100 experiments, so from their pov its an absurd definition.
The definition is not absurd. It's just a definition. You can argue that CI's are being abused and are not as helpful as people think they are and I'll probably agree with you. But that's no excuse for people to use it and get a pass for not even knowing what it means!
I contend that it was defined as such because at the time no one had anything better. I suspect that much is true even now. I'm not aware of any obviously better alternatives, and articles like these even suggest there aren't any - just that we need to be mindful that the CI alone doesn't allow for reliable conclusions.
At the end of the day, this is not a problem with the CI definition. It's not a statistics/technical problem. It's a social/cultural one. As such, the solution isn't to change statistics, but to fix the cultural problem: Why do we keep letting people get away with such analyses? Are there any journals that have a clear policy on these analyses? Are referees rejecting papers because of an over reliance on p-values?
Let's not change basic statistics definitions and concepts because the majority of non-statisticians don't understand them. When the majority of the public can't understand basic rules of logic (like A => B does not mean that (not A => not B)), we don't argue for a change to the discipline of logic. When huge numbers of people violate algebra (e.g. sqrt(a^2+b^2) = a + b), we don't blame mathematics for their sins. (I know I'm picking extreme cases for illustration, but the principle is the same). I had only one stats class in my curriculum. If you want people to perform correct statistics as part of the profession, make sure it is fundamental to much of their work. It would have been trivial to add a statistics component to most of my engineering classes, and that would hammer in the correct interpretation. Yet while we were required to know calculus and diff eq for most of our classes, none required any statistics beyond the notion of the mean (and very occasionally, some probability).
Statistics is a tool. It will always be the responsibility of the person invoking the tool to get it right.
>Regards overlap, I meant total/exact overlap, as in no two CIs will be identical on any conti dist.
Isn't that the whole point of inferential statistics? You have a population with a true mean. You cannot poll the whole population. Hence you take a sample. This is inherently random. There is variance in your estimate (obvious), What should be clear is that the CI should move with your point estimate. Furthermore, you never know the true stddev, so you estimate the stddev from your sample. Now both your center and the width of the CI will vary with each sample. I can't comprehend how you could hope to get the same interval from different samples, given that it is quite possible to get all your points below the mean in one sample and above the mean in the other.
I think people are bashing statistics because it isn't helping them come to a clear conclusion (which is fair). But as I said, all the proposals I've seen appear to be as problematic.
You could possibly enforce that when publishing in journals, but there is no way in hell "science journalism" would follow suite, so the general public would still be just as hopelessly mislead, if not moreso if we account for the general statistical incompetence of the typical science journalist.
Writing a paper, you need to support your conclusion, removing p-value doesn't remove that need for support, it will just find something different, hopefully better.
Protecting against misinterpretations by outsides is not something that scientific research papers should worry about.
That's the gist of the crisis science has been having of late: The popular meme for a long time was that it was just non-scientists who would misinterpret the research. Recently, it has become clear that misinterpreting papers also rampant among trained scientists, to the extent that entire fields are being shown to be houses of cards.
There are plenty of personalities and pundits that love to say "look at the data" or "the data is clear".
It bugs me when data is weaponized as truth to prove a conjecture, especially in the social sciences where studies are routinely difficult to replicate with consistent results.
By that measure, I find this entire idea rather idiotic.
You have given no reason to believe this.
I don't see why a scientist at a conference who's saying that two groups are the same has to be heard as claiming, "we have measured every electron in their bodies and found that they have the same mass, forget about six sigma, we did it to infinity." Instead they could simply be understood to be saying that the two groups must be similar enough to not have ruled out the null hypothesis in their study.
Edit: remove "interpret" from last sentence to clarify
Read it in full here: https://www.statisticsdonewrong.com/power.html#the-wrong-tur...
Not really. A low p-value says that it was surprising to get the result that you got, assuming that the null hypothesis is true. And if the null hypothesis is true it would be surprising to get again the same result (i.e. a result as extreme). If the null hypothesis is not true, the result would not be so surprising (or maybe more, if the true effect is in the “wrong” direction).
The result we got gives some evidence for the null hypothesis being false, but if the null hypothesis was very very likely to be true before it may still be very likely to be true afterwards. In that case it wouldn’t be surprising to get a different result if the experiment is performed again.
Illustration: I roll a die three times. I get three ones. P<0.01 (for the null hypothesis of a fair die and the two-tailed test on the average). This is not simply saying that if I roll the die three times again it would be surprising to get something other than ones.
Hmm. At a glance, that doesn't seem right. Yes, the chances of rolling 3 1's is 1/(6^3), but if we only rolled once and got a single 1, we wouldn't have any reason to suspect that the die was unfair. So maybe we should only consider the second two repetitions, and conclude with p ~ .03 that the die is unfair? Otherwise, consider the case that we rolled a 1, 5, 2 --- certainly we shouldn't use this series of non-repeated outcomes as p < .01 evidence of an unfair die?
The sampling distribution for the average can be calculated and for three rolls the extreme values are 1 (three ones) and 6 (three sixes) which happen with probability 1/216 each. Getting three ones or three sixes is then a p=0.0093 result.
You raise a valid point. This is clearly not the best test for detecting unfair dice, because for a die which has only two equally probable values 3 and 4 we would reject the null hypothesis even less often than for a fair die! (In that case, the power would be below alpha, which is obviously pretty bad.)
P values don't tell you the chances of getting the same result, only with the chances of getting the same result by chance.
If we repeat the experiment and get a different result, then we need to be looking into confounding variables and testing methodology. Just because the P-value is low doesn't mean there's no fundamental flaw in the experiment used to find the p-value.
You're only allowed to say either nothing happened or it's unlikely nothing happened, but it doesn't say what the "unlikely nothing happened" means.
I would say it’s the other way around.
The p-value says something about the result (how likely this result would be if the null hypothesis were true).
It doesn’t say anything about the probability of the null hypothesis being true.
It's how likely the result is to occur by chance if the null hypothesis is true. A positive result can occur for lots of other reasons even if the null hypothesis is true, and the p-value doesn't tell you anything about how likely you are to get a certain result if the null hypothesis is true (or false).
I don’t think your comment makes sense.
Given a parametric model and a particular value of the parameter (i.e. the null hypothesis) one can calculate the sampling distribution of the data.
Therefore under the null hypothesis the model gives a well-defined probability distribution for the data and you can tell how likely you are to get a certain result.
There is no room for “other reasons”.
Looking at the p-value of a study doesn't tell you anything about how likely the study was to have been based on fabricated data.
I agree that if the data is made up the results of the study and the statistical analysis based on the results will have no relation whatsoever with the fact that the null hypothesis was or wasn’t true.
The p-value tells you just how likely you are to get a certain (or more extreme) result if the data generating model is indeed correct and the null hypothesis is true.
We agree that the p-value doesn’t tell you anything about how likely it is that the study was based on fabricated data, or how likely it is that the model is correct or how likely it is that the null hypothesis is true.
The p-value doesn’t tell us anything about the real world. It’s a probability conditional on a hypothetical model.
That interpretation subtly suggests that the P-value gives an estimate of the likelihood of the alternative hypothesis being true.
Any observation is consistent with an infinite number of models. Eg., your sight is defective, ie., in many cases: your sample is biased, not big enough, etc.
And that A correlates with B, or fails to, to "some significance" is consistent with any causal relationship between A and B.
> Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half (see ‘Wrong interpretations’ and Supplementary Information).
You can't expect a statistical test to comb through your experimental procedure to check whether or not you did it right. (Least of all, to comb through your personal medical history. ;) )
>Any observation is consistent with an infinite number of models.
I'm afraid that slips below the "get out of bed in the morning" pragmatism bar. Although it's a provocative idea, concluding that conclusions are impossible because you could be making a mistake or hallucinating would lead to, literally, not getting out of bed in the morning, if you took it seriously.
At the end of the day you just have to hope that the human beings reading your paper will be able to tell whether or not you did the experiment right. Any statistical test just walls off one way (or a few ways) in which you could fool yourself.
>Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half (see ‘Wrong interpretations’ and Supplementary Information).
What we really want to avoid here is requiring every paper to have a twelve paragraph Sophist disclaimer about how knowledge is impossible. ;) Just because the text doesn't say "oh and by the way there could be gnomes in the LHC that are fooling us to entertain themselves," doesn't mean that the author wouldn't begrudgingly acknowledge it as an epistemic risk.
It isnt a trivial sceptical point.
Does CO2 cause climate change, or climate change cause C02, or are they both caused by a hidden factor (one of many) or do they co-vary accidentally (ie., no joint cause)?
No "statistical significance" of any kind can decide between them, yet this is how it is used. There is no way to do so with statistics; it requires repeated experimentation and refutation of all plausible causal models until the one remains stands as "the most plausible".
"Statistics" is being used in place of the scientific method which it cannot replace. Covariation is not causation: the covarying of any number of variables is consistent with any number of causal models. No "test" decides.
The point is for people to make a decision using probability as one available decision aid, and then deciding whether to place a bet.
Not if you don't have free will, which is apparently the conclusion that a lot of scientists seem to be coming to recently.
When I look around me I see people doing this every day.
Emphasis added (obviously). This word was missing from the analysis of whatshisface (GGP). While all of our models are probabilistic, and it would be silly to constantly describe everything this way in normal parlance, a discussion of how we want to go about building and interpreting models of the world is exactly the domain where this needs to be explicitly stated. You're not wrong here, but neither is GP. GGP is wrong, though not provably so (again, obviously).
> In effect, we can make a statement like "at least 90% of studies like ours will detect an elephant larger than 20 microns if it is actually there on the table".
The "in effect" disclaimer makes this statement arguable, but fundamentally we still can't say that. We can only say whether or not an elephant was detected, since detection (in this sense) is subjective. We can't say whether elephants actually exist or not. Perhaps all tables have invisible elephants on them, or perhaps all elephants are hallucinations shared by multiple humans through crazy coincidence. If 50% of elephants that actually exist on tables are invisible, and 10% of visible elephants are not detected due to error, then only 45% of studies will detect an elephant even though it will still seem like 90% to best of our knowledge. Since we don't know what percentage of elephants are currently detectable we need to be even more ridiculous and say something like "we guesstimate that there is a 99ish% chance that at least 90% of studies like ours will detect an elephant larger than 20 microns if it is actually there on the table".
I have seen outrageous examples of "accepting the null hypothesis" many times, but many negative result studies have great value and even a single negative study can provide evidence against a very large effect.
True, but this crucial assumption needs to be kept in mind when using the study to inform yourself of the state of the world. Too often, it's forgotten that there are two questions that need answering for the study to be relevant and, only one of those has a number associated with it.
In that sense, Bayesian statistics, where this is explicit, are less misleading because they actually draw attention to the fact that we don't know that the model is correct.
Interesting point about Bayesian methods, whenever there are potential flaws or additional uncertainty, it's better if they are more explicit to prompt thoughtful interpretation.
If you're willing to treat probability as a measure of uncertainty, then you'd be right at home as a Bayesian.
But successfully ruling out an elephant is uninteresting if you didn't expect an elephant. The problem is that "statistically significant" sounds impressive, but we shouldn't be impressed.
I guess we need a less impressive term for this? Maybe something like "may have avoided statistical blindness."
The groups need not be similar to fail to rule out the null. It can also be that the measurements are too noisy and too few.
Also on the flip side, if you do reject the null, it doesn't mean the the groups are different. It could also be that you have so many measurements that you are picking up tiny biases in your experiment or instruments.
Null hypothesis testing is almost always too weak of a test to be useful.
A tiny effect is very easy to reject (p<0.0000...000001). I can tell you before running the experiment that any two objects close enough for long enough to be in each other's light cone have at least a tiny effect on each other.
The null hypothesis is hereby rejected for most anything relevant to humans. No need to calculate p-values. Just reference this comment.
The problem is that with all its problems, statistical significance provides one major advantage over more meaningful methods: it provides pre-canned tests and a number (.05, .01, etc) that you need to 'beat'. The pre-canned-ness/standardization provides benchmarks for publication.
I once worked in a computational genomics lab. We got a paper into PNAS by running fisher-exact test on huge (N=100000+) dataset, ranked the p-values, got the lowest p-values, and reported those as findings. There's so much wrong with that procedure its not even funny.
I hope we aren't worse at reform than they were in the 1800s.
I suspect this will be the eventual revision that's adopted in most domains, since some sort of binary test will still be demanded by researchers. Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish. As scientists they want to focus on the design of the experiment and results, not the methodological subtleties of experimental assessment.
p<0.01 is good when we have a good model of reality which generally works. When we have no good model, there are no good value for p. The trouble is all the hypotheses are lies. The are false. We need more data to find good hypotheses. And we think like "there are useful data, and there are useless, we need to collect useful, while rejecting useless". But we do not know what data is useful, while we have no theory.
There is an example from physics I like. Static electricity. Researchers described in their works what causes static electricity. There was a lot of empirical data. But all that data was useless, because the most important part of it didn't get recorded. The most important part was a temporality of phenomena. Static electricity worked some time after charging and then discharged. Why? Because of all the materials are not a perfect insulators, there was a process of electrical discharge, there was voltage and current. It was a link to all other known electical phenomena. But physicists missed it because they had no theory, they didn't knew what is important and what is not. They chased what was shiny, like sparks from static electricity, not the lack of sparks after some time.
We are modern people. We are clever. We are using statistics to decide what is important and what is not. Maybe it is a key, but we need to remember that it is not a perfect key.
And this article emphasises that huge numbers of scientists and their audiences cannot correctly interpret negatives (false or otherwise).
The negative means: "Our data does not appear to show any correlation that met the criteria we chose to determine statistical significance"
This negative is frequently misinterpreted as: “We have shown with acceptable confidence that there is no correlation". (which does not inherently follow).
Therefore, adjusting the common p value threshold to P<0.01 while not correcting this widely-held cognitive error could (potentially) even worsen the problem, because people will encounter proportionally even more of the 'negative' results that they are already well-established as poor interpreters of.
The problem with p values is the opposite: in a complex system, null hypothesis is almost never true. Everything affects everything else. It's the magnitude of the effect that is important, because it can be too small to have practical implications.
If you do enough studies eventually you'll find a result with p value below 0.05 that will overestimate the magnitude of the effect by a lot, and publish it.
For example, global weather is a hugely complex system which is far from fully understood. Let's take an incredibly simple predictive model of temperature: Still, even with the most simplistic model, say, predict the current temperature by just predicting the average temperature for the current season. That's not going to be very accurate, but if you run it for a year it will almost certainly be p<.01.
Not really. You raise the probability of an inconclusive result when you could otherwise have gotten a positive. If you interpret p > threshold as “null hypothesis is true”, then you are doing the statistics wrong.
In most cases, I think a better model would be to extract an effect size such that an effect larger than the size is ruled out by the study to some degree of confidence. Currently, I read about studies that conclude that “such-and-such had no significant effect detected by this study.” Concretely, this looks like “vaccines had no significant effect on autism risk.” This may be accurate, but it’s lousy. How about “vaccines caused no more than an 0.01% increase in autism, and a bigger study could have set an even tighter limit.”
Physicists regularly do this. For example, we know that the universe has no overall curvature to a very good approximation.
> For example, consider a series of analyses of unintended effects of anti-inflammatory drugs2. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.
> Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).
I know, right? God forbid that we take a close look at the ways we might be fooling ourselves. Sounds like hard work.
I suspect just changing the threshold (especially as a new universal threshold, rather than related to the nature of the experiment) wouldn't even strike the authors as an improvement.
> Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value, coupled with the vague feeling that this is a basis for a confident decision.
I'm surprised the authors don't talk about 'practical significance' vs 'statistical significance'. Statistical significance can be easily gamed, especially if you're relying on one study. I think the real problem is the reliance on one study to make broad generalizations. The 'replication crisis' is everywhere.
A compatibility interval (at an agreed upon arbitrary level) communicates the magnitude of the difference as well as the uncertainty, which makes it much better for comparing options.
If my medication alleviates symptoms for 94-95% of patients above the current gold standard of 92-93% you could say it's "statistically significantly better", but the marginal improvement may not be worth the investment. Conversely if my medication alleviates symptoms for 50-80% of patients and the gold standard is 45-55% it would at least warrant future research (and if my medication has fewer severe adverse events it might be a better bet overall).
But this is just one small part of the whole picture: ideally we'd have preregistered experiments, experimental data published (or available to researchers where not possible due to confidentiality) and incentives for replication. Maybe this is too much for every field of science, but for ones where a wrong decision could have a severely detrimental impact they would create much more value than moving the P value.
This is literally every paper I publish.
Also, tightening the p-value criterion has problems of its own - as mentioned, it boosts the false negative rate, which is not a consequence free act. In the work I do, it's also a largely arbitrary threshold I can meet if I give the computer enough time.
Indeed. P-hacking is so much less fun when you don't have a P to hack.
This is why these calls to eliminate significance testing always seem really naive and short-sighted to me. P-values are abused, and people confuse p-values and effect size, but there will always be a need to focus on 0 as that supremely-important number. ε can be judged on its practical significance but 0 is always less.
Anyway, I agree with you but wanted to point out that there's two sides to the coin, and both lead in the same direction.
Differences between means of any two groups (e.g. treatment and control) on any outcomes will tend be non-zero. Interpreting this sample difference as a population difference without considering confidence interval seems risky.
I think that funding and publishing pressure turns the already hard problem of doing good science into a Pareto optimisation between do science, publish, get funding. The result is partially coerced results, stronger than justified conclusions, lenient interpretations , and funding of course.
It’ll go this way whatever the metric. I see the same sort of crap in ML papers where the authors report far more and lie far more at the same time.
The authors are spot on that the problem is not p-values per se but dichotomous thinking. People want a magic truth box that doesn’t exist. Unfortunately there are a ton of people in the world who continue to make money off of pretending that it does.
To me the issue stems from an even deeper fear of uncertainty. It takes a rock-solid psyche to be comfortable with the idea that we know little and a lot of our science would have a tough time standing up to statistical analyses.
If the kurtosis is high, p-values are over-stated. If fat-tailed then p-values are understated.
Why? Because the likelihood of your p-value isn't guaranteed to be normally distributed.
Normal is a nice assumption but asymptotic can take a long time to kick in. The CLT is beautiful analytically, but fortunes are made from people who assume it.
Honestly...? You're screwed. At least in Bio, where most researchers haven't taken calculus, most folks will screw up the t-test or their ANOVA if you are not super careful. For non Gaussian data you better pray it's Poisson or has some other exotic name that you can at least google.
Especially with low N, you just kinda pray it's normal and then you go and try and get grant funding with those results.
Cynically, in the end, it barely matters. It's all about that grant money. Whatever way you can tease that data to get more grants, you just do that. No-one ever checks anyway (Statcheck excepted)
I'm frankly tired seeing executives going on stage trying to show some numbers and graphs to prove a point on some variables. You see this in board meetings too. The sample sizes are too small to conclude anything significant about it!
I'm an Economics PhD (And former professor) and if someone were to say those lines at an academic conference there is a high likelihood that they would be literally laughed at.
Maybe it is because of my background in a quantitative field where we place a huge emphasis on statistical rigor, but t-tests were pretty much dismissed by anyone serious 20+ years ago. Seems like the issue stems to those disciplines without a stats/math background to just point to t-stats. My wife reads medical literature for her work and I gag pretty close to every time she asks me to look at the results.
There is not, I suspect, any other solution but that we must train a whole lot more statisticians. This means we will need to give more credit, and authority, and probably pay, to people who choose to pursue this field of study.
This comment is not meant to disparage anyone who considers themselves a data scientist. However, as someone who has advanced degrees in both statistics and computer science, employers and recruiters outside of R&D roles have shown very little interest in my statistical background aside from my machine learning experience. My experimental design, data handling (not just data cleaning, but data collection) skills, and theoretical understanding are rarely discussed. Statisticians are compensated much less than programmers--maybe deservedly so--but to that extent that I'm compensated the same as a "data scientist" who only studied computer science and didn't study any statistics, I feel like many employers, even those who benefit heavily from them, don't properly compensate statisticians.
At my alma mater, the poor wages have really hurt the statistics program as more students have decided to enroll in the "data science" (typically housed within business or computer science departments) programs. I think this is a really unfortunate trend because while those programs teach you how to implement gradient descent, do basic data wrangling in Python and R, make data visualizations, etc. they don't teach experimental design or the statistical theory that drives applied statistics. Perhaps as the supply of well-trained statisticians decreases and demand increases there will be upward wage pressure, but I think it's more likely that unqualified and inexperienced "data scientists" will continue to be shoehorned into these empty roles instead.
What do you think practicing data scientists should learn to be more effective? Experimental design? Something else?
I don't believe you need a PhD to call yourself a scientist, but I do think one common trait most scientists share is curiosity. To that end, I would encourage practicing data scientists who might not have a formal statistics education to not shy away from statistical theory. A solid theoretical understanding is what guides you when the questions and answers aren't clear--and I think the fundamental shortcoming of many data science programs is that they prepare their students for extremely simplified (and therefore unrealistic) questions with easily obtainable answers relative to what will be encountered in the real world.
I apologize for giving you an answer that isn't as coherent as I would like to it be. I tried not to be too verbose, but I think I failed at that anyway. I have a lot of feelings on this topic that I haven't fully articulated to myself yet. I could answer your first question if you're still interested, but as an addendum to my original comment, even though my statistics degree has gained me nothing in terms of career advancement or an increase in salaries or opportunities, its been truly invaluable to me as a programmer and a public speaker and advocate for the critical thinking skills it taught me. I hope others continue to recognize statistics value in academia and are curious of it.
But I don't think it's just like a statistician shortage. It's that most researchers are rushing to get as many papers as possible published (which, yeah, is arguaby an 'economic' _incentive_ to do statistics sloppily; but mostly I mean they don't feel they have _time_ to do it right), and most universities are trying to cut internally funded research budgets (meaning no money to pay all these extra statisticians, even if they were trained).
It's a "market" problem with how the work of science is actually materially rewarded and sustained. Papers, papers, papers. (Which for that matter -- a good properly trained statistician ought to have the same academic status as the researchers, but you don't get tenure by helping someone else analyze their research...)
Change needs to start at the top, otherwise the reviewers will not understand what they are reviewing, as silly as that sounds.
Couple this with the scientist going to the statistican basically saying "what can I claim with some kind of plausibility", rather than being indifferent to whether the result is interesting or groundbreaking or would make a good clickbait headline, and it's hard to figure out what's true.
It was clear to me that to do the job properly I would have to worry about every thing. And I couldn't trust any of the statisticians to do that well for me, as none of them were aware of all the things.
I suspect this is only done well in extremely narrow domains - maybe nuclear physics (e.g. CERN)? Where everyone present is extremely well educated about the statistics - not as a separate discipline, but as necessary background understanding to do non-statistical jobs too.
We need to start publishing with transparent and reproducible code from raw data to figure. Show me the data and let me make my own conclusions.
It's not too hard,I'm writing my phd thesis and every figure is produced from scratch and placed in the final document by a compilation script. My jupyter notebooks are then compiled in pdf and attached in the thesis document as well. Isn't this a better way of doing the "methods" section?
Unless you have implemented some new method, I don't see why the code would be of any interest.
Instead of saying "we normalized the counts", I can show you EXACTLY what that was that I did.
If I can't see your code I don't trust you.
I think it's cleaner, more honest, more reproducible, and it helps teach younger researchers.
Huge amounts of "human" data are normally public and available for anyone to work with,
it's only specific subsets that need to be private.
was written over 45 years ago. Granger is rolling over in his grave every time someone "discovers" a magical relationship between two time-series. In all honesty, statistics is hard and it's something you need to practice on a regular basis.
So the answer is more likely "statistical significance and more" rather than "ditch statistical significance".
When we're talking about how to take data as implying X, what is needed is: [logical reason to believe position, how the data was chosen to not bias the whole process, etc] + [data above threshold].
The data that a scientist gets "lives" inside one or another experimental box, some area. But unless the scientists also takes into account how that box and that data came to be, the scientist cannot make any definitive statement based on the properties of just the data.
The statement "Correlation does not [automatically] imply causation" and "Extraordinary claims require extraordinary evidence" both reflect this.
Or are you just summarising?
"Please don't insinuate that someone hasn't read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that.""
- The basis of a p-value is very much aligned with the scientific process in that you arent trying to prove something 'is true' rather you're trying to prove something false. Rejection of p-values / hypothesis testing is a bit like rejecting the scientific method. I am lucky enough to be friends with one of the physicists that worked on finding the Higgs Boson and he hammered it into my head that their work was to go out of their way to prove the Higgs Boson was a fluke - a statistical anomaly - sheer randomness. This is a very different mentality to trying to prove your new wunder-drug is effective - especially when those pesky confidence intervals get in your way of a promotion or a new grant. Its much easier to say p-values are at fault.
- Underpinning p-values are the underlying distributional assumption that makes up your p-value needs to match that of whatever process you're trying to test else the p-values become less meaningful.
- The 5% threshold is far too low. This means at least 5% of published papers are reporting nonsense and nothing but dumb luck (even if they got lucky with the distribution). If the distributional assumptions arent met then its even higher. Why are we choosing 5% threshold for a process/drug that can have serious side-effects?
- p-value hacking. So many sneaky ways to find significance here. Taleb goes into some detail into the problem of p-values here https://www.youtube.com/watch?v=8qrfSh07rT0 and in a similar vein here https://www.youtube.com/watch?v=D6CxfBMUf1o.
Doing stats well is hard and open to wilful and naive abuse. The solution is not to misuse or throw away these tools but to understand them properly. If you're in research you should think of stats as being part of your education not just a tickbox that is used validate whatever experiment you're doing
Significance requirements should be approached differently depending on the use-case. The above are two extreme cases: FDA authorized a new drug where significance guarantees should be rigorously obtained beforehand, and at the other extreme, exploratory data-analysis inside a private company, where data-scientists may use fancy priors or unproven techniques to fish for potential discoveries in the data.
Now how much significance guarantee should be required from a lab scientist is unclear to me. Why not let lab scientists publish their lab notebook with all experiments/remarks/conjectures without any significance requirement? The current situation looks pretty much like this anyway with many papers with significance claims that are not reproducible.
We should ask the question of how much the requirement of statistical significance hinders the science exploratory process. Maybe the current situation is fine, maybe we should new journals for "lab notebooks" with no significance requirements, etc.
On the other hand, in the mathematical literature, wrong claims are published often, see  for some examples. But mathematicians do not seem to as critical of this as the public is critical of non-reproducible papers in life sciences. Wrong mathematical proofs can be fixed, wrong proofs that can't be fixed sometimes still have a fruitful argument in it that could be helpful elsewhere. More importantly, the most difficult task is to come up with what to prove; if the proof is wrong or lacks an argument it can still be pretty useful.
If "statistical significance" is just sort of an empty phrase used to dismiss or prove something somewhat arbitrarily. Then isn't the same person writing the same study likely to be just as arbitrary declaring what is or isn't significant .... anyway?
It can't stop you from lying, cheating, or stealing. Nobody has ever not been able to commit fraud because of p-values. There is no formula in the world that can notice when you are cheating in what numbers you plug in...
However, if you know what you're doing, and you're honest, it's really important to know whether you are seeing shapes in clouds or actual patterns. That's what statistical significance is about.
the problem is people leaning on it too heavily and especially obsessing over the 0.05 threshold: so that p=0.049 means your result is statistically significant, you get a paper published and a press release and tenure, while p=0.051 means failure and penury.
the article is arguing specifically against this latter practice, of "bucketing" things according to some cutoff which was literally made up arbitrarily by Ronald Fisher.
"I measured no significant difference."
"There's no difference."
"I couldn't measure precisely enough to see what difference there is, if any."
You need these ingredients:
- The scientists polled for whether they're rising up need to be randomly sampled from the population of 7M
- A threshold for the fraction of scientists rising up to consider an uprising to be occurring (e.g. 90% of scientists rising up indicates an uprising)
Then you can poll your scientists and feed the results into the Binomial test to find out whether there is statistical evidence to support the hypothesis that an uprising is occurring.
Note that the sample size to do this might be quite small. In general, the required sample size is not dependent on the size of the population, but rather on how close the actual fraction of scientists rising up is to your chosen threshold. If the true fraction of scientists rising up is much smaller than your threshold, say 10%, and your threshold is 90%, it might only take a random sample of 10-20 scientists to be confident there's no uprising. But if the true fraction is 89.99999% it would take a huge sample to be confident that no uprising is occurring.
Sampling error is independent of population size, so, no.
Here’s something you might credibly calculate: if you ask n scientists a yes/no question, but actually only half of scientists worldwide think “yes”, then what’s the chance that m or more of them in your sample say “yes”? The answer is the p value, and it’s not very hard to calculate.
The point of this article is that many, many papers do this calculation and ascribe totally inappropriate meaning to it. And, of course, when you clearly state what you actually calculated, it’s pretty obvious that it’s a terrible way to draw conclusions about what fraction of scientists think “yes”. This is the point of the Nature paper.
P.S. I don’t know what you’re trying to calculate, but I think you did it wrong. If you want your calculation checked, can you clarify your question?
Hence rising up against statistical significance.
Without context on the 800 scientists that signed, we don’t know whether it’s stastically significant or not. 800 Nobel prize winners would certainly mean more than 800 graduate students. The discipline they were in could matter. I didn’t check you’re math, and I’m sure it’s fine. Raw statistics cannot reject the hypothesis (or null).
Also, it sounds like the authors didn’t sample scientists at all. They solicited signatures with some minimal qualifications. The 800 number wasn’t claimed to prove anything.
Second, the problem with significance is it's pinning "not just some random noise" to an arbitrary threshold. Results with a p-value of 0.051 and those with a p-value of 0.049 are effectively the same, and yet are treated as if they're worlds apart.
First, the authors are well-known in the area. Sander Greenland is a major figure in epidemiology methods, etc.
Second, there's been a push there, historically, to get rid of p-values because the field is primarily more focused on effect estimation, rather than pure hypothesis testing, so p-values are particularly poorly suited.
that have led to various national health systems refusing to pay for homeopathic treatment.
The studies are almost universally p-value based.
They find that when you perform an action, there is a EEG spike  in the motor cortex well before you actually consciously decide to perform the action. The experiment is conducted with a dot running around a circle and the subject has to tell when (as per where the dot was) he decided to act. The EEG potential is seen prior to that decision moment.
This is related to free-will as it is as if the decision of acting is not coming from your conscious self but from a deeper layer.
 Libet experiment: https://www.youtube.com/watch?v=OjCt-L0Ph5o
 Bereitschaftspotential: https://en.wikipedia.org/wiki/Bereitschaftspotential
I read (in a book sitting in a science museum giftshop, so take with the appropriate grain of salt) that most of Freud's theories have been shown to be bullshit in the century and a half since he lived, but his enduring influence was showing that the vast majority of human behavior is unconscious. Instead of thinking of your consciousness as the primary actor that determines what you do, you have to think of it as a vague overseer that occasionally notices the body doing something and can intervene with enough time, exposure, and practice. And whole industries have been built upon that principle - advertising, mass media, propaganda, behavioral finance, therapy, coaching, education, gaming, gambling, cigarettes, coffee, travel, social media.
There was a HN thread 2 days ago about the rise and fall of scientific authority and how to bring it back. I cynically commented that the rise of the physical sciences came from the ability to win wars with them. You could look at the rise of the psychological and social sciences as coming from the ability to make money with them.
The consciousness is super slow. It doesn't make sense to have it "do" anything. But it's good for making executive decisions. The CEO doesn't make the product.
I'm not sure philosophically whether our inability to understand our decisions undermines our free will, but it certainly undermines any ability to consciously prime ourselves to make certain decisions - hard to have that feedback loop when you don't even know what decision you made!
I personally would confidently guess there are unconscious faculties in the mind, but I dont see how this experiment is remarkable in proving this, see here(1). Is it not an equally likely conclusion that the brain takes a few hundreds of a second to develop a conscious decision? Actually, the inverse of that is what would be remarkable.
I should have phrased the original as "it is a relatively weak example of this specific set of experimental results".
No, that's not what they find. What they find is that the time of the change in EEG (it's not really a "spike", it's more like the leading edge of an increased action potential that lasts for a significant time) is a few tenths of a second before the time that the subject reports as the time they "made the decision". But you can't assume that the time the subject reports is "the time they made the decision", because the process of generating the conclusion "this is when I decided to act" also takes time--and that time was not measured. All you can really conclude from this experiment is that people are not consciously aware of all of the neural processes that actually go into their making a conscious decision.
Basically, it feels like the decision was made a tiny bit earlier than "I" made it.
Or maybe you overcompensate when calculating where you thought the dot was.
Most people think of their consciousness as "That part of my mind which integrates all my sensory information into a coherent experience of reality at this moment, and then decides what to do within that reality." You could redefine free will as the EEG potentials themselves operating within your brain - but that's not how most people, subjectively, experience it, particularly because they are not generally aware of most of these EEG potentials directly.
Note: full video is behind YT Premium but the preview is enough to understand the experiment.
What's strange to me, is that my interpretation of the results of such an experiment wouldn't even lead to your professor's conclusion. The takeaway being the fallibility of sensory perception, where I might then prompt the class for a discussion of their intuitive refutations of empiricism before diving into the literature.
Unfortunately, being a philosophy major myself, I know all too well that a crap teacher can totally ruin a philosophy topic (let alone a topic of any subject). From my 4 years in philosophy classes of varying levels of difficulty, the common denominator between a fruitful time spent in class has been the willingness of the professor to engage with their students. Whether it's logic, metaphysics, epistemology, ontology, &c, the principal property of a quality professor is his/her dialectical ability.
Hell, that's how philosophy & theology was taught in the first universities! The professor would profess and then the students would engage their master in the subject at hand.
claim that one has (a quality or feeling), especially when this is not the case.
"he had professed his love for her only to walk away"
synonyms: declare, announce, proclaim, assert, state, affirm, avow, maintain,
protest, aver, vow;
affirm one's faith in or allegiance to (a religion or set of beliefs).
"a people professing Christianity"
synonyms: state/affirm one's faith in, affirm one's allegiance to, make a public declaration of, declare publicly, avow, confess, acknowledge publicly
"in 325 the Emperor himself professed Christianity"
From Latin "profiteri", to the form "profess-" meaning "declared publicly", and to "professor", then to Late Middle English as "professor".
So a professor's practice is probably closer to definition 2: "make a public declaration of" whatever one's skill or knowledge of a particular art might be.
Furthermore, it seems strange for a professor of philosophy to so easily dismiss criticism out of hand. Of all subjects, a philosophy professor has a pedagogical imperative to entertain contradictory positions and explain why or why not one ought to follow a line of reasoning. In addition, the question about the merit of a small sample size could itself serve as a valuable aside in teaching fundamental notions in the philosophy of science.
Note: This is from the perspective of Western analytic philosophy, but the spirit of debate and discussion is no less integral to the continental tradition.
Personally, I side with Dennett (as always).
Also, many studies have n=12 and still have statistical power.
If signals like touch, smell, sight, etc. arrived at the input to your perceptual system immediately, you'd get a bunch of inputs for the same event at different times, and it would probably be difficult for your perceptual system to make sense of of them.
I think the only thing the spinning wheel experiment shows is that the brain has some sort of perceptual delay/compensation mechanism that's probably there to account for these differences in input processing times. And it probably backdates the "timestamp" of the event so that things that rely on short time intervals (e.g. control tasks like balancing) still work reliably.
I don't know why anybody would think this says anything about free will. Philosophers are weird.
If some decisions that we think are made consciously are actually unconscious, then how do we know that any decisions we make are really conscious decisions?
What if our consciousness makes no decisions, and is just a figment of our imagination.
I'm not sure how this test would have proven anything, especially with such a small sample size. As another poster mentioned, this very much sounds like a test for sensory motor latency, which is absolutely a nontrivial thing for this sort of test.
If the test you described was done against trained shooters, I'd imagine the conclusion would likely be the opposite.
Results like those are what make me question free will (My mind "weighted" my answers in ways I can't control).
For example, it's common to have the intuition that the choice when picking a number from 1 to 10 is completely unbiased and independent. That would imply a perfectly uniform distribution.
Of course that doesn't match reality, and it raises more questions for the intellectually honest thinker, but most people don't think about it beyond the simple intuition.
Untrained people just don't know, for example, that long streaks of heads and tails are actually quite likely in a sequence of random coin tosses. If they did know the likelihood, and they were tasked with writing down a sequence of random coin tosses, they would probably do a much better job.
If you take this to the limit, a human could learn to compute a pseudorandom generator in their head (at least in theory, although it may be very slow going) or perhaps figure out an effective way to gather entropy from the environment and turn it into a truly random sequence.
To me that shows us how outside forces influence our thoughts. Did I want to buy this garbage bag because it's a good value, has a good design, or because I've heard "DONT GET MAD, GET GLAD!!" 1000 times? Every thought we have is affected by our experiences, and many experiences can be controlled - they call these advertisements.
I find this topic fascinating. I suffer from depression, and my thoughts during a major depressive episode can be the stark opposite from when I feel well (Let's say it's about suicide). How can I truly have "free will" if my decisions are so dependent on external influences?
You don't have actually free will, if by free you mean "uninfluenced by anything".
There's no evidence that such a thing exists, and anyone who claims such a thing invariably appeals to supernatural or hand-wavy explanations.
Note that many thinkers who claim free will exist actually are saying "what most people mean by 'free will' does exist", and usually that means "an agent capable of making choices without outside influence". This idea is compatible with determinism, so these people are called compatibilists.
For most people, "outside influence" means, "at this point in time, only the contents of my brain affect the choice I make" and "no force or agent outside myself is compelling me to make a different choice than the one I'd make otherwise".
It's a word game of course. Even if you try to be extremely precise with your definitions, to avoid this kind of thing, most people won't quite follow.
Compatibilists believe what you believe. They just agree to use the popular, vague definition of free will instead of the strict one you're using.
because you can recognize this as a pattern, make a supposition that this is related to previous conditioning, and retrain yourself at picking numbers randomly to eliminate the previous bias that you yourself recognized as 'problematic' here.
The statistics just don't mean what most scientists think/write they mean. They're using em wrong. As the OP explains some manners of, in fairly technical language.
I still find Bayes to be more grounded and less “pie in the sky” than frequentists.
But seriously, you can have the most pathological prior distribution, so you can then stick with it forever. (Let's say your prior predestines you to always find that whatever the new piece of data you get is so unlikely that it's more likely that it's an error/conspiracy than a real piece of data that you have to do belief update on.)
So, instead of coming up with a significance level, you have to estimate the chance of observing a null result, which determines how much new data moves your posterior distribution. The "advantage" of the Bayesian approach is that - in theory - you can incorporate every tiny little bit of data into your model (distribution). The disadvantage is, that it's very susceptible to various biases (through a biased prior).
IIRC, this interview with neuroscientist Dale Stevens dives into the claims (similar to your example) from the past few years where scientists try to disprove free will: https://youtube.com/watch?v=X6VtwHpZ1BM
I should have had him solve a captcha for me, still not convinced that it wasn't just a reverse turing test.
I can't stand when people give flippant responses to fair questions - I can totally imagine your frustration in your case.
Then they just said the first thing that came to their mind rather than that they don't know.
I'm not sure how well a difference in nomenclature can fix such serious misunderstandings, but I do like the "compatibility" suggestions and the way they talk about the point estimate and endpoints of the confidence interval.